Analytics to the masses by JOSE LUIS LÓPEZ at Big Data Spain 2014

by Big Data Spain

Slide 1

Slide 1 text

BIG DATA ANALYTICS TO THE MASSES JOSE LUIS LÓPEZ PINO DATA ENGINEER GETYOURGUIDE

Slide 2

Slide 2 text

Big Data Analytics to the masses Why it has failed and how we can fix it Jose Luis Lopez Pino

Slide 3

Slide 3 text

Who am I? BI Consultant Large-Scale & Distributed Founding Data Engineer

Slide 4

Slide 4 text

Big Data is like Tourism But if you aren’t an expert, you can’t make the most of it It seems easy to do

Slide 5

Slide 5 text

Struggle to analyze Big Data Harlan Harris, Sean Murphy, and Marck Vaisman. Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work. O’Reilly Media, Inc., 2013 Also: Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. Enterprise data analysis and visualization: An interview study. Visualization and Computer Graphics, IEEE Transactions

Slide 6

Slide 6 text

Tools Volker Markl. Breaking the chains: On declarative data analysis and data independence in the big data era. Proceedings of the VLDB Endowment, 7(13), 2014

Slide 7

Slide 7 text

Tools (October 2014) Original: Volker Markl. Breaking the chains: On declarative data analysis and data independence in the big data era. Proceedings of the VLDB Endowment, 7(13), 2014

Slide 8

Slide 8 text

Deep analytics

Slide 9

Slide 9 text

Libraries! We need libraries... Query languages Write your own MR/RDD/Transformations

Slide 10

Slide 10 text

… comprehensive ones!

Slide 11

Slide 11 text

Say it with memes! When you do Deep analytics in small data using R and CRAN packages When you do deep analytics in BIG data using R and CRAN packages

Slide 12

Slide 12 text

When you try to program it using MapReduce When you try to program it using Apache Spark / Apache Flink When you try to use a library scalable to large data sets

Slide 13

Slide 13 text

Can’t we do it better? - Make it similar to normal R programs. - Hide complexity. - Make file manipulation easier. - Part of the computing in the cluster and part of the computer in the client.

Slide 14

Slide 14 text

Our approach

Slide 15

Slide 15 text

Our approach

Slide 16

Slide 16 text

Behind the scenes: Before

Slide 17

Slide 17 text

Behind the scenes: After

Slide 18

Slide 18 text

Without writing significantly different code

Slide 19

Slide 19 text

Competitive or even faster than R native code in small data

Slide 20

Slide 20 text

And it scales

Slide 21

Slide 21 text

Some relevant findings - Transmission time was not significant. - Stratosphere/Flink was competitive in highly iterative programs. - We were not able to do it keeping the code 100% the same. - Ensemble scenarios are the most exciting ones.

Slide 22

Slide 22 text

4 Takeaways from this talk - We still need to bring Big Data to the right people in the right place. - We need comprehensive libraries. - We need to move data back and forth. - Use a syntax that the users are familiar with.

Slide 23

Slide 23 text

That’s all! - Have you found this talk interesting? - Follow me: @jllopezpino - Interested in a job as SEM Data Analyst (Berlin)? - Ask me for the details: - Are you interested in Data + Energy? - Keep in touch:

Slide 24

Slide 24 text

17TH ~ 18th NOV 2014 MADRID (SPAIN)