Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analytics to the masses by JOSE LUIS LÓPEZ at Big Data Spain 2014

Analytics to the masses by JOSE LUIS LÓPEZ at Big Data Spain 2014

In this talk, we will discuss our approach to bring large scale deep analytics to the masses. R is an extremely popular numerical computer environment, but scientific data processing frequently hits its memory limits. On the other hand, system to execute data intensive tasks like Hadoop or Stratosphere are not popular among R users because writing programs using these paradigms is cumbersome. We present an innovative approach to overcome these limitations using the Stratosphere/Apache Flink big data platform by means of a R package and ready-to-use distributed algorithm.

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

November 25, 2014
Tweet

Transcript

  1. BIG DATA ANALYTICS TO THE MASSES JOSE LUIS LÓPEZ PINO

    DATA ENGINEER GETYOURGUIDE
  2. Big Data Analytics to the masses Why it has failed

    and how we can fix it Jose Luis Lopez Pino
  3. Who am I? BI Consultant Large-Scale & Distributed Founding Data

    Engineer
  4. Big Data is like Tourism But if you aren’t an

    expert, you can’t make the most of it It seems easy to do
  5. Struggle to analyze Big Data Harlan Harris, Sean Murphy, and

    Marck Vaisman. Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work. O’Reilly Media, Inc., 2013 Also: Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. Enterprise data analysis and visualization: An interview study. Visualization and Computer Graphics, IEEE Transactions
  6. Tools Volker Markl. Breaking the chains: On declarative data analysis

    and data independence in the big data era. Proceedings of the VLDB Endowment, 7(13), 2014
  7. Tools (October 2014) Original: Volker Markl. Breaking the chains: On

    declarative data analysis and data independence in the big data era. Proceedings of the VLDB Endowment, 7(13), 2014
  8. Deep analytics

  9. Libraries! We need libraries... Query languages Write your own MR/RDD/Transformations

  10. … comprehensive ones!

  11. Say it with memes! When you do Deep analytics in

    small data using R and CRAN packages When you do deep analytics in BIG data using R and CRAN packages
  12. When you try to program it using MapReduce When you

    try to program it using Apache Spark / Apache Flink When you try to use a library scalable to large data sets
  13. Can’t we do it better? - Make it similar to

    normal R programs. - Hide complexity. - Make file manipulation easier. - Part of the computing in the cluster and part of the computer in the client.
  14. Our approach

  15. Our approach

  16. Behind the scenes: Before

  17. Behind the scenes: After

  18. Without writing significantly different code

  19. Competitive or even faster than R native code in small

    data
  20. And it scales

  21. Some relevant findings - Transmission time was not significant. -

    Stratosphere/Flink was competitive in highly iterative programs. - We were not able to do it keeping the code 100% the same. - Ensemble scenarios are the most exciting ones.
  22. 4 Takeaways from this talk - We still need to

    bring Big Data to the right people in the right place. - We need comprehensive libraries. - We need to move data back and forth. - Use a syntax that the users are familiar with.
  23. That’s all! - Have you found this talk interesting? -

    Follow me: @jllopezpino - Interested in a job as SEM Data Analyst (Berlin)? - Ask me for the details: - Are you interested in Data + Energy? - Keep in touch:
  24. 17TH ~ 18th NOV 2014 MADRID (SPAIN)