Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Journey

Aab9ac774f61c5d9bf143b5a1bfe901b?s=47 Tugdual Grall
September 18, 2015

Big Data Journey

Generic introduction to Big Data and key components/architectures.

This presentation was delivered during the JUG Summer Camp 2015 in La Rochelle (France)

Aab9ac774f61c5d9bf143b5a1bfe901b?s=128

Tugdual Grall

September 18, 2015
Tweet

More Decks by Tugdual Grall

Other Decks in Technology

Transcript

  1. © 2015 MapR Technologies ‹#› Big Data Journey with Hadoop

    & MapR Tug Grall tug@mapr.com @tgrall
  2. YARN

  3. © 2015 MapR Technologies ‹#› Big Data Journey Tug Grall

    tug@mapr.com @tgrall Tug Grall tug@mapr.com @tgrall David Pilato david@elastic.co @dadoonet
  4. YARN

  5. WHY?

  6. https://www.domo.com/

  7. Building new applications

  8. None
  9. Can I use my existing tools?

  10. (Big) Data Platform (Big) Data Project

  11. Ingest Store Process Consume

  12. Ingest Data

  13. Copy files in HDFS hadoop fs -put dailylogs-log.zip /logs/2015/09/10/

  14. Import RDBMS data sqoop import --connect jdbc:mysql://db.foo.com/somedb --table \ customers

    --target-dir /incremental_dataset --append Files HBase Hive
  15. Import RDBMS data input { jdbc { jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb"

    jdbc_user => "postgres" jdbc_driver_library => "/path/to/postgresql-9.4-1201.jdbc41.jar" jdbc_driver_class => "org.postgresql.Driver" statement => "SELECT * from contacts" } }
  16. What’s “wrong”? Batch????

  17. Streaming Flume, Kafka, Logstash to the rescue

  18. Log App Events Twitter Sensors … HDFS MapR-FS Alerts Elasticsearch

    … DB
  19. Log App Events Twitter Sensors … HDFS MapR-FS Alerts Elasticsearch

    … DB Broker Producers Consumers
  20. Stream data into Hadoop using Flume Server Files HBase Hive

    Server Server Server
  21. Streams using Kafka Files HBase Hive Producer Producer Producer Consumer

    Consumer Consumer Alert
  22. Stream data using Logstash

  23. Data Storage Data Format

  24. How to store your data? • Files in a distributed

    file system • Rows in NoSQL Table • Index in Search Engine
  25. Process Data

  26. Data Processing • Transform the data • Enrich the data

    • Examples: • Store data in multiple formats • Aggregate data • Build Recommendations • ….
  27. MapReduce Processing Model • Define mappers • Shuffling is automatic

    • Define reducers • For complex work, chain jobs together – Use a higher level language or DSL that does this for you
  28. Apache Spark: Fast Big Data – Rich APIs in Java,

    Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage
  29. Spark: Unified Platform Spark SQL Spark Streaming (Streaming) MLlib (Machine

    learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN
  30. Elasticsearch / Watcher

  31. None
  32. Query the data

  33. Files HBase Hive Index Discovery/Analytics

  34. SQL strikes back!

  35. Files HBase Hive SQL on Hadoop • SQL Shell •

    JDBC ODBC • BI Tools • Reporting
  36. Elasticsearch

  37. Kibana as a frontend

  38. Example: Recommendation Platform

  39. Machine Learning MapR Cluster HBase
 MapR DB MapR-FS Add recommendations

    to movies Capture Ratings Movies & Recommendations Movie Database
  40. Conclusion • If possible use Streams: Kafka, Logstash
 • Advanced

    Data Processing and Machine Learning : Spark • Expose your data using SQL for your “BI folks” : Drill • Aggregation and Full Text Search : Elasticsearch • Data Visualisation : Kibana
  41. © 2015 MapR Technologies ‹#› Big Data Journey Tug Grall

    tug@mapr.com @tgrall Tug Grall tug@mapr.com @tgrall David Pilato david@elastic.co @dadoonet