Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Journey

Tugdual Grall
September 18, 2015

Big Data Journey

Generic introduction to Big Data and key components/architectures.

This presentation was delivered during the JUG Summer Camp 2015 in La Rochelle (France)

Tugdual Grall

September 18, 2015
Tweet

More Decks by Tugdual Grall

Other Decks in Technology

Transcript

  1. © 2015 MapR Technologies ‹#›
    Big Data Journey with Hadoop & MapR
    Tug Grall
    [email protected]
    @tgrall

    View Slide

  2. YARN

    View Slide

  3. © 2015 MapR Technologies ‹#›
    Big Data Journey
    Tug Grall
    [email protected]
    @tgrall
    Tug Grall
    [email protected]
    @tgrall
    David Pilato
    [email protected]
    @dadoonet

    View Slide

  4. YARN

    View Slide

  5. WHY?

    View Slide

  6. https://www.domo.com/

    View Slide

  7. Building new applications

    View Slide

  8. View Slide

  9. Can I use my existing tools?

    View Slide

  10. (Big) Data Platform
    (Big) Data Project

    View Slide

  11. Ingest
    Store
    Process
    Consume

    View Slide

  12. Ingest Data

    View Slide

  13. Copy files in HDFS
    hadoop fs -put dailylogs-log.zip /logs/2015/09/10/

    View Slide

  14. Import RDBMS data
    sqoop import --connect jdbc:mysql://db.foo.com/somedb --table \
    customers --target-dir /incremental_dataset --append
    Files
    HBase
    Hive

    View Slide

  15. Import RDBMS data
    input {
    jdbc {
    jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb"
    jdbc_user => "postgres"
    jdbc_driver_library => "/path/to/postgresql-9.4-1201.jdbc41.jar"
    jdbc_driver_class => "org.postgresql.Driver"
    statement => "SELECT * from contacts"
    }
    }

    View Slide

  16. What’s “wrong”?
    Batch????

    View Slide

  17. Streaming
    Flume, Kafka, Logstash
    to the rescue

    View Slide

  18. Log
    App Events
    Twitter
    Sensors

    HDFS
    MapR-FS
    Alerts
    Elasticsearch

    DB

    View Slide

  19. Log
    App Events
    Twitter
    Sensors

    HDFS
    MapR-FS
    Alerts
    Elasticsearch

    DB
    Broker
    Producers Consumers

    View Slide

  20. Stream data into Hadoop using Flume
    Server
    Files
    HBase
    Hive
    Server
    Server
    Server

    View Slide

  21. Streams using Kafka
    Files
    HBase
    Hive
    Producer
    Producer
    Producer
    Consumer
    Consumer
    Consumer
    Alert

    View Slide

  22. Stream data using Logstash

    View Slide

  23. Data Storage
    Data Format

    View Slide

  24. How to store your data?
    • Files in a distributed file system
    • Rows in NoSQL Table
    • Index in Search Engine

    View Slide

  25. Process Data

    View Slide

  26. Data Processing
    • Transform the data
    • Enrich the data
    • Examples:
    • Store data in multiple formats
    • Aggregate data
    • Build Recommendations
    • ….

    View Slide

  27. MapReduce Processing Model
    • Define mappers
    • Shuffling is automatic
    • Define reducers
    • For complex work, chain jobs together
    – Use a higher level language or DSL that does this for you

    View Slide

  28. Apache Spark: Fast Big Data
    – Rich APIs in Java,
    Scala, Python
    – Interactive shell
    • Fast to Run
    – General execution
    graphs
    – In-memory storage

    View Slide

  29. Spark: Unified Platform
    Spark SQL
    Spark Streaming
    (Streaming)
    MLlib
    (Machine learning)
    Spark (General execution engine)
    GraphX (Graph
    computation)
    Mesos
    Distributed File System (HDFS, MapR-FS, S3, …)
    Hadoop YARN

    View Slide

  30. Elasticsearch / Watcher

    View Slide

  31. View Slide

  32. Query the data

    View Slide

  33. Files
    HBase
    Hive
    Index
    Discovery/Analytics

    View Slide

  34. SQL strikes back!

    View Slide

  35. Files
    HBase
    Hive
    SQL on Hadoop
    • SQL Shell

    • JDBC ODBC

    • BI Tools

    • Reporting

    View Slide

  36. Elasticsearch

    View Slide

  37. Kibana as a frontend

    View Slide

  38. Example: Recommendation Platform

    View Slide

  39. Machine Learning
    MapR Cluster
    HBase

    MapR DB
    MapR-FS
    Add recommendations
    to movies
    Capture Ratings
    Movies & Recommendations
    Movie Database

    View Slide

  40. Conclusion
    • If possible use Streams: Kafka, Logstash

    • Advanced Data Processing and Machine Learning : Spark
    • Expose your data using SQL for your “BI folks” : Drill
    • Aggregation and Full Text Search : Elasticsearch
    • Data Visualisation : Kibana

    View Slide

  41. © 2015 MapR Technologies ‹#›
    Big Data Journey
    Tug Grall
    [email protected]
    @tgrall
    Tug Grall
    [email protected]
    @tgrall
    David Pilato
    [email protected]
    @dadoonet

    View Slide