Big Data Journey

Tugdual Grall
September 18, 2015

Generic introduction to Big Data and key components/architectures.

This presentation was delivered during the JUG Summer Camp 2015 in La Rochelle (France)

  1. Import RDBMS data sqoop import --connect jdbc:mysql://db.foo.com/somedb --table \ customers

    --target-dir /incremental_dataset --append Files HBase Hive
  2. Import RDBMS data input { jdbc { jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb"

    jdbc_user => "postgres" jdbc_driver_library => "/path/to/postgresql-9.4-1201.jdbc41.jar" jdbc_driver_class => "org.postgresql.Driver" statement => "SELECT * from contacts" } }
  3. How to store your data? • Files in a distributed

    file system • Rows in NoSQL Table • Index in Search Engine
  4. Data Processing • Transform the data • Enrich the data

    • Examples: • Store data in multiple formats • Aggregate data • Build Recommendations • ….
  5. MapReduce Processing Model • Define mappers • Shuffling is automatic

    • Define reducers • For complex work, chain jobs together – Use a higher level language or DSL that does this for you
  6. Apache Spark: Fast Big Data – Rich APIs in Java,

    Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage
  7. Spark: Unified Platform Spark SQL Spark Streaming (Streaming) MLlib (Machine

    learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN
  8. Files HBase Hive SQL on Hadoop • SQL Shell •

    JDBC ODBC • BI Tools • Reporting
  9. Machine Learning MapR Cluster HBase
 MapR DB MapR-FS Add recommendations

    to movies Capture Ratings Movies & Recommendations Movie Database
  10. Conclusion • If possible use Streams: Kafka, Logstash
 • Advanced

    Data Processing and Machine Learning : Spark • Expose your data using SQL for your “BI folks” : Drill • Aggregation and Full Text Search : Elasticsearch • Data Visualisation : Kibana