Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Journey

Tugdual Grall
September 18, 2015

Big Data Journey

Generic introduction to Big Data and key components/architectures.

This presentation was delivered during the JUG Summer Camp 2015 in La Rochelle (France)

Tugdual Grall

September 18, 2015
Tweet

More Decks by Tugdual Grall

Other Decks in Technology

Transcript

  1. Import RDBMS data sqoop import --connect jdbc:mysql://db.foo.com/somedb --table \ customers

    --target-dir /incremental_dataset --append Files HBase Hive
  2. Import RDBMS data input { jdbc { jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb"

    jdbc_user => "postgres" jdbc_driver_library => "/path/to/postgresql-9.4-1201.jdbc41.jar" jdbc_driver_class => "org.postgresql.Driver" statement => "SELECT * from contacts" } }
  3. How to store your data? • Files in a distributed

    file system • Rows in NoSQL Table • Index in Search Engine
  4. Data Processing • Transform the data • Enrich the data

    • Examples: • Store data in multiple formats • Aggregate data • Build Recommendations • ….
  5. MapReduce Processing Model • Define mappers • Shuffling is automatic

    • Define reducers • For complex work, chain jobs together – Use a higher level language or DSL that does this for you
  6. Apache Spark: Fast Big Data – Rich APIs in Java,

    Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage
  7. Spark: Unified Platform Spark SQL Spark Streaming (Streaming) MLlib (Machine

    learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN
  8. Files HBase Hive SQL on Hadoop • SQL Shell •

    JDBC ODBC • BI Tools • Reporting
  9. Machine Learning MapR Cluster HBase
 MapR DB MapR-FS Add recommendations

    to movies Capture Ratings Movies & Recommendations Movie Database
  10. Conclusion • If possible use Streams: Kafka, Logstash
 • Advanced

    Data Processing and Machine Learning : Spark • Expose your data using SQL for your “BI folks” : Drill • Aggregation and Full Text Search : Elasticsearch • Data Visualisation : Kibana