Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Strata Hadoop 2016 at San Jose

Strata Hadoop 2016 at San Jose

サンノゼで開催されたStrata Hadoopに行ってきたのでその振り返りです。

Toyama Hiroshi

April 13, 2016
Tweet

More Decks by Toyama Hiroshi

Other Decks in Technology

Transcript

  1. Strata Hadoop 2016

    View Slide

  2. What is Strata Hadoop?
    Presented by O’Reilly and Cloudera.
    cutting-edge big data
    cutting-edge data science
    new business fundamentals to work

    View Slide

  3. Agenda
    1. technology
    Apache Spark
    Kafka
    Apache Flink
    2. topic
    SQL-on-Hadoop
    Data Lake
    AI(Artificial Intelligence)
    Workflow Engine
    3. Expo

    View Slide

  4. Technology

    View Slide

  5. View Slide

  6. What is Spark?
    fast and general engine for large-scale data processing
    Write applications quickly in Java, Scala, Python, R
    Combine SQL, streaming, and complex analytics.
    Spark runs on Hadoop, Mesos, standalone, or in the cloud.
    It can access diverse data sources including HDFS,
    Cassandra, HBase, and S3.

    View Slide

  7. Spark Includes
    Spark SQL
    Spark Streaming
    MLlib(Machine Learning)
    GraphX(Graph)

    View Slide

  8. Spark Community
    wide set of developers from over 200 companies.
    NTT Data
    IBM
    more than 800 developers.
    Apache Project Top 3.

    View Slide

  9. View Slide

  10. What is Kafka?
    distributed publish-subscribe system.
    developed by the Apache So ware Foundation.
    originally developed by LinkedIn.
    written in Scala.
    low-latency platform for handling real-time data feeds

    View Slide

  11. View Slide

  12. Kafka User
    LinkedIn
    Netflix
    PayPal
    Spotify
    Tuenti
    Uber

    View Slide

  13. View Slide

  14. Apache Flink
    Apache Top Level Project.
    open source platform for distributed stream and batch
    data processing.
    Flink’s core is a streaming dataflow engine that provides
    data distribution, communication, and fault tolerance for
    distributed computations over data streams.

    View Slide

  15. Apache Flink rival
    Norikra
    Kinesis Analytics
    PipelineDB
    Spark Streaming

    View Slide

  16. SQL Over Streaming
    Sounds good!

    View Slide

  17. Apache Flink API
    1. DataStream API for unbounded streams embedded in
    Java and Scala, and
    2. DataSet API for static data embedded in Java, Scala, and
    Python,
    3. Table API with a SQL-like expression language embedded
    in Java and Scala.

    View Slide

  18. Ice Break

    View Slide

  19. View Slide

  20. View Slide

  21. View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. Got tired?
    Okay..
    Lets go main feature!

    View Slide

  28. Topic

    View Slide

  29. SQL-on-Hadoop

    View Slide

  30. SQL-on-Hadoop
    Control on Hadoop by SQL
    SQL-on-Hadoop has become a “must have” in the Hadoop
    toolkit and has entered the mainstream.
    Within the big data landscape there are multiple
    approaches to accessing, analyzing, and manipulating
    data in Hadoop.
    ANSI SQL completeness (and the ability to tolerate
    machine-generated SQL), developer and analyst
    skillsets, and architecture tradeoffs

    View Slide

  31. SQL-on-Hadoop
    Spark SQL
    Impala
    Hive
    Apache Drill
    Presto

    View Slide

  32. Data Lake

    View Slide

  33. What is Data Lake?
    large-scale storage repository.
    data lake provides massive storage for any kind of data,
    enormous processing power and the ability to handle
    virtually limitless concurrent tasks or jobs
    Include Unstructured data and structured data.

    View Slide

  34. Data Lake level
    Data Swamps
    Raw Data
    Can’t find or use data
    Data Puddles
    Low variety of data and low adoption
    Strong technical skill set requirement
    Data Lake
    Right Interface

    View Slide

  35. AI

    View Slide

  36. Artificial Intelligence
    That Imitate the human feelings and perceptions
    machine to let you learn the data audio and image and
    text to guess automatically.
    It is introduced with Deep Learning.

    View Slide

  37. AI Implements
    TensorFlow
    AlphaGo
    Siri
    Google Now

    View Slide

  38. Microso Tay is funny
    Designed to Imitate the informal youth conversation.
    On Twitter reiterated an incendiary and racist remarks..
    Sprinkle vomit rant like drug taking, getting in trouble.
    http://gigazine.net/news/20160329-reason-microso -tay-
    crazy/

    View Slide

  39. Workflow Engine

    View Slide

  40. What is Workflow
    Engine?
    so ware application that manages business processes.
    manages and monitors the state of activities in a
    workflow.
    can execute any arbitrary sequence of steps.

    View Slide

  41. Workflow Engine
    Airflow
    distributed by Airbnb
    Luigi
    distributed by spotify
    Azkaban
    Rundeck
    digdag
    distributed by treasure data

    View Slide

  42. EXPO

    View Slide

  43. View Slide

  44. View Slide

  45. View Slide

  46. View Slide

  47. View Slide

  48. View Slide

  49. View Slide

  50. View Slide

  51. View Slide

  52. View Slide

  53. View Slide

  54. View Slide

  55. View Slide

  56. summary
    Data Lake are spoken well in the United States.
    SQL-on-Hadoop has become a must have.
    AI and Deep Learning going well.
    Kafka, Spark, Flink going well.

    View Slide

  57. Thanks!

    View Slide