Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Strata Hadoop 2016 at San Jose

Strata Hadoop 2016 at San Jose

サンノゼで開催されたStrata Hadoopに行ってきたのでその振り返りです。

Toyama Hiroshi

April 13, 2016
Tweet

More Decks by Toyama Hiroshi

Other Decks in Technology

Transcript

  1. Strata Hadoop 2016

  2. What is Strata Hadoop? Presented by O’Reilly and Cloudera. cutting-edge

    big data cutting-edge data science new business fundamentals to work
  3. Agenda 1. technology Apache Spark Kafka Apache Flink 2. topic

    SQL-on-Hadoop Data Lake AI(Artificial Intelligence) Workflow Engine 3. Expo
  4. Technology

  5. None
  6. What is Spark? fast and general engine for large-scale data

    processing Write applications quickly in Java, Scala, Python, R Combine SQL, streaming, and complex analytics. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
  7. Spark Includes Spark SQL Spark Streaming MLlib(Machine Learning) GraphX(Graph)

  8. Spark Community wide set of developers from over 200 companies.

    NTT Data IBM more than 800 developers. Apache Project Top 3.
  9. None
  10. What is Kafka? distributed publish-subscribe system. developed by the Apache

    So ware Foundation. originally developed by LinkedIn. written in Scala. low-latency platform for handling real-time data feeds
  11. None
  12. Kafka User LinkedIn Netflix PayPal Spotify Tuenti Uber

  13. None
  14. Apache Flink Apache Top Level Project. open source platform for

    distributed stream and batch data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
  15. Apache Flink rival Norikra Kinesis Analytics PipelineDB Spark Streaming

  16. SQL Over Streaming Sounds good!

  17. Apache Flink API 1. DataStream API for unbounded streams embedded

    in Java and Scala, and 2. DataSet API for static data embedded in Java, Scala, and Python, 3. Table API with a SQL-like expression language embedded in Java and Scala.
  18. Ice Break

  19. None
  20. None
  21. None
  22. None
  23. None
  24. None
  25. None
  26. None
  27. Got tired? Okay.. Lets go main feature!

  28. Topic

  29. SQL-on-Hadoop

  30. SQL-on-Hadoop Control on Hadoop by SQL SQL-on-Hadoop has become a

    “must have” in the Hadoop toolkit and has entered the mainstream. Within the big data landscape there are multiple approaches to accessing, analyzing, and manipulating data in Hadoop. ANSI SQL completeness (and the ability to tolerate machine-generated SQL), developer and analyst skillsets, and architecture tradeoffs
  31. SQL-on-Hadoop Spark SQL Impala Hive Apache Drill Presto

  32. Data Lake

  33. What is Data Lake? large-scale storage repository. data lake provides

    massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs Include Unstructured data and structured data.
  34. Data Lake level Data Swamps Raw Data Can’t find or

    use data Data Puddles Low variety of data and low adoption Strong technical skill set requirement Data Lake Right Interface
  35. AI

  36. Artificial Intelligence That Imitate the human feelings and perceptions machine

    to let you learn the data audio and image and text to guess automatically. It is introduced with Deep Learning.
  37. AI Implements TensorFlow AlphaGo Siri Google Now

  38. Microso Tay is funny Designed to Imitate the informal youth

    conversation. On Twitter reiterated an incendiary and racist remarks.. Sprinkle vomit rant like drug taking, getting in trouble. http://gigazine.net/news/20160329-reason-microso -tay- crazy/
  39. Workflow Engine

  40. What is Workflow Engine? so ware application that manages business

    processes. manages and monitors the state of activities in a workflow. can execute any arbitrary sequence of steps.
  41. Workflow Engine Airflow distributed by Airbnb Luigi distributed by spotify

    Azkaban Rundeck digdag distributed by treasure data
  42. EXPO

  43. None
  44. None
  45. None
  46. None
  47. None
  48. None
  49. None
  50. None
  51. None
  52. None
  53. None
  54. None
  55. None
  56. summary Data Lake are spoken well in the United States.

    SQL-on-Hadoop has become a must have. AI and Deep Learning going well. Kafka, Spark, Flink going well.
  57. Thanks!