Strata Hadoop 2016
What is Strata Hadoop?
Presented by O’Reilly and Cloudera.
cutting-edge big data
cutting-edge data science
new business fundamentals to work
What is Spark?
fast and general engine for large-scale data processing
Write applications quickly in Java, Scala, Python, R
Combine SQL, streaming, and complex analytics.
Spark runs on Hadoop, Mesos, standalone, or in the cloud.
It can access diverse data sources including HDFS,
Cassandra, HBase, and S3.
wide set of developers from over 200 companies.
more than 800 developers.
Apache Project Top 3.
What is Kafka?
distributed publish-subscribe system.
developed by the Apache So ware Foundation.
originally developed by LinkedIn.
written in Scala.
low-latency platform for handling real-time data feeds
Apache Top Level Project.
open source platform for distributed stream and batch
Flink’s core is a streaming dataflow engine that provides
data distribution, communication, and fault tolerance for
distributed computations over data streams.
Apache Flink rival
SQL Over Streaming
Apache Flink API
1. DataStream API for unbounded streams embedded in
Java and Scala, and
2. DataSet API for static data embedded in Java, Scala, and
3. Table API with a SQL-like expression language embedded
in Java and Scala.
Lets go main feature!
Control on Hadoop by SQL
SQL-on-Hadoop has become a “must have” in the Hadoop
toolkit and has entered the mainstream.
Within the big data landscape there are multiple
approaches to accessing, analyzing, and manipulating
data in Hadoop.
ANSI SQL completeness (and the ability to tolerate
machine-generated SQL), developer and analyst
skillsets, and architecture tradeoﬀs
What is Data Lake?
large-scale storage repository.
data lake provides massive storage for any kind of data,
enormous processing power and the ability to handle
virtually limitless concurrent tasks or jobs
Include Unstructured data and structured data.
Data Lake level
Can’t find or use data
Low variety of data and low adoption
Strong technical skill set requirement
That Imitate the human feelings and perceptions
machine to let you learn the data audio and image and
text to guess automatically.
It is introduced with Deep Learning.
Microso Tay is funny
Designed to Imitate the informal youth conversation.
On Twitter reiterated an incendiary and racist remarks..
Sprinkle vomit rant like drug taking, getting in trouble.
What is Workflow
so ware application that manages business processes.
manages and monitors the state of activities in a
can execute any arbitrary sequence of steps.
distributed by Airbnb
distributed by spotify
distributed by treasure data
Data Lake are spoken well in the United States.
SQL-on-Hadoop has become a must have.
AI and Deep Learning going well.
Kafka, Spark, Flink going well.