spbdsm#1

Apache Spark as SQL Engine Data Engineering Approach Dmitry Timofeev,
Data Analyst, Wrike Inc.

Wrike is a collaborative task and project management platform wrike.com

What is Apache Spark? • Run programs up to 100x
faster than Hadoop MapReduce in memory, or 10x faster on disk. • Write applications quickly in Java, Scala, Python, R. • Combine SQL, streaming, and complex analytics. • Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. Apache Spark™ is a fast in-memory and general engine for large-scale data processing.

Where it came from? Original white papers • "Spark: Cluster
Computing with Working Sets" by Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. University of California, Berkeley • "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing" Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. University of California, Berkeley

Few words about data analysts Or why they don’t want
to write code and   want query, query, query? • We know SQL • We love ETL

Spark SQL Spark SQL is Spark's module for working with
structured data. • DataFrame and seamlessly mix SQL queries with Spark programs; • Connect to any data source the same way: Hive, Avro, Parquet, JSON and JDBC; • Server mode: connect to Spark SQL with you favorite DB client over JDBC.

Spark SQL Distributed SQL Engine. Integration with BI tools

Spark SQL Distributed SQL Engine and my favorite DB tool

Spark SQL Data sources

Spark SQL Mix SQL queries with Spark programs

Where it came from? Original white papers • "Spark SQL:
Relational Data Processing in Spark" by Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan‡, Michael J. Franklin‡, Ali Ghodsi, Matei Zaharia. Databricks Inc. MIT CSAIL, AMPLab, UC Berkeley

Conclusion • You can easy crate scalable infrastructure; • Do
you dream about cross-DB joins? Welcome! • Do you want to join logs and usual DBs? Welcome! • You analysts is not a programmers? Not a problem!

Your questions? To make our team more awesome we need:
UX Data Analyst Billing Operations Analyst Data Engineer [email protected]

spbdsm#1

spbdsm#1

Pavel

More Decks by Pavel

Featured

Transcript

Apache Spark as SQL Engine Data Engineering Approach Dmitry Timofeev,

Wrike is a collaborative task and project management platform wrike.com

What is Apache Spark? • Run programs up to 100x

Where it came from? Original white papers • "Spark: Cluster

Few words about data analysts Or why they don’t want

Spark SQL Spark SQL is Spark's module for working with

Spark SQL Distributed SQL Engine. Integration with BI tools

Spark SQL Distributed SQL Engine and my favorite DB tool

Spark SQL Data sources

Spark SQL Data sources

Spark SQL Data sources

Spark SQL Mix SQL queries with Spark programs

Where it came from? Original white papers • "Spark SQL:

Conclusion • You can easy crate scalable infrastructure; • Do

Your questions? To make our team more awesome we need: