How to process millions of data with Spark and DataProc

Juan Guillermo Gómez GDE Firebase & GCP & Kotlin &
ML @jggomezt How to process millions of data with Spark and DataProc

Juan Guillermo Gómez ➢ Co-Leader and Co-Founder of GDG Cali.
➢ Tech Lead WordBox & Founder DevHack ➢ Consultant and advisor on software architecture, cloud computing and software development. ➢ Experience in several languages and platforms. (C, C#, Java, NodeJS, android, GCP, Firebase, Python, Go, DS, ML). ➢ Google Developer Expert (GDE) in Firebase & GCP & Kotlin & ML ➢ BS in System Engineering and a MS in Software Engineering. ➢ @jggomezt ➢ youtube.com/devhack

A lot of data…

About Data Science ❖ Analyze data and add value to
the business. ❖ Find faults and points for improvement. ❖ It processes a lot of data in a short time. ❖ Use the best tools to obtain, process and store data. ❖ Automate processes to obtain, process and store data.

Why do I need to process the data? ❖ Gain
insights to make the best decisions quickly ❖ Create new services faster. ❖ Create new attributes. ❖ Improve our data and its structure. ❖ Changing technology for data lakes.

How can millions and millions of data be processed? ❖
Scripts and executing in our PC. ❖ Scripts and executing in Colab or Jupyter. ❖ Scripts and executing in a Super Virtual Machine ❖ Scripts and executing in a Cluster. ❖ Using advanced tools.

Tools to process millions of data

Spark ?

What is Spark? ❖ Multi-language engine for executing data engineering,
data science, and machine learning on single-node machines or clusters. ❖ Framework for processing data distributed in multiple nodes. ❖ Framework with high level components for developing jobs. ❖ Batch and streaming data processing. ❖ SQL Analytics. ❖ Data Science at scale. ❖ Machine learning.

What is Spark?

Spark Cluster Architecture Cluster Manager Types: ❖ Standalone ❖ Hadoop
YARN ❖ Kubernetes

Spark Ecosystem

RDD (Resilient Distributed Dataset)

PySpark

Spark SQL, Datasets and DataFrame ❖ Spark SQL lets you
query structured data inside Spark programs, using either SQL or a familiar DataFrame API ❖ DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.

Structured Streaming

How to install Spark…

DataProc by GCP

What is DataProc? Fully managed and highly scalable service for
running: ➢ Apache Spark ➢ Hadoop ➢ Apache Flink ➢ Presto ➢ Frameworks ➢ 30+ open source tools.

Thank You! Gracias ! Juan Guillermo Gómez GDE Firebase &
GCP & ML & Kotlin @jggomezt

How to process millions of data with Spark and ...

How to process millions of data with Spark and DataProc

Juan Guillermo Gómez Torres

More Decks by Juan Guillermo Gómez Torres

Other Decks in Programming

Featured

Transcript

Juan Guillermo Gómez GDE Firebase & GCP & Kotlin &

Juan Guillermo Gómez ➢ Co-Leader and Co-Founder of GDG Cali.

A lot of data…

About Data Science ❖ Analyze data and add value to

Why do I need to process the data? ❖ Gain

How can millions and millions of data be processed? ❖

Tools to process millions of data

Spark

Spark ?

What is Spark? ❖ Multi-language engine for executing data engineering,

What is Spark?

Spark Cluster Architecture Cluster Manager Types: ❖ Standalone ❖ Hadoop

Spark Ecosystem

RDD (Resilient Distributed Dataset)

PySpark

Spark SQL, Datasets and DataFrame ❖ Spark SQL lets you

Structured Streaming

How to install Spark…

DataProc by GCP

What is DataProc? Fully managed and highly scalable service for

Thank You! Gracias ! Juan Guillermo Gómez GDE Firebase &