Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to process millions of data with Spark and DataProc

How to process millions of data with Spark and DataProc

How to process millions of data with PySpark and DataProc

More Decks by Juan Guillermo Gómez Torres

Other Decks in Programming

Transcript

  1. Juan Guillermo Gómez GDE Firebase & GCP & Kotlin &

    ML @jggomezt How to process millions of data with Spark and DataProc
  2. Juan Guillermo Gómez ➢ Co-Leader and Co-Founder of GDG Cali.

    ➢ Tech Lead WordBox & Founder DevHack ➢ Consultant and advisor on software architecture, cloud computing and software development. ➢ Experience in several languages and platforms. (C, C#, Java, NodeJS, android, GCP, Firebase, Python, Go, DS, ML). ➢ Google Developer Expert (GDE) in Firebase & GCP & Kotlin & ML ➢ BS in System Engineering and a MS in Software Engineering. ➢ @jggomezt ➢ youtube.com/devhack
  3. About Data Science ❖ Analyze data and add value to

    the business. ❖ Find faults and points for improvement. ❖ It processes a lot of data in a short time. ❖ Use the best tools to obtain, process and store data. ❖ Automate processes to obtain, process and store data.
  4. Why do I need to process the data? ❖ Gain

    insights to make the best decisions quickly ❖ Create new services faster. ❖ Create new attributes. ❖ Improve our data and its structure. ❖ Changing technology for data lakes.
  5. How can millions and millions of data be processed? ❖

    Scripts and executing in our PC. ❖ Scripts and executing in Colab or Jupyter. ❖ Scripts and executing in a Super Virtual Machine ❖ Scripts and executing in a Cluster. ❖ Using advanced tools.
  6. What is Spark? ❖ Multi-language engine for executing data engineering,

    data science, and machine learning on single-node machines or clusters. ❖ Framework for processing data distributed in multiple nodes. ❖ Framework with high level components for developing jobs. ❖ Batch and streaming data processing. ❖ SQL Analytics. ❖ Data Science at scale. ❖ Machine learning.
  7. Spark SQL, Datasets and DataFrame ❖ Spark SQL lets you

    query structured data inside Spark programs, using either SQL or a familiar DataFrame API ❖ DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.
  8. What is DataProc? Fully managed and highly scalable service for

    running: ➢ Apache Spark ➢ Hadoop ➢ Apache Flink ➢ Presto ➢ Frameworks ➢ 30+ open source tools.