$30 off During Our Annual Pro Sale. View Details »

What Is Big Data

What Is Big Data

A series of talks on data engineering

Avatar for Yuri Ostapchuk

Yuri Ostapchuk

September 13, 2021
Tweet

More Decks by Yuri Ostapchuk

Other Decks in Programming

Transcript

  1. WHAT IS BIG DATA? WHAT IS BIG DATA? AND WHY

    YOU MAY CARE AND WHY YOU MAY CARE 1
  2. 3 3 A FEW CASES A FEW CASES 1Tb per

    day at stock exchange 500+Tb at facebook each day 10+Tb in 30mins of Jet ight solar station can generate more data then network to transmit to cloud 73% of data is not used these days 5 . 1
  3. 4 4 WHAT IS (BIG)DATA? WHAT IS (BIG)DATA? cumulative term

    huge amount of data business problems related to collecting and retrieving value from the (big) data a set of engineering tools, patterns, programming models and skills to solve those problems 6 . 1
  4. 5 5 BUSINESS: WHY IT MATTERS BUSINESS: WHY IT MATTERS

    data driven decisions operational ef ciency, risk management, etc. cost optimization .. 7 . 1
  5. 6 6 BUSINESS: LEVELS OF ADOPTION & BUSINESS: LEVELS OF

    ADOPTION & CUSTOMER PAIN CUSTOMER PAIN A given customer can be at various levels of big-data / AI adoption, which re ects his current needs, problems and pain points 8 . 1
  6. 1. early: "Why my MySQL instance can't handle this?" 2.

    optimization & strategy required: "shi.. my aws bill" 8 . 2
  7. 1. early: "Why my MySQL instance can't handle this?" 2.

    optimization & strategy required: "shi.. my aws bill" 3. "I have money, let's build something" - lack of strategy: operational, technical 8 . 2
  8. 1. early: "Why my MySQL instance can't handle this?" 2.

    optimization & strategy required: "shi.. my aws bill" 3. "I have money, let's build something" - lack of strategy: operational, technical 4. data swamp: "where are my socks?" 8 . 2
  9. 1. early: "Why my MySQL instance can't handle this?" 2.

    optimization & strategy required: "shi.. my aws bill" 3. "I have money, let's build something" - lack of strategy: operational, technical 4. data swamp: "where are my socks?" 5. platform already exists: Analytics, BI, ML 8 . 2
  10. 1. early: "Why my MySQL instance can't handle this?" 2.

    optimization & strategy required: "shi.. my aws bill" 3. "I have money, let's build something" - lack of strategy: operational, technical 4. data swamp: "where are my socks?" 5. platform already exists: Analytics, BI, ML 6. All of them: "Here I planted my potatoes.. want to adopt AI/ML eventually.." 8 . 2
  11. ENGINEERING ENGINEERING distributed systems structured, unstructured data collect (ingest, integrate)

    prepare: clean, enrich, feature extract, pre- aggregate, .. process batch real-time store integrate (user interface - SQL front-end engines, API REST front-ends, services) 9 . 4
  12. agile doesn't (fully) work data pipeline is bad to split

    vertically POC phase & not-ending exploration 10 . 2
  13. agile doesn't (fully) work data pipeline is bad to split

    vertically POC phase & not-ending exploration team: smaller over versatile 10 . 2
  14. agile doesn't (fully) work data pipeline is bad to split

    vertically POC phase & not-ending exploration team: smaller over versatile requirements collection –> data exploration 10 . 2
  15. agile doesn't (fully) work data pipeline is bad to split

    vertically POC phase & not-ending exploration team: smaller over versatile requirements collection –> data exploration BA –> BI 10 . 2
  16. agile doesn't (fully) work data pipeline is bad to split

    vertically POC phase & not-ending exploration team: smaller over versatile requirements collection –> data exploration BA –> BI QA –> Data Engineer 10 . 2
  17. agile doesn't (fully) work data pipeline is bad to split

    vertically POC phase & not-ending exploration team: smaller over versatile requirements collection –> data exploration BA –> BI QA –> Data Engineer DevOps –> DataOps Engineer 10 . 2
  18. agile doesn't (fully) work data pipeline is bad to split

    vertically POC phase & not-ending exploration team: smaller over versatile requirements collection –> data exploration BA –> BI QA –> Data Engineer DevOps –> DataOps Engineer tooling matters A LOT: dev environment, fast access, .. 10 . 2
  19. agile doesn't (fully) work data pipeline is bad to split

    vertically POC phase & not-ending exploration team: smaller over versatile requirements collection –> data exploration BA –> BI QA –> Data Engineer DevOps –> DataOps Engineer tooling matters A LOT: dev environment, fast access, .. simplicity of architecture over structure 10 . 2
  20. agile doesn't (fully) work data pipeline is bad to split

    vertically POC phase & not-ending exploration team: smaller over versatile requirements collection –> data exploration BA –> BI QA –> Data Engineer DevOps –> DataOps Engineer tooling matters A LOT: dev environment, fast access, .. simplicity of architecture over structure optimization over simplicity 10 . 2
  21. agile doesn't (fully) work data pipeline is bad to split

    vertically POC phase & not-ending exploration team: smaller over versatile requirements collection –> data exploration BA –> BI QA –> Data Engineer DevOps –> DataOps Engineer tooling matters A LOT: dev environment, fast access, .. simplicity of architecture over structure optimization over simplicity sacri ce locally to optimize globally 10 . 2
  22. agile doesn't (fully) work data pipeline is bad to split

    vertically POC phase & not-ending exploration team: smaller over versatile requirements collection –> data exploration BA –> BI QA –> Data Engineer DevOps –> DataOps Engineer tooling matters A LOT: dev environment, fast access, .. simplicity of architecture over structure optimization over simplicity sacri ce locally to optimize globally communicate: half of something is better then all of nothing 10 . 2
  23. skills 30% coding 25% algorithms, distributed design 20% devops 15%

    sql and statistics 10% visualization & presentation 13 . 2
  24. 11 11 HISTORICAL PERSPECTIVE HISTORICAL PERSPECTIVE Pre: Oracle, Informatica, mysql,..

    2004: Hadoop, mapreduce, hdfs 2009-2010: Spark, Kafka Con uent, Databricks, Cloudera, Hortonworks, MapR AWS, Azure, GCP, IBM 14 . 1
  25. 13 13 GLOSSARY GLOSSARY BA vs BA vs BI Data

    Engineer Data Warehouse Data Pipeline Data Lake OLAP, OLTP, OLAP cubes, tensors batch vs stream 16 . 1
  26. Data Science vs AI vs ML predictive analytics MPP Hadoop

    MapReduce, HDFS structured / unstructured data ETL 16 . 2