Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Machine Learning Startup to Big Data Company

From Machine Learning Startup to Big Data Company

Talk given by Christoph Tavan at Berlin Buzzwords 2015, June 2.

https://2015.berlinbuzzwords.de/session/machine-learning-startup-big-data-company

How to start a company based on a machine learning idea? And how to scale it into the "Big Data" region?

In this talk I want to share some insights that I gathered during the last 3 years while founding and successfully scaling a real-time bidding (RTB) company from a two-person startup to a leading technology provider in the field:

- From fancy algorithms to production-proof algorithms.
- From thousands of model evaluations per day to trillions.
- From megabytes to petabytes.
- From real-time to batch to real-time.
- From two people to entire teams of data scientists and engineers.

I want to present real world examples of pitfalls we were facing, bad technology decisions we made and other things that can and will go wrong. And how to make the best out of it!

Buzzwords involved: Hadoop, Kafka, Spark, Impala, Redis, Aerospike, …

Christoph Tavan

June 02, 2015
Tweet

More Decks by Christoph Tavan

Other Decks in Technology

Transcript

  1. From Machine Learning Startup to Big Data Company Berlin Buzzwords

    – June 2, 2015 Christoph Tavan – @ctavan
  2. 10+ years of research in Machine Learning (Clustering) “Hey these

    algorithms seem to be really powerful for marketing, let's make money with it!” Company
  3. 10+ years of research in Machine Learning (Clustering) “Hey these

    algorithms seem to be really powerful for marketing, let's make money with it!” “We ‘just’ need to build some API around them.” Company
  4. 10+ years of research in Machine Learning (Clustering) “Hey these

    algorithms seem to be really powerful for marketing, let's make money with it!” “We ‘just’ need to build some API around them.” Well, not quite… Company
  5. Start with recommender systems: → Good results → Quick Go-Live

    → Each new shop ≙ Lots of effort → Hard to scale business-wise Finding a product…
  6. • Presenting our concepts at conferences • Talking to potential

    customers of our technology • Trying to find a product that scales business-wise Finding a the right product…
  7. • Presenting our concepts at conferences • Talking to potential

    customers of our technology • Trying to find a product that scales business-wise Move to online advertising: Real Time Bidding (RTB) → We have to build (even more) infrastructure… Finding a the right product…
  8. If your algorithms yield extraordinary results in the lab, that

    doesn't mean you'll have a profitable business within a day or two: Prepare for building an actual product! If none of your co-founders is a business-insider from the field of your product → find one. Lesson Learned: From Algorithm to Product…
  9. Stage 1: Bootstrapping (2011) 1 2 3 Find the right

    product Build a proof of concept
  10. Stage 3: Big Data Company (2014) 1 2 3 Scale

    your product Add more features
  11. 3 Stages of a (Machine Learning) Company 1. Bootstrapping: Proof

    of Concept 2. Funded Startup: Market Entry 3. Big Data Company: Scaling
  12. Structure of a Machine Learning Company • Team • Product

    ✅ • Architecture: Realtime Architecture Data Pipeline & Storage Machine Learning Algorithms Internet of Everything Events Models
  13. Structure of mbr targeting • Team • Product: Real Time

    Bidding • Architecture: Bidder & Tracking Data Pipeline & Storage Click / Conversion Prediction Models Events Internet Models
  14. Website-Owners sell their ad-spaces. Advertisers buy impressions for their ads.

    Interlude: Real Time Bidding (RTB) Roundtrip <100ms Website with Ads Supply Side Platform (SSP) Demand Side Platform (DSP) 1 4 2 3 mbr
  15. • Bid Request • Bid • Impression • Click •

    Conversion RTB Event Chain Website with Ads Supply Side Platform (SSP) Demand Side Platform (DSP) 1 4 2 3 Goal: Predict Clicks (or even better: Conversions) 2 4 3
  16. • We pay for every impression. • The advertiser wants

    to pay only for clicks (or conversions). • We need to predict the click probability: … RTB Goal What we are willing to pay. What we earn in case of success, const. Probability of click → Predict from DATA
  17. Structure of mbr targeting • Team • Product: Real Time

    Bidding • Architecture: Bidder & Tracking Data Pipeline & Storage Click / Conversion Prediction Models Events Internet Models
  18. 3 Stages of a (Machine Learning) Company 1. Bootstrapping: Proof

    of Concept 2. Funded Startup: Market Entry 3. Big Data Company: Scaling
  19. Bootstrapping: Proof of Concept • Events (JSON) → Log files

    • Copy log files to storage server (partition by date) • Reporting: Extract dimensions and stream to MySQL • Analytics: Stream entire log files through python on storage server 1 2 3 Storage Server rsync Reporting: MySQL streaming (script) LogFiles Analytics: Python, bash, jq, ... Tracking LogFiles Disks full Aggregations too slow No real queries possible
  20. Bootstrapping: Proof of Concept • Events (JSON) → Log files

    • Copy log files to storage server (partition by date) • Reporting: Extract dimensions and stream to MySQL • Analytics: Stream entire log files through python on storage server 1 2 3 Storage Server rsync Reporting: MySQL streaming (script) LogFiles Analytics: Python, bash, jq, ... Tracking LogFiles Disks full Aggregations too slow No real queries possible
  21. Funded Startup: Market Entry • Events (JSON) → Log files

    • Copy log files to HDFS • Reporting: Extract dimensions and stream to MySQL • Analytics: Stream entire log files through python on storage server Reporting: MySQL streaming (script) Analytics: Python, bash, jq, ... Aggregations too slow No real queries possible 1 2 3 HDFS hdfs put Tracking LogFiles
  22. Funded Startup: Market Entry • Events (JSON) → Log files

    • Copy log files to HDFS • Reporting: Oozie-Pipeline ◦ ETL with HIVE and MapReduce on gzipped JSON ◦ Export to PostgreSQL • Analytics: Stream entire log files through python on storage server Analytics: Python, bash, jq, ... No real queries possible 1 2 3 HDFS hdfs put Tracking LogFiles HIVE/MR/Oozie Reporting: PostgreSQL
  23. • Events (JSON) → Log files • Copy log files

    to HDFS • Reporting: Oozie-Pipeline ◦ ETL with HIVE and MapReduce on gzipped JSON ◦ Export to PostgreSQL • Analytics: HIVE on gzipped JSON ◦ Parquet-”Materialized-Views” for faster Impala/Hive Funded Startup: Market Entry 1 2 3 Analytics: Hive, Impala HDFS hdfs put Reporting: PostgreSQL HIVE/MR/Oozie Tracking LogFiles No timely ingestion Small-file problem No reliable structure Duplicated Data No fun to maintain Slow Slow Oozie-DSL
  24. • Events (JSON) → Log files • Copy log files

    to HDFS • Reporting: Oozie-Pipeline ◦ ETL with HIVE and MapReduce on gzipped JSON ◦ Export to PostgreSQL • Analytics: HIVE on gzipped JSON ◦ Parquet-”Materialized-Views” for faster Impala/Hive Big Data Company: Scale Analytics: Hive, Impala HDFS hdfs put Reporting: PostgreSQL HIVE/MR/Oozie Tracking LogFiles No timely ingestion Small-file problem No reliable structure Duplicated Data No fun to maintain Slow Slow Oozie-DSL 1 2 3
  25. • Events (Protobuf) → Log files • Copy log files

    to HDFS • Reporting: Oozie-Pipeline ◦ ETL with HIVE and MapReduce on gzipped JSON ◦ Export to PostgreSQL • Analytics: HIVE on gzipped JSON ◦ Parquet-”Materialized-Views” for faster Impala/Hive Big Data Company: Scale Analytics: Hive, Impala HDFS hdfs put Reporting: PostgreSQL HIVE/MR/Oozie Tracking LogFiles No timely ingestion Duplicated Data No fun to maintain Slow Slow Oozie-DSL 1 2 3 Small-file problem
  26. • Events (Protobuf) → Kafka • Write optimized tables directly

    in parquet • Reporting: Oozie-Pipeline ◦ ETL with HIVE and MapReduce on gzipped JSON ◦ Export to PostgreSQL • Analytics: HIVE on gzipped JSON ◦ Parquet-”Materialized-Views” for faster Impala/Hive Big Data Company: Scale Analytics: Hive, Impala Reporting: PostgreSQL HIVE/MR/Oozie Duplicated Data No fun to maintain Slow Slow 1 2 3 Oozie-DSL Kafka Tracking HDFS
  27. • Events (Protobuf) → Kafka • Write optimized tables directly

    in parquet • Reporting: Luigi-Workflow ◦ ETL with HIVE and MapReduce on gzipped JSON ◦ Export to PostgreSQL • Analytics: HIVE on gzipped JSON ◦ Parquet-”Materialized-Views” for faster Impala/Hive Big Data Company: Scale Analytics: Hive, Impala Reporting: PostgreSQL HIVE/MR/Luigi Duplicated Data No fun to maintain Slow Slow 1 2 3 Kafka Tracking HDFS
  28. • Events (Protobuf) → Kafka • Write optimized tables directly

    in parquet • Reporting: Luigi-Workflow ◦ ETL with HIVE and Spark on gzipped JSON ◦ Export to PostgreSQL • Analytics: HIVE on gzipped JSON ◦ Parquet-”Materialized-Views” for faster Impala/Hive Big Data Company: Scale Analytics: Hive, Impala Reporting: PostgreSQL HIVE/Spark/Luigi Duplicated Data Slow Slow 1 2 3 Kafka Tracking HDFS
  29. Big Data Company: Scale • Events (Protobuf) → Kafka •

    Write optimized tables directly in parquet • Reporting: Luigi-Workflow ◦ ETL with HIVE and Spark on Parquet tables ◦ Export to PostgreSQL • Analytics: HIVE/Impala/Spark on Parquet tables 1 2 3 HDFS Kafka Tracking Analytics: Hive, Impala HIVE/Spark/Luigi Reporting: PostgreSQL
  30. Data Pipeline Evolution HDFS hdfs put Reporting: PostgreSQL HIVE/MR/Oozie Tracking

    LogFiles HDFS Reporting: PostgreSQL HIVE/Spark/Luigi Kafka Tracking Storage Server rsync Reporting: MySQL streaming (script) LogFiles Tracking LogFiles 1 2 3 Analytics: Python, bash, jq Analytics: Hive, Impala Analytics: Hive, Impala
  31. Bootstrapping: Proof of Concept ◦ 500 qps ◦ 10 GB/day

    ◦ 4 TB Storage, 1 Server Funded Startup: Market Entry ◦ 5k qps ◦ 100 GB/day ◦ 50 TB Hadoop Cluster Big Data Company: Scaling / Internationalization ◦ 50k+ qps ◦ 1+ TB/day ◦ 1+ PB Hadoop Cluster + Kafka Cluster Numbers 1 2 3
  32. Lesson Learned: Fast access to all data via SQL •

    Make all your raw data accessible through SQL! ◦ Make it fast! ◦ If your Data Scientists can query the raw data fast with SQL, they will do so and find out great things! • Leave other DSLs like MapReduce, Pig to the engineers. ◦ SQL is (almost always) good enough. ◦ Exception: Spark might be an interesting alternative due to the Python- integration and straightforward interface.
  33. Lesson Learned: Introducing Hadoop • When is the right time

    to introduce Hadoop/HDFS? ◦ When “conventional” SQL becomes too slow… ◦ Probably easier now than 3 years ago. Depends on team and product… • Yet, many things can be achieved without Hadoop ◦ Do you really need the extra-complexity from the beginning? • When you introduce: Find people with experience early ◦ Avoid Beginner’s mistakes: ▪ Small file problem or big, unsplittable files ▪ MapReduce-Jobs in Java where we should have used HIVE or Pig
  34. Lesson Learned: Data Format: JSON → Protobuf • JSON is

    great for the start, use it! ◦ Human-readable ◦ Flexible ◦ You can: • Later schemaful data formats may streamline processes: ◦ Reliable schema required for long-term analysis ◦ Inter-Team/Component compatibility ◦ In case of binary format provide tooling, like: See also: http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-02-Lin.pdf
  35. Bootstrapping: Proof of Concept ◦ Small group of founders ◦

    Use the tools you’re most productive with Funded Startup: Market Entry ◦ Small team of generalists ◦ Use wide-spread technologies (where you can google all problems) Big Data Company: Scaling / Internationalization ◦ Multiple teams of specialists ◦ “Write your own database or streaming framework” Team 1 2 3
  36. Building a Machine Learning Company 1. Find the right product

    2. Prepare for lots of infrastructure development 3. Provide fast SQL-access to all data at all time 4. When introducing Hadoop hire an expert 5. Start with flexible, human-readable data (JSON)