From Machine Learning Startup to Big Data Company

From Machine Learning Startup to Big Data Company Berlin Buzzwords
– June 2, 2015 Christoph Tavan – @ctavan

• Developer since 2001 • Studied Physics • Startup-Opportunity during
Master Studies in 2011…

A few Physicists from University… 10+ years of research in
Machine Learning (Clustering)

10+ years of research in Machine Learning (Clustering) “Hey these
algorithms seem to be really powerful for marketing, let's make money with it!” Company

algorithms seem to be really powerful for marketing, let's make money with it!” “We ‘just’ need to build some API around them.” Company

algorithms seem to be really powerful for marketing, let's make money with it!” “We ‘just’ need to build some API around them.” Well, not quite… Company

Start with recommender systems: → Good results → Quick Go-Live
→ Each new shop ≙ Lots of effort → Hard to scale business-wise Finding a product…

Finding a the right product…

• Presenting our concepts at conferences • Talking to potential
customers of our technology • Trying to find a product that scales business-wise Finding a the right product…

• Presenting our concepts at conferences • Talking to potential
customers of our technology • Trying to find a product that scales business-wise Move to online advertising: Real Time Bidding (RTB) → We have to build (even more) infrastructure… Finding a the right product…

If your algorithms yield extraordinary results in the lab, that
doesn't mean you'll have a profitable business within a day or two: Prepare for building an actual product! If none of your co-founders is a business-insider from the field of your product → find one. Lesson Learned: From Algorithm to Product…

3 Stages of a (Machine Learning) Company

Stage 1: Bootstrapping (2011) 1 2 3 Find the right
product Build a proof of concept

Stage 2: VC-Funded Startup (2012) 1 2 3 Enter the
market Make customers happy

Stage 3: Big Data Company (2014) 1 2 3 Scale
your product Add more features

3 Stages of a (Machine Learning) Company 1. Bootstrapping: Proof
of Concept 2. Funded Startup: Market Entry 3. Big Data Company: Scaling

Structure of a Machine Learning Company • Team • Product
✅ • Architecture: Realtime Architecture Data Pipeline & Storage Machine Learning Algorithms Internet of Everything Events Models

Structure of mbr targeting • Team • Product: Real Time
Bidding • Architecture: Bidder & Tracking Data Pipeline & Storage Click / Conversion Prediction Models Events Internet Models

Website-Owners sell their ad-spaces. Advertisers buy impressions for their ads.
Interlude: Real Time Bidding (RTB) Roundtrip <100ms Website with Ads Supply Side Platform (SSP) Demand Side Platform (DSP) 1 4 2 3 mbr

• Bid Request • Bid • Impression • Click •
Conversion RTB Event Chain Website with Ads Supply Side Platform (SSP) Demand Side Platform (DSP) 1 4 2 3 Goal: Predict Clicks (or even better: Conversions) 2 4 3

• We pay for every impression. • The advertiser wants
to pay only for clicks (or conversions). • We need to predict the click probability: … RTB Goal What we are willing to pay. What we earn in case of success, const. Probability of click → Predict from DATA

Structure of mbr targeting • Team • Product: Real Time
Bidding • Architecture: Bidder & Tracking Data Pipeline & Storage Click / Conversion Prediction Models Events Internet Models

Events in RTB …

Events in RTB

RTB Data Pipeline Analytics Dimension (“Raw Events”) Reporting / Model
Dimension (“Features”) Featurize

3 Stages of a (Machine Learning) Company 1. Bootstrapping: Proof
of Concept 2. Funded Startup: Market Entry 3. Big Data Company: Scaling

Bootstrapping: Proof of Concept • Events (JSON) → Log files
• Copy log files to storage server (partition by date) • Reporting: Extract dimensions and stream to MySQL • Analytics: Stream entire log files through python on storage server 1 2 3 Storage Server rsync Reporting: MySQL streaming (script) LogFiles Analytics: Python, bash, jq, ... Tracking LogFiles Disks full Aggregations too slow No real queries possible

to the rescue!

Bootstrapping: Proof of Concept • Events (JSON) → Log files
• Copy log files to storage server (partition by date) • Reporting: Extract dimensions and stream to MySQL • Analytics: Stream entire log files through python on storage server 1 2 3 Storage Server rsync Reporting: MySQL streaming (script) LogFiles Analytics: Python, bash, jq, ... Tracking LogFiles Disks full Aggregations too slow No real queries possible

Funded Startup: Market Entry • Events (JSON) → Log files
• Copy log files to HDFS • Reporting: Extract dimensions and stream to MySQL • Analytics: Stream entire log files through python on storage server Reporting: MySQL streaming (script) Analytics: Python, bash, jq, ... Aggregations too slow No real queries possible 1 2 3 HDFS hdfs put Tracking LogFiles

Funded Startup: Market Entry • Events (JSON) → Log files
• Copy log files to HDFS • Reporting: Oozie-Pipeline ◦ ETL with HIVE and MapReduce on gzipped JSON ◦ Export to PostgreSQL • Analytics: Stream entire log files through python on storage server Analytics: Python, bash, jq, ... No real queries possible 1 2 3 HDFS hdfs put Tracking LogFiles HIVE/MR/Oozie Reporting: PostgreSQL

• Events (JSON) → Log files • Copy log files
to HDFS • Reporting: Oozie-Pipeline ◦ ETL with HIVE and MapReduce on gzipped JSON ◦ Export to PostgreSQL • Analytics: HIVE on gzipped JSON ◦ Parquet-”Materialized-Views” for faster Impala/Hive Funded Startup: Market Entry 1 2 3 Analytics: Hive, Impala HDFS hdfs put Reporting: PostgreSQL HIVE/MR/Oozie Tracking LogFiles No timely ingestion Small-file problem No reliable structure Duplicated Data No fun to maintain Slow Slow Oozie-DSL

• Events (JSON) → Log files • Copy log files
to HDFS • Reporting: Oozie-Pipeline ◦ ETL with HIVE and MapReduce on gzipped JSON ◦ Export to PostgreSQL • Analytics: HIVE on gzipped JSON ◦ Parquet-”Materialized-Views” for faster Impala/Hive Big Data Company: Scale Analytics: Hive, Impala HDFS hdfs put Reporting: PostgreSQL HIVE/MR/Oozie Tracking LogFiles No timely ingestion Small-file problem No reliable structure Duplicated Data No fun to maintain Slow Slow Oozie-DSL 1 2 3

• Events (Protobuf) → Log files • Copy log files
to HDFS • Reporting: Oozie-Pipeline ◦ ETL with HIVE and MapReduce on gzipped JSON ◦ Export to PostgreSQL • Analytics: HIVE on gzipped JSON ◦ Parquet-”Materialized-Views” for faster Impala/Hive Big Data Company: Scale Analytics: Hive, Impala HDFS hdfs put Reporting: PostgreSQL HIVE/MR/Oozie Tracking LogFiles No timely ingestion Duplicated Data No fun to maintain Slow Slow Oozie-DSL 1 2 3 Small-file problem

• Events (Protobuf) → Kafka • Write optimized tables directly
in parquet • Reporting: Oozie-Pipeline ◦ ETL with HIVE and MapReduce on gzipped JSON ◦ Export to PostgreSQL • Analytics: HIVE on gzipped JSON ◦ Parquet-”Materialized-Views” for faster Impala/Hive Big Data Company: Scale Analytics: Hive, Impala Reporting: PostgreSQL HIVE/MR/Oozie Duplicated Data No fun to maintain Slow Slow 1 2 3 Oozie-DSL Kafka Tracking HDFS

in parquet • Reporting: Luigi-Workflow ◦ ETL with HIVE and MapReduce on gzipped JSON ◦ Export to PostgreSQL • Analytics: HIVE on gzipped JSON ◦ Parquet-”Materialized-Views” for faster Impala/Hive Big Data Company: Scale Analytics: Hive, Impala Reporting: PostgreSQL HIVE/MR/Luigi Duplicated Data No fun to maintain Slow Slow 1 2 3 Kafka Tracking HDFS

in parquet • Reporting: Luigi-Workflow ◦ ETL with HIVE and Spark on gzipped JSON ◦ Export to PostgreSQL • Analytics: HIVE on gzipped JSON ◦ Parquet-”Materialized-Views” for faster Impala/Hive Big Data Company: Scale Analytics: Hive, Impala Reporting: PostgreSQL HIVE/Spark/Luigi Duplicated Data Slow Slow 1 2 3 Kafka Tracking HDFS

Big Data Company: Scale • Events (Protobuf) → Kafka •
Write optimized tables directly in parquet • Reporting: Luigi-Workflow ◦ ETL with HIVE and Spark on Parquet tables ◦ Export to PostgreSQL • Analytics: HIVE/Impala/Spark on Parquet tables 1 2 3 HDFS Kafka Tracking Analytics: Hive, Impala HIVE/Spark/Luigi Reporting: PostgreSQL

Data Pipeline Evolution HDFS hdfs put Reporting: PostgreSQL HIVE/MR/Oozie Tracking
LogFiles HDFS Reporting: PostgreSQL HIVE/Spark/Luigi Kafka Tracking Storage Server rsync Reporting: MySQL streaming (script) LogFiles Tracking LogFiles 1 2 3 Analytics: Python, bash, jq Analytics: Hive, Impala Analytics: Hive, Impala

Bootstrapping: Proof of Concept ◦ 500 qps ◦ 10 GB/day
◦ 4 TB Storage, 1 Server Funded Startup: Market Entry ◦ 5k qps ◦ 100 GB/day ◦ 50 TB Hadoop Cluster Big Data Company: Scaling / Internationalization ◦ 50k+ qps ◦ 1+ TB/day ◦ 1+ PB Hadoop Cluster + Kafka Cluster Numbers 1 2 3

Lesson learned: Fast analytics access via SQL http://howfuckedismydatabase.com/nosql/

Lesson Learned: Fast access to all data via SQL •
Make all your raw data accessible through SQL! ◦ Make it fast! ◦ If your Data Scientists can query the raw data fast with SQL, they will do so and find out great things! • Leave other DSLs like MapReduce, Pig to the engineers. ◦ SQL is (almost always) good enough. ◦ Exception: Spark might be an interesting alternative due to the Python- integration and straightforward interface.

Lesson Learned: Introducing Hadoop • When is the right time
to introduce Hadoop/HDFS? ◦ When “conventional” SQL becomes too slow… ◦ Probably easier now than 3 years ago. Depends on team and product… • Yet, many things can be achieved without Hadoop ◦ Do you really need the extra-complexity from the beginning? • When you introduce: Find people with experience early ◦ Avoid Beginner’s mistakes: ▪ Small file problem or big, unsplittable files ▪ MapReduce-Jobs in Java where we should have used HIVE or Pig

Lesson Learned: Data Format: JSON → Protobuf • JSON is
great for the start, use it! ◦ Human-readable ◦ Flexible ◦ You can: • Later schemaful data formats may streamline processes: ◦ Reliable schema required for long-term analysis ◦ Inter-Team/Component compatibility ◦ In case of binary format provide tooling, like: See also: http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-02-Lin.pdf

Bootstrapping: Proof of Concept ◦ Small group of founders ◦
Use the tools you’re most productive with Funded Startup: Market Entry ◦ Small team of generalists ◦ Use wide-spread technologies (where you can google all problems) Big Data Company: Scaling / Internationalization ◦ Multiple teams of specialists ◦ “Write your own database or streaming framework” Team 1 2 3

Conclusion

Building a Machine Learning Company 1. Find the right product
2. Prepare for lots of infrastructure development 3. Provide fast SQL-access to all data at all time 4. When introducing Hadoop hire an expert 5. Start with flexible, human-readable data (JSON)

Thank you! Questions? Twitter: @ctavan [email protected] http://mbr-targeting.com

From Machine Learning Startup to Big Data Company

From Machine Learning Startup to Big Data Company

More Decks by Christoph Tavan

Other Decks in Technology

Featured

Transcript