Data ingestion at Vinted

SPEAKER Saulius Data ingestion at Vinted

@imsaulius saulius

Make second-hand the ﬁrst choice worldwide

Agenda • Event ingestion • How does it work? •
Change (MySQL data) ingestion • How will it work? • Why doesn’t it work yet? • What’s next? • Why do all of this? • What else do we have in store?

Data sources • Event data • MySQL data (table dumps/snapshots)
• Imported data (via CSV) • Data from 3rd party APIs (Facebook ads api, Zendesk api, adjust.io, etc)

Big data ingestion @ Vinted • up to 19 billion
events / month • up to 0.7 billion events / day • growing

Event ingestion: positive stats

Event ingestion: not so positive stats • 500 000 invalid
events / day • ~60% Android, ~35% web, ~5% iOS • Slowly getting better

HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka
-> HDFS MySQL -> Kafka

-> HDFS MySQL -> Kafka Pre-production

A high throughput distributed messaging system •Distributed: partitions messages across
multiple nodes •Reliable: messages replicated across multiple nodes •Persistent: all messages are persisted to disk •Performant: works just ﬁne with 70 000 message writes per second. Up to 700Mb/s (when replicating/rebalancing).

•Broker - Kafka server •Producer - N producers send messages
to Brokers •Consumer - N consumers read messages from Brokers. Each at its own pace.

@ Vinted • 6 Broker nodes • ~4TB of usable
space • Allows us to safely keep 3+ weeks worth of event tracking data (= Hadoop can be down for 3 weeks w/ out data loss)

http-relay • Node.JS • Listens on HTTP • Accepts JSON
event batches • Unwraps the batch to single event JSON object • Enriches the object with metadata coming from HTTP request headers • Sends all events to UDP (svc-udp-kafka-bridge) • CI, continuous deploys • Single instance can be down

svc-udp-kafka-bridge • Clojure • ~500 LOC • Listens on UDP
port for event JSON objects and relays them to Kafka broker • 47 instances deployed, ~6-10K requests per second through all instances, more during peak hours • CI, continuous deploys (takes a minute a deploy) • Uses clever socket trick during deploys to keep processing (SO_REUSEADDR) • Has internal in-memory buffer used in case Kafka is down. Only lasts ~1 minute depending on load. • Mission critical, both instances per host cannot be down

svc-schema-registry aka Škėma service • Ruby • Stateless JSON schema
file server • + declarative ETL command generator • Centralized, human readable event registry page (field type descriptions, field purpose descriptions) • From just engage.inactive_seller.push_notification_sent to full descriptions of each field. • Schemas are used to encode event JSON files to binary Avro payload • CI, continous deploys • Deploy triggers redeploys of dependent services • Schemas automatically update in Hive tables (periodically).

svc-event-passage • Clojure • Stream processor • Flow: • Fetch
event schemas from svc-schema-registry • Validate and encode events pushed by svc-udp-kafka-bridge to Avro payload • Push back to Kafka • JSON payload from events-hx topic becomes Avro byte payload in event.user.view_screen, event.user.view_item etc topics (one for each event type, ~200 in total). • Single instance processes ~1500+ events per second during peak hours • Can be down for ~ 5 days. • CI, continous deploys

Camus • Java • LinkedIn project • MapReduce batch job
(not a service!) • Periodically ofﬂoads Avro payload from event topics (event.user.view_screen, event.user.view_item etc) to Hadoop Distributed File System (HDFS). • + Do small transformations (ETL tasks) while ofﬂoading • Runs periodically, deployed from warehouse-jobs repo • Runs are scheduled via Oozie

MySQL imports to HDFS • Current situation • Dump predeﬁned
list of MySQL tables to HDFS nightly • We have ~1 day old snapshots of MySQL tables • Issues • Unreliable tooling • Lots of trafﬁc each night to retransfer same data • We still get only a snapshot - table state at time X • In analytics we need facts not state (view) • Cannot build certain reports at all (data is gone)

How does gatling work? • Listens to MySQL binary log
stream • Captures change data for all tables like: • User insert • (login = “saulius”, email = “[email protected]”, real_name = “…”, …) • User update • (before: login = “saulius”, after: login = “nebesaulius”) • User delete • Encodes to Avro and pushes to Kafka • Why do this? • We get facts for free (e.g. user creation, removal) • We could recreate table snapshots faster, not just once per day • Real time data from MySQL (stream processing?) • We can recreate every table at any point in time.

• Precomputed metrics • Instead of What’s next? • Do
this: SELECT CONCAT(ab.test_name, "_", ab.test_variant) test_id, COUNT(DISTINCT t.user_id) c FROM tracking_events.us__users_ex users INNER JOIN batch_views.ab_tests ab ON ab.portal = 'us' AND ab.test_name = 'orl' AND cast(users.id as string) = ab.user_id INNER JOIN (SELECT tp.user_id FROM tracking_events.us__user_sessions_ex tp WHERE tp.session_date BETWEEN '2014-12-10' AND ‘2014-12-12') tp ON tp.user_id = users.id LEFT JOIN (SELECT t.user_id FROM tracking_events.us__items_ex t WHERE t.created_at BETWEEN '2014-12-10' AND ‘2014-12-12') t ON t.user_id = users.id GROUP BY ab.test_name, ab.test_variant, users.id ORDER BY ab.test_variant ASC SELECT lister_count FROM listers_metric WHERE date BETWEEN '2014-12-10' AND ‘2014-12-12’ AND ab_test = ‘orl’ GROUP BY date, test_variant

What’s next? • Uniﬁed metric computation • Metrics computed the
same way for all reports including AB test result reporting • E.g. same meaning of “active user”

What’s next? • One day.. OLAP style reporting. Self service:
we make the data available - people create reports themselves.

Utilize stream processing • Build real time dashboards / reports
• Detect anomalies in event streams - identify failures quicker than we get a new support ticket • Join event streams with application metric or logging streams for root cause identiﬁcation?

Contributions back to production • Feed rework • Intelligent newsletters
(like Pinterest sends) • Collaborative ﬁltering? • … you name it!

Thanks! • The Log: What every software engineer should know
about real- time data's unifying abstraction • All Aboard the Databus! Linkedin’s Scalable Consistent Change Data Capture Platform • Wormhole pub/sub system: Moving data through space and time • The “Big Data” Ecosystem at LinkedIn • The Uniﬁed Logging Infrastructure for Data Analytics at Twitter • Kafka: A Distributed Messaging System for Log Processing • Building LinkedIn’s Real-time Activity Data Pipeline

Data ingestion at Vinted

Data ingestion at Vinted

Saulius Grigaliunas

More Decks by Saulius Grigaliunas

Other Decks in Programming

Featured

Transcript

SPEAKER Saulius Data ingestion at Vinted

@imsaulius saulius

Make second-hand the ﬁrst choice worldwide

Agenda • Event ingestion • How does it work? •

Data sources • Event data • MySQL data (table dumps/snapshots)

Big data ingestion @ Vinted • up to 19 billion

Event ingestion: positive stats

Event ingestion: not so positive stats • 500 000 invalid

HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

A high throughput distributed messaging system •Distributed: partitions messages across

•Broker - Kafka server •Producer - N producers send messages

@ Vinted • 6 Broker nodes • ~4TB of usable

HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

http-relay • Node.JS • Listens on HTTP • Accepts JSON

HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

svc-udp-kafka-bridge • Clojure • ~500 LOC • Listens on UDP

HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

svc-schema-registry aka Škėma service • Ruby • Stateless JSON schema

HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

svc-event-passage • Clojure • Stream processor • Flow: • Fetch

HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

Camus • Java • LinkedIn project • MapReduce batch job

HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

MySQL imports to HDFS • Current situation • Dump predeﬁned

How does gatling work? • Listens to MySQL binary log

• Precomputed metrics • Instead of What’s next? • Do

What’s next? • Uniﬁed metric computation • Metrics computed the

What’s next? • One day.. OLAP style reporting. Self service:

Utilize stream processing • Build real time dashboards / reports

Contributions back to production • Feed rework • Intelligent newsletters

Thanks! • The Log: What every software engineer should know