Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data ingestion at Vinted

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Data ingestion at Vinted

Talk on data ingestion at Vinted given at Big Data Vilnius meetup (http://www.meetup.com/Vilnius-Hadoop-Meetup/events/220085043/).

Avatar for Saulius Grigaliunas

Saulius Grigaliunas

February 05, 2015
Tweet

More Decks by Saulius Grigaliunas

Other Decks in Programming

Transcript

  1. Agenda • Event ingestion • How does it work? •

    Change (MySQL data) ingestion • How will it work? • Why doesn’t it work yet? • What’s next? • Why do all of this? • What else do we have in store?
  2. Data sources • Event data • MySQL data (table dumps/snapshots)

    • Imported data (via CSV) • Data from 3rd party APIs (Facebook ads api, Zendesk api, adjust.io, etc)
  3. Big data ingestion @ Vinted • up to 19 billion

    events / month • up to 0.7 billion events / day • growing
  4. Event ingestion: not so positive stats • 500 000 invalid

    events / day • ~60% Android, ~35% web, ~5% iOS • Slowly getting better
  5. HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

    -> HDFS MySQL -> Kafka Pre-production
  6. A high throughput distributed messaging system •Distributed: partitions messages across

    multiple nodes •Reliable: messages replicated across multiple nodes •Persistent: all messages are persisted to disk •Performant: works just fine with 70 000 message writes per second. Up to 700Mb/s (when replicating/rebalancing).
  7. •Broker - Kafka server •Producer - N producers send messages

    to Brokers •Consumer - N consumers read messages from Brokers. Each at its own pace.
  8. @ Vinted • 6 Broker nodes • ~4TB of usable

    space • Allows us to safely keep 3+ weeks worth of event tracking data (= Hadoop can be down for 3 weeks w/ out data loss)
  9. HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

    -> HDFS MySQL -> Kafka Pre-production
  10. http-relay • Node.JS • Listens on HTTP • Accepts JSON

    event batches • Unwraps the batch to single event JSON object • Enriches the object with metadata coming from HTTP request headers • Sends all events to UDP (svc-udp-kafka-bridge) • CI, continuous deploys • Single instance can be down
  11. HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

    -> HDFS MySQL -> Kafka Pre-production
  12. svc-udp-kafka-bridge • Clojure • ~500 LOC • Listens on UDP

    port for event JSON objects and relays them to Kafka broker • 47 instances deployed, ~6-10K requests per second through all instances, more during peak hours • CI, continuous deploys (takes a minute a deploy) • Uses clever socket trick during deploys to keep processing (SO_REUSEADDR) • Has internal in-memory buffer used in case Kafka is down. Only lasts ~1 minute depending on load. • Mission critical, both instances per host cannot be down
  13. HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

    -> HDFS MySQL -> Kafka Pre-production
  14. svc-schema-registry aka Škėma service • Ruby • Stateless JSON schema

    file server • + declarative ETL command generator • Centralized, human readable event registry page (field type descriptions, field purpose descriptions) • From just engage.inactive_seller.push_notification_sent to full descriptions of each field. • Schemas are used to encode event JSON files to binary Avro payload • CI, continous deploys • Deploy triggers redeploys of dependent services • Schemas automatically update in Hive tables (periodically).
  15. HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

    -> HDFS MySQL -> Kafka Pre-production
  16. svc-event-passage • Clojure • Stream processor • Flow: • Fetch

    event schemas from svc-schema-registry • Validate and encode events pushed by svc-udp-kafka-bridge to Avro payload • Push back to Kafka • JSON payload from events-hx topic becomes Avro byte payload in event.user.view_screen, event.user.view_item etc topics (one for each event type, ~200 in total). • Single instance processes ~1500+ events per second during peak hours • Can be down for ~ 5 days. • CI, continous deploys
  17. HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

    -> HDFS MySQL -> Kafka Pre-production
  18. Camus • Java • LinkedIn project • MapReduce batch job

    (not a service!) • Periodically offloads Avro payload from event topics (event.user.view_screen, event.user.view_item etc) to Hadoop Distributed File System (HDFS). • + Do small transformations (ETL tasks) while offloading • Runs periodically, deployed from warehouse-jobs repo • Runs are scheduled via Oozie
  19. HTTP -> UDP UDP -> Kafka Kafka -> Kafka Kafka

    -> HDFS MySQL -> Kafka Pre-production
  20. MySQL imports to HDFS • Current situation • Dump predefined

    list of MySQL tables to HDFS nightly • We have ~1 day old snapshots of MySQL tables • Issues • Unreliable tooling • Lots of traffic each night to retransfer same data • We still get only a snapshot - table state at time X • In analytics we need facts not state (view) • Cannot build certain reports at all (data is gone)
  21. How does gatling work? • Listens to MySQL binary log

    stream • Captures change data for all tables like: • User insert • (login = “saulius”, email = “[email protected]”, real_name = “…”, …) • User update • (before: login = “saulius”, after: login = “nebesaulius”) • User delete • Encodes to Avro and pushes to Kafka • Why do this? • We get facts for free (e.g. user creation, removal) • We could recreate table snapshots faster, not just once per day • Real time data from MySQL (stream processing?) • We can recreate every table at any point in time.
  22. • Precomputed metrics • Instead of What’s next? • Do

    this: SELECT CONCAT(ab.test_name, "_", ab.test_variant) test_id, COUNT(DISTINCT t.user_id) c FROM tracking_events.us__users_ex users INNER JOIN batch_views.ab_tests ab ON ab.portal = 'us' AND ab.test_name = 'orl' AND cast(users.id as string) = ab.user_id INNER JOIN (SELECT tp.user_id FROM tracking_events.us__user_sessions_ex tp WHERE tp.session_date BETWEEN '2014-12-10' AND ‘2014-12-12') tp ON tp.user_id = users.id LEFT JOIN (SELECT t.user_id FROM tracking_events.us__items_ex t WHERE t.created_at BETWEEN '2014-12-10' AND ‘2014-12-12') t ON t.user_id = users.id GROUP BY ab.test_name, ab.test_variant, users.id ORDER BY ab.test_variant ASC SELECT lister_count FROM listers_metric WHERE date BETWEEN '2014-12-10' AND ‘2014-12-12’ AND ab_test = ‘orl’ GROUP BY date, test_variant
  23. What’s next? • Unified metric computation • Metrics computed the

    same way for all reports including AB test result reporting • E.g. same meaning of “active user”
  24. What’s next? • One day.. OLAP style reporting. Self service:

    we make the data available - people create reports themselves.
  25. Utilize stream processing • Build real time dashboards / reports

    • Detect anomalies in event streams - identify failures quicker than we get a new support ticket • Join event streams with application metric or logging streams for root cause identification?
  26. Contributions back to production • Feed rework • Intelligent newsletters

    (like Pinterest sends) • Collaborative filtering? • … you name it!
  27. Thanks! • The Log: What every software engineer should know

    about real- time data's unifying abstraction • All Aboard the Databus! Linkedin’s Scalable Consistent Change Data Capture Platform • Wormhole pub/sub system: Moving data through space and time • The “Big Data” Ecosystem at LinkedIn • The Unified Logging Infrastructure for Data Analytics at Twitter • Kafka: A Distributed Messaging System for Log Processing • Building LinkedIn’s Real-time Activity Data Pipeline