A real-time data ingestion system or: How I learned to stop worrying and take good advices by Maciej Arciuch at Big Data Spain 2015

A Real-Time Data Ingestion System Or: How I Learned to
Stop Worrying and Love the Bomb Avro Maciej Arciuch

Allegro.pl • biggest online auction website in Poland • sites
in other countries • “Polish eBay” (but better!)

Clickstream at Allegro.pl • how do our users behave? •
~ 400 M of raw clickstream events daily • collected at the front-end • web and mobile devices • valuable source of information

Legacy system • HDFS, Flume and MapReduce • Main issues:
◦batch processing - per hour or day ◦data formats ◦how to make data more accessible for others?

How to do it better? (1) stream processing: Spark Streaming
and Kafka - data available “almost” instantly new applications: security recommendations & search

How to do it better? (2) Use Avro • mature
software, good support in Hadoop ecosystem • space-efficient • schema: structure + doc placeholder • the same format for stream and batch processing • backward/forward compatibility control

How to do it better? (3) Create a central schema
repository: • single source of truth • all the elements of system refer to the latest version • validate backward/forward compatibility on commit • immutable schemas • propagate info to Hive metastore, files, HTMLs

How to do it better? (4)

How to do it better? (5) New system: • two
separate Kafka instances (buffer and destination) • if your infrastructure is down – you still collect data • collectors – only save HTTP requests, no logic • logic in Spark Streaming • dead letter queue – you can reprocess failed messages

How to do it better? (6) New system: • data
saved to HDFS in hourly batches using LinkedIn’s Camus (now obsolete, but good tool) • Hive tables and partitions created automatically (look for camus2hive on Github)

Why Spark Streaming? • pros: ◦momentum ◦good integration with YARN
- better resource utilization, easy scaling ◦good integration with Kafka ◦reuse batch Spark code • cons: ◦micro-batching ◦as complex as Spark

Key take-aways • Kafka, Avro, Spark - solid building blocks
• Use a central schema repository

Thank you! http://github.com/allegro http://allegro.tech

A real-time data ingestion system or: How I lea...

A real-time data ingestion system or: How I learned to stop worrying and take good advices by Maciej Arciuch at Big Data Spain 2015

Big Data Spain

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript

A Real-Time Data Ingestion System Or: How I Learned to

Allegro.pl • biggest online auction website in Poland • sites

Clickstream at Allegro.pl • how do our users behave? •

Legacy system • HDFS, Flume and MapReduce • Main issues:

How to do it better? (1) stream processing: Spark Streaming

How to do it better? (2) Use Avro • mature

How to do it better? (3) Create a central schema

How to do it better? (4)

How to do it better? (5) New system: • two

How to do it better? (6) New system: • data

Why Spark Streaming? • pros: ◦momentum ◦good integration with YARN

Key take-aways • Kafka, Avro, Spark - solid building blocks

Q/A?

Thank you! http://github.com/allegro http://allegro.tech