Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A real-time data ingestion system or: How I learned to stop worrying and take good advices by Maciej Arciuch at Big Data Spain 2015

A real-time data ingestion system or: How I learned to stop worrying and take good advices by Maciej Arciuch at Big Data Spain 2015

Last year, during BDS14, two Allegro engineers shared their experience by presenting pitfalls and mistakes that were made when implementing data ingestion pipelines in our company. This time, Maciej Arciuch presents a brand new design and shows how accepting good advices can result in a drastic design change and how making mistakes can teach us a lot.

The design presented during BDS14 as an anti-pattern was an HDFS-centric system storing semi-structured, but hardly documented JSON documents. Moreover, there was no buffer between the traffic and HDFS. Data was not monitored well enough and low level, error-prone APIs like Hadoop MapReduce in Java were used to manipulate data. Data was available in hourly or daily batches. Large number of small files caused name nodes to choke.

Session presented at Big Data Spain 2015 Conference
16th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/fri/slot-29.html#spch29.1

Big Data Spain

October 22, 2015
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. A Real-Time Data Ingestion System Or: How I Learned to

    Stop Worrying and Love the Bomb Avro Maciej Arciuch
  2. Allegro.pl • biggest online auction website in Poland • sites

    in other countries • “Polish eBay” (but better!)
  3. Clickstream at Allegro.pl • how do our users behave? •

    ~ 400 M of raw clickstream events daily • collected at the front-end • web and mobile devices • valuable source of information
  4. Legacy system • HDFS, Flume and MapReduce • Main issues:

    ◦batch processing - per hour or day ◦data formats ◦how to make data more accessible for others?
  5. How to do it better? (1) stream processing: Spark Streaming

    and Kafka - data available “almost” instantly new applications: security recommendations & search
  6. How to do it better? (2) Use Avro • mature

    software, good support in Hadoop ecosystem • space-efficient • schema: structure + doc placeholder • the same format for stream and batch processing • backward/forward compatibility control
  7. How to do it better? (3) Create a central schema

    repository: • single source of truth • all the elements of system refer to the latest version • validate backward/forward compatibility on commit • immutable schemas • propagate info to Hive metastore, files, HTMLs
  8. How to do it better? (5) New system: • two

    separate Kafka instances (buffer and destination) • if your infrastructure is down – you still collect data • collectors – only save HTTP requests, no logic • logic in Spark Streaming • dead letter queue – you can reprocess failed messages
  9. How to do it better? (6) New system: • data

    saved to HDFS in hourly batches using LinkedIn’s Camus (now obsolete, but good tool) • Hive tables and partitions created automatically (look for camus2hive on Github)
  10. Why Spark Streaming? • pros: ◦momentum ◦good integration with YARN

    - better resource utilization, easy scaling ◦good integration with Kafka ◦reuse batch Spark code • cons: ◦micro-batching ◦as complex as Spark