Leveraging Apache Kafka for Web Crawling and Data Processing

29ccab0d4e3aa0e1f711ce9e158392ae?s=47 Fluquid Ltd.
November 21, 2017

Leveraging Apache Kafka for Web Crawling and Data Processing

Apache Kafka has become the de-facto standard for distributed message passing, processing and storage of data. At Fluquid we are using Kafka for decoupling and connecting multiple applications asynchronously and for storing data in the context of web crawling and big data processing.

29ccab0d4e3aa0e1f711ce9e158392ae?s=128

Fluquid Ltd.

November 21, 2017
Tweet

Transcript

  1. Leveraging Apache Kafka for Web Crawling and Data Processing OpenStack

    Cork, 2017-11-21 Image: https://softwareengineeringdaily.com/wp-content/uploads/2015/08/kafka-logo-wide.png
  2. About Me • Johannes Ahlmann • fluquid.com • Sales &

    Client Intelligence • Intelligent Lead Generation • Large-scale web crawls • Gathering and Enriching Web Data • webdata.org • Share Libraries and Best Practices • Bring Data Scientists and SME Companies together • ForDevelopers • AwesomeAvailableDatasets • Contact: johannes@fluquid.com fluquid
  3. Background: Queues & PubSub • PubSub • multiple subscribers •

    no scaling • Queues • multiple consumers • single-subscriber • scaling, load balancing
  4. A high-throughput distributed messaging system • Decouples Data Pipelines •

    Scalable & Fault-Tolerant • Kafka Functionalities • Messaging • Processing • Storing • Performance (>100k/s) • Batching • Zero Copy I/O • Leverages OS Cache • Durability Image Credit: Confluent
  5. Core Concepts • Producers • Consumers • Brokers • Topics

    • Zookeeper • Offsets • Broker Addresses Image Credit: Confluent
  6. Key Idea: Partitioned Log • Very fast, due to zero

    copy I/O and batching • Uses sendfile and OS buffer cache • Sequential writes to FS • Order guaranteed within partition • Scaling Image Credit: Confluent
  7. Logs & PubSub • Consumers can be transient • Consumer

    Groups • Delivery Semantics • at least once (default) • at most once • exactly once • Retention Policy • Reprocessing Image Credit: Confluent
  8. Partitions & Replication • Partitions configurable • Partition allocation •

    round-robin • semantic partition by key • Replication • optional • 1 leader, 0 or more followers • sync or async • flush delay configurable Image Credit: Confluent
  9. Kafka Connect Image Credit: Confluent

  10. Kafka Streams Image Credit: Confluent • Operations • filter •

    map • join • aggregate • KStream • KTable • manages local state • Windows stateful
  11. Summary Image Credit: Confluent