Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Leveraging Apache Kafka for Web Crawling and Data Processing

Fluquid Ltd.
November 21, 2017

Leveraging Apache Kafka for Web Crawling and Data Processing

Apache Kafka has become the de-facto standard for distributed message passing, processing and storage of data. At Fluquid we are using Kafka for decoupling and connecting multiple applications asynchronously and for storing data in the context of web crawling and big data processing.

Fluquid Ltd.

November 21, 2017
Tweet

More Decks by Fluquid Ltd.

Other Decks in Technology

Transcript

  1. Leveraging Apache Kafka
    for Web Crawling
    and Data Processing
    OpenStack Cork, 2017-11-21
    Image: https://softwareengineeringdaily.com/wp-content/uploads/2015/08/kafka-logo-wide.png

    View Slide

  2. About Me
    • Johannes Ahlmann
    • fluquid.com
    • Sales & Client Intelligence
    • Intelligent Lead Generation
    • Large-scale web crawls
    • Gathering and Enriching Web Data
    • webdata.org
    • Share Libraries and Best Practices
    • Bring Data Scientists and SME Companies together
    • ForDevelopers
    • AwesomeAvailableDatasets
    • Contact:
    [email protected]
    fluquid

    View Slide

  3. Background: Queues & PubSub
    • PubSub
    • multiple subscribers
    • no scaling
    • Queues
    • multiple consumers
    • single-subscriber
    • scaling, load balancing

    View Slide

  4. A high-throughput distributed messaging system
    • Decouples Data Pipelines
    • Scalable & Fault-Tolerant
    • Kafka Functionalities
    • Messaging
    • Processing
    • Storing
    • Performance (>100k/s)
    • Batching
    • Zero Copy I/O
    • Leverages OS Cache
    • Durability
    Image Credit: Confluent

    View Slide

  5. Core Concepts
    • Producers
    • Consumers
    • Brokers
    • Topics
    • Zookeeper
    • Offsets
    • Broker Addresses
    Image Credit: Confluent

    View Slide

  6. Key Idea: Partitioned Log
    • Very fast, due to zero copy I/O and
    batching
    • Uses sendfile and OS buffer cache
    • Sequential writes to FS
    • Order guaranteed within partition
    • Scaling
    Image Credit: Confluent

    View Slide

  7. Logs & PubSub
    • Consumers can be transient
    • Consumer Groups
    • Delivery Semantics
    • at least once (default)
    • at most once
    • exactly once
    • Retention Policy
    • Reprocessing
    Image Credit: Confluent

    View Slide

  8. Partitions & Replication
    • Partitions configurable
    • Partition allocation
    • round-robin
    • semantic partition by key
    • Replication
    • optional
    • 1 leader, 0 or more followers
    • sync or async
    • flush delay configurable
    Image Credit: Confluent

    View Slide

  9. Kafka Connect
    Image Credit: Confluent

    View Slide

  10. Kafka Streams
    Image Credit: Confluent
    • Operations
    • filter
    • map
    • join
    • aggregate
    • KStream
    • KTable
    • manages local state
    • Windows
    stateful

    View Slide

  11. Summary
    Image Credit: Confluent

    View Slide