Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Leveraging Apache Kafka for Web Crawling and Data Processing

Leveraging Apache Kafka for Web Crawling and Data Processing

Apache Kafka has become the de-facto standard for distributed message passing, processing and storage of data. At Fluquid we are using Kafka for decoupling and connecting multiple applications asynchronously and for storing data in the context of web crawling and big data processing.

Fluquid Ltd.

November 21, 2017
Tweet

More Decks by Fluquid Ltd.

Other Decks in Technology

Transcript

  1. Leveraging Apache Kafka
    for Web Crawling
    and Data Processing
    OpenStack Cork, 2017-11-21
    Image: https://softwareengineeringdaily.com/wp-content/uploads/2015/08/kafka-logo-wide.png

    View full-size slide

  2. About Me
    • Johannes Ahlmann
    • fluquid.com
    • Sales & Client Intelligence
    • Intelligent Lead Generation
    • Large-scale web crawls
    • Gathering and Enriching Web Data
    • webdata.org
    • Share Libraries and Best Practices
    • Bring Data Scientists and SME Companies together
    • ForDevelopers
    • AwesomeAvailableDatasets
    • Contact:
    [email protected]
    fluquid

    View full-size slide

  3. Background: Queues & PubSub
    • PubSub
    • multiple subscribers
    • no scaling
    • Queues
    • multiple consumers
    • single-subscriber
    • scaling, load balancing

    View full-size slide

  4. A high-throughput distributed messaging system
    • Decouples Data Pipelines
    • Scalable & Fault-Tolerant
    • Kafka Functionalities
    • Messaging
    • Processing
    • Storing
    • Performance (>100k/s)
    • Batching
    • Zero Copy I/O
    • Leverages OS Cache
    • Durability
    Image Credit: Confluent

    View full-size slide

  5. Core Concepts
    • Producers
    • Consumers
    • Brokers
    • Topics
    • Zookeeper
    • Offsets
    • Broker Addresses
    Image Credit: Confluent

    View full-size slide

  6. Key Idea: Partitioned Log
    • Very fast, due to zero copy I/O and
    batching
    • Uses sendfile and OS buffer cache
    • Sequential writes to FS
    • Order guaranteed within partition
    • Scaling
    Image Credit: Confluent

    View full-size slide

  7. Logs & PubSub
    • Consumers can be transient
    • Consumer Groups
    • Delivery Semantics
    • at least once (default)
    • at most once
    • exactly once
    • Retention Policy
    • Reprocessing
    Image Credit: Confluent

    View full-size slide

  8. Partitions & Replication
    • Partitions configurable
    • Partition allocation
    • round-robin
    • semantic partition by key
    • Replication
    • optional
    • 1 leader, 0 or more followers
    • sync or async
    • flush delay configurable
    Image Credit: Confluent

    View full-size slide

  9. Kafka Connect
    Image Credit: Confluent

    View full-size slide

  10. Kafka Streams
    Image Credit: Confluent
    • Operations
    • filter
    • map
    • join
    • aggregate
    • KStream
    • KTable
    • manages local state
    • Windows
    stateful

    View full-size slide

  11. Summary
    Image Credit: Confluent

    View full-size slide