$30 off During Our Annual Pro Sale. View Details »

Big Data Architecture at Scrapinghub

Shane Evans
October 03, 2016

Big Data Architecture at Scrapinghub

In this talk, Shane will discuss the design and architecture of Scrapinghub’s systems for storing and analysing big data. He will introduce HBase, dive into the architecture, and share what he learned designing, implementing and scaling a platform based on it.

Shane Evans

October 03, 2016
Tweet

More Decks by Shane Evans

Other Decks in Technology

Transcript

  1. Big Data Architecture at Scrapinghub
    Shane Evans

    View Slide

  2. About Shane
    ● 9 years web scraping
    ● Scrapy, Portia, Frontera,
    Scrapy Cloud, etc.
    ● Co-founded Scrapinghub
    ● Decades with Big Data

    View Slide

  3. About Scrapinghub
    We turn web content into useful data
    Our platform is used to scrape over 4 billion web pages a
    month.
    We offer:
    ● Professional Services to handle the web scraping for you
    ● Off-the-shelf datasets so you can get data hassle free
    ● A cloud-based platform that makes scraping a breeze

    View Slide

  4. Who Uses Web Data?
    Used by everyone from individuals to large
    corporates:
    ● Monitor your competitors by analyzing product
    information
    ● Detect fraudulent reviews and sentiment changes
    by mining reviews
    ● Create apps that use public data
    ● Track criminal activity

    View Slide

  5. Web Crawling and Hadoop
    Heavily influenced by the infrastructure at
    Google
    Initial code for Hadoop was factored out of
    the Nutch web crawler

    View Slide

  6. “Getting information off the
    Internet is like taking a drink
    from a fire hydrant.”
    – Mitchell Kapor

    View Slide

  7. Scrapy
    Scrapy is a web scraping framework that
    gets the dirty work related to web crawling
    out of your way.
    Benefits
    ● Open Source
    ● Very popular (16k+ ★)
    ● Battle tested
    ● Highly extensible
    ● Great documentation

    View Slide

  8. Scraping Infrastructure
    Meet Scrapy Cloud , our PaaS for web crawlers:
    ● Scalable: Crawlers run on our cloud infrastructure
    ● Crawlera add-on
    ● Control your spiders: Command line, API or web UI
    ● Store your data: All data is securely stored in Scrapinghub's
    fault-tolerant database and accessible through the Scrapinghub API

    View Slide

  9. MongoDB - v1.0
    ● Quick to prototype
    ● Filtering, indexing, etc.
    ● Easy to work with JSON

    View Slide

  10. MongoDB - v1.0
    But for our use, when we started to scale:
    ● Cannot keep hot data in memory
    ● Lock contention
    ● Cannot order data without sorting,
    skip+limit queries slow
    ● Poor space efficiency
    See https://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/

    View Slide

  11. View Slide

  12. View Slide

  13. ● Many processes stream data, typically JSON
    ● Which needs to be stored and delivered to
    customers
    ● Order should be preserved (e.g. logs), sometimes
    key-value
    Typical Web Scraping

    View Slide

  14. Apache Kafka
    Image from Hortonworks

    View Slide

  15. ● Browse data as the crawl is running
    ● Filter datasets
    ● Summary statistics in real time
    ● Share links to items, logs, requests
    Data Dashboard

    View Slide

  16. ● High write volume. Writes are micro-batched
    ● Much of the data is written in order and often immutable (e.g
    logs)
    ● Items are semi-structured nested data (like JSON)
    ● Expect exponential growth
    ● Random access from dashboard users, keep summary stats
    ● Sequential reading important (downloading & analyzing)
    ● Store data on disk, many TB per node
    Storage Requirements - v2.0

    View Slide

  17. Bigtable looks good...
    Google’s Bigtable provides a sparse,
    distributed, persistent
    multidimensional sorted map
    Can express our requirements in what
    Bigtable provides
    Performance characteristics should
    match our workload
    Inspired several open source projects

    View Slide

  18. Apache HBase
    ● Modelled after Google’s Bigtable
    ● Provides real time random read and write to billions
    of rows with millions of columns
    ● Runs on hadoop and uses HDFS
    ● Strictly consistent reads and writes
    ● Extensible via server side filters and coprocessors
    ● Java-based

    View Slide

  19. HBase Architecture

    View Slide

  20. Key Selection
    Data ordered by key
    Consider
    ● Key ranges assigned to regions
    ● Tall/Narrow vs. Fat/Wide
    ● Avoid hotspotting

    View Slide

  21. Key Selection Examples
    OpenTSDB:
    Row key: [...]
    Google web table:
    Row keys: URLs with reversed domains
    good use of column families
    Facebook inbox search:
    Row key: User
    Column: Word
    Version: message id

    View Slide

  22. Key Design: Job Data
    ● Atomic operations are at the row level: we use wide columns, update counts on write
    operations and delete whole rows at once
    ● Order is determined by the binary key: our offsets preserve order

    View Slide

  23. Job Data Small Regions
    ● Some IDs gradually increased over time, leaving small or empty regions
    ● Small Regions have overhead - min buffer sizes, time to migrate, etc.

    View Slide

  24. Job Data Merged Regions

    View Slide

  25. HBase Values
    ● Msgpack is like JSON but fast and small
    ● Storing entire records as a value has low
    overhead (vs. splitting records into multiple
    key/values in hbase)
    ● Doesn’t handle very large values well, requires
    us to limit the size of single records
    ● We need arbitrarily nested data anyway, so we
    need some custom binary encoding
    ● Write custom Filters to support simple queries
    We store the entire item record as msgpack encoded data in a single value

    View Slide

  26. HBase Deployment
    ● All access is via a service that provides a restricted API
    ● Ensure no long running queries, deal with timeouts everywhere, ...
    ● Tune settings to work with a lot of data per node
    ● Set block size and compression for each Column Family
    ● Do not use block cache for large scans (Scan.setCacheBlocks) and
    ‘batch’ every time you touch fat columns
    ● Scripts to manage regions (balancing, merging, bulk delete)
    ● We host on dedicated servers
    ● Data replicated to backup clusters, where we run analytics

    View Slide

  27. Data Growth
    ● Items, logs and requests are collected in real time
    ● Millions of web crawling jobs each month
    ● Now at 4 billion a month and growing
    ● Thousands of separate active projects

    View Slide

  28. Uses for HBase Data
    ● Consumed by customers directly
    ● Operational data (e.g. crawl state) used by
    applications
    ● We export data to HDFS and elsewhere
    ● Analysis using different tools, e.g. Spark

    View Slide

  29. Hardware failures
    http://xkcd.com/1737/
    ● Drive failures happen, and they matter
    ● Measure cost of replacement, time to
    failover
    ● Poorly performing drive worse than
    failure

    View Slide

  30. Hardware failures
    http://xkcd.com/1737/
    Don’t copy a solution that
    works at a much larger
    scale

    View Slide

  31. HBase Lessons Learned
    ● It was a lot of work
    ○ API is low level (untyped bytes) - check out Apache Phoenix
    ○ Many parts -> longer learning curve and difficult to debug. Tools
    are getting better
    ● Many of our early problems were addressed in later releases
    ○ reduced memory allocation & GC times
    ○ improved MTTR
    ○ online region merging
    ○ scanner heartbeat

    View Slide

  32. Some advice from Jeff Dean
    ● Use back of the envelope calculations
    ● Plan to scale 10-20x, but rewrite by 100x

    View Slide

  33. Thank you!
    Shane Evans
    [email protected]
    scrapinghub.com
    Thank you!

    View Slide