Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Architecture at Scrapinghub

Shane Evans
October 03, 2016

Big Data Architecture at Scrapinghub

In this talk, Shane will discuss the design and architecture of Scrapinghub’s systems for storing and analysing big data. He will introduce HBase, dive into the architecture, and share what he learned designing, implementing and scaling a platform based on it.

Shane Evans

October 03, 2016
Tweet

More Decks by Shane Evans

Other Decks in Technology

Transcript

  1. About Shane • 9 years web scraping • Scrapy, Portia,

    Frontera, Scrapy Cloud, etc. • Co-founded Scrapinghub • Decades with Big Data
  2. About Scrapinghub We turn web content into useful data Our

    platform is used to scrape over 4 billion web pages a month. We offer: • Professional Services to handle the web scraping for you • Off-the-shelf datasets so you can get data hassle free • A cloud-based platform that makes scraping a breeze
  3. Who Uses Web Data? Used by everyone from individuals to

    large corporates: • Monitor your competitors by analyzing product information • Detect fraudulent reviews and sentiment changes by mining reviews • Create apps that use public data • Track criminal activity
  4. Web Crawling and Hadoop Heavily influenced by the infrastructure at

    Google Initial code for Hadoop was factored out of the Nutch web crawler
  5. “Getting information off the Internet is like taking a drink

    from a fire hydrant.” – Mitchell Kapor
  6. Scrapy Scrapy is a web scraping framework that gets the

    dirty work related to web crawling out of your way. Benefits • Open Source • Very popular (16k+ ★) • Battle tested • Highly extensible • Great documentation
  7. Scraping Infrastructure Meet Scrapy Cloud , our PaaS for web

    crawlers: • Scalable: Crawlers run on our cloud infrastructure • Crawlera add-on • Control your spiders: Command line, API or web UI • Store your data: All data is securely stored in Scrapinghub's fault-tolerant database and accessible through the Scrapinghub API
  8. MongoDB - v1.0 But for our use, when we started

    to scale: • Cannot keep hot data in memory • Lock contention • Cannot order data without sorting, skip+limit queries slow • Poor space efficiency See https://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/
  9. • Many processes stream data, typically JSON • Which needs

    to be stored and delivered to customers • Order should be preserved (e.g. logs), sometimes key-value Typical Web Scraping
  10. • Browse data as the crawl is running • Filter

    datasets • Summary statistics in real time • Share links to items, logs, requests Data Dashboard
  11. • High write volume. Writes are micro-batched • Much of

    the data is written in order and often immutable (e.g logs) • Items are semi-structured nested data (like JSON) • Expect exponential growth • Random access from dashboard users, keep summary stats • Sequential reading important (downloading & analyzing) • Store data on disk, many TB per node Storage Requirements - v2.0
  12. Bigtable looks good... Google’s Bigtable provides a sparse, distributed, persistent

    multidimensional sorted map Can express our requirements in what Bigtable provides Performance characteristics should match our workload Inspired several open source projects
  13. Apache HBase • Modelled after Google’s Bigtable • Provides real

    time random read and write to billions of rows with millions of columns • Runs on hadoop and uses HDFS • Strictly consistent reads and writes • Extensible via server side filters and coprocessors • Java-based
  14. Key Selection Data ordered by key Consider • Key ranges

    assigned to regions • Tall/Narrow vs. Fat/Wide • Avoid hotspotting
  15. Key Selection Examples OpenTSDB: Row key: <metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>] Google web table:

    Row keys: URLs with reversed domains good use of column families Facebook inbox search: Row key: User Column: Word Version: message id
  16. Key Design: Job Data • Atomic operations are at the

    row level: we use wide columns, update counts on write operations and delete whole rows at once • Order is determined by the binary key: our offsets preserve order
  17. Job Data Small Regions • Some IDs gradually increased over

    time, leaving small or empty regions • Small Regions have overhead - min buffer sizes, time to migrate, etc.
  18. HBase Values • Msgpack is like JSON but fast and

    small • Storing entire records as a value has low overhead (vs. splitting records into multiple key/values in hbase) • Doesn’t handle very large values well, requires us to limit the size of single records • We need arbitrarily nested data anyway, so we need some custom binary encoding • Write custom Filters to support simple queries We store the entire item record as msgpack encoded data in a single value
  19. HBase Deployment • All access is via a service that

    provides a restricted API • Ensure no long running queries, deal with timeouts everywhere, ... • Tune settings to work with a lot of data per node • Set block size and compression for each Column Family • Do not use block cache for large scans (Scan.setCacheBlocks) and ‘batch’ every time you touch fat columns • Scripts to manage regions (balancing, merging, bulk delete) • We host on dedicated servers • Data replicated to backup clusters, where we run analytics
  20. Data Growth • Items, logs and requests are collected in

    real time • Millions of web crawling jobs each month • Now at 4 billion a month and growing • Thousands of separate active projects
  21. Uses for HBase Data • Consumed by customers directly •

    Operational data (e.g. crawl state) used by applications • We export data to HDFS and elsewhere • Analysis using different tools, e.g. Spark
  22. Hardware failures http://xkcd.com/1737/ • Drive failures happen, and they matter

    • Measure cost of replacement, time to failover • Poorly performing drive worse than failure
  23. HBase Lessons Learned • It was a lot of work

    ◦ API is low level (untyped bytes) - check out Apache Phoenix ◦ Many parts -> longer learning curve and difficult to debug. Tools are getting better • Many of our early problems were addressed in later releases ◦ reduced memory allocation & GC times ◦ improved MTTR ◦ online region merging ◦ scanner heartbeat
  24. Some advice from Jeff Dean • Use back of the

    envelope calculations • Plan to scale 10-20x, but rewrite by 100x