Big Data Architecture at Scrapinghub

Big Data Architecture at Scrapinghub Shane Evans

About Shane • 9 years web scraping • Scrapy, Portia,
Frontera, Scrapy Cloud, etc. • Co-founded Scrapinghub • Decades with Big Data

About Scrapinghub We turn web content into useful data Our
platform is used to scrape over 4 billion web pages a month. We offer: • Professional Services to handle the web scraping for you • Off-the-shelf datasets so you can get data hassle free • A cloud-based platform that makes scraping a breeze

Who Uses Web Data? Used by everyone from individuals to
large corporates: • Monitor your competitors by analyzing product information • Detect fraudulent reviews and sentiment changes by mining reviews • Create apps that use public data • Track criminal activity

Web Crawling and Hadoop Heavily influenced by the infrastructure at
Google Initial code for Hadoop was factored out of the Nutch web crawler

“Getting information off the Internet is like taking a drink
from a fire hydrant.” – Mitchell Kapor

Scrapy Scrapy is a web scraping framework that gets the
dirty work related to web crawling out of your way. Benefits • Open Source • Very popular (16k+ ★) • Battle tested • Highly extensible • Great documentation

Scraping Infrastructure Meet Scrapy Cloud , our PaaS for web
crawlers: • Scalable: Crawlers run on our cloud infrastructure • Crawlera add-on • Control your spiders: Command line, API or web UI • Store your data: All data is securely stored in Scrapinghub's fault-tolerant database and accessible through the Scrapinghub API

MongoDB - v1.0 • Quick to prototype • Filtering, indexing,
etc. • Easy to work with JSON

MongoDB - v1.0 But for our use, when we started
to scale: • Cannot keep hot data in memory • Lock contention • Cannot order data without sorting, skip+limit queries slow • Poor space efficiency See https://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/

• Many processes stream data, typically JSON • Which needs
to be stored and delivered to customers • Order should be preserved (e.g. logs), sometimes key-value Typical Web Scraping

Apache Kafka Image from Hortonworks

• Browse data as the crawl is running • Filter
datasets • Summary statistics in real time • Share links to items, logs, requests Data Dashboard

• High write volume. Writes are micro-batched • Much of
the data is written in order and often immutable (e.g logs) • Items are semi-structured nested data (like JSON) • Expect exponential growth • Random access from dashboard users, keep summary stats • Sequential reading important (downloading & analyzing) • Store data on disk, many TB per node Storage Requirements - v2.0

Bigtable looks good... Google’s Bigtable provides a sparse, distributed, persistent
multidimensional sorted map Can express our requirements in what Bigtable provides Performance characteristics should match our workload Inspired several open source projects

Apache HBase • Modelled after Google’s Bigtable • Provides real
time random read and write to billions of rows with millions of columns • Runs on hadoop and uses HDFS • Strictly consistent reads and writes • Extensible via server side filters and coprocessors • Java-based

HBase Architecture

Key Selection Data ordered by key Consider • Key ranges
assigned to regions • Tall/Narrow vs. Fat/Wide • Avoid hotspotting

Key Selection Examples OpenTSDB: Row key: <metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>] Google web table:
Row keys: URLs with reversed domains good use of column families Facebook inbox search: Row key: User Column: Word Version: message id

Key Design: Job Data • Atomic operations are at the
row level: we use wide columns, update counts on write operations and delete whole rows at once • Order is determined by the binary key: our offsets preserve order

Job Data Small Regions • Some IDs gradually increased over
time, leaving small or empty regions • Small Regions have overhead - min buffer sizes, time to migrate, etc.

Job Data Merged Regions

HBase Values • Msgpack is like JSON but fast and
small • Storing entire records as a value has low overhead (vs. splitting records into multiple key/values in hbase) • Doesn’t handle very large values well, requires us to limit the size of single records • We need arbitrarily nested data anyway, so we need some custom binary encoding • Write custom Filters to support simple queries We store the entire item record as msgpack encoded data in a single value

HBase Deployment • All access is via a service that
provides a restricted API • Ensure no long running queries, deal with timeouts everywhere, ... • Tune settings to work with a lot of data per node • Set block size and compression for each Column Family • Do not use block cache for large scans (Scan.setCacheBlocks) and ‘batch’ every time you touch fat columns • Scripts to manage regions (balancing, merging, bulk delete) • We host on dedicated servers • Data replicated to backup clusters, where we run analytics

Data Growth • Items, logs and requests are collected in
real time • Millions of web crawling jobs each month • Now at 4 billion a month and growing • Thousands of separate active projects

Uses for HBase Data • Consumed by customers directly •
Operational data (e.g. crawl state) used by applications • We export data to HDFS and elsewhere • Analysis using different tools, e.g. Spark

Hardware failures http://xkcd.com/1737/ • Drive failures happen, and they matter
• Measure cost of replacement, time to failover • Poorly performing drive worse than failure

Hardware failures http://xkcd.com/1737/ Don’t copy a solution that works at
a much larger scale

HBase Lessons Learned • It was a lot of work
◦ API is low level (untyped bytes) - check out Apache Phoenix ◦ Many parts -> longer learning curve and difficult to debug. Tools are getting better • Many of our early problems were addressed in later releases ◦ reduced memory allocation & GC times ◦ improved MTTR ◦ online region merging ◦ scanner heartbeat

Some advice from Jeff Dean • Use back of the
envelope calculations • Plan to scale 10-20x, but rewrite by 100x

Thank you! Shane Evans [email protected] scrapinghub.com Thank you!

Big Data Architecture at Scrapinghub

Big Data Architecture at Scrapinghub

Shane Evans

More Decks by Shane Evans

Other Decks in Technology

Featured

Transcript