Big Data Architecture at Scrapinghub

Slide 1

Slide 1 text

Big Data Architecture at Scrapinghub Shane Evans

Slide 2

Slide 2 text

About Shane ● 9 years web scraping ● Scrapy, Portia, Frontera, Scrapy Cloud, etc. ● Co-founded Scrapinghub ● Decades with Big Data

Slide 3

Slide 3 text

About Scrapinghub We turn web content into useful data Our platform is used to scrape over 4 billion web pages a month. We offer: ● Professional Services to handle the web scraping for you ● Off-the-shelf datasets so you can get data hassle free ● A cloud-based platform that makes scraping a breeze

Slide 4

Slide 4 text

Who Uses Web Data? Used by everyone from individuals to large corporates: ● Monitor your competitors by analyzing product information ● Detect fraudulent reviews and sentiment changes by mining reviews ● Create apps that use public data ● Track criminal activity

Slide 5

Slide 5 text

Web Crawling and Hadoop Heavily influenced by the infrastructure at Google Initial code for Hadoop was factored out of the Nutch web crawler

Slide 6

Slide 6 text

“Getting information off the Internet is like taking a drink from a fire hydrant.” – Mitchell Kapor

Slide 7

Slide 7 text

Scrapy Scrapy is a web scraping framework that gets the dirty work related to web crawling out of your way. Benefits ● Open Source ● Very popular (16k+ ★) ● Battle tested ● Highly extensible ● Great documentation

Slide 8

Slide 8 text

Scraping Infrastructure Meet Scrapy Cloud , our PaaS for web crawlers: ● Scalable: Crawlers run on our cloud infrastructure ● Crawlera add-on ● Control your spiders: Command line, API or web UI ● Store your data: All data is securely stored in Scrapinghub's fault-tolerant database and accessible through the Scrapinghub API

Slide 9

Slide 9 text

MongoDB - v1.0 ● Quick to prototype ● Filtering, indexing, etc. ● Easy to work with JSON

Slide 10

Slide 10 text

MongoDB - v1.0 But for our use, when we started to scale: ● Cannot keep hot data in memory ● Lock contention ● Cannot order data without sorting, skip+limit queries slow ● Poor space efficiency See https://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

● Many processes stream data, typically JSON ● Which needs to be stored and delivered to customers ● Order should be preserved (e.g. logs), sometimes key-value Typical Web Scraping

Slide 14

Slide 14 text

Apache Kafka Image from Hortonworks

Slide 15

Slide 15 text

● Browse data as the crawl is running ● Filter datasets ● Summary statistics in real time ● Share links to items, logs, requests Data Dashboard

Slide 16

Slide 16 text

● High write volume. Writes are micro-batched ● Much of the data is written in order and often immutable (e.g logs) ● Items are semi-structured nested data (like JSON) ● Expect exponential growth ● Random access from dashboard users, keep summary stats ● Sequential reading important (downloading & analyzing) ● Store data on disk, many TB per node Storage Requirements - v2.0

Slide 17

Slide 17 text

Bigtable looks good... Google’s Bigtable provides a sparse, distributed, persistent multidimensional sorted map Can express our requirements in what Bigtable provides Performance characteristics should match our workload Inspired several open source projects

Slide 18

Slide 18 text

Apache HBase ● Modelled after Google’s Bigtable ● Provides real time random read and write to billions of rows with millions of columns ● Runs on hadoop and uses HDFS ● Strictly consistent reads and writes ● Extensible via server side filters and coprocessors ● Java-based

Slide 19

Slide 19 text

HBase Architecture

Slide 20

Slide 20 text

Key Selection Data ordered by key Consider ● Key ranges assigned to regions ● Tall/Narrow vs. Fat/Wide ● Avoid hotspotting

Slide 21

Slide 21 text

Key Selection Examples OpenTSDB: Row key: [...] Google web table: Row keys: URLs with reversed domains good use of column families Facebook inbox search: Row key: User Column: Word Version: message id

Slide 22

Slide 22 text

Key Design: Job Data ● Atomic operations are at the row level: we use wide columns, update counts on write operations and delete whole rows at once ● Order is determined by the binary key: our offsets preserve order

Slide 23

Slide 23 text

Job Data Small Regions ● Some IDs gradually increased over time, leaving small or empty regions ● Small Regions have overhead - min buffer sizes, time to migrate, etc.

Slide 24

Slide 24 text

Job Data Merged Regions

Slide 25

Slide 25 text

HBase Values ● Msgpack is like JSON but fast and small ● Storing entire records as a value has low overhead (vs. splitting records into multiple key/values in hbase) ● Doesn’t handle very large values well, requires us to limit the size of single records ● We need arbitrarily nested data anyway, so we need some custom binary encoding ● Write custom Filters to support simple queries We store the entire item record as msgpack encoded data in a single value

Slide 26

Slide 26 text

HBase Deployment ● All access is via a service that provides a restricted API ● Ensure no long running queries, deal with timeouts everywhere, ... ● Tune settings to work with a lot of data per node ● Set block size and compression for each Column Family ● Do not use block cache for large scans (Scan.setCacheBlocks) and ‘batch’ every time you touch fat columns ● Scripts to manage regions (balancing, merging, bulk delete) ● We host on dedicated servers ● Data replicated to backup clusters, where we run analytics

Slide 27

Slide 27 text

Data Growth ● Items, logs and requests are collected in real time ● Millions of web crawling jobs each month ● Now at 4 billion a month and growing ● Thousands of separate active projects

Slide 28

Slide 28 text

Uses for HBase Data ● Consumed by customers directly ● Operational data (e.g. crawl state) used by applications ● We export data to HDFS and elsewhere ● Analysis using different tools, e.g. Spark

Slide 29

Slide 29 text

Hardware failures http://xkcd.com/1737/ ● Drive failures happen, and they matter ● Measure cost of replacement, time to failover ● Poorly performing drive worse than failure

Slide 30

Slide 30 text

Hardware failures http://xkcd.com/1737/ Don’t copy a solution that works at a much larger scale

Slide 31

Slide 31 text

HBase Lessons Learned ● It was a lot of work ○ API is low level (untyped bytes) - check out Apache Phoenix ○ Many parts -> longer learning curve and difficult to debug. Tools are getting better ● Many of our early problems were addressed in later releases ○ reduced memory allocation & GC times ○ improved MTTR ○ online region merging ○ scanner heartbeat

Slide 32

Slide 32 text

Some advice from Jeff Dean ● Use back of the envelope calculations ● Plan to scale 10-20x, but rewrite by 100x

Slide 33

Slide 33 text

Thank you! Shane Evans [email protected] scrapinghub.com Thank you!