Slide 1

Slide 1 text

Big Data at Scrapinghub Shane Evans

Slide 2

Slide 2 text

About Shane ● 9 years web scraping ● Decades with Big Data ● Scrapy, Portia, Frontera, Scrapy Cloud, etc. ● Co-founded Scrapinghub

Slide 3

Slide 3 text

We turn web content into useful data

Slide 4

Slide 4 text

Founded in 2010, largest 100% remote company based outside of the US We’re 126 teammates in 41 countries

Slide 5

Slide 5 text

About Scrapinghub Scrapinghub specializes in data extraction. Our platform is used to scrape over 4 billion web pages a month. We offer: ● Professional Services to handle the web scraping for you ● Off-the-shelf datasets so you can get data hassle free ● A cloud-based platform that makes scraping a breeze

Slide 6

Slide 6 text

Who Uses Web Scraping Used by everyone from individuals to multinational companies: ● Monitor your competitors’ prices by scraping product information ● Detect fraudulent reviews and sentiment changes by scraping product reviews ● Track online reputation by scraping social media profiles ● Create apps that use public data ● Track SEO by scraping search engine results

Slide 7

Slide 7 text

“Getting information off the Internet is like taking a drink from a fire hydrant.” – Mitchell Kapor

Slide 8

Slide 8 text

Scrapy Scrapy is a web scraping framework that gets the dirty work related to web crawling out of your way. Benefits ● No platform lock-in: Open Source ● Very popular (13k+ ★) ● Battle tested ● Highly extensible ● Great documentation

Slide 9

Slide 9 text

Introducing Portia Portia is a Visual Scraping tool that lets you get data without needing to write code. Benefits ● No platform lock-in: Open Source ● JavaScript dynamic content generation ● Ideal for non-developers ● Extensible ● It’s as easy as annotating a page

Slide 10

Slide 10 text

How Portia Works User provides seed URLs: Follows links ● Users specify which links to follow (regexp, point-and-click) ● Automatically guesses: finds and follows pagination, infinite scroll, prioritizes content ● Knows when to stop Extracts data ● Given a sample, extracts the same data from all similar pages ● Understands repetitive patterns ● Manages item schemas Run standalone or on Scrapy Cloud

Slide 11

Slide 11 text

Portia UI

Slide 12

Slide 12 text

Large Scale Infrastructure Meet Scrapy Cloud , our PaaS for web crawlers: ● Scalable: Crawlers run on our cloud infrastructure ● Crawlera add-on ● Control your spiders: Command line, API or web UI ● Machine learning integration: BigML, MonkeyLearn, among others ● No lock-in: scrapyd, Scrapy or Portia to run spiders on your own infrastructure

Slide 13

Slide 13 text

Data Growth ● Items, logs and requests are collected in real time ● Millions of web crawling jobs each month ● Now at 4 billion a month and growing ● Thousands of separate active projects

Slide 14

Slide 14 text

● Browse data as the crawl is running ● Filter and download huge datasets ● Items can have arbitrary schemas Data Dashboard

Slide 15

Slide 15 text

MongoDB - v1.0 MongoDB was a good fit to get a demo up and running, but it’s a bad fit for our use at scale ● Cannot keep hot data in memory ● Lock contention ● Cannot order data without sorting, skip+limit queries slow ● Poor space efficiency See https://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/

Slide 16

Slide 16 text

● High write volume. Writes are micro-batched ● Much of the data is written in order and immutable (like logs) ● Items are semi-structured nested data ● Expect exponential growth ● Random access from dashboard users, keep summary stats ● Sequential reading important (downloading & analyzing) ● Store data on disk, many TB per node Storage Requirements - v2.0

Slide 17

Slide 17 text

Bigtable looks good... Google’s Bigtable provides a sparse, distributed, persistent multidimensional sorted map Can express our requirements in what Bigtable provides Performance characteristics should match our workload Inspired several open source projects

Slide 18

Slide 18 text

Apache HBase ● Modelled after Google’s Bigtable ● Provides real time random read and write to billions of rows with millions of columns ● Runs on hadoop and uses HDFS ● Strictly consistent reads and writes ● Extensible via server side filters and coprocessors ● Java-based

Slide 19

Slide 19 text

HBase Architecture

Slide 20

Slide 20 text

HBase Key Selection Key selection is critical ● Atomic operations are at the row level: we use fat columns, update counts on write operations and delete whole rows at once ● Order is determined by the binary key: our offsets preserve order

Slide 21

Slide 21 text

HBase Values ● Msgpack is like JSON but fast and small ● Storing entire records as a value has low overhead (vs. splitting records into multiple key/values in hbase) ● Doesn’t handle very large values well, requires us to limit the size of single records ● We need arbitrarily nested data anyway, so we need some custom binary encoding ● Write custom Filters to support simple queries We store the entire item record as msgpack encoded data in a single value

Slide 22

Slide 22 text

HBase Deployment ● All access is via a single service that provides a restricted API ● Ensure no long running queries, deal with timeouts everywhere, ... ● Tune settings to work with a lot of data per node ● Set block size and compression for each Column Family ● Do not use block cache for large scans (Scan.setCacheBlocks) and ‘batch’ every time you touch fat columns ● Scripts to manage regions (balancing, merging, bulk delete) ● We host in Hetzner, on dedicated servers ● Data replicated to backup clusters, where we run analytics

Slide 23

Slide 23 text

HBase Lessons Learned ● It was a lot of work ○ API is low level (untyped bytes) - check out Apache Phoenix ○ Many parts -> longer learning curve and difficult to debug. Tools are getting better ● Many of our early problems were addressed in later releases ○ reduced memory allocation & GC times ○ improved MTTR ○ online region merging ○ scanner heartbeat

Slide 24

Slide 24 text

Broad Crawls

Slide 25

Slide 25 text

Broad Crawls Frontera allows us to build large scale web crawlers in Python: ● Scrapy support out of the box ● Distribute and scale custom web crawlers across servers ● Crawl Frontier Framework: large scale URL prioritization logic ● Aduana to prioritize URLs based on link analysis (PageRank, HITS)

Slide 26

Slide 26 text

Broad Crawls Many uses of Frontera: ○ News analysis, Topical crawling ○ Plagiarism detection ○ Sentiment analysis (popularity, likeability) ○ Due diligence (profile/business data) ○ Lead generation (extracting contact information) ○ Track criminal activity & find lost persons (DARPA)

Slide 27

Slide 27 text

Frontera Motivation Frontera started when we needed to identify frequently changing hubs We had to crawl about 1 billion pages per week

Slide 28

Slide 28 text

Frontera Architecture Supports both local and distributed mode ● Scrapy for crawl spiders ● Kafka for message bus ● HBase for storage and frontier maintenance ● Twisted.Internet for async primitives ● Snappy for compression

Slide 29

Slide 29 text

Frontera: Big and Small hosts Ordering of URLs across hosts is important: ● Politeness: a single host crawled by one Scrapy process ● Each Scrapy process crawls multiple hosts Challenges we found at scale: ● Queue flooded with URLs from the same host. ○ Underuse of spider resources. ● Additional per-host (per-IP) queue and metering algorithm. ● URLs from big hosts are cached in memory. ○ Found a few very huge hosts (>20M docs) ● All queue partitions were flooded with huge hosts. ● Two MapReduce jobs: queue shuffling, limit all hosts to 100 docs MAX.

Slide 30

Slide 30 text

Breadth-first strategy: huge amount of DNS requests ● Recursive DNS server on every spider node, upstream to Verizon & OpenDNS ● Scrapy patch for large thread pool for DNS resolving and timeout customization Intensive network traffic from workers to services ● Throughput between workers and Kafka/HBase ~ 1Gbit/s ● Thrift compact protocol for HBase ● Message compression in Kafka with Snappy Batching and caching to achieve performance Frontera: tuning

Slide 31

Slide 31 text

Duplicate Content The web is full of duplicate content. Duplicate Content negatively impacts: ● Storage ● Re-crawl performance ● Quality of data Efficient algorithms for Near Duplicate Detection, like SimHash, are applied to estimate similarity between web pages to avoid scraping duplicated content.

Slide 32

Slide 32 text

Near Duplicate Detection Uses Compare prices of products scraped from different retailers by finding near duplicates in a dataset: Merge similar items to avoid duplicate entries: Title Store Price ThinkPad X220 Laptop Lenovo (i7 2.8GHz, 12.5 LED, 320 GB) Acme Store 599.89 Lenovo Thinkpad Notebook Model X220 (i7 2.8, 12.5’’, HDD 320) XYZ Electronics 559.95 Name Summary Location Saint Fin Barre’s Cathedral Begun in 1863, the cathedral was the first major work of the Victorian architect William Burges… 51.8944, -8.48064 St. Finbarr’s Cathedral Cork Designed by William Burges and consecrated in 1870, ... 51.894401550293, -8.48064041137695

Slide 33

Slide 33 text

What we’re seeing.. ● More data is available than ever ● Scrapinghub can provide web data in a usable format ● We’re combining multiple data sources and analyzing ● The technology to use big data is rapidly improving and becoming more accessible ● Data Science is everywhere

Slide 34

Slide 34 text

Thank you! Shane Evans [email protected] scrapinghub.com Thank you!