Frontera: open source, large scale web crawling framework

Slide 1

Slide 1 text

Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 15, 2015 [email protected]

Slide 2

Slide 2 text

• Software Engineer @ Scrapinghub • Born in Yekaterinburg, RU • 5 years at Yandex, search quality department: social and QA search, snippets. • 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts. Hola a todos! 2

Slide 3

Slide 3 text

• Over 2 billion requests per month (~800 per second) • Focused crawls & Broad crawls We help turn web content into useful data 3 { "content": [ { "title": { "text": "'Extreme poverty' to fall below 10% of world population for first time", "href": "http:// www.theguardian.com/society/2015/ oct/05/world-bank-extreme-poverty- to-fall-below-10-of-world- population-for-first-time" }, "points": "9 points",

Slide 4

Slide 4 text

• News analysis • Topical crawling • Price change monitoring • Sentiment analysis (popularity, likability) • Due diligence (proﬁle/business data) • Lead generation (extracting contact information) • Track criminal activity & ﬁnd lost persons (DARPA) Broad crawl usages 4

Slide 5

Slide 5 text

Task • Crawl Spanish web to gather statistics about hosts and their sizes. • Limit crawl to .es zone. • Breadth-ﬁrst strategy: ﬁrst crawl 1-click distance documents, next 2-clicks, and so on, • Finishing condition: absence of hosts with less than 100 crawled documents. • Low costs. 5

Slide 6

Slide 6 text

Spanish internet (.es) in 2012 • Domain names registered - 1,56М (39% growth per year) • Web server in zone - 283,4K (33,1%) • Hosts - 4,2M (21%) • Spanish web sites in DMOZ catalog - 22043  * - отчет OECD Communications Outlook 2013 6

Slide 7

Slide 7 text

Solution • Scrapy* - network operations. • Apache Kafka - data bus (offsets, partitioning). • Apache HBase - storage (random access, linear scanning, scalability). • Twisted.Internet - library for async primitives for use in workers. • Snappy - efﬁcient compression algorithm for IO-bounded applications. * - network operations in Scrapy are implemented asynchronously, based on the same Twisted.Internet 7

Slide 8

Slide 8 text

Architecture Kafka topic SW Crawling strategy workers Storage workers 8 DB

Slide 9

Slide 9 text

1. Big and small hosts problem • When crawler comes to huge number of links from some host, along with usage of simple prioritization models, it turns out queue is ﬂooded with URLs from the same host. • That causes underuse of spider resources. • We adopted additional per- host (optionally per-IP) queue and metering algorithm: URLs from big hosts are cached in memory. 9

Slide 10

Slide 10 text

2. DDoS DNS service Amazon AWS • Breadth-ﬁrst strategy assumes ﬁrst visiting of previously unknown hosts, therefore generating huge amount of DNS request. • Recursive DNS server on each downloading node, with upstream set to Verizon and OpenDNS. • We used dnsmasq. 10

Slide 11

Slide 11 text

3. Tuning Scrapy thread pool’а for efﬁcient DNS resolution • Scrapy uses a thread pool to resolve DNS name to IP. • When ip is absent in cache, request is sent to DNS server in it’s own thread, which is blocking. • Scrapy reported numerous errors related to DNS name resolution and timeouts. • We added option to Scrapy for thread pool size and timeout adjustment. 11

Slide 12

Slide 12 text

4. Overloaded HBase region servers during state check • Crawler extracts from document hundreds of links in average. • Before adding this links to queue, they needs to be checked if they weren’t already crawled (to avoid repetitive visiting). • On small volumes SSDs were just fine. After increase of table size, we had to move to HDDs, and response times dramatically grew up. • Host-local fingerprint function for keys in HBase. • Tuning HBase block cache to fit average host states into one block. 12

Slide 13

Slide 13 text

5. Intensive network trafﬁc from workers to services • We noticed throughput between workers Kafka and HBase up to 1Gbit/s. • Switched to Thrift compact protocol for HBase communication. • Message compression in Kafka using Snappy. 13

Slide 14

Slide 14 text

6. Further query and trafﬁc optimizations to HBase • State check required lion’s share of requests and network throughput. • Consistency was another requirement. • We created local state cache in strategy worker. • For consistency, spider log was partitioned by host, to avoid cache overlap between workers. 14

Slide 15

Slide 15 text

State cache • All operations are batched: • If key is absent in cache, it’s requested from HBase, • every ~4K documents cache is flushed to HBase. • When achieving 3M (~1Гб) elements, flush and cleanup happens. • It seems Least-Recently-Used (LRU) algorithm is a good fit there. 15

Slide 16

Slide 16 text

Spider priority queue (slot) • Cell has an array of:  - ﬁngerprint,   - Crc32(hostname),   - URL,   - score • Dequeueing top N. • Such design is prone to huge hosts. • Partially this problem can be solved using scoring model taking into account known document count per host. 16

Slide 17

Slide 17 text

7. Problem of big and small hosts (strikes back!) • During crawling we’ve found few very huge hosts (>20M docs) • All queue partitions were ﬂooded with pages from few huge hosts, because of queue design and scoring model used. • We made two MapReduce jobs: • queue shufﬂing, • limiting all hosts to no more than 100 documents. 17

Slide 18

Slide 18 text

• Single-thread Scrapy spider gives 1200 pages/min. from about 100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example: • 12 spiders ~ 14.4K pages/min., • 3 SW and 3 DB workers, • Total 18 cores. Hardware requirements 18

Slide 19

Slide 19 text

• Apache HBase, • Apache Kafka, • Python 2.7+, • Scrapy 0.24+, • DNS Service. Software requirements CDH (100% Open source Hadoop package) 19

Slide 20

Slide 20 text

Maintaining Cloudera Hadoop on Amazon EC2 • CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager. • We’ve moved it using symbolic links to separate EBS partition. • EBS should be at least 30Gb, base IOPS should be enough. • Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD). • After one week of crawling, we ran out of space, and started to move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD). 20

Slide 21

Slide 21 text

Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found (~600K expected), • 46.5M crawled pages overall, • 1.5 months, • 22 websites with more than 50M pages 21

Slide 22

Slide 22 text

where are the rest of web servers?!

Slide 23

Slide 23 text

Bow-tie model A. Broder et al. / Computer Networks 33 (2000) 309-320 23

Slide 24

Slide 24 text

Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005 24

Slide 25

Slide 25 text

Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014 25

Slide 26

Slide 26 text

• Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included). • Canonical URLs resolution abstraction: each document has many URLs, which to use? • Scrapy ecosystem: good documentation, big community, ease of customization. Main features 26

Slide 27

Slide 27 text

• Communication layer is Apache Kafka: topic partitioning, offsets mechanism. • Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. • Polite by design: each website is downloaded by at most one spider. • Python: workers, spiders. Distributed Frontera features 27

Slide 28

Slide 28 text

References • Distributed Frontera. https://github.com/ scrapinghub/distributed-frontera • Frontera. https://github.com/scrapinghub/frontera • Documentation: • http://distributed-frontera.readthedocs.org/ • http://frontera.readthedocs.org/ 28

Slide 29

Slide 29 text

Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box. • Watchdog solution: tracking website content changes. • PageRank or HITS strategy. • Own HTML and URL parsers. • Integration into Scrapinghub services. • Testing on larger volumes. 29

Slide 30

Slide 30 text

Contribute! • Distributed Frontera is a historically ﬁrst attempt to implement web scale web crawler using Python. • Truly resource-intensive task: CPU, network, disks. • Made in Scrapinghub, a company where Scrapy was created. • A plans to become an Apache Software Foundation project. 30

Slide 31

Slide 31 text

We’re hiring! http://scrapinghub.com/jobs/ 31

Slide 32

Slide 32 text

32 Mandatory sales slide Crawl the web, at scale • cloud-based platform • smart proxy rotator Get data, hassle-free • off-the-shelf datasets • turn-key web scraping try.scrapinghub.com/BDS15

Slide 33

Slide 33 text

Gracias! Thank you! Alexander Sibiryakov, [email protected]