Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data at Scrapinghub

Big Data at Scrapinghub

Talk from the Cork Big Data and Analytics meetup
http://www.meetup.com/Cork-Big-Data-Analytics-Group/events/229772532/

Cork-based company Scrapinghub offers tools to turn web-based content into useful data, including a cloud-based web crawling platform, off-the-shelf datasets and turn-key web scraping services. At this meetup, director and co-founder Shane Evans will give an overview and history of the company, discuss the data architecture and provide an insight into their data and analytics plans for the future.

Shane Evans

April 04, 2016
Tweet

More Decks by Shane Evans

Other Decks in Technology

Transcript

  1. Big Data at Scrapinghub
    Shane Evans

    View full-size slide

  2. About Shane
    ● 9 years web scraping
    ● Decades with Big Data
    ● Scrapy, Portia, Frontera,
    Scrapy Cloud, etc.
    ● Co-founded Scrapinghub

    View full-size slide

  3. We turn web content into useful data

    View full-size slide

  4. Founded in 2010, largest 100% remote company based outside of the US
    We’re 126 teammates in 41 countries

    View full-size slide

  5. About Scrapinghub
    Scrapinghub specializes in data extraction. Our platform is
    used to scrape over 4 billion web pages a month.
    We offer:
    ● Professional Services to handle the web scraping for you
    ● Off-the-shelf datasets so you can get data hassle free
    ● A cloud-based platform that makes scraping a breeze

    View full-size slide

  6. Who Uses Web Scraping
    Used by everyone from individuals to
    multinational companies:
    ● Monitor your competitors’ prices by scraping
    product information
    ● Detect fraudulent reviews and sentiment changes
    by scraping product reviews
    ● Track online reputation by scraping social media
    profiles
    ● Create apps that use public data
    ● Track SEO by scraping search engine results

    View full-size slide

  7. “Getting information off the
    Internet is like taking a drink
    from a fire hydrant.”
    – Mitchell Kapor

    View full-size slide

  8. Scrapy
    Scrapy is a web scraping framework that
    gets the dirty work related to web crawling
    out of your way.
    Benefits
    ● No platform lock-in: Open Source
    ● Very popular (13k+ ★)
    ● Battle tested
    ● Highly extensible
    ● Great documentation

    View full-size slide

  9. Introducing Portia
    Portia is a Visual Scraping tool that lets you
    get data without needing to write code.
    Benefits
    ● No platform lock-in: Open Source
    ● JavaScript dynamic content generation
    ● Ideal for non-developers
    ● Extensible
    ● It’s as easy as annotating a page

    View full-size slide

  10. How Portia Works
    User provides seed URLs:
    Follows links
    ● Users specify which links to follow (regexp, point-and-click)
    ● Automatically guesses: finds and follows pagination, infinite scroll, prioritizes content
    ● Knows when to stop
    Extracts data
    ● Given a sample, extracts the same data from all similar pages
    ● Understands repetitive patterns
    ● Manages item schemas
    Run standalone or on Scrapy Cloud

    View full-size slide

  11. Large Scale Infrastructure
    Meet Scrapy Cloud , our PaaS for web crawlers:
    ● Scalable: Crawlers run on our cloud infrastructure
    ● Crawlera add-on
    ● Control your spiders: Command line, API or web UI
    ● Machine learning integration: BigML, MonkeyLearn, among others
    ● No lock-in: scrapyd, Scrapy or Portia to run spiders on your own
    infrastructure

    View full-size slide

  12. Data Growth
    ● Items, logs and requests are collected in real time
    ● Millions of web crawling jobs each month
    ● Now at 4 billion a month and growing
    ● Thousands of separate active projects

    View full-size slide

  13. ● Browse data as the crawl is running
    ● Filter and download huge datasets
    ● Items can have arbitrary schemas
    Data Dashboard

    View full-size slide

  14. MongoDB - v1.0
    MongoDB was a good fit to get a demo up and
    running, but it’s a bad fit for our use at scale
    ● Cannot keep hot data in memory
    ● Lock contention
    ● Cannot order data without sorting, skip+limit
    queries slow
    ● Poor space efficiency
    See https://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/

    View full-size slide

  15. ● High write volume. Writes are micro-batched
    ● Much of the data is written in order and immutable (like logs)
    ● Items are semi-structured nested data
    ● Expect exponential growth
    ● Random access from dashboard users, keep summary stats
    ● Sequential reading important (downloading & analyzing)
    ● Store data on disk, many TB per node
    Storage Requirements - v2.0

    View full-size slide

  16. Bigtable looks good...
    Google’s Bigtable provides a sparse,
    distributed, persistent
    multidimensional sorted map
    Can express our requirements in what
    Bigtable provides
    Performance characteristics should
    match our workload
    Inspired several open source projects

    View full-size slide

  17. Apache HBase
    ● Modelled after Google’s Bigtable
    ● Provides real time random read and write to billions of rows with
    millions of columns
    ● Runs on hadoop and uses HDFS
    ● Strictly consistent reads and writes
    ● Extensible via server side filters and coprocessors
    ● Java-based

    View full-size slide

  18. HBase Architecture

    View full-size slide

  19. HBase Key Selection
    Key selection is critical
    ● Atomic operations are at the row level: we use fat columns, update counts on write
    operations and delete whole rows at once
    ● Order is determined by the binary key: our offsets preserve order

    View full-size slide

  20. HBase Values
    ● Msgpack is like JSON but fast and small
    ● Storing entire records as a value has low
    overhead (vs. splitting records into multiple
    key/values in hbase)
    ● Doesn’t handle very large values well, requires
    us to limit the size of single records
    ● We need arbitrarily nested data anyway, so we
    need some custom binary encoding
    ● Write custom Filters to support simple queries
    We store the entire item record as msgpack encoded data in a single value

    View full-size slide

  21. HBase Deployment
    ● All access is via a single service that provides a restricted API
    ● Ensure no long running queries, deal with timeouts everywhere, ...
    ● Tune settings to work with a lot of data per node
    ● Set block size and compression for each Column Family
    ● Do not use block cache for large scans (Scan.setCacheBlocks) and
    ‘batch’ every time you touch fat columns
    ● Scripts to manage regions (balancing, merging, bulk delete)
    ● We host in Hetzner, on dedicated servers
    ● Data replicated to backup clusters, where we run analytics

    View full-size slide

  22. HBase Lessons Learned
    ● It was a lot of work
    ○ API is low level (untyped bytes) - check out Apache Phoenix
    ○ Many parts -> longer learning curve and difficult to debug. Tools
    are getting better
    ● Many of our early problems were addressed in later releases
    ○ reduced memory allocation & GC times
    ○ improved MTTR
    ○ online region merging
    ○ scanner heartbeat

    View full-size slide

  23. Broad Crawls

    View full-size slide

  24. Broad Crawls
    Frontera allows us to build large scale web crawlers in Python:
    ● Scrapy support out of the box
    ● Distribute and scale custom web crawlers across servers
    ● Crawl Frontier Framework: large scale URL prioritization logic
    ● Aduana to prioritize URLs based on link analysis (PageRank, HITS)

    View full-size slide

  25. Broad Crawls
    Many uses of Frontera:
    ○ News analysis, Topical crawling
    ○ Plagiarism detection
    ○ Sentiment analysis (popularity, likeability)
    ○ Due diligence (profile/business data)
    ○ Lead generation (extracting contact information)
    ○ Track criminal activity & find lost persons (DARPA)

    View full-size slide

  26. Frontera Motivation
    Frontera started when we needed to identify frequently changing
    hubs
    We had to crawl about 1 billion pages per week

    View full-size slide

  27. Frontera Architecture
    Supports both local and distributed mode
    ● Scrapy for crawl spiders
    ● Kafka for message bus
    ● HBase for storage and frontier
    maintenance
    ● Twisted.Internet for async primitives
    ● Snappy for compression

    View full-size slide

  28. Frontera: Big and Small hosts
    Ordering of URLs across hosts is important:
    ● Politeness: a single host crawled by one Scrapy process
    ● Each Scrapy process crawls multiple hosts
    Challenges we found at scale:
    ● Queue flooded with URLs from the same host.
    ○ Underuse of spider resources.
    ● Additional per-host (per-IP) queue and metering
    algorithm.
    ● URLs from big hosts are cached in memory.
    ○ Found a few very huge hosts (>20M docs)
    ● All queue partitions were flooded with huge hosts.
    ● Two MapReduce jobs: queue shuffling, limit all hosts to
    100 docs MAX.

    View full-size slide

  29. Breadth-first strategy: huge amount of DNS requests
    ● Recursive DNS server on every spider node, upstream to
    Verizon & OpenDNS
    ● Scrapy patch for large thread pool for DNS resolving and
    timeout customization
    Intensive network traffic from workers to services
    ● Throughput between workers and Kafka/HBase ~ 1Gbit/s
    ● Thrift compact protocol for HBase
    ● Message compression in Kafka with Snappy
    Batching and caching to achieve performance
    Frontera: tuning

    View full-size slide

  30. Duplicate Content
    The web is full of duplicate content.
    Duplicate Content negatively impacts:
    ● Storage
    ● Re-crawl performance
    ● Quality of data
    Efficient algorithms for Near Duplicate Detection, like SimHash, are
    applied to estimate similarity between web pages to avoid scraping
    duplicated content.

    View full-size slide

  31. Near Duplicate Detection Uses
    Compare prices of products scraped from different retailers by finding
    near duplicates in a dataset:
    Merge similar items to avoid duplicate entries:
    Title Store Price
    ThinkPad X220 Laptop Lenovo (i7 2.8GHz, 12.5 LED, 320 GB) Acme Store 599.89
    Lenovo Thinkpad Notebook Model X220 (i7 2.8, 12.5’’, HDD 320) XYZ Electronics 559.95
    Name Summary Location
    Saint Fin Barre’s Cathedral Begun in 1863, the cathedral was the first major work of the
    Victorian architect William Burges…
    51.8944, -8.48064
    St. Finbarr’s Cathedral Cork Designed by William Burges and consecrated in 1870, ... 51.894401550293, -8.48064041137695

    View full-size slide

  32. What we’re seeing..
    ● More data is available than ever
    ● Scrapinghub can provide web data in a usable format
    ● We’re combining multiple data sources and analyzing
    ● The technology to use big data is rapidly improving and
    becoming more accessible
    ● Data Science is everywhere

    View full-size slide

  33. Thank you!
    Shane Evans
    [email protected]
    scrapinghub.com
    Thank you!

    View full-size slide