Slide 1

Slide 1 text

Scraping the Web with Scrapinghub For Finance

Slide 2

Slide 2 text

We turn web content into useful data

Slide 3

Slide 3 text

About Scrapinghub Scrapinghub specializes in data extraction. Our platform is used to scrape over 4 billion web pages a month. We offer: ● Professional Services to handle the web scraping for you ● Off-the-shelf datasets so you can get data hassle free ● A cloud-based platform that makes scraping a breeze

Slide 4

Slide 4 text

Founded in 2010, largest 100% remote company based outside of the US We’re 134 teammates in 48 countries

Slide 5

Slide 5 text

“Getting information off the Internet is like taking a drink from a fire hydrant.” – Mitchell Kapor

Slide 6

Slide 6 text

Scrapy Scrapy is a web scraping framework that gets the dirty work related to web crawling out of your way. Benefits ● No platform lock-in: Open Source ● Very popular (13k+ ★) ● Battle tested ● Highly extensible ● Great documentation

Slide 7

Slide 7 text

Portia Portia is a Visual Scraping tool that lets you get data without needing to write code. Benefits ● No platform lock-in: Open Source ● JavaScript dynamic content generation ● Ideal for non-developers ● Extensible ● It’s as easy as annotating a page

Slide 8

Slide 8 text

Portia

Slide 9

Slide 9 text

Large Scale Infrastructure Meet Scrapy Cloud , our PaaS for web crawlers: ● Scalable: Crawlers run on EC2 instances or dedicated servers ● Crawlera add-on ● Control your spiders: Command line, API or web UI ● Machine learning integration: BigML, MonkeyLearn ● No lock-in: scrapyd to run Scrapy spiders on your own infrastructure

Slide 10

Slide 10 text

Broad Crawls Frontera allows us to build large scale web crawlers in Python: ● Scrapy support out of the box ● Distribute and scale custom web crawlers across servers ● Crawl Frontier Framework: large scale URL prioritization logic ● Aduana to prioritize URLs based on link analysis (PageRank, HITS)

Slide 11

Slide 11 text

Web Scraping Use Cases

Slide 12

Slide 12 text

Competitive Pricing Companies use web scraping to monitor the pricing and the ratings of competitors: ● Scrape online retailers ● Structure the data in a search engine or DB ● Create an interface to search for products ● Sentiment analysis for product rankings

Slide 13

Slide 13 text

We help a leading IT manufacturer monitor the activities of their resellers: ● Tracking and watching out for stolen goods ● Pricing agreement violations ● Customer support responses on complaints ● Product line quality checks Monitor Resellers

Slide 14

Slide 14 text

Lead Generation Mine scraped data to identify who to target in a company for your outbound sales campaigns: ● Locate possible leads in your target market ● Identify the right contacts within each one ● Augment the information you already have on them

Slide 15

Slide 15 text

Real Estate Crawl property websites and use the data obtained in order to: ● Estimate house prices ● Rental values ● Housing stock movements ● Give insight into real estate agents and homeowners

Slide 16

Slide 16 text

Fraud Detection Monitor for sellers that offer products violating the ToS of credit card companies including: ● Drugs ● Weapons ● Gambling Identify stolen cards and IDs on the Dark Web ● Forums where hackers share ID numbers / pins

Slide 17

Slide 17 text

Company Reputation Sentiment analysis of a company or product through newsletters, social networks and other natural language data sources. ● NLP to create an associated sentiment indicator. ● Track the relevant news supporting the indicator can lead to market insights for long-term trends.

Slide 18

Slide 18 text

Consumer Behavior Extract data from forums and websites like Reddit to evaluate consumer reviews and commentary: ● Volume of comments across brands ● Topics of discussion ● Comparisons with other brands and products ● Evaluate product launches and marketing tactics

Slide 19

Slide 19 text

Tracking Legislation Monitor bills and regulations that are being discussed in Congress. Access court judgments and opinions in order to: ● Follow discussions ● Try to forecast legislative outcomes ● Track regulations that impact different economic sectors

Slide 20

Slide 20 text

Hiring Crawl and extract data from job boards and other sources in order to understand: ● Hiring trends in different sectors or regions ● Find candidates for jobs, or future leaders ● Spot and rescue employees that are shopping for a new job

Slide 21

Slide 21 text

Monitoring Corruption Journalists and analysts can create Open Data by extracting information from difficult to access government websites: ● Track the activities of lobbyists ● Patterns in the behavior of government officials ● Disruptions in the economy due to corruption allegations

Slide 22

Slide 22 text

Thank you! scrapinghub.com Thank you!