Scrapy & Scrapinghub - Speaker Deck

Slide 1

Slide 1 text

Scrapy & Scrapinghub Lessons learned building a company around an open source project Pablo Hoffman - PyCon Uruguay 2013

Slide 2

Slide 2 text

Agenda 1. Who am I? 2. Scrapy 3. Scrapinghub

Slide 3

Slide 3 text

Who am I? Hacker, father, entrepreneur Scrapy co-creator Scrapinghub co-founder Loving Python since 2005

Slide 4

Slide 4 text

Scrapy

Slide 5

Slide 5 text

Scrapy → Motivation Scrapers back then were messy, ad-hoc urllib + BeautifulSoup + cross fingers Let’s make a framework! conventions write well structured code share code & patterns between sites crawl politely, efficiently & reliably

Slide 6

Slide 6 text

Scrapy → Early stages direction more unclear no community, no users slow progress messy code minimal documentation faith, patience & hard work

Slide 7

Slide 7 text

Scrapy → Growing up Documentation gets better over time Support can be top notch from day 1 good support → more users engaged → community grows!

Slide 8

Slide 8 text

Scrapy → Evolution Source: Google Trends

Slide 9

Slide 9 text

Scrapy → Today GitHub #12 in Top Python projects 3,000+ watchers, 750 forks StackOverflow 1,500 questions Twitter 1,000 followers Mailing list 1,600 members 200 messages/month IRC 50 users

Slide 10

Slide 10 text

Scrapinghub

Slide 11

Slide 11 text

Scrapinghub → Motivation We love Scrapy! ↓ So how do we keep working on it? ↓ Let’s make a business around it! Crawling at scale is hard & expensive ↓ Let’s bring that to everyone!

Slide 12

Slide 12 text

Scrapinghub → Inspiration Consulting Cloudera (Hadoop, et.al) LucidWorks (Lucene) SaaS Many search SaaS (ElasticSearch, Lucene) Automattic (Wordpress) GetSentry (Sentry) PaaS Heroku, Amazon

Slide 13

Slide 13 text

Scrapinghub → Conception Validate the business freelancing sites real customers, concrete projects don’t worry about scaling (at first) start consulting, productize later Our case 2 years writing Scrapy crawlers at Insophia ↓ there was a real business

Slide 14

Slide 14 text

Scrapinghub → Community Python community web crawling & data mining → hot topics free, organic, tech-savvy referrals bidirectional: give and you shall receive Help, don’t sell grateful users → potential customers Always answer even if it takes a while, and the answer is “No”

Slide 15

Slide 15 text

Many useful OSS practices remote interactions different time zones meritocracy code reviews OSS improvements Scrapinghub & Scrapy must grow together Also important Self-management, keep track of hours Scrapinghub → Management

Slide 16

Slide 16 text

Scrapinghub → Team Fully distributed team requires discipline, communication, responsiveness time-zone friendly to US, EU, Asia, EU Internal tools Google Apps, HipChat, Github, Trello 2010 1 person full-time 2 people part-time Today 35 people full-time 17 countries

Slide 17

Slide 17 text

Scrapinghub → Hiring Attract good developers remote work, flexible times open source (your work is yours) very technical team, good developers Worldwide pool of developers more tailored skills, already know our tools Our case StackOverflow careers + Trello + Trial runs

Slide 18

Slide 18 text

Scrapinghub → Consulting Easier to bootstrap harder to scale Position yourself as experts still working on it :) Helps to understand customer needs find patterns → devise product / open source project Our case Scrapy web crawlers (main source of revenue) Scrapy consultancy, tuning & training

Slide 19

Slide 19 text

Scrapinghub → Product & Services Solve recurrent, common, tedious tasks Our goal infrastructure & services for running web crawlers Our products Scrapy Cloud (PaaS) Autoscraping (SaaS) Crawlera, Splash (developer APIs)

Slide 20

Slide 20 text

Closing thoughts Proud to watch our open source baby grow Happy to make my living with it Confident that it has and will survive any company behind it Hopeful that Scrapinghub will, someday, conquer the world :) Love your open source project!

Slide 21

Slide 21 text

Questions? Get involved! http://scrapy.org http://scrapinghub.com