These are the slides for my talk "Building the Crawling System I wanted to have" from the Technologieplauscherl Meetup in Linz (AT) on 2022-11-30.
You can also re-watch the livestream at https://youtu.be/KaC0AZuVCpE?t=2430
Building theCrawling SystemI wanted to have@crwlrsoft@chrolear | @[email protected]
View Slide
> Name: Christian Olear a.k.a. otsch> Prev. working as a Web Developer for 13years> Now in the process of building a businessbased on crawling / web data> @chrolear on 🐦 | 📖 otsch.codes | @[email protected] 🐘@crwlrsoft@chrolear | @[email protected]
@crwlrsoftCrawling ScrapingLoading documents andfollowing the links in themAnd load them as well.Extracting datafrom one document.@chrolear | @[email protected]
@crwlrsoftI prefer to use the Term Crawling• Some people actually do both in separated processes• Doing so you’re• Parsing documents twice• Unnecessarily using a lot of disk space or memory• From my experience a lot of people combine c&s• Instead of always saying „crawling & scraping“, I prefer the term crawling@chrolear | @[email protected]
@crwlrsoft💩Craping@chrolear | @[email protected]
@crwlrsoftWhy I started building a crawling system• Built crawlers since 2008 in my day job• Over time a lot of helping libraries have been released• Some mainly focused on crawling, some on scraping• Most only for HTTP and HTML• Basically no implementation of a polite Crawler/Bot@chrolear | @[email protected]
@crwlrsoftStarting the project in 2018• Definitely thought a bit too ambitious in the beginning• Wanted to build URL parser, HTTP client, DOM lib.,…• Started with the URL parser@chrolear | @[email protected]
@crwlrsoftThe url package• Parse URLs to objects to access and modify all components separately• Works with IDN• Even tells you subdomain, domain, domain suffix using mozilla public suffix list• Resolve relative paths against a base URL• Advanced Query String API• …@chrolear | @[email protected]
@crwlrsoftThe actualcrawling system@chrolear | @[email protected]
@crwlrsoft@chrolear | @[email protected]
@crwlrsoftThe basic idea• Everything is a „Step“ (Loading and also Extracting Data)• Steps have Input(s) and Output(s) just like a simple function• The lib provides a lot of common Steps• that can be arranged to build a Crawler@chrolear | @[email protected]
@crwlrsoftA step takes one input (at a time)and may yield outputs@chrolear | @[email protected]
@crwlrsoftOne step’s output is the next step’s input@chrolear | @[email protected]
@crwlrsoftIn a crawler,data cascades from one stepto the next@chrolear | @[email protected]
@crwlrsoftSteps are Generators• Crawlers are often long running and memory consuming• Therefore steps have to return a Generator• Had a Crawler that previously went up to use more than 4 GB of memory => with Generators down to max. usage below 0,5 GB@chrolear | @[email protected]
@crwlrsoftWhat is a Generator?• Functions returning an array, return the whole array (all its elements)at once• Functions returning a Generator, yield element by element.So thefirst returned element (output) can be processed further, beforethe program has even created the next element.@chrolear | @[email protected]
@crwlrsoftThe Crawler class• You always need your own Crawler class• And at least define a user agent• You can also customize the Loader and the Logger@chrolear | @[email protected]
@crwlrsoftLoader?• Currently there is only the HttpLoader• By default it loads pages using guzzle HTTP client• You can also switch to use a headless chrome (uses chrome-php/chrome composer package under the hood)• Is home to the politeness features, which also have somesettings@chrolear | @[email protected]
@crwlrsoftWhat aboutPagination?@chrolear | @[email protected]
@crwlrsoftNew Pagination Feature@chrolear | @[email protected]
@crwlrsoftThere’s also LoopsBut I’ll probably removethat before v1.0 😅@chrolear | @[email protected]
@crwlrsoftStep Groups• Can be used if you need to feed multiple steps with the same input.• You can also combine the outputs to one group output.• Example: You want to extract data from HTML and additionally from JSON-LD within a tag<br/> <br/>in the same HTML document.<br/>@chrolear | @[email protected]<br/>
@crwlrsoftComposing the Crawling-Result• By default thefinal results are the last step’s outputs.• By using addKeysToResult() (array outputs) or setResultKey() (scalar outputs) a step adds data to thefinal crawling result.• This way you can compose results with data coming from different steps.@chrolear | @[email protected]
@crwlrsoftWriting Custom StepsEasy:• Extend the Step class• Implement invoke() method• Optionally use validateAndSanitizeInput()• You can use the PSR-3 Logger via $this->logger@chrolear | @[email protected]
@crwlrsoftPoliteness built in• The default HttpCrawler class automatically• loads robots.txtfiles and sticks to the rules (if you use aBotUserAgent)• Waits a little bit between requests. The wait time is based on howlong responses take to be delivered. (Can also be set to 0 if youwant)• Reacts to 429 (Too Many Requests) Responses (also checks Retry-After header)@chrolear | @[email protected]
@crwlrsoftLogger• The Crawler class uses a PSR-3 LoggerInterface• It automatically passes it on to all the steps you add• If you don’t provide your own, there’s a default „CliLogger“@chrolear | @[email protected]
@crwlrsoftOutput Filters• There is also a lot of Filters that can be applied to any step.• Outputs not matching thefilter are not yielded.@chrolear | @[email protected]
@crwlrsoftUnique Output/InputMethod uniqueInput()• Don’t call a step twice with the same input• E.g. helpful if it could be possible in a pagination loop that a page linkcould be found twice.Method uniqueOutput()• Don’t return the same output twiceBoth can take a key as argument if I/O is array or object.@chrolear | @[email protected]
@crwlrsoftStores• Convenient way to store Results• Crawler automatically calls the Store with each result@chrolear | @[email protected]
@crwlrsoftCache• During development you can add a (PSR-6) Cache• It caches all the HTTP responses• So you can test changes to the crawler without waiting for actual responses@chrolear | @[email protected]
@crwlrsoftDemo Time@chrolear | @[email protected]
@crwlrsoftProject Status• v0.7 is already in the pipeline with a lot of nice features and improvements• There will presumably be some more 0.x tags. Planned features:• Improve composing results• Step to get data from RSS (and other) feeds• Further loaders (FTP,filesystem?)• …• Plan to tag v1.0 the next months@chrolear | @[email protected]
@crwlrsoftWant to contribute?• Low hanging fruit: Filters• Or, of course also steps, if you have any ideas• Approach me anytime to discuss ideas• Later today• on twitter/mastodon• via crwlr.software@chrolear | @[email protected]
@crwlrsoftCheck it out!• There is a lot more features that I haven’t mentioned yet• If you have any need or idea for crawling => try it!• If you do, please tell me what you think about it! 🙏• Docs: crwlr.software@chrolear | @[email protected]
Thank you foryour attention!@crwlrsoft@chrolear | @[email protected]
Questions?@crwlrsoft@chrolear | @[email protected]