building-the-crawling-system-plauscherl.pdf

Building the Crawling System I wanted to have @crwlrsoft @chrolear
| @[email protected]

> Name: Christian Olear a.k.a. otsch > Prev. working as
a Web Developer for 13 years > Now in the process of building a business based on crawling / web data > @chrolear on 🐦 | 📖 otsch.codes | @[email protected] 🐘 @crwlrsoft @chrolear | @[email protected]

@crwlrsoft Crawling Scraping Loading documents and following the links in
them And load them as well. Extracting data from one document. @chrolear | @[email protected]

@crwlrsoft I prefer to use the Term Crawling • Some
people actually do both in separated processes • Doing so you’re • Parsing documents twice • Unnecessarily using a lot of disk space or memory • From my experience a lot of people combine c&s • Instead of always saying „crawling & scraping“,   I prefer the term crawling @chrolear | @[email protected]

@crwlrsoft 💩 Craping @chrolear | @[email protected]

@crwlrsoft Why I started building a crawling system • Built
crawlers since 2008 in my day job • Over time a lot of helping libraries have been released • Some mainly focused on crawling, some on scraping • Most only for HTTP and HTML • Basically no implementation of a polite Crawler/Bot @chrolear | @[email protected]

@crwlrsoft Starting the project in 2018 • De fi nitely
thought a bit too ambitious in the beginning • Wanted to build URL parser, HTTP client, DOM lib.,… • Started with the URL parser @chrolear | @[email protected]

@crwlrsoft The url package • Parse URLs to objects to
access and modify all components separately • Works with IDN • Even tells you subdomain, domain, domain suf fi x using mozilla public suf fi x list • Resolve relative paths against a base URL • Advanced Query String API • … @chrolear | @[email protected]

@crwlrsoft The actual crawling system @chrolear | @[email protected]

@crwlrsoft @chrolear | @[email protected]

@crwlrsoft The basic idea • Everything is a „Step“ (Loading
and also Extracting Data) • Steps have Input(s) and Output(s) just like a simple function • The lib provides a lot of common Steps • that can be arranged to build a Crawler @chrolear | @[email protected]

@crwlrsoft A step takes one input (at a time) and
may yield outputs @chrolear | @[email protected]

@crwlrsoft One step’s output is the next step’s input @chrolear
| @[email protected]

@crwlrsoft In a crawler, data cascades   from one step
to the next @chrolear | @[email protected]

@crwlrsoft Steps are Generators • Crawlers are often long running
and memory consuming • Therefore steps have to return a Generator • Had a Crawler that previously went up to use more than 4 GB   of memory => with Generators down to max. usage below 0,5 GB @chrolear | @[email protected]

@crwlrsoft What is a Generator? • Functions returning an array,
return the whole array (all its elements) at once • Functions returning a Generator, yield element by element. So the fi rst returned element (output) can be processed further, before the program has even created the next element. @chrolear | @[email protected]

@crwlrsoft The Crawler class • You always need your own
  Crawler class • And at least de fi ne   a user agent • You can also customize   the Loader and   the Logger @chrolear | @[email protected]

@crwlrsoft Loader? • Currently there is only the HttpLoader •
By default it loads pages using guzzle HTTP client • You can also switch to use a headless chrome (uses chrome- php/chrome composer package under the hood) • Is home to the politeness features, which also have some settings @chrolear | @[email protected]

@crwlrsoft What about Pagination? @chrolear | @[email protected]

@crwlrsoft New Pagination Feature @chrolear | @[email protected]

@crwlrsoft There’s also Loops But I’ll probably remove that before
v1.0 😅 @chrolear | @[email protected]

@crwlrsoft Step Groups • Can be used if you need
to feed multiple steps   with the same input. • You can also combine the outputs to one   group output. • Example: You want to extract data from HTML and   additionally from JSON-LD within a <script> tag   in the same HTML document. @chrolear | @[email protected]

@crwlrsoft Composing the Crawling-Result • By default the fi nal
results are the last step’s   outputs. • By using addKeysToResult() (array outputs)   or setResultKey() (scalar outputs) a step   adds data to the fi nal crawling result. • This way you can compose results with   data coming from different steps. @chrolear | @[email protected]

@crwlrsoft Writing Custom Steps Easy: • Extend the Step class
• Implement invoke() method • Optionally use   validateAndSanitizeInput() • You can use the PSR-3   Logger via $this->logger @chrolear | @[email protected]

@crwlrsoft Politeness built in • The default HttpCrawler class automatically
• loads robots.txt fi les and sticks to the rules (if you use a BotUserAgent) • Waits a little bit between requests. The wait time is based on how long responses take to be delivered. (Can also be set to 0 if you want) • Reacts to 429 (Too Many Requests) Responses (also checks Retry- After header) @chrolear | @[email protected]

@crwlrsoft Logger • The Crawler class uses a PSR-3 LoggerInterface
• It automatically passes it on to all the steps you add • If you don’t provide your own, there’s a default „CliLogger“ @chrolear | @[email protected]

@crwlrsoft Output Filters • There is also a lot of
Filters that can be applied to any step. • Outputs not matching the fi lter are not yielded. @chrolear | @[email protected]

@crwlrsoft Unique Output/Input Method uniqueInput() • Don’t call a step
twice with the same input • E.g. helpful if it could be possible in a pagination loop that a page link could be found twice. Method uniqueOutput() • Don’t return the same output twice Both can take a key as argument if I/O is array or object. @chrolear | @[email protected]

@crwlrsoft Stores • Convenient way to store   Results •
Crawler automatically calls   the Store with each result @chrolear | @[email protected]

@crwlrsoft Cache • During development you can add a (PSR-6)
Cache • It caches all the HTTP responses • So you can test changes to the crawler without waiting for   actual responses @chrolear | @[email protected]

@crwlrsoft Demo Time @chrolear | @[email protected]

@crwlrsoft Project Status • v0.7 is already in the pipeline
with a lot of nice features and improvements • There will presumably be some more 0.x tags. Planned features: • Improve composing results • Step to get data from RSS (and other) feeds • Further loaders (FTP, fi lesystem?) • … • Plan to tag v1.0 the next months @chrolear | @[email protected]

@crwlrsoft Want to contribute? • Low hanging fruit: Filters •
Or, of course also steps, if you have any ideas • Approach me anytime to discuss ideas • Later today • on twitter/mastodon • via crwlr.software @chrolear | @[email protected]

@crwlrsoft Check it out! • There is a lot more
features that I haven’t mentioned yet • If you have any need or idea for crawling => try it! • If you do, please tell me what you think about it! 🙏 • Docs: crwlr.software @chrolear | @[email protected]

Thank you for your attention! @crwlrsoft @chrolear | @[email protected]

Questions? @crwlrsoft @chrolear | @[email protected]

building-the-crawling-system-plauscherl.pdf

building-the-crawling-system-plauscherl.pdf

Other Decks in Programming

Featured

Transcript