Upgrade to Pro — share decks privately, control downloads, hide ads and more …

building-the-crawling-system-plauscherl.pdf

 building-the-crawling-system-plauscherl.pdf

These are the slides for my talk "Building the Crawling System I wanted to have" from the Technologieplauscherl Meetup in Linz (AT) on 2022-11-30.

You can also re-watch the livestream at https://youtu.be/KaC0AZuVCpE?t=2430

Christian Olear

November 30, 2022
Tweet

Other Decks in Programming

Transcript

  1. > Name: Christian Olear a.k.a. otsch > Prev. working as

    a Web Developer for 13 years > Now in the process of building a business based on crawling / web data > @chrolear on 🐦 | 📖 otsch.codes | @[email protected] 🐘 @crwlrsoft @chrolear | @[email protected]
  2. @crwlrsoft Crawling Scraping Loading documents and following the links in

    them And load them as well. Extracting data from one document. @chrolear | @[email protected]
  3. @crwlrsoft I prefer to use the Term Crawling • Some

    people actually do both in separated processes • Doing so you’re • Parsing documents twice • Unnecessarily using a lot of disk space or memory • From my experience a lot of people combine c&s • Instead of always saying „crawling & scraping“, 
 I prefer the term crawling @chrolear | @[email protected]
  4. @crwlrsoft Why I started building a crawling system • Built

    crawlers since 2008 in my day job • Over time a lot of helping libraries have been released • Some mainly focused on crawling, some on scraping • Most only for HTTP and HTML • Basically no implementation of a polite Crawler/Bot @chrolear | @[email protected]
  5. @crwlrsoft Starting the project in 2018 • De fi nitely

    thought a bit too ambitious in the beginning • Wanted to build URL parser, HTTP client, DOM lib.,… • Started with the URL parser @chrolear | @[email protected]
  6. @crwlrsoft The url package • Parse URLs to objects to

    access and modify all components separately • Works with IDN • Even tells you subdomain, domain, domain suf fi x using mozilla public suf fi x list • Resolve relative paths against a base URL • Advanced Query String API • … @chrolear | @[email protected]
  7. @crwlrsoft The basic idea • Everything is a „Step“ (Loading

    and also Extracting Data) • Steps have Input(s) and Output(s) just like a simple function • The lib provides a lot of common Steps • that can be arranged to build a Crawler @chrolear | @[email protected]
  8. @crwlrsoft Steps are Generators • Crawlers are often long running

    and memory consuming • Therefore steps have to return a Generator • Had a Crawler that previously went up to use more than 4 GB 
 of memory => with Generators down to max. usage below 0,5 GB @chrolear | @[email protected]
  9. @crwlrsoft What is a Generator? • Functions returning an array,

    return the whole array (all its elements) at once • Functions returning a Generator, yield element by element. So the fi rst returned element (output) can be processed further, before the program has even created the next element. @chrolear | @[email protected]
  10. @crwlrsoft The Crawler class • You always need your own

    
 Crawler class • And at least de fi ne 
 a user agent • You can also customize 
 the Loader and 
 the Logger @chrolear | @[email protected]
  11. @crwlrsoft Loader? • Currently there is only the HttpLoader •

    By default it loads pages using guzzle HTTP client • You can also switch to use a headless chrome (uses chrome- php/chrome composer package under the hood) • Is home to the politeness features, which also have some settings @chrolear | @[email protected]
  12. @crwlrsoft Step Groups • Can be used if you need

    to feed multiple steps 
 with the same input. • You can also combine the outputs to one 
 group output. • Example: You want to extract data from HTML and 
 additionally from JSON-LD within a <script> tag 
 in the same HTML document. @chrolear | @[email protected]
  13. @crwlrsoft Composing the Crawling-Result • By default the fi nal

    results are the last step’s 
 outputs. • By using addKeysToResult() (array outputs) 
 or setResultKey() (scalar outputs) a step 
 adds data to the fi nal crawling result. • This way you can compose results with 
 data coming from different steps. @chrolear | @[email protected]
  14. @crwlrsoft Writing Custom Steps Easy: • Extend the Step class

    • Implement invoke() method • Optionally use 
 validateAndSanitizeInput() • You can use the PSR-3 
 Logger via $this->logger @chrolear | @[email protected]
  15. @crwlrsoft Politeness built in • The default HttpCrawler class automatically

    • loads robots.txt fi les and sticks to the rules (if you use a BotUserAgent) • Waits a little bit between requests. The wait time is based on how long responses take to be delivered. (Can also be set to 0 if you want) • Reacts to 429 (Too Many Requests) Responses (also checks Retry- After header) @chrolear | @[email protected]
  16. @crwlrsoft Logger • The Crawler class uses a PSR-3 LoggerInterface

    • It automatically passes it on to all the steps you add • If you don’t provide your own, there’s a default „CliLogger“ @chrolear | @[email protected]
  17. @crwlrsoft Output Filters • There is also a lot of

    Filters that can be applied to any step. • Outputs not matching the fi lter are not yielded. @chrolear | @[email protected]
  18. @crwlrsoft Unique Output/Input Method uniqueInput() • Don’t call a step

    twice with the same input • E.g. helpful if it could be possible in a pagination loop that a page link could be found twice. Method uniqueOutput() • Don’t return the same output twice Both can take a key as argument if I/O is array or object. @chrolear | @[email protected]
  19. @crwlrsoft Stores • Convenient way to store 
 Results •

    Crawler automatically calls 
 the Store with each result @chrolear | @[email protected]
  20. @crwlrsoft Cache • During development you can add a (PSR-6)

    Cache • It caches all the HTTP responses • So you can test changes to the crawler without waiting for 
 actual responses @chrolear | @[email protected]
  21. @crwlrsoft Project Status • v0.7 is already in the pipeline

    with a lot of nice features and improvements • There will presumably be some more 0.x tags. Planned features: • Improve composing results • Step to get data from RSS (and other) feeds • Further loaders (FTP, fi lesystem?) • … • Plan to tag v1.0 the next months @chrolear | @[email protected]
  22. @crwlrsoft Want to contribute? • Low hanging fruit: Filters •

    Or, of course also steps, if you have any ideas • Approach me anytime to discuss ideas • Later today • on twitter/mastodon • via crwlr.software @chrolear | @[email protected]
  23. @crwlrsoft Check it out! • There is a lot more

    features that I haven’t mentioned yet • If you have any need or idea for crawling => try it! • If you do, please tell me what you think about it! 🙏 • Docs: crwlr.software @chrolear | @[email protected]