Upgrade to Pro — share decks privately, control downloads, hide ads and more …



These are the slides for my talk "Building the Crawling System I wanted to have" from the Technologieplauscherl Meetup in Linz (AT) on 2022-11-30.

You can also re-watch the livestream at https://youtu.be/KaC0AZuVCpE?t=2430

Christian Olear

November 30, 2022

Other Decks in Programming


  1. Building the

    Crawling System

    I wanted to have
    @chrolear | @[email protected]

    View Slide

  2. > Name: Christian Olear a.k.a. otsch

    > Prev. working as a Web Developer for 13

    > Now in the process of building a business
    based on crawling / web data

    > @chrolear on 🐦 | 📖 otsch.codes | @[email protected] 🐘
    @chrolear | @[email protected]

    View Slide

  3. @crwlrsoft
    Crawling Scraping
    Loading documents and

    following the links in them

    And load them as well.
    Extracting data

    from one document.
    @chrolear | @[email protected]

    View Slide

  4. @crwlrsoft
    I prefer to use the Term Crawling
    • Some people actually do both in separated processes

    • Doing so you’re

    • Parsing documents twice

    • Unnecessarily using a lot of disk space or memory

    • From my experience a lot of people combine c&s

    • Instead of always saying „crawling & scraping“,

    I prefer the term crawling
    @chrolear | @[email protected]

    View Slide

  5. @crwlrsoft
    @chrolear | @[email protected]

    View Slide

  6. @crwlrsoft
    Why I started building a crawling system
    • Built crawlers since 2008 in my day job

    • Over time a lot of helping libraries have been released

    • Some mainly focused on crawling, some on scraping

    • Most only for HTTP and HTML

    • Basically no implementation of a polite Crawler/Bot
    @chrolear | @[email protected]

    View Slide

  7. @crwlrsoft
    Starting the project in 2018
    • De
    nitely thought a bit too ambitious in the beginning

    • Wanted to build URL parser, HTTP client, DOM lib.,…

    • Started with the URL parser
    @chrolear | @[email protected]

    View Slide

  8. @crwlrsoft
    The url package
    • Parse URLs to objects to access and modify all components separately

    • Works with IDN

    • Even tells you subdomain, domain, domain suf
    x using mozilla public suf
    x list

    • Resolve relative paths against a base URL

    • Advanced Query String API

    • …
    @chrolear | @[email protected]

    View Slide

  9. @crwlrsoft
    The actual
    crawling system
    @chrolear | @[email protected]

    View Slide

  10. @crwlrsoft
    @chrolear | @[email protected]

    View Slide

  11. @crwlrsoft
    The basic idea
    • Everything is a „Step“ (Loading and also Extracting Data)

    • Steps have Input(s) and Output(s) just like a simple function

    • The lib provides a lot of common Steps

    • that can be arranged to build a Crawler
    @chrolear | @[email protected]

    View Slide

  12. @crwlrsoft
    A step takes one input (at a time)

    and may yield outputs
    @chrolear | @[email protected]

    View Slide

  13. @crwlrsoft
    One step’s output is the next step’s input
    @chrolear | @[email protected]

    View Slide

  14. @crwlrsoft
    In a crawler,

    data cascades

    from one step

    to the next
    @chrolear | @[email protected]

    View Slide

  15. @crwlrsoft
    Steps are Generators
    • Crawlers are often long running and memory consuming

    • Therefore steps have to return a Generator

    • Had a Crawler that previously went up to use more than 4 GB

    of memory => with Generators down to max. usage below 0,5 GB
    @chrolear | @[email protected]

    View Slide

  16. @crwlrsoft
    What is a Generator?
    • Functions returning an array, return the whole array (all its elements)
    at once

    • Functions returning a Generator, yield element by element.

    So the
    rst returned element (output) can be processed further, before
    the program has even created the next element.
    @chrolear | @[email protected]

    View Slide

  17. @crwlrsoft
    The Crawler class
    • You always need your own

    Crawler class

    • And at least de

    a user agent

    • You can also customize

    the Loader and

    the Logger
    @chrolear | @[email protected]

    View Slide

  18. @crwlrsoft
    • Currently there is only the HttpLoader

    • By default it loads pages using guzzle HTTP client

    • You can also switch to use a headless chrome (uses chrome-
    php/chrome composer package under the hood)

    • Is home to the politeness features, which also have some
    @chrolear | @ot[email protected]

    View Slide

  19. @crwlrsoft
    What about

    @chrolear | @[email protected]

    View Slide

  20. @crwlrsoft
    New Pagination Feature
    @chrolear | @[email protected]

    View Slide

  21. @crwlrsoft
    There’s also Loops
    But I’ll probably remove

    that before v1.0 😅
    @chrolear | @[email protected]

    View Slide

  22. @crwlrsoft
    Step Groups
    • Can be used if you need to feed multiple steps

    with the same input.

    • You can also combine the outputs to one

    group output.

    • Example: You want to extract data from HTML and

    additionally from JSON-LD within a tag<br/>
<br/>in the same HTML document.<br/>@chrolear | @[email protected]<br/>

    View Slide

  23. @crwlrsoft
    Composing the Crawling-Result
    • By default the
    nal results are the last step’s


    • By using addKeysToResult() (array outputs)

    or setResultKey() (scalar outputs) a step

    adds data to the
    nal crawling result.

    • This way you can compose results with

    data coming from different steps.
    @chrolear | @[email protected]

    View Slide

  24. @crwlrsoft
    Writing Custom Steps

    • Extend the Step class

    • Implement invoke() method

    • Optionally use


    • You can use the PSR-3

    Logger via $this->logger
    @chrolear | @[email protected]

    View Slide

  25. @crwlrsoft
    Politeness built in
    • The default HttpCrawler class automatically

    • loads robots.txt
    les and sticks to the rules (if you use a

    • Waits a little bit between requests. The wait time is based on how
    long responses take to be delivered. (Can also be set to 0 if you

    • Reacts to 429 (Too Many Requests) Responses (also checks Retry-
    After header)
    @chrolear | @[email protected]

    View Slide

  26. @crwlrsoft
    • The Crawler class uses a PSR-3 LoggerInterface

    • It automatically passes it on to all the steps you add

    • If you don’t provide your own, there’s a default „CliLogger“
    @chrolear | @[email protected]

    View Slide

  27. @crwlrsoft
    Output Filters
    • There is also a lot of Filters that can be applied to any step.

    • Outputs not matching the
    lter are not yielded.
    @chrolear | @[email protected]

    View Slide

  28. @crwlrsoft
    Unique Output/Input
    Method uniqueInput()

    • Don’t call a step twice with the same input

    • E.g. helpful if it could be possible in a pagination loop that a page link
    could be found twice.
    Method uniqueOutput()

    • Don’t return the same output twice
    Both can take a key as argument if I/O is array or object.
    @chrolear | @[email protected]

    View Slide

  29. @crwlrsoft
    • Convenient way to store


    • Crawler automatically calls

    the Store with each result
    @chrolear | @[email protected]

    View Slide

  30. @crwlrsoft
    • During development you can add a (PSR-6) Cache

    • It caches all the HTTP responses

    • So you can test changes to the crawler without waiting for

    actual responses
    @chrolear | @[email protected]

    View Slide

  31. @crwlrsoft
    Demo Time
    @chrolear | @[email protected]

    View Slide

  32. @crwlrsoft
    Project Status
    • v0.7 is already in the pipeline with a lot of nice features and improvements

    • There will presumably be some more 0.x tags. Planned features:

    • Improve composing results

    • Step to get data from RSS (and other) feeds

    • Further loaders (FTP,

    • …

    • Plan to tag v1.0 the next months
    @chrolear | @[email protected]

    View Slide

  33. @crwlrsoft
    Want to contribute?
    • Low hanging fruit: Filters

    • Or, of course also steps, if you have any ideas

    • Approach me anytime to discuss ideas

    • Later today

    • on twitter/mastodon

    • via crwlr.software
    @chrolear | @[email protected]

    View Slide

  34. @crwlrsoft
    Check it out!
    • There is a lot more features that I haven’t mentioned yet

    • If you have any need or idea for crawling => try it!

    • If you do, please tell me what you think about it! 🙏

    • Docs: crwlr.software
    @chrolear | @[email protected]

    View Slide

  35. Thank you for
    your attention!
    @chrolear | @[email protected]

    View Slide

  36. Questions?
    @chrolear | @[email protected]

    View Slide