$30 off During Our Annual Pro Sale. View Details »

building-the-crawling-system-plauscherl.pdf

 building-the-crawling-system-plauscherl.pdf

These are the slides for my talk "Building the Crawling System I wanted to have" from the Technologieplauscherl Meetup in Linz (AT) on 2022-11-30.

You can also re-watch the livestream at https://youtu.be/KaC0AZuVCpE?t=2430

Christian Olear

November 30, 2022
Tweet

Other Decks in Programming

Transcript

  1. Building the


    Crawling System


    I wanted to have
    @crwlrsoft
    @chrolear | @[email protected]

    View Slide

  2. > Name: Christian Olear a.k.a. otsch


    > Prev. working as a Web Developer for 13
    years


    > Now in the process of building a business
    based on crawling / web data


    > @chrolear on 🐦 | 📖 otsch.codes | @[email protected] 🐘
    @crwlrsoft
    @chrolear | @[email protected]

    View Slide

  3. @crwlrsoft
    Crawling Scraping
    Loading documents and


    following the links in them


    And load them as well.
    Extracting data


    from one document.
    @chrolear | @[email protected]

    View Slide

  4. @crwlrsoft
    I prefer to use the Term Crawling
    • Some people actually do both in separated processes


    • Doing so you’re


    • Parsing documents twice


    • Unnecessarily using a lot of disk space or memory


    • From my experience a lot of people combine c&s


    • Instead of always saying „crawling & scraping“,

    I prefer the term crawling
    @chrolear | @[email protected]

    View Slide

  5. @crwlrsoft
    💩
    Craping
    @chrolear | @[email protected]

    View Slide

  6. @crwlrsoft
    Why I started building a crawling system
    • Built crawlers since 2008 in my day job


    • Over time a lot of helping libraries have been released


    • Some mainly focused on crawling, some on scraping


    • Most only for HTTP and HTML


    • Basically no implementation of a polite Crawler/Bot
    @chrolear | @[email protected]

    View Slide

  7. @crwlrsoft
    Starting the project in 2018
    • De
    fi
    nitely thought a bit too ambitious in the beginning


    • Wanted to build URL parser, HTTP client, DOM lib.,…


    • Started with the URL parser
    @chrolear | @[email protected]

    View Slide

  8. @crwlrsoft
    The url package
    • Parse URLs to objects to access and modify all components separately


    • Works with IDN


    • Even tells you subdomain, domain, domain suf
    fi
    x using mozilla public suf
    fi
    x list


    • Resolve relative paths against a base URL


    • Advanced Query String API


    • …
    @chrolear | @[email protected]

    View Slide

  9. @crwlrsoft
    The actual
    crawling system
    @chrolear | @[email protected]

    View Slide

  10. @crwlrsoft
    @chrolear | @[email protected]

    View Slide

  11. @crwlrsoft
    The basic idea
    • Everything is a „Step“ (Loading and also Extracting Data)


    • Steps have Input(s) and Output(s) just like a simple function


    • The lib provides a lot of common Steps


    • that can be arranged to build a Crawler
    @chrolear | @[email protected]

    View Slide

  12. @crwlrsoft
    A step takes one input (at a time)


    and may yield outputs
    @chrolear | @[email protected]

    View Slide

  13. @crwlrsoft
    One step’s output is the next step’s input
    @chrolear | @[email protected]

    View Slide

  14. @crwlrsoft
    In a crawler,


    data cascades

    from one step


    to the next
    @chrolear | @[email protected]

    View Slide

  15. @crwlrsoft
    Steps are Generators
    • Crawlers are often long running and memory consuming


    • Therefore steps have to return a Generator


    • Had a Crawler that previously went up to use more than 4 GB

    of memory => with Generators down to max. usage below 0,5 GB
    @chrolear | @[email protected]

    View Slide

  16. @crwlrsoft
    What is a Generator?
    • Functions returning an array, return the whole array (all its elements)
    at once


    • Functions returning a Generator, yield element by element.


    So the
    fi
    rst returned element (output) can be processed further, before
    the program has even created the next element.
    @chrolear | @[email protected]

    View Slide

  17. @crwlrsoft
    The Crawler class
    • You always need your own

    Crawler class


    • And at least de
    fi
    ne

    a user agent


    • You can also customize

    the Loader and

    the Logger
    @chrolear | @[email protected]

    View Slide

  18. @crwlrsoft
    Loader?
    • Currently there is only the HttpLoader


    • By default it loads pages using guzzle HTTP client


    • You can also switch to use a headless chrome (uses chrome-
    php/chrome composer package under the hood)


    • Is home to the politeness features, which also have some
    settings
    @chrolear | @[email protected]

    View Slide

  19. @crwlrsoft
    What about


    Pagination?
    @chrolear | @[email protected]

    View Slide

  20. @crwlrsoft
    New Pagination Feature
    @chrolear | @[email protected]

    View Slide

  21. @crwlrsoft
    There’s also Loops
    But I’ll probably remove


    that before v1.0 😅
    @chrolear | @[email protected]

    View Slide

  22. @crwlrsoft
    Step Groups
    • Can be used if you need to feed multiple steps

    with the same input.


    • You can also combine the outputs to one

    group output.


    • Example: You want to extract data from HTML and

    additionally from JSON-LD within a tag<br/>
<br/>in the same HTML document.<br/>@chrolear | @[email protected]<br/>

    View Slide

  23. @crwlrsoft
    Composing the Crawling-Result
    • By default the
    fi
    nal results are the last step’s

    outputs.


    • By using addKeysToResult() (array outputs)

    or setResultKey() (scalar outputs) a step

    adds data to the
    fi
    nal crawling result.


    • This way you can compose results with

    data coming from different steps.
    @chrolear | @[email protected]

    View Slide

  24. @crwlrsoft
    Writing Custom Steps
    Easy:


    • Extend the Step class


    • Implement invoke() method


    • Optionally use

    validateAndSanitizeInput()


    • You can use the PSR-3

    Logger via $this->logger
    @chrolear | @[email protected]

    View Slide

  25. @crwlrsoft
    Politeness built in
    • The default HttpCrawler class automatically


    • loads robots.txt
    fi
    les and sticks to the rules (if you use a
    BotUserAgent)


    • Waits a little bit between requests. The wait time is based on how
    long responses take to be delivered. (Can also be set to 0 if you
    want)


    • Reacts to 429 (Too Many Requests) Responses (also checks Retry-
    After header)
    @chrolear | @[email protected]

    View Slide

  26. @crwlrsoft
    Logger
    • The Crawler class uses a PSR-3 LoggerInterface


    • It automatically passes it on to all the steps you add


    • If you don’t provide your own, there’s a default „CliLogger“
    @chrolear | @[email protected]

    View Slide

  27. @crwlrsoft
    Output Filters
    • There is also a lot of Filters that can be applied to any step.


    • Outputs not matching the
    fi
    lter are not yielded.
    @chrolear | @[email protected]

    View Slide

  28. @crwlrsoft
    Unique Output/Input
    Method uniqueInput()


    • Don’t call a step twice with the same input


    • E.g. helpful if it could be possible in a pagination loop that a page link
    could be found twice.
    Method uniqueOutput()


    • Don’t return the same output twice
    Both can take a key as argument if I/O is array or object.
    @chrolear | @[email protected]

    View Slide

  29. @crwlrsoft
    Stores
    • Convenient way to store

    Results


    • Crawler automatically calls

    the Store with each result
    @chrolear | @[email protected]

    View Slide

  30. @crwlrsoft
    Cache
    • During development you can add a (PSR-6) Cache


    • It caches all the HTTP responses


    • So you can test changes to the crawler without waiting for

    actual responses
    @chrolear | @[email protected]

    View Slide

  31. @crwlrsoft
    Demo Time
    @chrolear | @[email protected]

    View Slide

  32. @crwlrsoft
    Project Status
    • v0.7 is already in the pipeline with a lot of nice features and improvements


    • There will presumably be some more 0.x tags. Planned features:


    • Improve composing results


    • Step to get data from RSS (and other) feeds


    • Further loaders (FTP,
    fi
    lesystem?)


    • …


    • Plan to tag v1.0 the next months
    @chrolear | @[email protected]

    View Slide

  33. @crwlrsoft
    Want to contribute?
    • Low hanging fruit: Filters


    • Or, of course also steps, if you have any ideas


    • Approach me anytime to discuss ideas


    • Later today


    • on twitter/mastodon


    • via crwlr.software
    @chrolear | @[email protected]

    View Slide

  34. @crwlrsoft
    Check it out!
    • There is a lot more features that I haven’t mentioned yet


    • If you have any need or idea for crawling => try it!


    • If you do, please tell me what you think about it! 🙏


    • Docs: crwlr.software
    @chrolear | @[email protected]

    View Slide

  35. Thank you for
    your attention!
    @crwlrsoft
    @chrolear | @[email protected]

    View Slide

  36. Questions?
    @crwlrsoft
    @chrolear | @[email protected]

    View Slide