$30 off During Our Annual Pro Sale. View Details »

The making of Jardinero

The making of Jardinero

«In the end, what's the difference between any passionate software engineer and an artist?»

I firmly believe that creativity is the very heart of excellent engineering: for example, the gradual refinement steps to achieve beautiful code are conceptually similar to the process of creating an exquisite painting or sculpting a statue out of marble...

In my personal, endless pursuit of Knowledge, I've noticed that even different domains can actually influence each other - and natural languages are by far one of my favorite inspiration sources when developing software.

In this presentation we're going to unfold - with technology and fun! - the story of the software architecture that I created to savor the nuances of the Spanish language.

Gianluca Costa

May 03, 2022
Tweet

More Decks by Gianluca Costa

Other Decks in Technology

Transcript

  1. Gianluca Costa
    The making of
    Story of a software engineer who wanted to learn Spanish
    Latest update: 2022-05-03

    View Slide

  2. Foreword
    «
    In the end, what's the difference between any passionate

    software engineer and an artist?
    »
    I firmly believe that creativity is the very heart of excellent engineering: for example,
    the gradual refinement steps to achieve beautiful code are conceptually similar to
    the process of creating an exquisite painting or sculpting a statue out of marble...
    In my personal, endless pursuit of Knowledge, I've noticed that even different
    domains can actually influence each other - and natural languages are by far one of my
    favorite inspiration sources when developing software.
    In this presentation we're going to unfold - with technology and fun! - the story of
    the software architecture that I created to savor the nuances of the Spanish language.
    2

    View Slide

  3. Part 1
    Exploring Spanish

    View Slide

  4. My problem: understanding Spanish morphology
    «
    How many Spanish words end with -tad?
    »
    «
    And how many Spanish words end with -dad?
    »
    «
    What's the difference between them?
    »
    The above questions - and far more sophisticated ones, often involving verb stems -
    were becoming more and more frustrating as I progressed through my enthusiastic
    exploration of the Spanish language...
    Most unfortunately, however, looking for answers definitely required in-depth analysis
    - more than what I was able to perform in traditional dictionaries.
    4

    View Slide

  5. Let's create our custom dictionary!
    On the web there is a variety of tools and libraries targeting almost every native
    language, in different technologies - so I won't mention specific existing solutions.
    What I needed was a tool having these traits:
    1. Extracting data from Wikcionario, the Spanish edition of Wiktionary
    2. Based on my own linguistic model of the Spanish grammar
    3. Supporting arbitrarily complex queries - for example, via SQL
    In the past I created other projects - for the JVM - like Esprit for French and Balmung
    for German...
    ...so, which ecosystem could I choose this time?
    5

    View Slide

  6. Modern Python for the backend...
    I started programming in Python back in 2004 - when Python 2.3 was still the latest
    release... I actually discovered it when reading Bruce Eckel's masterwork - «Thinking in
    Java» - and it quickly became my main language from 2004 to late 2007.
    However, I never stopped using it over the decades - especially for DevOps scripts and
    text-processing utilities...
    ...which is why I decided to explore its modern version - Python 3.10 - relying on type
    annotations enforced by Mypy; furthermore, after years of courses and projects with
    Node, I wanted to apply similar patterns to Flask as well.
    Alas, a doubt arose: would Python be fast enough for such a mountain of data?
    Spoiler alert: it was excellent - far beyond my expectations!
    6

    View Slide

  7. ...and a TypeScript + React frontend!
    I also had to design a UI for the app: I discarded a mere command-line interface,
    because I wanted the user to enter multiline language-related queries with ease.
    At first, given the simplicity of the UI, I took into account Tkinter, because of its
    portability - not without a bit of vintage curiosity, indeed...
    ...but, in the end, I resolved to further explore the websocket technology - which I had
    previously adopted in my Node / React full-JS web stack named Ulysses... of course,
    while leveraging the magnificent elegance of React's declarative syntax.
    Oh, and I adore the hypnotic rainbow SVG spinner that you can find in Ulysses, too!
    Actually, since the beginning, it was clear that the very purpose of Jardinero was to
    explore - not only a native language (Spanish), but also technological patterns.
    7

    View Slide

  8. Jardinero in action
    8

    View Slide

  9. Part 2
    The architectural components

    View Slide

  10. Architectural overview
    10

    View Slide

  11. Eos-core: modern, type-checked utilities
    Eos-core is far from being a trivial utility library, because it is:
    modern, written in Python 3.10
    type-checked - with consistency checks performed by Mypy
    dependency-free - only requiring your Python standard library
    general-purpose - from parallelism to I/O, from functional programming to
    adaptive queue agents
    For details, please visit its GitHub project page.
    11

    View Slide

  12. WikiPrism: parsing wikis and creating dictionaries
    WikiPrism focuses on:
    lightning-fast wiki parsing - using SAX to extract Page objects from XML files
    term extraction from pages - according to your language-specific algorithm
    dictionary creation; in WikiPrism's model, a dictionary supports:
    storing terms into an arbitrary data storage
    querying the dictionary - even via a custom DSL
    In particular, WikiPrism provides a SqliteDictionary backed by SQLite
    For details, please visit its GitHub project page.
    12

    View Slide

  13. Cervantes: creating a dictionary of Spanish terms
    Cervantes is built upon WikiPrism to extract Spanish terms from Wikcionario and
    classify them into grammar categories stored into SQLite tables.
    It can be referenced as a standalone Python library, but its interface actually exports all
    the functions required by Jardinero's extension protocol, therefore:
    «
    Cervantes is also a linguistic module - a plugin - for Jardinero
    »
    The DDL schema created by Cervantes is documented in the project's README file and
    perfectly supports all the SQL constructs - including joins.
    For details, please visit its GitHub project page.
    13

    View Slide

  14. Jardinero: merging everything into a hybrid web app
    Jardinero is the colorful tip of the whole architecture - a web application to:
    create a new dictionary backed by a SQLite db - one for each linguistic module
    run queries - and instantly see the results in a good-looking HTML table
    support new languages via linguistic modules, written in Python. You can even
    customize each dictionary and its query language
    Should you need more sophisticated analysis tools after creating a dictionary, you can
    also access the databases stored in the $HOME/.jardinero directory.
    14

    View Slide

  15. Jardinero - Running a WikiPrism pipeline
    15

    View Slide

  16. Extending Jardinero with Python
    A linguistic module is a plugin for Jardinero, but it's merely a Python module - or a
    package - just declaring 3 functions:
    get_wiki_url() -> str: returns the URL of the BZ2-compressed XML wiki source file
    extract_terms(page: Page) -> list[TTerm]: given a page, extracts dictionary terms
    create_sqlite_dictionary(connection: Connection) -> SqliteDictionary[TTerm]:
    creates a SqlDictionary - or an instance of a subclass, e.g. for a custom DSL
    Once a linguistic module is ready and installed in your Python distribution (for example,
    via pip), you can just run:
    python -OO -m info.gianlucacosta.jardinero

    16

    View Slide

  17. Part 3
    Evolving a prototype into an architecture

    View Slide

  18. Start by facing the risks
    Even though I wanted a nice web UI, my very first goal was to create a working CLI
    prototype - extracting terms from Wikcionario and creating my SQLite dictionary.
    Of course, there were doubts:
    would Python be fast enough to parse such a huge data source, with no help from
    dedicated C extensions?
    would my regular expressions be able to correctly extract all the data I needed?
    Although a comprehensive test suite ensured the correctness of the parsing process, the
    performance side was initially far from optimal...
    18

    View Slide

  19. «Python is too slow! » - or maybe not
    The very first execution took more than 73 minutes on a 10th-gen i3 processor - but
    that did not come as a surprise, because my codebase was focused on exploration
    rather than performance; autrement dit, I added a bit of redundancy to start with a
    tentative but expressive codebase that would allow me to:
    play with model evolution while still exploring the problem
    easily remove abstractions when compelled by performance needs
    Therefore, I gradually simplified the the codebase and, most importantly, I optimized
    database serialization by introducing buffers.
    Finally, I reached a 10-minute run, later optimized up to around 6 minutes!
    19

    View Slide

  20. Lower-level languages are not always the solution
    First of all, measure the execution time: does it satisfy your expectations? If not:
    1. revise and simplify your algorithm - maybe evaluating alternatives in terms of O(...)
    2. ensure you're following the best practices for the runtime and for your external
    dependencies - including libraries and storage technologies
    3. use more performant dependencies - especially ones designed for your scenario
    4. try expressing CPU-intensive parts as extensions written in a lower-level language
    You should consider switching to another tech stack only when everything else fails.
    20

    View Slide

  21. The second step: creating a basic web UI
    After ensuring the engine was able to fulfill the performance constraints, the following
    major issue was the gap between the Python backend and the React frontend...
    ...and the solution was opening a websocket via Socket.IO.
    So, I designed an essential UI to test bidirectional server communication:
    21

    View Slide

  22. Refinement cycles - baking the cake
    The project was growing steadily, as a monolith - since my architectural focus was
    always on namespacing and protocols:
    «
    As long as the namespace structure is clear, and well-defined
    interaction rules are in place, the monolith is still tidy and can be
    split at the most convenient moment
    »
    In particular, on the React side, the communication protocol with the websocket is
    transparent to the UI because of a type-checked hook, which:
    expects the callbacks to be invoked upon message arrival from the server
    returns a sort of remote control object to send messages to the backend
    22

    View Slide

  23. Splitting the monolith, step by step
    Finally, I decided to gradually split the monolith.
    Eos-core - stemming from the utils package - was the first library that I extracted,
    because it was the most stable component, as well as the most generic.
    I still kept WikiPrism in the monolith for a while, since I wanted to consolidate and
    optimize the parsing engine a bit more.
    Even after WikiPrism became a standalone project, Cervantes was still integrated with
    Jardinero: only after introducing the reflection-based plugin protocol - and dedicated
    pipeline scripts - I was able to extract Cervantes as a dev dependency.
    23

    View Slide

  24. «El broche de oro» - the finishing touch
    Writing the documentation actually led to a couple of interesting breakthroughs that
    contributed to further simplification - thus achieving a more coherent, minimalist
    model.
    I also reused and extended the GitHub Actions pipeline I had recently added to a
    legacy project.
    Last but not least, playing with vector graphics via Inkscape to create logos is always a
    pleasure!
    In the end, voilà how a huge monolith for exploring the Spanish language became the
    flexible architecture - open to any native language - expressed by Jardinero.
    24

    View Slide

  25. Part 4
    Takeaways from each project

    View Slide

  26. Eos-core: creating a general-purpose library
    Huge, measurable test coverage via PyTest: more than 98%
    Multi-threading: utilities like Atomic, SafeThread, CancelableThread,
    CancelableThreadHandle + higher-order functions to create adaptive queue agents
    Multi-processing: InThreadPool to overcome problems while testing and
    debugging on Windows, PoolFacade to add a queue with capacity in front of
    process pools
    Database optimizations: buffering when writing to db via BufferedDbSerializer
    26

    View Slide

  27. WikiPrism: remove inessential abstractions
    Achieve lightning performance by deleting redundant abstractions - while never
    hindering elegance and expressiveness. Result: from 73+ mins to about 6 mins
    In-depth exploration of threads and processes, CPU-bound vs I/O-bound, the GIL
    Interruptible SAX parser for hi-speed XML parsing
    Transition from SqlAlchemy to raw SQL declared via custom decorators
    More optimizations: objects with slots, adaptive queue agents, batch writing, ...
    Dictionary customization of the repository pattern to achieve:
    independence from the storage technology
    custom DSL support
    27

    View Slide

  28. WikiPrism: pipeline architecture
    28

    View Slide

  29. Cervantes: the fine art of linguistic regexes
    The most difficult task in Cervantes was writing effective and performant regexes; I
    wanted to include as many wiki variations as possible, while excluding noise and
    using minimalist regexes
    Consequently, Cervantes is by far the most tested component, with test scenarios
    including both common situations and a wide variety of edge cases
    Furthermore, I wanted to create a reusable library while maintaining its role as the
    initial monolith kernel: in the end, I opted for a transparent protocol à la Python
    The [A-Z] regex does not capture accented vowels and ñ! I chose \w instead
    It was great to revisit SQL to create and populate the SQLite dictionary
    29

    View Slide

  30. Jardinero: combining Python and TypeScript
    Hybrid full stack - backend in Python with Flask, frontend in React with TypeScript
    __main__.py in package, enabling relative, non-ambiguous imports
    Websockets - both point-to-point and sending broadcast messages via a
    background thread: if you start a pipeline, notifications are delivered to every
    browser displaying the app
    Global CSS and CSS modules in Sass, via custom Webpack configuration
    React hooks to conceal complexity, especially when combining technologies via a
    specific protocol
    Mixed build pipeline: Poe + yarn, driven by Poetry
    30

    View Slide

  31. Designing a Python build pipeline
    Poetry project - in lieu of traditional scripts - leveraging:
    Type enforcement via Mypy
    Linting with flake8 and formatting with black
    Import sorting provided by isort
    Poe plugin to add yarn-like scripts to Poetry
    Actual yarn subproject - with its own scripts - to create frontend artifacts
    Twine, for compliance checks
    Pipeline integrated with GitHub Actions - triggered when pushing a tag
    31

    View Slide

  32. Playing with modern Python
    match statements
    intermediate variables in collection comprehensions
    logging with conditional __debug__ compilation for extra performance
    advanced usage of Pytest - with very high coverage rates
    @dataclass - including frozen and slots
    different flavors of decorators
    32

    View Slide

  33. Conclusion...
    It was marvelous to explore, at the very same time:
    Spanish and its morphology
    modern Python
    React and its real-time interactions with a non-JS backend, while adopting several
    Node patterns, especially in CPU-bound/IO-bound scenario evaluations
    sophisticated data extraction from a non-trivial data source, via my beloved
    regular expressions
    33

    View Slide

  34. ...but what about the initial question?
    Last but not least, if you start Jardinero, you'll immediately see the query answering the
    dilemma discussed at the beginning of the presentation, whose result is clear:
    «
    Jardinero could only find 13 words ending with -tad, versus
    662 words ending with -dad!
    »
    More detailed queries reveal that there seem to be no distinguishing trait among
    those 13 words - so they are probably just to be memorized.
    Despite the simplicity of this example, I think it properly reveals Jardinero's wide range
    of applicability in the vast domain of morphology!
    34

    View Slide

  35. Further references
    Jardinero - Extensible web application for exploring natural languages
    Cervantes - Extract a compact Spanish dictionary from Wikcionario, with elegance
    WikiPrism - Parse wiki pages and create dictionaries, fast, with Python
    Eos-core - Type-checked, dependency-free utility library for modern Python
    Python
    Flask
    Poetry and Poe
    TypeScript
    React
    Socket.IO
    35

    View Slide

  36. Thank you!

    View Slide