The making of Jardinero

Gianluca Costa The making of Story of a software engineer
who wanted to learn Spanish Latest update: 2022-05-03

Foreword « In the end, what's the difference between any
passionate software engineer and an artist? » I firmly believe that creativity is the very heart of excellent engineering: for example, the gradual refinement steps to achieve beautiful code are conceptually similar to the process of creating an exquisite painting or sculpting a statue out of marble... In my personal, endless pursuit of Knowledge, I've noticed that even different domains can actually influence each other - and natural languages are by far one of my favorite inspiration sources when developing software. In this presentation we're going to unfold - with technology and fun! - the story of the software architecture that I created to savor the nuances of the Spanish language. 2

Part 1 Exploring Spanish

My problem: understanding Spanish morphology « How many Spanish words
end with -tad? » « And how many Spanish words end with -dad? » « What's the difference between them? » The above questions - and far more sophisticated ones, often involving verb stems - were becoming more and more frustrating as I progressed through my enthusiastic exploration of the Spanish language... Most unfortunately, however, looking for answers definitely required in-depth analysis - more than what I was able to perform in traditional dictionaries. 4

Let's create our custom dictionary! On the web there is
a variety of tools and libraries targeting almost every native language, in different technologies - so I won't mention specific existing solutions. What I needed was a tool having these traits: 1. Extracting data from Wikcionario, the Spanish edition of Wiktionary 2. Based on my own linguistic model of the Spanish grammar 3. Supporting arbitrarily complex queries - for example, via SQL In the past I created other projects - for the JVM - like Esprit for French and Balmung for German... ...so, which ecosystem could I choose this time? 5

Modern Python for the backend... I started programming in Python
back in 2004 - when Python 2.3 was still the latest release... I actually discovered it when reading Bruce Eckel's masterwork - «Thinking in Java» - and it quickly became my main language from 2004 to late 2007. However, I never stopped using it over the decades - especially for DevOps scripts and text-processing utilities... ...which is why I decided to explore its modern version - Python 3.10 - relying on type annotations enforced by Mypy; furthermore, after years of courses and projects with Node, I wanted to apply similar patterns to Flask as well. Alas, a doubt arose: would Python be fast enough for such a mountain of data? Spoiler alert: it was excellent - far beyond my expectations! 6

...and a TypeScript + React frontend! I also had to
design a UI for the app: I discarded a mere command-line interface, because I wanted the user to enter multiline language-related queries with ease. At first, given the simplicity of the UI, I took into account Tkinter, because of its portability - not without a bit of vintage curiosity, indeed... ...but, in the end, I resolved to further explore the websocket technology - which I had previously adopted in my Node / React full-JS web stack named Ulysses... of course, while leveraging the magnificent elegance of React's declarative syntax. Oh, and I adore the hypnotic rainbow SVG spinner that you can find in Ulysses, too! Actually, since the beginning, it was clear that the very purpose of Jardinero was to explore - not only a native language (Spanish), but also technological patterns. 7

Jardinero in action 8

Part 2 The architectural components

Architectural overview 10

Eos-core: modern, type-checked utilities Eos-core is far from being a
trivial utility library, because it is: modern, written in Python 3.10 type-checked - with consistency checks performed by Mypy dependency-free - only requiring your Python standard library general-purpose - from parallelism to I/O, from functional programming to adaptive queue agents For details, please visit its GitHub project page. 11

WikiPrism: parsing wikis and creating dictionaries WikiPrism focuses on: lightning-fast
wiki parsing - using SAX to extract Page objects from XML files term extraction from pages - according to your language-specific algorithm dictionary creation; in WikiPrism's model, a dictionary supports: storing terms into an arbitrary data storage querying the dictionary - even via a custom DSL In particular, WikiPrism provides a SqliteDictionary backed by SQLite For details, please visit its GitHub project page. 12

Cervantes: creating a dictionary of Spanish terms Cervantes is built
upon WikiPrism to extract Spanish terms from Wikcionario and classify them into grammar categories stored into SQLite tables. It can be referenced as a standalone Python library, but its interface actually exports all the functions required by Jardinero's extension protocol, therefore: « Cervantes is also a linguistic module - a plugin - for Jardinero » The DDL schema created by Cervantes is documented in the project's README file and perfectly supports all the SQL constructs - including joins. For details, please visit its GitHub project page. 13

Jardinero: merging everything into a hybrid web app Jardinero is
the colorful tip of the whole architecture - a web application to: create a new dictionary backed by a SQLite db - one for each linguistic module run queries - and instantly see the results in a good-looking HTML table support new languages via linguistic modules, written in Python. You can even customize each dictionary and its query language Should you need more sophisticated analysis tools after creating a dictionary, you can also access the databases stored in the $HOME/.jardinero directory. 14

Jardinero - Running a WikiPrism pipeline 15

Extending Jardinero with Python A linguistic module is a plugin
for Jardinero, but it's merely a Python module - or a package - just declaring 3 functions: get_wiki_url() -> str: returns the URL of the BZ2-compressed XML wiki source file extract_terms(page: Page) -> list[TTerm]: given a page, extracts dictionary terms create_sqlite_dictionary(connection: Connection) -> SqliteDictionary[TTerm]: creates a SqlDictionary - or an instance of a subclass, e.g. for a custom DSL Once a linguistic module is ready and installed in your Python distribution (for example, via pip), you can just run: python -OO -m info.gianlucacosta.jardinero <linguistic module to import> 16

Part 3 Evolving a prototype into an architecture

Start by facing the risks Even though I wanted a
nice web UI, my very first goal was to create a working CLI prototype - extracting terms from Wikcionario and creating my SQLite dictionary. Of course, there were doubts: would Python be fast enough to parse such a huge data source, with no help from dedicated C extensions? would my regular expressions be able to correctly extract all the data I needed? Although a comprehensive test suite ensured the correctness of the parsing process, the performance side was initially far from optimal... 18

«Python is too slow! » - or maybe not The
very first execution took more than 73 minutes on a 10th-gen i3 processor - but that did not come as a surprise, because my codebase was focused on exploration rather than performance; autrement dit, I added a bit of redundancy to start with a tentative but expressive codebase that would allow me to: play with model evolution while still exploring the problem easily remove abstractions when compelled by performance needs Therefore, I gradually simplified the the codebase and, most importantly, I optimized database serialization by introducing buffers. Finally, I reached a 10-minute run, later optimized up to around 6 minutes! 19

Lower-level languages are not always the solution First of all,
measure the execution time: does it satisfy your expectations? If not: 1. revise and simplify your algorithm - maybe evaluating alternatives in terms of O(...) 2. ensure you're following the best practices for the runtime and for your external dependencies - including libraries and storage technologies 3. use more performant dependencies - especially ones designed for your scenario 4. try expressing CPU-intensive parts as extensions written in a lower-level language You should consider switching to another tech stack only when everything else fails. 20

The second step: creating a basic web UI After ensuring
the engine was able to fulfill the performance constraints, the following major issue was the gap between the Python backend and the React frontend... ...and the solution was opening a websocket via Socket.IO. So, I designed an essential UI to test bidirectional server communication: 21

Refinement cycles - baking the cake The project was growing
steadily, as a monolith - since my architectural focus was always on namespacing and protocols: « As long as the namespace structure is clear, and well-defined interaction rules are in place, the monolith is still tidy and can be split at the most convenient moment » In particular, on the React side, the communication protocol with the websocket is transparent to the UI because of a type-checked hook, which: expects the callbacks to be invoked upon message arrival from the server returns a sort of remote control object to send messages to the backend 22

Splitting the monolith, step by step Finally, I decided to
gradually split the monolith. Eos-core - stemming from the utils package - was the first library that I extracted, because it was the most stable component, as well as the most generic. I still kept WikiPrism in the monolith for a while, since I wanted to consolidate and optimize the parsing engine a bit more. Even after WikiPrism became a standalone project, Cervantes was still integrated with Jardinero: only after introducing the reflection-based plugin protocol - and dedicated pipeline scripts - I was able to extract Cervantes as a dev dependency. 23

«El broche de oro» - the finishing touch Writing the
documentation actually led to a couple of interesting breakthroughs that contributed to further simplification - thus achieving a more coherent, minimalist model. I also reused and extended the GitHub Actions pipeline I had recently added to a legacy project. Last but not least, playing with vector graphics via Inkscape to create logos is always a pleasure! In the end, voilà how a huge monolith for exploring the Spanish language became the flexible architecture - open to any native language - expressed by Jardinero. 24

Part 4 Takeaways from each project

Eos-core: creating a general-purpose library Huge, measurable test coverage via
PyTest: more than 98% Multi-threading: utilities like Atomic, SafeThread, CancelableThread, CancelableThreadHandle + higher-order functions to create adaptive queue agents Multi-processing: InThreadPool to overcome problems while testing and debugging on Windows, PoolFacade to add a queue with capacity in front of process pools Database optimizations: buffering when writing to db via BufferedDbSerializer 26

WikiPrism: remove inessential abstractions Achieve lightning performance by deleting redundant
abstractions - while never hindering elegance and expressiveness. Result: from 73+ mins to about 6 mins In-depth exploration of threads and processes, CPU-bound vs I/O-bound, the GIL Interruptible SAX parser for hi-speed XML parsing Transition from SqlAlchemy to raw SQL declared via custom decorators More optimizations: objects with slots, adaptive queue agents, batch writing, ... Dictionary customization of the repository pattern to achieve: independence from the storage technology custom DSL support 27

WikiPrism: pipeline architecture 28

Cervantes: the fine art of linguistic regexes The most difficult
task in Cervantes was writing effective and performant regexes; I wanted to include as many wiki variations as possible, while excluding noise and using minimalist regexes Consequently, Cervantes is by far the most tested component, with test scenarios including both common situations and a wide variety of edge cases Furthermore, I wanted to create a reusable library while maintaining its role as the initial monolith kernel: in the end, I opted for a transparent protocol à la Python The [A-Z] regex does not capture accented vowels and ñ! I chose \w instead It was great to revisit SQL to create and populate the SQLite dictionary 29

Jardinero: combining Python and TypeScript Hybrid full stack - backend
in Python with Flask, frontend in React with TypeScript __main__.py in package, enabling relative, non-ambiguous imports Websockets - both point-to-point and sending broadcast messages via a background thread: if you start a pipeline, notifications are delivered to every browser displaying the app Global CSS and CSS modules in Sass, via custom Webpack configuration React hooks to conceal complexity, especially when combining technologies via a specific protocol Mixed build pipeline: Poe + yarn, driven by Poetry 30

Designing a Python build pipeline Poetry project - in lieu
of traditional scripts - leveraging: Type enforcement via Mypy Linting with flake8 and formatting with black Import sorting provided by isort Poe plugin to add yarn-like scripts to Poetry Actual yarn subproject - with its own scripts - to create frontend artifacts Twine, for compliance checks Pipeline integrated with GitHub Actions - triggered when pushing a tag 31

Playing with modern Python match statements intermediate variables in collection
comprehensions logging with conditional __debug__ compilation for extra performance advanced usage of Pytest - with very high coverage rates @dataclass - including frozen and slots different flavors of decorators 32

Conclusion... It was marvelous to explore, at the very same
time: Spanish and its morphology modern Python React and its real-time interactions with a non-JS backend, while adopting several Node patterns, especially in CPU-bound/IO-bound scenario evaluations sophisticated data extraction from a non-trivial data source, via my beloved regular expressions 33

...but what about the initial question? Last but not least,
if you start Jardinero, you'll immediately see the query answering the dilemma discussed at the beginning of the presentation, whose result is clear: « Jardinero could only find 13 words ending with -tad, versus 662 words ending with -dad! » More detailed queries reveal that there seem to be no distinguishing trait among those 13 words - so they are probably just to be memorized. Despite the simplicity of this example, I think it properly reveals Jardinero's wide range of applicability in the vast domain of morphology! 34

Further references Jardinero - Extensible web application for exploring natural
languages Cervantes - Extract a compact Spanish dictionary from Wikcionario, with elegance WikiPrism - Parse wiki pages and create dictionaries, fast, with Python Eos-core - Type-checked, dependency-free utility library for modern Python Python Flask Poetry and Poe TypeScript React Socket.IO 35

Thank you!

The making of Jardinero

The making of Jardinero

More Decks by Gianluca Costa

Other Decks in Technology

Featured

Transcript