«In the end, what's the difference between any passionate software engineer and an artist?»
I firmly believe that creativity is the very heart of excellent engineering: for example, the gradual refinement steps to achieve beautiful code are conceptually similar to the process of creating an exquisite painting or sculpting a statue out of marble...
In my personal, endless pursuit of Knowledge, I've noticed that even different domains can actually influence each other - and natural languages are by far one of my favorite inspiration sources when developing software.
In this presentation we're going to unfold - with technology and fun! - the story of the software architecture that I created to savor the nuances of the Spanish language.
The making of
Story of a software engineer who wanted to learn Spanish
Latest update: 2022-05-03
In the end, what's the difference between any passionate
software engineer and an artist?
I firmly believe that creativity is the very heart of excellent engineering: for example,
the gradual refinement steps to achieve beautiful code are conceptually similar to
the process of creating an exquisite painting or sculpting a statue out of marble...
In my personal, endless pursuit of Knowledge, I've noticed that even different
domains can actually influence each other - and natural languages are by far one of my
favorite inspiration sources when developing software.
In this presentation we're going to unfold - with technology and fun! - the story of
the software architecture that I created to savor the nuances of the Spanish language.
My problem: understanding Spanish morphology
How many Spanish words end with -tad?
And how many Spanish words end with -dad?
What's the difference between them?
The above questions - and far more sophisticated ones, often involving verb stems -
were becoming more and more frustrating as I progressed through my enthusiastic
exploration of the Spanish language...
Most unfortunately, however, looking for answers definitely required in-depth analysis
- more than what I was able to perform in traditional dictionaries.
Let's create our custom dictionary!
On the web there is a variety of tools and libraries targeting almost every native
language, in different technologies - so I won't mention specific existing solutions.
What I needed was a tool having these traits:
1. Extracting data from Wikcionario, the Spanish edition of Wiktionary
2. Based on my own linguistic model of the Spanish grammar
3. Supporting arbitrarily complex queries - for example, via SQL
In the past I created other projects - for the JVM - like Esprit for French and Balmung
...so, which ecosystem could I choose this time?
Modern Python for the backend...
I started programming in Python back in 2004 - when Python 2.3 was still the latest
release... I actually discovered it when reading Bruce Eckel's masterwork - «Thinking in
Java» - and it quickly became my main language from 2004 to late 2007.
However, I never stopped using it over the decades - especially for DevOps scripts and
...which is why I decided to explore its modern version - Python 3.10 - relying on type
annotations enforced by Mypy; furthermore, after years of courses and projects with
Node, I wanted to apply similar patterns to Flask as well.
Alas, a doubt arose: would Python be fast enough for such a mountain of data?
Spoiler alert: it was excellent - far beyond my expectations!
...and a TypeScript + React frontend!
I also had to design a UI for the app: I discarded a mere command-line interface,
because I wanted the user to enter multiline language-related queries with ease.
At first, given the simplicity of the UI, I took into account Tkinter, because of its
portability - not without a bit of vintage curiosity, indeed...
...but, in the end, I resolved to further explore the websocket technology - which I had
previously adopted in my Node / React full-JS web stack named Ulysses... of course,
while leveraging the magnificent elegance of React's declarative syntax.
Oh, and I adore the hypnotic rainbow SVG spinner that you can find in Ulysses, too!
Actually, since the beginning, it was clear that the very purpose of Jardinero was to
explore - not only a native language (Spanish), but also technological patterns.
Jardinero in action
The architectural components
Eos-core: modern, type-checked utilities
Eos-core is far from being a trivial utility library, because it is:
modern, written in Python 3.10
type-checked - with consistency checks performed by Mypy
dependency-free - only requiring your Python standard library
general-purpose - from parallelism to I/O, from functional programming to
adaptive queue agents
For details, please visit its GitHub project page.
WikiPrism: parsing wikis and creating dictionaries
WikiPrism focuses on:
lightning-fast wiki parsing - using SAX to extract Page objects from XML files
term extraction from pages - according to your language-specific algorithm
dictionary creation; in WikiPrism's model, a dictionary supports:
storing terms into an arbitrary data storage
querying the dictionary - even via a custom DSL
In particular, WikiPrism provides a SqliteDictionary backed by SQLite
For details, please visit its GitHub project page.
Cervantes: creating a dictionary of Spanish terms
Cervantes is built upon WikiPrism to extract Spanish terms from Wikcionario and
classify them into grammar categories stored into SQLite tables.
It can be referenced as a standalone Python library, but its interface actually exports all
the functions required by Jardinero's extension protocol, therefore:
Cervantes is also a linguistic module - a plugin - for Jardinero
The DDL schema created by Cervantes is documented in the project's README file and
perfectly supports all the SQL constructs - including joins.
For details, please visit its GitHub project page.
Jardinero: merging everything into a hybrid web app
Jardinero is the colorful tip of the whole architecture - a web application to:
create a new dictionary backed by a SQLite db - one for each linguistic module
run queries - and instantly see the results in a good-looking HTML table
support new languages via linguistic modules, written in Python. You can even
customize each dictionary and its query language
Should you need more sophisticated analysis tools after creating a dictionary, you can
also access the databases stored in the $HOME/.jardinero directory.
Jardinero - Running a WikiPrism pipeline
Extending Jardinero with Python
A linguistic module is a plugin for Jardinero, but it's merely a Python module - or a
package - just declaring 3 functions:
get_wiki_url() -> str: returns the URL of the BZ2-compressed XML wiki source file
extract_terms(page: Page) -> list[TTerm]: given a page, extracts dictionary terms
create_sqlite_dictionary(connection: Connection) -> SqliteDictionary[TTerm]:
creates a SqlDictionary - or an instance of a subclass, e.g. for a custom DSL
Once a linguistic module is ready and installed in your Python distribution (for example,
via pip), you can just run:
python -OO -m info.gianlucacosta.jardinero
Evolving a prototype into an architecture
Start by facing the risks
Even though I wanted a nice web UI, my very first goal was to create a working CLI
prototype - extracting terms from Wikcionario and creating my SQLite dictionary.
Of course, there were doubts:
would Python be fast enough to parse such a huge data source, with no help from
dedicated C extensions?
would my regular expressions be able to correctly extract all the data I needed?
Although a comprehensive test suite ensured the correctness of the parsing process, the
performance side was initially far from optimal...
«Python is too slow! » - or maybe not
The very first execution took more than 73 minutes on a 10th-gen i3 processor - but
that did not come as a surprise, because my codebase was focused on exploration
rather than performance; autrement dit, I added a bit of redundancy to start with a
tentative but expressive codebase that would allow me to:
play with model evolution while still exploring the problem
easily remove abstractions when compelled by performance needs
Therefore, I gradually simplified the the codebase and, most importantly, I optimized
database serialization by introducing buffers.
Finally, I reached a 10-minute run, later optimized up to around 6 minutes!
Lower-level languages are not always the solution
First of all, measure the execution time: does it satisfy your expectations? If not:
1. revise and simplify your algorithm - maybe evaluating alternatives in terms of O(...)
2. ensure you're following the best practices for the runtime and for your external
dependencies - including libraries and storage technologies
3. use more performant dependencies - especially ones designed for your scenario
4. try expressing CPU-intensive parts as extensions written in a lower-level language
You should consider switching to another tech stack only when everything else fails.
The second step: creating a basic web UI
After ensuring the engine was able to fulfill the performance constraints, the following
major issue was the gap between the Python backend and the React frontend...
...and the solution was opening a websocket via Socket.IO.
So, I designed an essential UI to test bidirectional server communication:
Refinement cycles - baking the cake
The project was growing steadily, as a monolith - since my architectural focus was
always on namespacing and protocols:
As long as the namespace structure is clear, and well-defined
interaction rules are in place, the monolith is still tidy and can be
split at the most convenient moment
In particular, on the React side, the communication protocol with the websocket is
transparent to the UI because of a type-checked hook, which:
expects the callbacks to be invoked upon message arrival from the server
returns a sort of remote control object to send messages to the backend
Splitting the monolith, step by step
Finally, I decided to gradually split the monolith.
Eos-core - stemming from the utils package - was the first library that I extracted,
because it was the most stable component, as well as the most generic.
I still kept WikiPrism in the monolith for a while, since I wanted to consolidate and
optimize the parsing engine a bit more.
Even after WikiPrism became a standalone project, Cervantes was still integrated with
Jardinero: only after introducing the reflection-based plugin protocol - and dedicated
pipeline scripts - I was able to extract Cervantes as a dev dependency.
«El broche de oro» - the finishing touch
Writing the documentation actually led to a couple of interesting breakthroughs that
contributed to further simplification - thus achieving a more coherent, minimalist
I also reused and extended the GitHub Actions pipeline I had recently added to a
Last but not least, playing with vector graphics via Inkscape to create logos is always a
In the end, voilà how a huge monolith for exploring the Spanish language became the
flexible architecture - open to any native language - expressed by Jardinero.
Takeaways from each project
Eos-core: creating a general-purpose library
Huge, measurable test coverage via PyTest: more than 98%
Multi-threading: utilities like Atomic, SafeThread, CancelableThread,
CancelableThreadHandle + higher-order functions to create adaptive queue agents
Multi-processing: InThreadPool to overcome problems while testing and
debugging on Windows, PoolFacade to add a queue with capacity in front of
Database optimizations: buffering when writing to db via BufferedDbSerializer
WikiPrism: remove inessential abstractions
Achieve lightning performance by deleting redundant abstractions - while never
hindering elegance and expressiveness. Result: from 73+ mins to about 6 mins
In-depth exploration of threads and processes, CPU-bound vs I/O-bound, the GIL
Interruptible SAX parser for hi-speed XML parsing
Transition from SqlAlchemy to raw SQL declared via custom decorators
More optimizations: objects with slots, adaptive queue agents, batch writing, ...
Dictionary customization of the repository pattern to achieve:
independence from the storage technology
custom DSL support
WikiPrism: pipeline architecture
Cervantes: the fine art of linguistic regexes
The most difficult task in Cervantes was writing effective and performant regexes; I
wanted to include as many wiki variations as possible, while excluding noise and
using minimalist regexes
Consequently, Cervantes is by far the most tested component, with test scenarios
including both common situations and a wide variety of edge cases
Furthermore, I wanted to create a reusable library while maintaining its role as the
initial monolith kernel: in the end, I opted for a transparent protocol à la Python
The [A-Z] regex does not capture accented vowels and ñ! I chose \w instead
It was great to revisit SQL to create and populate the SQLite dictionary
Jardinero: combining Python and TypeScript
Hybrid full stack - backend in Python with Flask, frontend in React with TypeScript
__main__.py in package, enabling relative, non-ambiguous imports
Websockets - both point-to-point and sending broadcast messages via a
background thread: if you start a pipeline, notifications are delivered to every
browser displaying the app
Global CSS and CSS modules in Sass, via custom Webpack configuration
React hooks to conceal complexity, especially when combining technologies via a
Mixed build pipeline: Poe + yarn, driven by Poetry
Designing a Python build pipeline
Poetry project - in lieu of traditional scripts - leveraging:
Type enforcement via Mypy
Linting with flake8 and formatting with black
Import sorting provided by isort
Poe plugin to add yarn-like scripts to Poetry
Actual yarn subproject - with its own scripts - to create frontend artifacts
Twine, for compliance checks
Pipeline integrated with GitHub Actions - triggered when pushing a tag
Playing with modern Python
intermediate variables in collection comprehensions
logging with conditional __debug__ compilation for extra performance
advanced usage of Pytest - with very high coverage rates
@dataclass - including frozen and slots
different flavors of decorators
It was marvelous to explore, at the very same time:
Spanish and its morphology
React and its real-time interactions with a non-JS backend, while adopting several
Node patterns, especially in CPU-bound/IO-bound scenario evaluations
sophisticated data extraction from a non-trivial data source, via my beloved
...but what about the initial question?
Last but not least, if you start Jardinero, you'll immediately see the query answering the
dilemma discussed at the beginning of the presentation, whose result is clear:
Jardinero could only find 13 words ending with -tad, versus
662 words ending with -dad!
More detailed queries reveal that there seem to be no distinguishing trait among
those 13 words - so they are probably just to be memorized.
Despite the simplicity of this example, I think it properly reveals Jardinero's wide range
of applicability in the vast domain of morphology!
Jardinero - Extensible web application for exploring natural languages
Cervantes - Extract a compact Spanish dictionary from Wikcionario, with elegance
WikiPrism - Parse wiki pages and create dictionaries, fast, with Python
Eos-core - Type-checked, dependency-free utility library for modern Python
Poetry and Poe