(webmaster@) —SRE/Platform for Geomagical Labs, part of IKEA —We do CV/AR for the home DjangoCon AU 2023 2 Hi there, I'm Noah Kantrowitz. I'm an SRE at Geomagical Labs. We do computer vision and augmented reality stuff for IKEA.
not here to talk about work. For this talk I'm going to use one of my side projects, Farm RPG. It's a free online web and mobile game and important for our needs here, like most RPGs it has lots of fun data and I don't have to run everything past a million lawyers!
big gamer, don't worry. The big thing to understand is that games tend to have highly interconnected data like items drop from monsters, and are used in recipes, and recipes come from quests, etc etc. In SQL this looks like pretty much every table having at least one foreign key to another table in more of a web rather than a tree view like you would have in a simple REST application.
6 This isn't really part of the main topic here but just in case anyone is wondering "did he really spend a year building a ETL system for a free internet game?" then yes, I totally did because it was fun and a great way to learn these tools, many of which are now deployed at my dayjob. Big shoutout to fun side projects where you can move at your own speed and no one bugs you if it's down for a week. Anyways, moving on!
I've said this talk is about ETL, but what does that even mean? The core is quite literal, extract data from somewhere, run it through some kind of transformation, and load it into a database.
a web scraper but the two are very similar. Maybe the site you are scraping is an internal API or a database rather than an HTTP endpoint but it's all built around the same structure.
and again —Cool but not very small DjangoCon AU 2023 9 To touch on it briefly, a lot of fancy folks have been thoughtleadering about ELT instead of ETL. It's the same core idea but rather than storing just the transformed data, you first store the raw stuff so that you could re-transform it later if needed. If that's a feature you need then by all means pursue it, but if your raw data is big it can dramatically increase both system complexity and storage requirements. And since we're here to talk about small systems, I don't think it's for us.
scraping, I would be remiss if I didn't remind everyone that hostile scraping is generally against the terms and conditions of websites, make sure you have permission before scraping any website, API, or data source you don't own. If you have questions about what is and isn't allowed, speak to either the owner or your trusted legal professional or both.
HTML/XML/JSON —Reshape the data to make it easy to query —Possibly import-time aggregations like averages —Loaders: Django ORM, maybe DRF or Pydantic DjangoCon AU 2023 11 The scrapers aren't all of ETL, we need the T and L too. Transforms in many of these systems are going to have two steps, first parsing into structured data either with something simple like json dot loads or a more complex parser like BeautifulSoup. Then we usually need to mold the structured data into a better shape for the queries we want to run later. This can mean SQL nomalization or denomialization, breaking things apart into discrete smaller models or combining them into bigger models, depending on your use cases. It can also be more mundane stuff like renaming fields because your schema is a little different than the upstream or you need to combine two slightly different sources, things like that. Data aggregation can be done either at transform time or later on at query time. The more rigorous definitions would say it should be done here, especially if the data is big that aggregation at query time is impossible without really bad performance, but Postgres is really fast and being a small system our data is probably fine to aggregate later on if you want to take the easier route. And then finally we have loaders, which for us is probably just Django ORM calls because that's what easy. More complex schema layers like Django Rest Framework's serializers or Pydantic could also be used here if you need them.
on the ORM later DjangoCon AU 2023 12 Async and Django has been a long journey. And that journey isn't over yet, but async Django is still great and you can use it today for real production applications. We'll touch more on the limitation of the async database layer in a bit.
—Portability and development env DjangoCon AU 2023 13 Why use async Django as the basis for this system? It let's us keep everything together in one codebase and one running server. ETL pipelines have a well-deserved reputation for being finickey creatures, one tool falls out of sync with another and all of a sudden the whole process jams until a sad on-call engineer comes by to fix it. Having fewer services, in this case just Django and Postgres server, means fewer places for things to go wrong and when they do go wrong, clearer solutions. Restart the one thing and then your back in business. It's also convienent because as a single project, we can easily move the deployment between tools or frameworks, and run a local version for development.
AU 2023 14 Microservices have been the cool way to do things for a while now and they can certainly be useful when projects get huge, when they cross team boundaries. But we're doing small and small doesn't need microservices. All the benefits we just mentioned about service simplicity apply here too, and we avoid the infrastructure and organizational complexity of a dozen services. Embrace the monolith.
if needed DjangoCon AU 2023 15 Lots of talks, as well as the Django documentation haver covered the basics of starting an async app. Starting with Django 4.2 it's all very similar to the non-async versions. Make your new project and app the same way as always, when you start adding views you'll mark them as async. All the default middleware is async-compatible so you only need to think about those if you start adding extra stuff, but fortunately most major projects are async compatible and Django will adapt things for you if they aren't, though at a performance cost.
uvicorn etl.asgi:application —--reload for development —--access-log --no-server-header —--ssl-keyfile=... --ssl-certfile=... DjangoCon AU 2023 16 One thing Django doesn't currently include is an async-compatible web server, but the community has several for us to pick from. I personally like Uvicorn as it supports a reload on change mode like the built-in develop server which makes local dev easier. For production you can turn on access logging, disable the server header to reduce attack fingerprinting, and hopefully turn on TLS.
we're going to need background tasks to run the data pipeline bits. The standard Django answer for that is Celery and Celery Beat but that's a whole big set of things to run. Cron jobs with custom management commands is smaller but we can still do better. One big advantage of an async system is we can spawn additional background tasks to run concurrently with the web application, so let's do that!
await asyncio.sleep(n) DjangoCon AU 2023 18 The core of any looping task in async Python is going to look something like this. Run the extractor function, sleep, repeat.
asyncio.create_task(coro) DjangoCon AU 2023 19 To integrate this will Django, we need to hook into the server startup process and the easiest way to do that is the ready callback in app configs. We want to launch our async task in the background so Django will carry on initializing and our extractor will loop and fire off every 30 seconds, and create_task does exactly that.
about failure modes —Make task status models if needed DjangoCon AU 2023 20 Unfortunately there is no free lunch, so we do need to think about downsides too. With Celery and RabbitMQ, we have a much more durable system, if a task is accepted by the queue then no matter what it's going to run to completion at least once. With tasks existing only as part of our running server, if something crashes then it's just gone. We can add exception handlers and retries but if the processes dies then again, no more tasks. For ETL systems this is usually okay though, if we miss it this hour, we'll get it next hour. As long as each scrape pulls in more than one hour of data at a time, we have a built-in buffer for failures, the final results converge to the same point no matter what. For cases where we can't externally handle convergence, we can also make database models to track status of specific tasks just like we would have with RabbitMQ.
AU 2023 21 As an example, rather than having a background task for sending a single email, we could instead store a database row for each pending email. Every few minutes we try sending anything that is pending to go out, and if it succeeds then clear it from the database. This gives us safety at a similar level to RabbitMQ, even if things fail at the wrong time, it will send at least once. This is more work though, so decide for each use case how much failure matters.
about concurrency —Still improving DjangoCon AU 2023 22 The general rule of async ORM is any method which would talk to the databaes has a version prefixed with "a" for async. aget, afirst, asave, aupdate. If you ever miss an "a" and call a synchronous method, it will raise an exception to remind you. The two big limitations still present are that transactions don't work in async code and that queries can't overlap. The lack of overlap is mostly an internal detail, you can still run multiple concurrent queries from asyncio's point of view, but internally they get handled by a single worker thread so the database sees them sequentially. The end result is you don't get a performance benefit from the concurrency yet.
await sync_to_async(_transaction)() DjangoCon AU 2023 23 Async transactions require a bit more complexity than usual. If we did things normally, the transactions might get intereleaved with other database queries at the asyncio level, so we have to put it all in one synchronous block and use asgiref's sync to async helper method. This is definitely something the Django team is looking to improve in the future. More broadly, we've only recently started seeing true async support in database driver libraries like psycopg3, which should open a lot of doors for better things in Django's async ORM.
fetching from web servers is such a common extractor, it's very likely you'll need a async compatible HTTP client. My recommendation is httpx, it's got a very simple API, solid testing support via respx, and will generally be very familiar to anyone used to requests. Aiohttp is also available, it's generally faster in terms of raw HTTP performance but it lacks HTTP2 support which usually negates that benefit. Also unlike the ORM, both of these can fully overlap requests so you can speed up fetching that way, though of course make sure you don't overload the server on the other side.
slides is always a bit questionable, but I want to go through some real examples from my case study project so you can see how simple this all is in practice.
resp.raise_for_status() for row in resp.json(): await Thing.objects.aupdate_or_create( id=row["key"], defaults={"name": row["fullName"]}, ) DjangoCon AU 2023 26 The most basic version of this. Make an HTTP request, throw things into the database. Use that looping async call helper we saw earlier and congrats, you've got yourself a mini ETL system. Maybe this is enough for you, but if not the nice thing about having this in code rather than built out of the config files of a dozen microservices is we can easily extend and customize it.
data=data) await sync_to_async(ser.is_valid)(raise_exception=True) await sync_to_async(ser.save)() DjangoCon AU 2023 27 Another level up of complexity would be using Django REST Framework to unpack the incoming data. This is great when the API is returning you complex nested stuff. DRF doesn't support async stuff itself but fortunately sync_to_async has us covered.
= ...aupdate_or_create(...) seen_ids.append(thing.id) await ...exclude(id__in=seen_ids).adelete() DjangoCon AU 2023 28 A common pattern in this kind of code is wanting to clean up values in the database which no longer exist upstream. This approach doesn't scale well but it's very easy and simple, and works great up to a few thousand rows. After that scaling point, you'll likely want to move the diffing code into Python rather than the database.
_registry[fn] = n return fn return decorator DjangoCon AU 2023 29 Setting things up in each ready() callback is okay, but something more like Celery's decorators can be nice as you get more of them. Django has a helper method for this called autodiscover modules. You need some data structure called underscore registry, and then pass in the name of the submodule to look for in each app and the module or class the registry is in. Combine that with a decorator to add things to the registry and a single ready hook to install all the tasks from it.
check next >= now DjangoCon AU 2023 30 Having a loop which runs every 30 seconds is good for some cases, but often we want more complex scheduling like we would get from cron. Fortunately there is a fantastic library for this named croniter. It handles all the complex math around timing, if you track the last time each task has run either in a database model or another in-memory dict then it will tell you when the next time to run the task is. You can make one background loop the same way we had before which checks once a second or once a minute and kicks off anything ready to run. Yes this is repeating some logic that cron or celery beat gives you for free, but it's about 10 lines of code and cuts down operational complexity a ton.
as an acronym technically only covers the data ingestion, but really we probably wanted to do something with this data or we wouldn't be collecting it. And a relatively common thing to want to do is to run some queries on it. Those could be SQL queries, but you probably already know how to do that, we've been using the ORM all over the place already, so let's look at something more fun.
cachable DjangoCon AU 2023 32 I want to talk about GraphQL but first a disclaimer. GraphQL does not scale. It's phenomenal for small systems and I'll talk about why but getting good performance out of it as things get bigger is incredibly difficult, verging on impossible. If your queries are extremely cache friendly that can be a solution, but otherwise beware the dragons.
{ nested1 } } } DjangoCon AU 2023 33 I can't possibly go over everything that GraphQL includes, that's another talk, but the really quick version. GraphQL queries specify a tree of nested fields you would like to retreive. Usually there's going to be one top-level field which is the type of thing you are querying, that might have some filters on it if you don't want all objects, and then fields inside that you would like in the response, including nested stuff. In Django terms, field means database fields, so columns in the simple case and foreign key fields or many to manys or their reverse managers for nested data.
{ quantity, quest { title, image, text DjangoCon AU 2023 34 If GraphQL has these sharp downsides, why use it at all? Because it's super cool for exploring deeply interlinked data. This query is looking at 3 tables in SQL, Items, Quests, and a many-to-many through table between them. If we were making a dedicated API endpoint for this, it would be super easy to write a SQL query for. But what if we don't want to build a dedicated view for every query? We could make a view that takes in a raw SQL query and runs it, but even ignoring the security issues with that, the data would come out as the fully expanded join matrix and require some reshaping by clients, or the clients would have to effectively implement the SQL join themselves. GraphQL gives us a middle ground. Performance isn't as good as the specific view, but it's good enough in small cases and in return we get a generic system that can safely answer many kind of client queries.
slow —Numeric aggregation - "mean where type = foo" —But you can pre-compute —Poorly linked data - no ForeignKeys DjangoCon AU 2023 35 Other than large data sets and places where API performance is at a premium, the two other major places to avoid GraphQL are when you want to answer numeric queries and poorly linked data. GraphQL gives you a great way to control which fields are sent back, but it doesn't have anything like SQL's numeric aggregation support. You could, for example, request all item values where type is foo, but you would have to compute the average client side. In some small cases that's okay but it's not delightful. Sometimes a middle ground is pre-computing the aggregations you need in the ETL side of things and then exposing it as a model field in GraphQL. It's also much less helpful when you have a lot of disconnected tables. It can work but making a select star as JSON view is usually a lot easier in those cases.
for GraphQL in Django is Strawberry. The core library implements the schema management and data flow layers, and then strawberry-django library adds in adapters to load data using the Django ORM and easier schema definitions for Django model types. You'll see some guides referencing strawberry django plus which was an enhancement library on top of strawberry django but that has since been merged back into strawberry django so you don't need it anymore.
= field() schema = gql.Schema(query=Query) DjangoCon AU 2023 37 A Strawberry schema defines the top level of the query namespace, kind of like a root urls.py does for HTTP. It takes a root query type, which then references other GraphQL types, and those reference more, and so on defining the whole tree of available data. Strawberry makes heavy use of type annotations to define things, along with class decorators.
auto required_for: list[QuestRequired] reward_for: list[QuestReward] DjangoCon AU 2023 38 A slight annoyance of Strawberry is having to restate all our model definitions as Strawberry types, but at least the majority of fields get their types inferred from the model. We only need to get explicit around inter-type links, so foreign keys, many to manys, and their related managers. I prefer to store the types for each app in that app and just import them into the central query type.
@gql.django.order(models.Item) class ItemOrder: id: auto name: auto DjangoCon AU 2023 39 GraphQL doesn't offer much in terms of data slice and dice other than picking what fields to include, but there is a bit. Filters allow relatively simple WHERE checks on the returned data, and orders let you sort the response data, though as a word of warning orders are currently a bit buggy and absolutely wreck query performance.
quest { id, title DjangoCon AU 2023 40 To show some examples of how query performance works let's compare some GraphQL queries to the resulting SQL. This is the same query we saw before.
like that, we've already got a not lovely SQL expansion. If you can't read that from the back, first it ran a select from the items table and then a select on the quests and many to many table with a huge "item id in" condition.
nested fields in, things get very complicated very fast. Also while the queries only took 200 milliseconds, the overal response time was 13 seconds. And this is my development environment so it's running against tables that only have a few hundred rows. If that was in the millions, which is nothing for a normal SQL query, it would get out of hand pretty fast. That said, for some use cases a 5 minute query can be okay. You wouldn't want that in front of every user but for automation, computers have plenty of patience.
Django path("graphql", GraphQLView.as_view(schema=schema)) DjangoCon AU 2023 43 Before we stop talking about Strawberry, one more word of warning, don't use their async Django view. It doesn't integrate well with the async ORM right now so you get errors during data fetching. And because Django can't overlap queries yet, it wouldn't actually help anyway. Hopefully this will be improved in the future.
particular place where ETL and GraphQL combine really well is static site generators. Gatsby has a lot of native support for it and things like Pelican allow easily slotting in an HTTP-based data source. Set up a build every hour in your CI system of choice and you've got a very easy way to build dashboards against complex data.
45 Another useful feature of GraphQL is making a query and then listening to live data updates. Going back to the dashboards example, this can allow for live-updating graphs without a very big code footprint. For Strawberry, you will need to install and configure Channels for this to work. Because we're explicitly targetting a single process, the in-memory channel layer is an easy way to get started, though a Postgres-based backend exists too. Check the Strawberry docs for details on the setup, you'll need to adjust some ASGI configuration to make everything work together.
BotConfig(AppConfig): def ready(self): create_task(client.start(token)) DjangoCon AU 2023 48 The core of the integration is the same as we saw with the ETL tasks, make a Django app, spawn something from the ready callback.
—client.start() - cooperates with others DjangoCon AU 2023 49 When reading the docs for any async library you're trying to integrate, make sure to note the difference between a blocking "run the thing" function and an async task. Using discord py as an example, client dot run is a synchronous function that launches an async event loop for you. This is good for tutorials but not what we want in an existing async server. For discord py, the underlying async task is client dot start so we can use that directly. Other libraries may call them something different but make sure you are using the correct one.
return if message.content == "!count": n = await Thing.objects.acount() await message.channel.send( f"There are {n} things") DjangoCon AU 2023 50 With those basics in place, making a chat bot that can reach into Django data is pretty easy. We've got the full power of the ORM here, or any other libraries we want to pull in.
= client.get_channel(id) await channel.send("Batch complete") DjangoCon AU 2023 51 And we can use it for high-level logging or notifications too. Combined with the cron pattern we built before, this can also send daily reports to a chat channel or email using aiosmtplib.
= await proc.stdin.readline() if cmd == 'count': n = await Thing.objects.acount() proc.stdout.write( f"There are {n} things\n") def ready(): create_task( asyncssh.listen(process_factory=handle_ssh)) DjangoCon AU 2023 52 Chat bots are old news, what if we SSH into our ETL server? Not into the server it is running on, into the server itself. Is this actually better? Probably not, but it's cool and I'm sure there is some edge case you could use to justify this.
53 For some of my projects, I integrate down to the hardware level, either with a Streamdeck or just poking at IOT devices in my home. This level of silly is just for my own fun but that's okay too.
Okay, back to reality. I've spent this whole talk singing the praises of small systems, but what if your system starts out small and then grows? Django has you covered.
2 == server_id —Adjust your aggregations DjangoCon AU 2023 55 One common problem is there just being too much data to transform and load from a single process. Usually this will only apply when the pre-transform data is very large so fitting things in memory all at once during the transform is hard. A common option for this is to shard the ingest and run it across multiple servers. Each server gets an ID number and it only handles corresponding data. This can get much more complex with hash rings and vector clocks but start simple and build as you need.
—ProcessPoolExecutor or aiomultiprocess —(PEP 703 in the future?) DjangoCon AU 2023 56 It's fairly common for transforms to want to do some number crunching. Working with CPU bound Python is about 4 talks all on its own but the really short version is that if the call is something which drops the GIL like a NumPy or PyTorch function then you can use sync to async with thread sensitive set to false. This will run the function in a background thread where it can eat CPU cycles to its hearts content and then pass control back to your async function afterwards. If the code doesn't drop the GIL then it's harder, either a process pool executor from the futures library or aiomultiprocess which is an async wrapper around multiprocessing.
if your needs grow beyond all this, the big tools are still right there. Maybe you swap your home grown sharded loader for Hadoop and Hive. You might have to rewrite some Django ORM calls for py hive but the rest of your code keeps on trucking.
Django is great for building ETLs GraphQL is an excellent way to query There's many cool async libraries Our tools can grow as our needs grow DjangoCon AU 2023 58 So to recap what we've talked about. ETL systems let us move data around and pull out the most important bits we need. Async Django isn't perfect but it's very usable and is a solid basis for small-scale ETL systems. GraphQL pairs well with both of them as a generic query interface, though with performance concerns at scale. Async Python has many fun libraries to add extra functionality to your service. And that there is value in starting small and simple and letting a tool grow with the use cases rather than investing in massive complexity up front.