Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Swiss Army Django: Small Footprint ETL (with notes) - DjangoCon US

Swiss Army Django: Small Footprint ETL (with notes) - DjangoCon US

Noah Kantrowitz

October 17, 2023
Tweet

More Decks by Noah Kantrowitz

Other Decks in Programming

Transcript

  1. Noah Kantrowitz —He/him —coderanger.net | cloudisland.nz/@coderanger —Kubernetes (ContribEx) and Python

    (webmaster@) —SRE/Platform for Geomagical Labs, part of IKEA —We do CV/AR for the home DjangoCon US 2023 2 Hi there, I'm Noah Kantrowitz. I'm an SRE at Geomagical Labs. We do computer vision and augmented reality stuff for IKEA.
  2. ETL DjangoCon US 2023 3 And I'm here to talk

    about building small-footprint ETL systems using Django.
  3. Case study: Farm RPG DjangoCon US 2023 4 But I'm

    not here to talk about work. For this talk I'm going to use one of my side projects, Farm RPG. It's a free online web and mobile game and important for our needs here, like most RPGs it has lots of fun data and I don't have to run everything past a million lawyers!
  4. Game data DjangoCon US 2023 5 If you're not a

    big gamer, don't worry. The big thing to understand is that games tend to have highly interconnected data like items drop from monsters, and are used in recipes, and recipes come from quests, etc etc. In SQL this looks like pretty much every table having at least one foreign key to another table in more of a web rather than a tree view like you would have in a simple REST application.
  5. Why build this? It's fun! And educational! DjangoCon US 2023

    6 This isn't really part of the main topic here but just in case anyone is wondering "did he really spend a year building a ETL system for a free internet game?" then yes, I totally did because it was fun and a great way to learn these tools, many of which are now deployed at my dayjob. Big shoutout to fun side projects where you can move at your own speed and no one bugs you if it's down for a week. Anyways, moving on!
  6. What is ETL? Extract Transform Load DjangoCon US 2023 7

    I've said this talk is about ETL, but what does that even mean? The core is quite literal, extract data from somewhere, run it through some kind of transformation, and load it into a database.
  7. Web scrapers DjangoCon US 2023 8 Not every ETL is

    a web scraper but the two are very similar. Maybe the site you are scraping is an internal API or a database rather than an HTTP endpoint but it's all built around the same structure.
  8. Scrape Responsibly DjangoCon US 2023 9 Because we're talking about

    scraping, I would be remiss if I didn't remind everyone that hostile scraping is generally against the terms and conditions of websites, make sure you have permission before scraping any website, API, or data source you don't own. If you have questions about what is and isn't allowed, speak to either the owner or your trusted legal professional or both.
  9. The shape of things —Extractors: scraper/importer cron jobs —Transforms: parse

    HTML/JSON/etc then munge —Loaders: Django ORM, maybe DRF or Pydantic DjangoCon US 2023 10 With scrapers as the E phase, we still need the T and the L.
  10. Robots Data in disguise —Parsed to structured data —json.loads() —BeautifulSoup

    —struct.unpack() —Reshape the data to make it easy to query —Possibly import-time aggregations like averages DjangoCon US 2023 11 Transforms usually have three big parts. First we have to take the output of the extractors, usually a big pile of bytes, and turn it into structured data. This can be as simple as json dot loads or as complex as HTML parsing or binary unpacking. Next we need to reshape it into something better for the queries we want to run later. This can mean SQL nomalization or denomialization, or more mundane stuff like renaming fields or adjusting the types of values. And then an optional step of updating numeric aggregations.
  11. Aggregations # Query time async def myview(): return await Item.objects.aaggregate(Avg("value"))

    # Ingest time avg = await Item.objects.aaggregate(Avg("value")) await ItemStats.objects.aupdate_or_create( defaults={"avg": avg}) DjangoCon US 2023 12 In big ETL systems, aggregation handling is a complex topic with whole map-reduce tools dedicated to it. Because we aren't aiming for maximum scale, we can take more laid back approach. The simplest option is just don't, run your aggregates at query time via SQL. Postgres is really fast so in a lot of cases this works just fine. Alternatively we can store them back into the database during ingest if we want to only compute them once.
  12. Loaders Update_or_create Or maybe DRF DjangoCon US 2023 13 And

    then finally we need to store the structured data somewhere. This is usually so simple in Django that we don't even think of it as a phase of its own. Just throw things into the ORM and let it write out to the database.
  13. Async & Django Yes, you can really use it! More

    on the ORM later DjangoCon US 2023 14 Async and Django has been a long journey. And that journey isn't over yet, but async Django is still great and you can use it today for real production applications. We'll touch more on the limitation of the async database layer in a bit.
  14. Why async? —Everything in one box —Fewer services, fewer problems

    —Portability and development env DjangoCon US 2023 15 Why use async Django as the basis for this system? It let's us keep everything together in one codebase and one running server. ETL pipelines have a well-deserved reputation for being finickey creatures, one tool falls out of sync with another and all of a sudden the whole process jams until a sad on-call engineer comes by to fix it. Having fewer services, in this case just Django and Postgres server, means fewer places for things to go wrong and when they do go wrong, clearer solutions. Restart the one thing and then your back in business. It's also convienent because as a single project, we can easily move the deployment between tools or frameworks, and run a local version for development.
  15. Monoliths! Reports of my death have been greatly exaggerated DjangoCon

    US 2023 16 Microservices have been the cool way to do things for a while now and they can certainly be useful when projects get huge, when they cross team boundaries. But we're doing small and small doesn't need microservices. All the benefits we just mentioned about service simplicity apply here too, and we avoid the infrastructure and organizational complexity of a dozen services. Embrace the monolith.
  16. Writing an async app —Normal models —Async views —Async middleware

    if needed DjangoCon US 2023 17 Lots of talks, as well as the Django documentation have covered the basics of starting an async app. Starting with Django 4.2 it's all very similar to the non-async versions. Make your new project and app the same way as always, when you start adding views you'll mark them as async. All the default middleware is async-compatible so you only need to think about those if you start adding extra stuff, but fortunately most major projects are async compatible and Django will adapt things for you if they aren't, though at a performance cost.
  17. Running an async app —Async server not included —python -m

    uvicorn etl.asgi:application —--reload for development —--access-log --no-server-header —--ssl-keyfile=... --ssl-certfile=... DjangoCon US 2023 18 One thing Django doesn't currently include is an async-compatible web server, but the community has several for us to pick from. I personally like Uvicorn as it supports a reload on change mode like the built-in server which makes local dev easier. And you can use it for production too with a few tweaked options.
  18. Background tasks —Celery beat? —Cron? —Async! DjangoCon US 2023 19

    For any ETL system we're going to need background tasks to run the data pipeline bits. The standard Django answer for that is Celery and Celery Beat but that's a whole big set of things to run. Cron jobs with custom management commands are smaller but we can still do better. One big advantage of an async system is we can spawn additional background tasks to run concurrently with the web application, so let's do that!
  19. Task factories async def do_every(fn, n): while True: await fn()

    await asyncio.sleep(n) DjangoCon US 2023 20 The core of any looping task in async Python is going to look something like this. Run the extractor function, sleep, repeat.
  20. Django integration class MyAppConfig(AppConfig): def ready(self): coro = do_every(my_extractor, 30)

    asyncio.create_task(coro) DjangoCon US 2023 21 To integrate this with Django, we need to hook into the server startup process and the easiest way to do that is the ready callback in app configs. We want to launch our async task in the background so Django will carry on initializing and our extractor will loop and fire off every 30 seconds, and create_task does exactly that.
  21. The other shoe —Crashes can happen —Plan for convergence —Think

    about failure modes —Make task status models if needed DjangoCon US 2023 22 Unfortunately there is no free lunch, so we do need to think about downsides too. With Celery and RabbitMQ, we have a much more durable system, if a task is accepted by the queue then no matter what it's going to run to completion at least once. With tasks existing only as part of our running server, if something crashes then it's just gone. We can add exception handlers and retries but if the processes dies then again, no more tasks. For ETL systems this is usually okay though, if we miss it this hour, we'll get it next hour. As long as each scrape pulls in more than one hour of data at a time, we have a built-in buffer for failures, the final results converge to the same point no matter what. For cases where we can't externally handle convergence, we can also make database models to track status of specific tasks just like we would have with RabbitMQ.
  22. Modeling for failure —create_task(send_email()) —await Emails.objects.acreate(...) —create_task(do_every(send_all_emails)) —await email.adelete() DjangoCon

    US 2023 23 As an example, rather than having a background task for sending a single email, we could instead store a database row for each pending email. Every few minutes we try sending anything that is pending to go out, and if it succeeds then clear it from the database. This gives us safety at a similar level to RabbitMQ, even if things fail at the wrong time, it will send at least once. This is more work though, so decide for each use case how much failure matters.
  23. Async ORM —Mind your a*s —Transactions and sync_to_async —Don't worry

    about concurrency —Still improving DjangoCon US 2023 24 The general rule of async ORM is any method which would talk to the databaes has a version prefixed with "a" for async. aget, afirst, asave, aupdate. If you ever miss an "a" and call a synchronous method, it will raise an exception to remind you. The two big limitations still present are that transactions don't work in async code and that queries can't overlap. The lack of overlap is mostly an internal detail, you can still run multiple concurrent queries from asyncio's point of view, but internally they get handled by a single worker thread so the database sees them sequentially. The end result is you don't get a performance benefit from the concurrency yet.
  24. Async transactions @transaction.atomic def _transaction(): Foo.objects.create() Bar.objects.create() async def my_view():

    await sync_to_async(_transaction)() DjangoCon US 2023 25 Async transactions require a bit more complexity than usual. If we did things normally, the transactions might get intereleaved with other database queries at the asyncio level, so we have to put it all in one synchronous block and use asgiref's sync to async helper method. This is definitely something the Django team is looking to improve in the future. More broadly, we've only recently started seeing true async support in database driver libraries like psycopg3, which should open a lot of doors for better things in Django's async ORM.
  25. Async HTTP HTTPX Or AIOHTTP DjangoCon US 2023 26 Because

    fetching from web servers is such a common extractor, it's very likely you'll need a async compatible HTTP client. My recommendation is httpx, it's got a very simple API, solid testing support via respx, and will generally be very familiar to anyone used to requests. Aiohttp is also available, it's generally faster in terms of raw HTTP performance but it lacks HTTP2 support which usually negates that benefit. Also unlike the ORM, both of these can fully overlap requests so you can speed up fetching that way, though of course make sure you don't overload the server on the other side.
  26. Examples! DjangoCon US 2023 27 I know showing code on

    slides is always a bit questionable, but I want to go through some real examples from my case study project so you can see how simple this all is in practice.
  27. The simple case async def scrape_things(): resp = await client.get("/api/things/")

    resp.raise_for_status() for row in resp.json(): await Thing.objects.aupdate_or_create( id=row["key"], defaults={"name": row["fullName"]}, ) DjangoCon US 2023 28 The most basic version of this. Make an HTTP request, throw things into the database. Use that looping async call helper we saw earlier and congrats, you've got yourself a mini ETL system. Maybe this is enough for you, but if not the nice thing about having this in code rather than built out of the config files of a dozen microservices is we can easily extend and customize it.
  28. Foreign keys # {"user": 25} defaults={"user_id": row["user"]} # {"user": "foo"}

    user = await User.objects.only("id")\ .aget(email=row["user"]) defaults={"user": user} DjangoCon US 2023 29 When you're pulling in inter-table links there's two common cases. If the data gives a primary key, and you're using those as your PKs you can just pass it directly. Otherwise you'll want to fetch the object first. This can get a bit slow at scale so consider an LRU cache here.
  29. DRF serializers item = await Item.objects.filter(id=data["id"])\ .afirst() ser = ItemAPISerializer(instance=item,

    data=data) await sync_to_async(ser.is_valid)(raise_exception=True) await sync_to_async(ser.save)() DjangoCon US 2023 30 Another level up of complexity would be using Django REST Framework to unpack the incoming data. This is great when the API is returning you complex nested stuff. DRF doesn't support async stuff itself but fortunately sync_to_async has us covered.
  30. Decorators —autodiscover_modules("tasks") —_registry = {} —register_to=mod def every(n): def decorator(fn):

    _registry[fn] = n return fn return decorator DjangoCon US 2023 31 Setting things up in each ready() callback is okay, but something more like Celery's decorators can be nice as you get more of them. Django has a helper method for this called autodiscover modules. You need some data structure called underscore registry, and then pass in the name of the submodule to look for in each app and the module or class the registry is in. Combine that with a decorator to add things to the registry and a single ready hook to install all the tasks from it.
  31. Real-er cron —Croniter is amazing —In-memory or database —Loop and

    check next >= now DjangoCon US 2023 32 Having a loop which runs every 30 seconds is good for some cases, but often we want more complex scheduling like we would get from cron. Fortunately there is a fantastic library for this named croniter. It handles all the complex math around timing, if you track the last time each task has run either in a database model or another in-memory dict then it will tell you when the next time to run the task is. You can make one background loop the same way we had before which checks once a second or once a minute and kicks off anything ready to run. Yes this is repeating some logic that cron or celery beat gives you for free, but it's about 10 lines of code and cuts down operational complexity a ton.
  32. while True: for cron in decorators._registry.values(): if (cron.next_run_at is None

    or cron.next_run_at <= now) and ( cron.previous_started_at is None or ( cron.previous_finished_at is not None and cron.previous_started_at <= cron.previous_finished_at ) ): cron.previous_started_at = now asyncio.create_task(_do_cron(cron)) await asyncio.sleep(1) DjangoCon US 2023 33 This is what my version looks like, using the decorator pattern and a loop, we check if it's time to run again and if the previous run finished so we don't overlap tasks if they take too long.
  33. await cron.fn() cron.previous_finished_at = now cron.next_run_at = ( croniter(cron.cronspec, now,

    hash_id=cron.name) .get_next(ret_type=datetime) .replace(tzinfo=dtimezone.utc) ) DjangoCon US 2023 34 And then the inner section, running the actual function, updating our tracking data using croniter, and we're good to go.
  34. History —django-pghistory —@pghistory.track(pghistory.Snapshot(), exclude=["modified_at"]) DjangoCon US 2023 35 One very

    useful case for a small ETL system is versioning data that isn't natively version controlled. Django Pghistory handles all the heavy lifting for you and because it's implemented at the database level using triggers, it's very effcient. You do need to be sure to exclude columns that change frequently though, or you'll end up with huge history tables.
  35. Incremental load max = await Trades.objects.aaggregate(Max("id")) resp = await client.get(...,

    query={"since": max.get("id__max") or 0}) DjangoCon US 2023 36 Another common thing is to want to extract only new records, this can be based on row ID or time or something else but in any case it's super easy to to write using the ORM.
  36. Multi-stage transforms one = await step_one() two = await step_two(one)

    three, four = await gather( step_three(two), step_four(two), ) DjangoCon US 2023 37 Most big ETL systems use a complex DAG runner to split up transforms into reusable bits. We can do that but without nearly as much glue. If you're on very new Python, check out task groups as well.
  37. We have data Now what? DjangoCon US 2023 38 ETL

    as an acronym technically only covers the data ingestion, but really we probably wanted to do something with this data or we wouldn't be collecting it. And a relatively common thing to want to do is to run some queries on it. Those could be SQL queries, but you probably already know how to do that, we've been using the ORM all over the place already, so let's look at something more fun.
  38. GraphQL Here be dragons Use only for small Or very

    cachable DjangoCon US 2023 39 I want to talk about GraphQL but first a disclaimer. GraphQL does not scale. It's phenomenal for small systems and I'll talk about why but getting good performance out of it as things get bigger is incredibly difficult, verging on impossible. If your queries are extremely cache friendly that can be a solution, but otherwise beware the dragons.
  39. Query basics query { topLevelField(filter: {field1: "foo"}) { field2 field3

    { nested1 } } } DjangoCon US 2023 40 I can't possibly go over everything that GraphQL includes, that's another talk, but the really quick version. GraphQL queries specify a tree of nested fields you would like to retreive. Usually there's going to be one top-level field which is the type of thing you are querying, that might have some filters on it if you don't want all objects, and then fields inside that you would like in the response, including nested stuff. In Django terms, field means database fields, so columns in the simple case and foreign key fields or many to manys or their reverse managers for nested data.
  40. Why GraphQL? query { items { name, image, value requiredFor

    { quantity, quest { title, image, text DjangoCon US 2023 41 If GraphQL has this scaling problem, why use it at all? Because it's super cool for exploring deeply interlinked data. This query is looking at 3 tables in SQL, Items, Quests, and a many-to-many through table between them. If we were making a dedicated API endpoint for this, it would be super easy to write a SQL query for. But what if we don't want to build a dedicated view for every query? We could make a view that takes in a raw SQL query and runs it, but even ignoring the security issues with that, the data would come out as the fully expanded join matrix and require some reshaping by clients, or the clients would have to effectively implement the SQL join themselves. GraphQL gives us a middle ground. Performance isn't as good as the specific view, but it's good enough in small cases and in return we get a generic system that can safely answer many kind of client queries.
  41. Strawberry DjangoCon US 2023 42 The best tool I've found

    for GraphQL in Django is Strawberry. The core library implements the schema management and data flow layers, and then strawberry-django library adds in adapters to load data using the Django ORM and easier schema definitions for Django model types.
  42. Model types @gql.django.type(models.Item) class Item: id: int name: auto image:

    auto required_for: list[QuestRequired] reward_for: list[QuestReward] DjangoCon US 2023 43 A slight annoyance of Strawberry is having to restate all our model definitions as Strawberry types, but at least the majority of fields get their types inferred from the model. We only need to get explicit around inter-type links, so foreign keys, many to manys, and their related managers. I prefer to store the types for each app in that app and just import them into the central query type.
  43. Site generators Gatsby Next.js Pelican DjangoCon US 2023 44 One

    particular place where ETL and GraphQL combine really well is static site generators. Gatsby has a lot of native support for it and things like Pelican allow easily slotting in an HTTP-based data source. Set up a build every hour in your CI system of choice and you've got a very easy way to build dashboards against complex data.
  44. Subscriptions —Channels —In-memory? —Channels-Postgres otherwise —ASGI Router DjangoCon US 2023

    45 Another useful feature of GraphQL is making a query and then listening to live data updates. Going back to the dashboards example, this can allow for live-updating graphs without a very big code footprint. For Strawberry, you will need to install and configure Channels for this to work. Because we're explicitly targetting a single process, the in- memory channel layer is an easy way to get started, though a Postgres-based backend exists too.
  45. Enough about boring stuff Async for fun? DjangoCon US 2023

    46 GraphQL is certainly useful and interesting but it's not very fun. What kind of weird and wonderful stuff can we do with our little ETL server?
  46. Discord-py Or Bolt-python for Slack Or Bottom for IRC if

    you're old DjangoCon US 2023 47 Integrating a chat bot is fun, right?
  47. intents = discord.Intents.default() intents.message_content = True client = discord.Client(intents=intents) class

    BotConfig(AppConfig): def ready(self): create_task(client.start(token)) DjangoCon US 2023 48 The core of the integration is the same as we saw with the ETL tasks, make a Django app, spawn something from the ready callback.
  48. Chat bot @client.event async def on_message(message): if message.author == client.user:

    return if message.content == "!count": n = await Thing.objects.acount() await message.channel.send( f"There are {n} things") DjangoCon US 2023 49 With those basics in place, making a chat bot that can reach into Django data is pretty easy. We've got the full power of the ORM here, or any other libraries we want to pull in.
  49. Notifications async def scrape_things(): # Do the ETL ... channel

    = client.get_channel(id) await channel.send("Batch complete") DjangoCon US 2023 50 And we can use it for high- level logging or notifications too. Combined with the cron pattern we built before, this can also send daily reports to a chat channel.
  50. Email msg = EmailMessage() msg["From"] = "etl@server" msg["To"] = "[email protected]"

    msg["Subject"] = "Batch complete" msg.set_content(log) await aiosmtplib.send(msg) DjangoCon US 2023 51 Using aiosmptlib we can do similar notifications via email too.
  51. Let's get weirder: SSH async def handle_ssh(proc): proc.stdout.write('Welcome!\n> ') cmd

    = await proc.stdin.readline() if cmd == 'count': n = await Thing.objects.acount() proc.stdout.write( f"There are {n} things\n") def ready(): create_task( asyncssh.listen(process_factory=handle_ssh)) DjangoCon US 2023 52 Chat bots are old news, what if we SSH into our ETL server? Not into the server it is running on, into the server itself. Is this actually better? Probably not, but it's cool and I'm sure there is some edge case you could use to justify this.
  52. The sky's the limit Streamdeck Aioserial Twisted DjangoCon US 2023

    53 For some of my projects, I integrate down to the hardware level, either with a Streamdeck or just poking at IOT devices in my home. This level of silly is just for my own fun but that's okay too. You can also use Twisted libraries via their asyncio reactors.
  53. Back to serious Scaling? Let's go DjangoCon US 2023 54

    Okay, back to reality. I've spent this whole talk singing the praises of small systems, but what if your system starts out small and then grows? Django has you covered.
  54. Ingest sharding —Too much data to load? —if hash(url) %

    2 == server_id —Adjust your aggregations DjangoCon US 2023 55 One common problem is there just being too much data to transform and load from a single process. Usually this will only apply when the pre-transform data is very large so fitting things in memory all at once during the transform is hard. A common option for this is to shard the ingest and run it across multiple servers. Each server gets an ID number and it only handles corresponding data. This can get much more complex with hash rings and vector clocks but start simple and build as you need.
  55. CPU bound? —Does the library drop the GIL? —sync_to_async(…, thread_sensitive=False)

    —ProcessPoolExecutor or aiomultiprocess —(PEP 703 in the future?) DjangoCon US 2023 56 Transforms very often want to do some number crunching. Working with CPU bound Python is about 4 talks all on its own but the really short version is that if the call is something which drops the GIL like a NumPy or PyTorch function then you can use sync to async with thread sensitive set to false. This will run the function in a background thread where it can eat CPU cycles to its hearts content and then pass control back to your async function afterwards. If the code doesn't drop the GIL then it's harder, either a process pool executor from the futures library or aiomultiprocess which is an async wrapper around multiprocessing.
  56. Big still allowed Microservices too DjangoCon US 2023 57 And

    if your needs grow beyond all this, the big tools are still right there. Maybe you swap your home grown sharded loader for Hadoop and Hive. You might have to rewrite some Django ORM calls for py hive but the rest of your code keeps on trucking.
  57. In review ETL systems are useful for massaging data Async

    Django is great for building ETLs GraphQL is an excellent way to query There's many cool async libraries Our tools can grow as our needs grow DjangoCon US 2023 58 So to recap what we've talked about. ETL systems let us move data around and pull out the most important bits we need. Async Django isn't perfect but it's very usable and is a solid basis for small-scale ETL systems. GraphQL pairs well with both of them as a generic query interface, though with performance concerns at scale. Async Python has many fun libraries to add extra functionality to your service. And that there is value in starting small and simple and letting a tool grow with the use cases rather than investing in massive complexity up front.