Swiss Army Django
Small Footprint ETL
Noah Kantrowitz - - August 18, 2023
DjangoCon AU 2023 1
Noah Kantrowitz
—Kubernetes (ContribEx) and Python (webmaster@)
—SRE/Platform for Geomagical Labs, part of IKEA
—We do CV/AR for the home
Case Study: Farm RPG
Why build this?
It's fun!
And educational!
What is ETL?
Web Scrapers
—Extract, Load, Transform
—Keep raw data
—Transform it again and again
—Cool but not very small
The shape of things
—Extractors: scraper/importer cron jobs
—Transforms: parse HTML/XML/JSON
—Reshape the data to make it easy to query
—Possibly import-time aggregations like
—Loaders: Django ORM, maybe DRF or Pydantic
Async & Django
Yes, you can
really use it!
More on the ORM later
Why async?
—Everything in one box
—Fewer services, fewer problems
—Portability and development env
Reports of my death
have been greatly exaggerated
Writing an async app
—Normal models
—Async views
—Async middleware if needed
Running an async app
—Async server not included
—python -m uvicorn etl.asgi:application
—--reload for development
—--access-log --no-server-header
—--ssl-keyfile=... --ssl-certfile=...
Task factories
async def do_every(fn, n):
while True:
await fn()
await asyncio.sleep(n)
Django integration
class MyAppConfig(AppConfig):
def ready(self):
coro = do_every(my_extractor, 30)
The other shoe
—Crashes can happen
—Plan for convergence
—Think about failure modes
—Make task status models if needed
Modeling for failure
—await Emails.objects.acreate(...)
—await email.adelete()
Async ORM
—Mind your a*s
—Transactions and sync_to_async
—Don't worry about concurrency
—Still improving
The simple case
async def scrape_things():
resp = await client.get("/api/things/")
for row in resp.json():
await Thing.objects.aupdate_or_create(
defaults={"name": row["fullName"]},
DRF serializers
item = await Item.objects.filter(id=data["id"])\
ser = ItemAPISerializer(instance=item, data=data)
await sync_to_async(ser.is_valid)(raise_exception=True)
await sync_to_async(
Diff delete
seen_ids = []
for row in data:
thing = ...aupdate_or_create(...)
await ...exclude(id__in=seen_ids).adelete()
Why GraphQL?
query {
items {
name, image, value
requiredFor {
quantity, quest {
title, image, text
When not to use GraphQL
—Big data - it's too slow
—Numeric aggregation - "mean where type = foo"
—But you can pre-compute
—Poorly linked data - no ForeignKeys
class Query:
items: list[Item] = field()
quests: list[Quest] = field()
schema = gql.Schema(query=Query)
Model Types
class Item:
id: int
name: auto
image: auto
required_for: list[QuestRequired]
reward_for: list[QuestReward]
Filters and Orders
class ItemFilter:
id: auto
can_mail: auto
class ItemOrder:
id: auto
name: auto
query {
items {
id, name
requiredFor {
quantity, quest {
id, title
Strawberry Views
Don't use their async view, it doesn't like Django
path("graphql", GraphQLView.as_view(schema=schema))
Site Generators
Enough about
boring stuff
Async for fun?
Or Bolt-python for Slack
Or Bottom for IRC if you're old
intents = discord.Intents.default()
intents.message_content = True
client = discord.Client(intents=intents)
class BotConfig(AppConfig):
def ready(self):
Control the loop
— - expects to own the loop
—client.start() - cooperates with others
Chat bot
async def on_message(message):
if == client.user:
if message.content == "!count":
n = await Thing.objects.acount()
f"There are {n} things")
async def scrape_things():
# Do the ETL ...
channel = client.get_channel(id)
await channel.send("Batch complete")
Let's get weirder: SSH
async def handle_ssh(proc):
proc.stdout.write('Welcome!\n> ')
cmd = await proc.stdin.readline()
if cmd == 'count':
n = await Thing.objects.acount()
f"There are {n} things\n")
def ready():
The sky's the limit
Back to serious
Scaling? Let's go
Ingest sharding
—Too much data to load?
—if hash(url) % 2 == server_id
—Adjust your aggregations
CPU bound?
—Does the library drop the GIL?
—sync_to_async(…, thread_sensitive=False)
—ProcessPoolExecutor or aiomultiprocess
—(PEP 703 in the future?)
Big still allowed
Microservices too
In review
ETL systems are useful for massaging data
Async Django is great for building ETLs
GraphQL is an excellent way to query
There's many cool async libraries
Our tools can grow as our needs grow
