Swiss Army Django
Small Footprint ETL
Noah Kantrowitz - coderanger.net
DjangoCon US 2023 1
Slide 2
Slide 2 text
Noah Kantrowitz
—He/him
—coderanger.net | cloudisland.nz/@coderanger
—Kubernetes (ContribEx) and Python (webmaster@)
—SRE/Platform for Geomagical Labs, part of IKEA
—We do CV/AR for the home
DjangoCon US 2023 2
Slide 3
Slide 3 text
ETL
DjangoCon US 2023 3
Slide 4
Slide 4 text
Case study: Farm RPG
DjangoCon US 2023 4
Slide 5
Slide 5 text
Game data
DjangoCon US 2023 5
Slide 6
Slide 6 text
Why build this?
It's fun!
And educational!
DjangoCon US 2023 6
Slide 7
Slide 7 text
What is ETL?
Extract
Transform
Load
DjangoCon US 2023 7
Slide 8
Slide 8 text
Web scrapers
DjangoCon US 2023 8
Slide 9
Slide 9 text
Scrape
Responsibly
DjangoCon US 2023 9
Slide 10
Slide 10 text
The shape of things
—Extractors: scraper/importer cron jobs
—Transforms: parse HTML/JSON/etc then munge
—Loaders: Django ORM, maybe DRF or Pydantic
DjangoCon US 2023 10
Slide 11
Slide 11 text
Robots Data in disguise
—Parsed to structured data
—json.loads()
—BeautifulSoup
—struct.unpack()
—Reshape the data to make it easy to query
—Possibly import-time aggregations like averages
DjangoCon US 2023 11
Slide 12
Slide 12 text
Aggregations
# Query time
async def myview():
return await Item.objects.aaggregate(Avg("value"))
# Ingest time
avg = await Item.objects.aaggregate(Avg("value"))
await ItemStats.objects.aupdate_or_create(
defaults={"avg": avg})
DjangoCon US 2023 12
Slide 13
Slide 13 text
Loaders
Update_or_create
Or maybe DRF
DjangoCon US 2023 13
Slide 14
Slide 14 text
Async & Django
Yes, you can
really use it!
More on the ORM later
DjangoCon US 2023 14
Slide 15
Slide 15 text
Why async?
—Everything in one box
—Fewer services, fewer problems
—Portability and development env
DjangoCon US 2023 15
Slide 16
Slide 16 text
Monoliths!
Reports of my death
have been greatly exaggerated
DjangoCon US 2023 16
Slide 17
Slide 17 text
Writing an async app
—Normal models
—Async views
—Async middleware if needed
DjangoCon US 2023 17
Slide 18
Slide 18 text
Running an async app
—Async server not included
—python -m uvicorn etl.asgi:application
—--reload for development
—--access-log --no-server-header
—--ssl-keyfile=... --ssl-certfile=...
DjangoCon US 2023 18
Slide 19
Slide 19 text
Background tasks
—Celery beat?
—Cron?
—Async!
DjangoCon US 2023 19
Slide 20
Slide 20 text
Task factories
async def do_every(fn, n):
while True:
await fn()
await asyncio.sleep(n)
DjangoCon US 2023 20
Slide 21
Slide 21 text
Django integration
class MyAppConfig(AppConfig):
def ready(self):
coro = do_every(my_extractor, 30)
asyncio.create_task(coro)
DjangoCon US 2023 21
Slide 22
Slide 22 text
The other shoe
—Crashes can happen
—Plan for convergence
—Think about failure modes
—Make task status models if needed
DjangoCon US 2023 22
Slide 23
Slide 23 text
Modeling for failure
—create_task(send_email())
—await Emails.objects.acreate(...)
—create_task(do_every(send_all_emails))
—await email.adelete()
DjangoCon US 2023 23
Slide 24
Slide 24 text
Async ORM
—Mind your a*s
—Transactions and sync_to_async
—Don't worry about concurrency
—Still improving
DjangoCon US 2023 24
The simple case
async def scrape_things():
resp = await client.get("/api/things/")
resp.raise_for_status()
for row in resp.json():
await Thing.objects.aupdate_or_create(
id=row["key"],
defaults={"name": row["fullName"]},
)
DjangoCon US 2023 28
Real-er cron
—Croniter is amazing
—In-memory or database
—Loop and check next >= now
DjangoCon US 2023 32
Slide 33
Slide 33 text
while True:
for cron in decorators._registry.values():
if (cron.next_run_at is None or cron.next_run_at <= now) and (
cron.previous_started_at is None
or (
cron.previous_finished_at is not None
and cron.previous_started_at <= cron.previous_finished_at
)
):
cron.previous_started_at = now
asyncio.create_task(_do_cron(cron))
await asyncio.sleep(1)
DjangoCon US 2023 33
Slide 34
Slide 34 text
await cron.fn()
cron.previous_finished_at = now
cron.next_run_at = (
croniter(cron.cronspec, now, hash_id=cron.name)
.get_next(ret_type=datetime)
.replace(tzinfo=dtimezone.utc)
)
DjangoCon US 2023 34
Slide 35
Slide 35 text
History
—django-pghistory
—@pghistory.track(pghistory.Snapshot(),
exclude=["modified_at"])
DjangoCon US 2023 35
Slide 36
Slide 36 text
Incremental load
max = await Trades.objects.aaggregate(Max("id"))
resp = await client.get(...,
query={"since": max.get("id__max") or 0})
DjangoCon US 2023 36
Slide 37
Slide 37 text
Multi-stage transforms
one = await step_one()
two = await step_two(one)
three, four = await gather(
step_three(two),
step_four(two),
)
DjangoCon US 2023 37
Slide 38
Slide 38 text
We have data
Now what?
DjangoCon US 2023 38
Slide 39
Slide 39 text
GraphQL
Here be dragons
Use only for small
Or very cachable
DjangoCon US 2023 39
Why GraphQL?
query {
items {
name, image, value
requiredFor {
quantity, quest {
title, image, text
DjangoCon US 2023 41
Slide 42
Slide 42 text
Strawberry
DjangoCon US 2023 42
Slide 43
Slide 43 text
Model types
@gql.django.type(models.Item)
class Item:
id: int
name: auto
image: auto
required_for: list[QuestRequired]
reward_for: list[QuestReward]
DjangoCon US 2023 43
Slide 44
Slide 44 text
Site generators
Gatsby
Next.js
Pelican
DjangoCon US 2023 44
Let's get weirder: SSH
async def handle_ssh(proc):
proc.stdout.write('Welcome!\n> ')
cmd = await proc.stdin.readline()
if cmd == 'count':
n = await Thing.objects.acount()
proc.stdout.write(
f"There are {n} things\n")
def ready():
create_task(
asyncssh.listen(process_factory=handle_ssh))
DjangoCon US 2023 52
Slide 53
Slide 53 text
The sky's the limit
Streamdeck
Aioserial
Twisted
DjangoCon US 2023 53
Slide 54
Slide 54 text
Back to serious
Scaling? Let's go
DjangoCon US 2023 54
Slide 55
Slide 55 text
Ingest sharding
—Too much data to load?
—if hash(url) % 2 == server_id
—Adjust your aggregations
DjangoCon US 2023 55
Slide 56
Slide 56 text
CPU bound?
—Does the library drop the GIL?
—sync_to_async(…, thread_sensitive=False)
—ProcessPoolExecutor or aiomultiprocess
—(PEP 703 in the future?)
DjangoCon US 2023 56
Slide 57
Slide 57 text
Big still allowed
Microservices too
DjangoCon US 2023 57
Slide 58
Slide 58 text
In review
ETL systems are useful for massaging data
Async Django is great for building ETLs
GraphQL is an excellent way to query
There's many cool async libraries
Our tools can grow as our needs grow
DjangoCon US 2023 58