Swiss Army Django
Small Footprint ETL
Noah Kantrowitz - coderanger.net - August 18, 2023
DjangoCon AU 2023 1
Slide 2
Slide 2 text
Noah Kantrowitz
—He/him
—coderanger.net | cloudisland.nz/@coderanger
—Kubernetes (ContribEx) and Python (webmaster@)
—SRE/Platform for Geomagical Labs, part of IKEA
—We do CV/AR for the home
DjangoCon AU 2023 2
Slide 3
Slide 3 text
ETL
DjangoCon AU 2023 3
Slide 4
Slide 4 text
Case Study: Farm RPG
DjangoCon AU 2023 4
Slide 5
Slide 5 text
Game
Data
DjangoCon AU 2023 5
Slide 6
Slide 6 text
Why build this?
It's fun!
And educational!
DjangoCon AU 2023 6
Slide 7
Slide 7 text
What is ETL?
Extract
Transform
Load
DjangoCon AU 2023 7
Slide 8
Slide 8 text
Web Scrapers
DjangoCon AU 2023 8
Slide 9
Slide 9 text
ELT?
—Extract, Load, Transform
—Keep raw data
—Transform it again and again
—Cool but not very small
DjangoCon AU 2023 9
Slide 10
Slide 10 text
Scrape
Responsibly
DjangoCon AU 2023 10
Slide 11
Slide 11 text
The shape of things
—Extractors: scraper/importer cron jobs
—Transforms: parse HTML/XML/JSON
—Reshape the data to make it easy to query
—Possibly import-time aggregations like
averages
—Loaders: Django ORM, maybe DRF or Pydantic
DjangoCon AU 2023 11
Slide 12
Slide 12 text
Async & Django
Yes, you can
really use it!
More on the ORM later
DjangoCon AU 2023 12
Slide 13
Slide 13 text
Why async?
—Everything in one box
—Fewer services, fewer problems
—Portability and development env
DjangoCon AU 2023 13
Slide 14
Slide 14 text
Monoliths!
Reports of my death
have been greatly exaggerated
DjangoCon AU 2023 14
Slide 15
Slide 15 text
Writing an async app
—Normal models
—Async views
—Async middleware if needed
DjangoCon AU 2023 15
Slide 16
Slide 16 text
Running an async app
—Async server not included
—python -m uvicorn etl.asgi:application
—--reload for development
—--access-log --no-server-header
—--ssl-keyfile=... --ssl-certfile=...
DjangoCon AU 2023 16
Slide 17
Slide 17 text
Background
tasks
DjangoCon AU 2023 17
Slide 18
Slide 18 text
Task factories
async def do_every(fn, n):
while True:
await fn()
await asyncio.sleep(n)
DjangoCon AU 2023 18
Slide 19
Slide 19 text
Django integration
class MyAppConfig(AppConfig):
def ready(self):
coro = do_every(my_extractor, 30)
asyncio.create_task(coro)
DjangoCon AU 2023 19
Slide 20
Slide 20 text
The other shoe
—Crashes can happen
—Plan for convergence
—Think about failure modes
—Make task status models if needed
DjangoCon AU 2023 20
Slide 21
Slide 21 text
Modeling for failure
—create_task(send_email())
—await Emails.objects.acreate(...)
—create_task(do_every(send_all_emails))
—await email.adelete()
DjangoCon AU 2023 21
Slide 22
Slide 22 text
Async ORM
—Mind your a*s
—Transactions and sync_to_async
—Don't worry about concurrency
—Still improving
DjangoCon AU 2023 22
The simple case
async def scrape_things():
resp = await client.get("/api/things/")
resp.raise_for_status()
for row in resp.json():
await Thing.objects.aupdate_or_create(
id=row["key"],
defaults={"name": row["fullName"]},
)
DjangoCon AU 2023 26
Slide 27
Slide 27 text
DRF serializers
item = await Item.objects.filter(id=data["id"])\
.afirst()
ser = ItemAPISerializer(instance=item, data=data)
await sync_to_async(ser.is_valid)(raise_exception=True)
await sync_to_async(ser.save)()
DjangoCon AU 2023 27
Slide 28
Slide 28 text
Diff delete
seen_ids = []
for row in data:
thing = ...aupdate_or_create(...)
seen_ids.append(thing.id)
await ...exclude(id__in=seen_ids).adelete()
DjangoCon AU 2023 28
Why GraphQL?
query {
items {
name, image, value
requiredFor {
quantity, quest {
title, image, text
DjangoCon AU 2023 34
Slide 35
Slide 35 text
When not to use GraphQL
—Big data - it's too slow
—Numeric aggregation - "mean where type = foo"
—But you can pre-compute
—Poorly linked data - no ForeignKeys
DjangoCon AU 2023 35
Slide 36
Slide 36 text
Strawberry
DjangoCon AU 2023 36
Slide 37
Slide 37 text
Schema
@gql.type
class Query:
items: list[Item] = field()
quests: list[Quest] = field()
schema = gql.Schema(query=Query)
DjangoCon AU 2023 37
Slide 38
Slide 38 text
Model Types
@gql.django.type(models.Item)
class Item:
id: int
name: auto
image: auto
required_for: list[QuestRequired]
reward_for: list[QuestReward]
DjangoCon AU 2023 38
Slide 39
Slide 39 text
Filters and Orders
@gql.django.filter(models.Item)
class ItemFilter:
id: auto
can_mail: auto
@gql.django.order(models.Item)
class ItemOrder:
id: auto
name: auto
DjangoCon AU 2023 39
Slide 40
Slide 40 text
Performance?
query {
items {
id, name
requiredFor {
quantity, quest {
id, title
DjangoCon AU 2023 40
Slide 41
Slide 41 text
DjangoCon AU 2023 41
Slide 42
Slide 42 text
DjangoCon AU 2023 42
Slide 43
Slide 43 text
Strawberry Views
Don't use their async view, it doesn't like Django
path("graphql", GraphQLView.as_view(schema=schema))
DjangoCon AU 2023 43
Slide 44
Slide 44 text
Site Generators
Gatsby
Next.js
Pelican
DjangoCon AU 2023 44
Enough about
boring stuff
Async for fun?
DjangoCon AU 2023 46
Slide 47
Slide 47 text
Discord-py
Or Bolt-python for Slack
Or Bottom for IRC if you're old
DjangoCon AU 2023 47
Slide 48
Slide 48 text
intents = discord.Intents.default()
intents.message_content = True
client = discord.Client(intents=intents)
class BotConfig(AppConfig):
def ready(self):
create_task(client.start(token))
DjangoCon AU 2023 48
Slide 49
Slide 49 text
Control the loop
—client.run() - expects to own the loop
—client.start() - cooperates with others
DjangoCon AU 2023 49
Slide 50
Slide 50 text
Chat bot
@client.event
async def on_message(message):
if message.author == client.user:
return
if message.content == "!count":
n = await Thing.objects.acount()
await message.channel.send(
f"There are {n} things")
DjangoCon AU 2023 50
Slide 51
Slide 51 text
Notifications
async def scrape_things():
# Do the ETL ...
channel = client.get_channel(id)
await channel.send("Batch complete")
DjangoCon AU 2023 51
Slide 52
Slide 52 text
Let's get weirder: SSH
async def handle_ssh(proc):
proc.stdout.write('Welcome!\n> ')
cmd = await proc.stdin.readline()
if cmd == 'count':
n = await Thing.objects.acount()
proc.stdout.write(
f"There are {n} things\n")
def ready():
create_task(
asyncssh.listen(process_factory=handle_ssh))
DjangoCon AU 2023 52
Slide 53
Slide 53 text
The sky's the limit
Streamdeck
Aioserial
Pytradfri
DjangoCon AU 2023 53
Slide 54
Slide 54 text
Back to serious
Scaling? Let's go
DjangoCon AU 2023 54
Slide 55
Slide 55 text
Ingest sharding
—Too much data to load?
—if hash(url) % 2 == server_id
—Adjust your aggregations
DjangoCon AU 2023 55
Slide 56
Slide 56 text
CPU bound?
—Does the library drop the GIL?
—sync_to_async(…, thread_sensitive=False)
—ProcessPoolExecutor or aiomultiprocess
—(PEP 703 in the future?)
DjangoCon AU 2023 56
Slide 57
Slide 57 text
Big still allowed
Microservices too
DjangoCon AU 2023 57
Slide 58
Slide 58 text
In review
ETL systems are useful for massaging data
Async Django is great for building ETLs
GraphQL is an excellent way to query
There's many cool async libraries
Our tools can grow as our needs grow
DjangoCon AU 2023 58