Slide 1

Slide 1 text

Swiss Army Django Small Footprint ETL Noah Kantrowitz - coderanger.net DjangoCon US 2023 1

Slide 2

Slide 2 text

Noah Kantrowitz —He/him —coderanger.net | cloudisland.nz/@coderanger —Kubernetes (ContribEx) and Python (webmaster@) —SRE/Platform for Geomagical Labs, part of IKEA —We do CV/AR for the home DjangoCon US 2023 2

Slide 3

Slide 3 text

ETL DjangoCon US 2023 3

Slide 4

Slide 4 text

Case study: Farm RPG DjangoCon US 2023 4

Slide 5

Slide 5 text

Game data DjangoCon US 2023 5

Slide 6

Slide 6 text

Why build this? It's fun! And educational! DjangoCon US 2023 6

Slide 7

Slide 7 text

What is ETL? Extract Transform Load DjangoCon US 2023 7

Slide 8

Slide 8 text

Web scrapers DjangoCon US 2023 8

Slide 9

Slide 9 text

Scrape Responsibly DjangoCon US 2023 9

Slide 10

Slide 10 text

The shape of things —Extractors: scraper/importer cron jobs —Transforms: parse HTML/JSON/etc then munge —Loaders: Django ORM, maybe DRF or Pydantic DjangoCon US 2023 10

Slide 11

Slide 11 text

Robots Data in disguise —Parsed to structured data —json.loads() —BeautifulSoup —struct.unpack() —Reshape the data to make it easy to query —Possibly import-time aggregations like averages DjangoCon US 2023 11

Slide 12

Slide 12 text

Aggregations # Query time async def myview(): return await Item.objects.aaggregate(Avg("value")) # Ingest time avg = await Item.objects.aaggregate(Avg("value")) await ItemStats.objects.aupdate_or_create( defaults={"avg": avg}) DjangoCon US 2023 12

Slide 13

Slide 13 text

Loaders Update_or_create Or maybe DRF DjangoCon US 2023 13

Slide 14

Slide 14 text

Async & Django Yes, you can really use it! More on the ORM later DjangoCon US 2023 14

Slide 15

Slide 15 text

Why async? —Everything in one box —Fewer services, fewer problems —Portability and development env DjangoCon US 2023 15

Slide 16

Slide 16 text

Monoliths! Reports of my death have been greatly exaggerated DjangoCon US 2023 16

Slide 17

Slide 17 text

Writing an async app —Normal models —Async views —Async middleware if needed DjangoCon US 2023 17

Slide 18

Slide 18 text

Running an async app —Async server not included —python -m uvicorn etl.asgi:application —--reload for development —--access-log --no-server-header —--ssl-keyfile=... --ssl-certfile=... DjangoCon US 2023 18

Slide 19

Slide 19 text

Background tasks —Celery beat? —Cron? —Async! DjangoCon US 2023 19

Slide 20

Slide 20 text

Task factories async def do_every(fn, n): while True: await fn() await asyncio.sleep(n) DjangoCon US 2023 20

Slide 21

Slide 21 text

Django integration class MyAppConfig(AppConfig): def ready(self): coro = do_every(my_extractor, 30) asyncio.create_task(coro) DjangoCon US 2023 21

Slide 22

Slide 22 text

The other shoe —Crashes can happen —Plan for convergence —Think about failure modes —Make task status models if needed DjangoCon US 2023 22

Slide 23

Slide 23 text

Modeling for failure —create_task(send_email()) —await Emails.objects.acreate(...) —create_task(do_every(send_all_emails)) —await email.adelete() DjangoCon US 2023 23

Slide 24

Slide 24 text

Async ORM —Mind your a*s —Transactions and sync_to_async —Don't worry about concurrency —Still improving DjangoCon US 2023 24

Slide 25

Slide 25 text

Async transactions @transaction.atomic def _transaction(): Foo.objects.create() Bar.objects.create() async def my_view(): await sync_to_async(_transaction)() DjangoCon US 2023 25

Slide 26

Slide 26 text

Async HTTP HTTPX Or AIOHTTP DjangoCon US 2023 26

Slide 27

Slide 27 text

Examples! DjangoCon US 2023 27

Slide 28

Slide 28 text

The simple case async def scrape_things(): resp = await client.get("/api/things/") resp.raise_for_status() for row in resp.json(): await Thing.objects.aupdate_or_create( id=row["key"], defaults={"name": row["fullName"]}, ) DjangoCon US 2023 28

Slide 29

Slide 29 text

Foreign keys # {"user": 25} defaults={"user_id": row["user"]} # {"user": "foo"} user = await User.objects.only("id")\ .aget(email=row["user"]) defaults={"user": user} DjangoCon US 2023 29

Slide 30

Slide 30 text

DRF serializers item = await Item.objects.filter(id=data["id"])\ .afirst() ser = ItemAPISerializer(instance=item, data=data) await sync_to_async(ser.is_valid)(raise_exception=True) await sync_to_async(ser.save)() DjangoCon US 2023 30

Slide 31

Slide 31 text

Decorators —autodiscover_modules("tasks") —_registry = {} —register_to=mod def every(n): def decorator(fn): _registry[fn] = n return fn return decorator DjangoCon US 2023 31

Slide 32

Slide 32 text

Real-er cron —Croniter is amazing —In-memory or database —Loop and check next >= now DjangoCon US 2023 32

Slide 33

Slide 33 text

while True: for cron in decorators._registry.values(): if (cron.next_run_at is None or cron.next_run_at <= now) and ( cron.previous_started_at is None or ( cron.previous_finished_at is not None and cron.previous_started_at <= cron.previous_finished_at ) ): cron.previous_started_at = now asyncio.create_task(_do_cron(cron)) await asyncio.sleep(1) DjangoCon US 2023 33

Slide 34

Slide 34 text

await cron.fn() cron.previous_finished_at = now cron.next_run_at = ( croniter(cron.cronspec, now, hash_id=cron.name) .get_next(ret_type=datetime) .replace(tzinfo=dtimezone.utc) ) DjangoCon US 2023 34

Slide 35

Slide 35 text

History —django-pghistory —@pghistory.track(pghistory.Snapshot(), exclude=["modified_at"]) DjangoCon US 2023 35

Slide 36

Slide 36 text

Incremental load max = await Trades.objects.aaggregate(Max("id")) resp = await client.get(..., query={"since": max.get("id__max") or 0}) DjangoCon US 2023 36

Slide 37

Slide 37 text

Multi-stage transforms one = await step_one() two = await step_two(one) three, four = await gather( step_three(two), step_four(two), ) DjangoCon US 2023 37

Slide 38

Slide 38 text

We have data Now what? DjangoCon US 2023 38

Slide 39

Slide 39 text

GraphQL Here be dragons Use only for small Or very cachable DjangoCon US 2023 39

Slide 40

Slide 40 text

Query basics query { topLevelField(filter: {field1: "foo"}) { field2 field3 { nested1 } } } DjangoCon US 2023 40

Slide 41

Slide 41 text

Why GraphQL? query { items { name, image, value requiredFor { quantity, quest { title, image, text DjangoCon US 2023 41

Slide 42

Slide 42 text

Strawberry DjangoCon US 2023 42

Slide 43

Slide 43 text

Model types @gql.django.type(models.Item) class Item: id: int name: auto image: auto required_for: list[QuestRequired] reward_for: list[QuestReward] DjangoCon US 2023 43

Slide 44

Slide 44 text

Site generators Gatsby Next.js Pelican DjangoCon US 2023 44

Slide 45

Slide 45 text

Subscriptions —Channels —In-memory? —Channels-Postgres otherwise —ASGI Router DjangoCon US 2023 45

Slide 46

Slide 46 text

Enough about boring stuff Async for fun? DjangoCon US 2023 46

Slide 47

Slide 47 text

Discord-py Or Bolt-python for Slack Or Bottom for IRC if you're old DjangoCon US 2023 47

Slide 48

Slide 48 text

intents = discord.Intents.default() intents.message_content = True client = discord.Client(intents=intents) class BotConfig(AppConfig): def ready(self): create_task(client.start(token)) DjangoCon US 2023 48

Slide 49

Slide 49 text

Chat bot @client.event async def on_message(message): if message.author == client.user: return if message.content == "!count": n = await Thing.objects.acount() await message.channel.send( f"There are {n} things") DjangoCon US 2023 49

Slide 50

Slide 50 text

Notifications async def scrape_things(): # Do the ETL ... channel = client.get_channel(id) await channel.send("Batch complete") DjangoCon US 2023 50

Slide 51

Slide 51 text

Email msg = EmailMessage() msg["From"] = "etl@server" msg["To"] = "me@example.com" msg["Subject"] = "Batch complete" msg.set_content(log) await aiosmtplib.send(msg) DjangoCon US 2023 51

Slide 52

Slide 52 text

Let's get weirder: SSH async def handle_ssh(proc): proc.stdout.write('Welcome!\n> ') cmd = await proc.stdin.readline() if cmd == 'count': n = await Thing.objects.acount() proc.stdout.write( f"There are {n} things\n") def ready(): create_task( asyncssh.listen(process_factory=handle_ssh)) DjangoCon US 2023 52

Slide 53

Slide 53 text

The sky's the limit Streamdeck Aioserial Twisted DjangoCon US 2023 53

Slide 54

Slide 54 text

Back to serious Scaling? Let's go DjangoCon US 2023 54

Slide 55

Slide 55 text

Ingest sharding —Too much data to load? —if hash(url) % 2 == server_id —Adjust your aggregations DjangoCon US 2023 55

Slide 56

Slide 56 text

CPU bound? —Does the library drop the GIL? —sync_to_async(…, thread_sensitive=False) —ProcessPoolExecutor or aiomultiprocess —(PEP 703 in the future?) DjangoCon US 2023 56

Slide 57

Slide 57 text

Big still allowed Microservices too DjangoCon US 2023 57

Slide 58

Slide 58 text

In review ETL systems are useful for massaging data Async Django is great for building ETLs GraphQL is an excellent way to query There's many cool async libraries Our tools can grow as our needs grow DjangoCon US 2023 58

Slide 59

Slide 59 text

Thank you DjangoCon US 2023 59

Slide 60

Slide 60 text

DjangoCon US 2023 60