Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Swiss Army Django: Small Footprint ETL

Swiss Army Django: Small Footprint ETL

Presented at DjangoCon AU 2023

Noah Kantrowitz

August 19, 2023
Tweet

More Decks by Noah Kantrowitz

Other Decks in Technology

Transcript

  1. Noah Kantrowitz —He/him —coderanger.net | cloudisland.nz/@coderanger —Kubernetes (ContribEx) and Python

    (webmaster@) —SRE/Platform for Geomagical Labs, part of IKEA —We do CV/AR for the home DjangoCon AU 2023 2
  2. ELT? —Extract, Load, Transform —Keep raw data —Transform it again

    and again —Cool but not very small DjangoCon AU 2023 9
  3. The shape of things —Extractors: scraper/importer cron jobs —Transforms: parse

    HTML/XML/JSON —Reshape the data to make it easy to query —Possibly import-time aggregations like averages —Loaders: Django ORM, maybe DRF or Pydantic DjangoCon AU 2023 11
  4. Async & Django Yes, you can really use it! More

    on the ORM later DjangoCon AU 2023 12
  5. Why async? —Everything in one box —Fewer services, fewer problems

    —Portability and development env DjangoCon AU 2023 13
  6. Running an async app —Async server not included —python -m

    uvicorn etl.asgi:application —--reload for development —--access-log --no-server-header —--ssl-keyfile=... --ssl-certfile=... DjangoCon AU 2023 16
  7. Task factories async def do_every(fn, n): while True: await fn()

    await asyncio.sleep(n) DjangoCon AU 2023 18
  8. The other shoe —Crashes can happen —Plan for convergence —Think

    about failure modes —Make task status models if needed DjangoCon AU 2023 20
  9. Async ORM —Mind your a*s —Transactions and sync_to_async —Don't worry

    about concurrency —Still improving DjangoCon AU 2023 22
  10. The simple case async def scrape_things(): resp = await client.get("/api/things/")

    resp.raise_for_status() for row in resp.json(): await Thing.objects.aupdate_or_create( id=row["key"], defaults={"name": row["fullName"]}, ) DjangoCon AU 2023 26
  11. DRF serializers item = await Item.objects.filter(id=data["id"])\ .afirst() ser = ItemAPISerializer(instance=item,

    data=data) await sync_to_async(ser.is_valid)(raise_exception=True) await sync_to_async(ser.save)() DjangoCon AU 2023 27
  12. Diff delete seen_ids = [] for row in data: thing

    = ...aupdate_or_create(...) seen_ids.append(thing.id) await ...exclude(id__in=seen_ids).adelete() DjangoCon AU 2023 28
  13. Why GraphQL? query { items { name, image, value requiredFor

    { quantity, quest { title, image, text DjangoCon AU 2023 34
  14. When not to use GraphQL —Big data - it's too

    slow —Numeric aggregation - "mean where type = foo" —But you can pre-compute —Poorly linked data - no ForeignKeys DjangoCon AU 2023 35
  15. Schema @gql.type class Query: items: list[Item] = field() quests: list[Quest]

    = field() schema = gql.Schema(query=Query) DjangoCon AU 2023 37
  16. Model Types @gql.django.type(models.Item) class Item: id: int name: auto image:

    auto required_for: list[QuestRequired] reward_for: list[QuestReward] DjangoCon AU 2023 38
  17. Filters and Orders @gql.django.filter(models.Item) class ItemFilter: id: auto can_mail: auto

    @gql.django.order(models.Item) class ItemOrder: id: auto name: auto DjangoCon AU 2023 39
  18. Performance? query { items { id, name requiredFor { quantity,

    quest { id, title DjangoCon AU 2023 40
  19. Strawberry Views Don't use their async view, it doesn't like

    Django path("graphql", GraphQLView.as_view(schema=schema)) DjangoCon AU 2023 43
  20. intents = discord.Intents.default() intents.message_content = True client = discord.Client(intents=intents) class

    BotConfig(AppConfig): def ready(self): create_task(client.start(token)) DjangoCon AU 2023 48
  21. Control the loop —client.run() - expects to own the loop

    —client.start() - cooperates with others DjangoCon AU 2023 49
  22. Chat bot @client.event async def on_message(message): if message.author == client.user:

    return if message.content == "!count": n = await Thing.objects.acount() await message.channel.send( f"There are {n} things") DjangoCon AU 2023 50
  23. Notifications async def scrape_things(): # Do the ETL ... channel

    = client.get_channel(id) await channel.send("Batch complete") DjangoCon AU 2023 51
  24. Let's get weirder: SSH async def handle_ssh(proc): proc.stdout.write('Welcome!\n> ') cmd

    = await proc.stdin.readline() if cmd == 'count': n = await Thing.objects.acount() proc.stdout.write( f"There are {n} things\n") def ready(): create_task( asyncssh.listen(process_factory=handle_ssh)) DjangoCon AU 2023 52
  25. Ingest sharding —Too much data to load? —if hash(url) %

    2 == server_id —Adjust your aggregations DjangoCon AU 2023 55
  26. CPU bound? —Does the library drop the GIL? —sync_to_async(…, thread_sensitive=False)

    —ProcessPoolExecutor or aiomultiprocess —(PEP 703 in the future?) DjangoCon AU 2023 56
  27. In review ETL systems are useful for massaging data Async

    Django is great for building ETLs GraphQL is an excellent way to query There's many cool async libraries Our tools can grow as our needs grow DjangoCon AU 2023 58