Slide 1

Slide 1 text

Swiss Army Django Small Footprint ETL Noah Kantrowitz - coderanger.net - August 18, 2023 DjangoCon AU 2023 1

Slide 2

Slide 2 text

Noah Kantrowitz —He/him —coderanger.net | cloudisland.nz/@coderanger —Kubernetes (ContribEx) and Python (webmaster@) —SRE/Platform for Geomagical Labs, part of IKEA —We do CV/AR for the home DjangoCon AU 2023 2

Slide 3

Slide 3 text

ETL DjangoCon AU 2023 3

Slide 4

Slide 4 text

Case Study: Farm RPG DjangoCon AU 2023 4

Slide 5

Slide 5 text

Game Data DjangoCon AU 2023 5

Slide 6

Slide 6 text

Why build this? It's fun! And educational! DjangoCon AU 2023 6

Slide 7

Slide 7 text

What is ETL? Extract Transform Load DjangoCon AU 2023 7

Slide 8

Slide 8 text

Web Scrapers DjangoCon AU 2023 8

Slide 9

Slide 9 text

ELT? —Extract, Load, Transform —Keep raw data —Transform it again and again —Cool but not very small DjangoCon AU 2023 9

Slide 10

Slide 10 text

Scrape Responsibly DjangoCon AU 2023 10

Slide 11

Slide 11 text

The shape of things —Extractors: scraper/importer cron jobs —Transforms: parse HTML/XML/JSON —Reshape the data to make it easy to query —Possibly import-time aggregations like averages —Loaders: Django ORM, maybe DRF or Pydantic DjangoCon AU 2023 11

Slide 12

Slide 12 text

Async & Django Yes, you can really use it! More on the ORM later DjangoCon AU 2023 12

Slide 13

Slide 13 text

Why async? —Everything in one box —Fewer services, fewer problems —Portability and development env DjangoCon AU 2023 13

Slide 14

Slide 14 text

Monoliths! Reports of my death have been greatly exaggerated DjangoCon AU 2023 14

Slide 15

Slide 15 text

Writing an async app —Normal models —Async views —Async middleware if needed DjangoCon AU 2023 15

Slide 16

Slide 16 text

Running an async app —Async server not included —python -m uvicorn etl.asgi:application —--reload for development —--access-log --no-server-header —--ssl-keyfile=... --ssl-certfile=... DjangoCon AU 2023 16

Slide 17

Slide 17 text

Background tasks DjangoCon AU 2023 17

Slide 18

Slide 18 text

Task factories async def do_every(fn, n): while True: await fn() await asyncio.sleep(n) DjangoCon AU 2023 18

Slide 19

Slide 19 text

Django integration class MyAppConfig(AppConfig): def ready(self): coro = do_every(my_extractor, 30) asyncio.create_task(coro) DjangoCon AU 2023 19

Slide 20

Slide 20 text

The other shoe —Crashes can happen —Plan for convergence —Think about failure modes —Make task status models if needed DjangoCon AU 2023 20

Slide 21

Slide 21 text

Modeling for failure —create_task(send_email()) —await Emails.objects.acreate(...) —create_task(do_every(send_all_emails)) —await email.adelete() DjangoCon AU 2023 21

Slide 22

Slide 22 text

Async ORM —Mind your a*s —Transactions and sync_to_async —Don't worry about concurrency —Still improving DjangoCon AU 2023 22

Slide 23

Slide 23 text

Async transactions @transaction.atomic def _transaction(): Foo.objects.create() Bar.objects.create() async def my_view(): await sync_to_async(_transaction)() DjangoCon AU 2023 23

Slide 24

Slide 24 text

Async HTTP HTTPX Or AIOHTTP DjangoCon AU 2023 24

Slide 25

Slide 25 text

Examples! DjangoCon AU 2023 25

Slide 26

Slide 26 text

The simple case async def scrape_things(): resp = await client.get("/api/things/") resp.raise_for_status() for row in resp.json(): await Thing.objects.aupdate_or_create( id=row["key"], defaults={"name": row["fullName"]}, ) DjangoCon AU 2023 26

Slide 27

Slide 27 text

DRF serializers item = await Item.objects.filter(id=data["id"])\ .afirst() ser = ItemAPISerializer(instance=item, data=data) await sync_to_async(ser.is_valid)(raise_exception=True) await sync_to_async(ser.save)() DjangoCon AU 2023 27

Slide 28

Slide 28 text

Diff delete seen_ids = [] for row in data: thing = ...aupdate_or_create(...) seen_ids.append(thing.id) await ...exclude(id__in=seen_ids).adelete() DjangoCon AU 2023 28

Slide 29

Slide 29 text

Decorators —autodiscover_modules("tasks") —_registry = {} —register_to=mod def every(n): def decorator(fn): _registry[fn] = n return fn return decorator DjangoCon AU 2023 29

Slide 30

Slide 30 text

Real-er cron —Croniter is amazing —In-memory or database —Loop and check next >= now DjangoCon AU 2023 30

Slide 31

Slide 31 text

We have data Now what? DjangoCon AU 2023 31

Slide 32

Slide 32 text

GraphQL Here be dragons Use only for small Or very cachable DjangoCon AU 2023 32

Slide 33

Slide 33 text

Query Basics query { topLevelField(filter: {field1: "foo"}) { field2 field3 { nested1 } } } DjangoCon AU 2023 33

Slide 34

Slide 34 text

Why GraphQL? query { items { name, image, value requiredFor { quantity, quest { title, image, text DjangoCon AU 2023 34

Slide 35

Slide 35 text

When not to use GraphQL —Big data - it's too slow —Numeric aggregation - "mean where type = foo" —But you can pre-compute —Poorly linked data - no ForeignKeys DjangoCon AU 2023 35

Slide 36

Slide 36 text

Strawberry DjangoCon AU 2023 36

Slide 37

Slide 37 text

Schema @gql.type class Query: items: list[Item] = field() quests: list[Quest] = field() schema = gql.Schema(query=Query) DjangoCon AU 2023 37

Slide 38

Slide 38 text

Model Types @gql.django.type(models.Item) class Item: id: int name: auto image: auto required_for: list[QuestRequired] reward_for: list[QuestReward] DjangoCon AU 2023 38

Slide 39

Slide 39 text

Filters and Orders @gql.django.filter(models.Item) class ItemFilter: id: auto can_mail: auto @gql.django.order(models.Item) class ItemOrder: id: auto name: auto DjangoCon AU 2023 39

Slide 40

Slide 40 text

Performance? query { items { id, name requiredFor { quantity, quest { id, title DjangoCon AU 2023 40

Slide 41

Slide 41 text

DjangoCon AU 2023 41

Slide 42

Slide 42 text

DjangoCon AU 2023 42

Slide 43

Slide 43 text

Strawberry Views Don't use their async view, it doesn't like Django path("graphql", GraphQLView.as_view(schema=schema)) DjangoCon AU 2023 43

Slide 44

Slide 44 text

Site Generators Gatsby Next.js Pelican DjangoCon AU 2023 44

Slide 45

Slide 45 text

Subscriptions —Channels —In-memory? —Channels-Postgres otherwise —ASGI Router DjangoCon AU 2023 45

Slide 46

Slide 46 text

Enough about boring stuff Async for fun? DjangoCon AU 2023 46

Slide 47

Slide 47 text

Discord-py Or Bolt-python for Slack Or Bottom for IRC if you're old DjangoCon AU 2023 47

Slide 48

Slide 48 text

intents = discord.Intents.default() intents.message_content = True client = discord.Client(intents=intents) class BotConfig(AppConfig): def ready(self): create_task(client.start(token)) DjangoCon AU 2023 48

Slide 49

Slide 49 text

Control the loop —client.run() - expects to own the loop —client.start() - cooperates with others DjangoCon AU 2023 49

Slide 50

Slide 50 text

Chat bot @client.event async def on_message(message): if message.author == client.user: return if message.content == "!count": n = await Thing.objects.acount() await message.channel.send( f"There are {n} things") DjangoCon AU 2023 50

Slide 51

Slide 51 text

Notifications async def scrape_things(): # Do the ETL ... channel = client.get_channel(id) await channel.send("Batch complete") DjangoCon AU 2023 51

Slide 52

Slide 52 text

Let's get weirder: SSH async def handle_ssh(proc): proc.stdout.write('Welcome!\n> ') cmd = await proc.stdin.readline() if cmd == 'count': n = await Thing.objects.acount() proc.stdout.write( f"There are {n} things\n") def ready(): create_task( asyncssh.listen(process_factory=handle_ssh)) DjangoCon AU 2023 52

Slide 53

Slide 53 text

The sky's the limit Streamdeck Aioserial Pytradfri DjangoCon AU 2023 53

Slide 54

Slide 54 text

Back to serious Scaling? Let's go DjangoCon AU 2023 54

Slide 55

Slide 55 text

Ingest sharding —Too much data to load? —if hash(url) % 2 == server_id —Adjust your aggregations DjangoCon AU 2023 55

Slide 56

Slide 56 text

CPU bound? —Does the library drop the GIL? —sync_to_async(…, thread_sensitive=False) —ProcessPoolExecutor or aiomultiprocess —(PEP 703 in the future?) DjangoCon AU 2023 56

Slide 57

Slide 57 text

Big still allowed Microservices too DjangoCon AU 2023 57

Slide 58

Slide 58 text

In review ETL systems are useful for massaging data Async Django is great for building ETLs GraphQL is an excellent way to query There's many cool async libraries Our tools can grow as our needs grow DjangoCon AU 2023 58

Slide 59

Slide 59 text

Thank you DjangoCon AU 2023 59

Slide 60

Slide 60 text

Questions? DjangoCon AU 2023 60

Slide 61

Slide 61 text

DjangoCon AU 2023 61