Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Let's get you started with asynchronous program...

Let's get you started with asynchronous programming | Ryan Varley | PyData Global 2024

Asynchronous programming can be intimidating for many due to its unique syntax, paradigm, and differing behaviour in scripts versus Jupyter notebooks.

Data professionals are likely to make numerous API calls and handle I/O-bound operations in R&D projects. When deploying data science solutions, they frequently create endpoints that perform multiple network calls. Async programming is incredibly useful in these scenarios but is often overlooked due to its steep learning curve.

But it’s not that complicated—and I'll prove it. In this talk, I will demystify the basics, along with some advanced concepts, from a practical perspective. By the end, you'll be ready to get started and implement significant performance improvements in your network or I/O-bound code.

Watch this talk if you’ve been intimidated by async and await for a while and are ready to change that.

Resources
---
More from the talk - https://blog.ryanvarley.com/p/pydata-get-started-with-python-async-in-25-minutes-talk
Video 1 - https://www.youtube.com/watch?v=gRCMZuAJvAk
Video 2 - https://www.youtube.com/watch?v=aFd1GjTRfig

Avatar for Ryan Varley

Ryan Varley

December 04, 2024
Tweet

Other Decks in Programming

Transcript

  1. Assumption You have used python but not used async (much)

    Aim You leave here with a basic understanding, and write some async code! Warning Async is a tool, not everything is a nail This talk
  2. Why is it confusing? • Paradigm is different • Multiple

    frameworks • It has changed throughout python 3* • It behaves differently in different environments** • Rabbit holes*** * asyncio became standard library in 3.4, await and async def became a thing in 3.5. This is not a history. ** More soon *** Were already on the 3rd *
  3. What is I/O? Its network calls 󰤇 • API Requests

    • Saving files ◦ To the cloud / another machine (network) ◦ To local disk (fast, maxes write anyway) • DB requests ◦ How many dbs run on the same machine? Network! ◦ Async not well supported by db libraries.
  4. We are going to focus on asyncio • Its part

    of the standard library since 3.4, it's here to stay • There are alternatives but we won't discuss them ◦ Older: Tornado, gevent, twisted ◦ Modern: asyncio, Curio ◦ Latest: Trio, AnyIO, UVloop • Everything here is in python 3.12.2 • Code here should work in 3.7 onwards and mostly work in 3.5 onwards, though i haven't verified it.
  5. What does async do? • Lets us do other things

    while we wait for I/O ◦ Generally more I/O • For example we want to make 20 API calls where each call takes 1s ◦ Syncronously ~20s ◦ Asyncronously ~1s
  6. Let’s write one! Sync version import requests API_URL = "http://localhost:8000"

    def get_video(video_id): response = requests.get( f"{API_URL}/videos/{video_id}", ) response.raise_for_status() return response.json() get_video(video_id="656c94f1875e6b1ap3ae5f19") { 'id': '656c94f1875e6b1ap3ae5f19', 'title': 'JWST: Looking Beyond The Pretty Pictures', … }
  7. Let’s write one! Async version import aiohttp async def get_video_async(video_id):

    async with aiohttp.ClientSession() as session: async with session.get( f"{API_URL}/videos/{video_id}" ) as response: response.raise_for_status() return await response.json() get_video_async("656c94f1875e6b1ap3ae5f19") <coroutine object get_video_async at 0x12824bc40> Generally you would reuse this Ensures resources are released correctly Need to await the response Not requests async
  8. Coroutines A coroutine is a special function that can be

    paused and resumed async def: defines a function that returns a coroutine instead of the return value This coroutine is lazy and must be awaited (“run”) to get the return value
  9. How do we run this coroutine? If it ran now

    then it would be synchronous await blocks subsequent code in the same coroutine video = await get_video_async("656c94f1875e6b1ap3ae5f19") video = await get_video_async("656c94f1875e6b1ap3ae5f19") channel = await get_video_channel(video[‘channel_id’])
  10. How do we run this coroutine? FastAPI Await in an

    async endpoint @app.get("/video") async def get_video(video_id: str): response = await get_video_async(video_id) return response Python REPL File "<stdin>", line 1 SyntaxError: 'await' outside function IPython / Juptyer {'id': '656c94f1666e6b1de3ae5f19', 'title': 'JWST: Looking Beyond The Pretty Pictures', … Script / pytest File "<stdin>", line 1 SyntaxError: 'await' outside function python -m asyncio video = await get_video_async("656c94f1875e6b1ap3ae5f19")
  11. In a script we need to start the event loop

    ourselves asyncio.run(MY_ENTRY_FUNCTION()) {'id': ‘656c94f1875e6b1ap3ae5f1 9’, 'title': 'JWST: Looking Beyond The Pretty Pictures'...} If you try to run this in Jupyter you will get RuntimeError: asyncio.run() cannot be called from a running event loop The event loop is running! You need to await instead. import asyncio async def main(): video = await get_video_async("656c94f1666e6b1de3ae5f19") print(video) if __name__ == "__main__": asyncio.run(main())
  12. If its async, it needs to be async all the

    way* asyncio.run(main()) asyncio.run(main2()) sync sync async async sync Everything in here must be async… or there’s no point (mostly)
  13. The simplest entry point • If you are using FastAPI

    already convert some endpoints to be async! • If you use Jupyter, next time you write requests, use async FastAPI @app.get("/video") async def get_video(video_id: str): response = await get_video_async(video_id) return response
  14. await will run the coroutine if Python REPL Started with

    python -m asyncio IPython / Jupyter Works FastAPI Endpoint is defined with async def Pytest You use a plugin (e.g. pytest-asyncio, anyio) Script asyncio.run() is used to call the top level function
  15. Let’s make 100 calls to an endpoint async def get_video_async(video_id,

    session): async with session.get( f"{API_URL}/videos/{video_id}", ) as response: response.raise_for_status() return await response.json() async def fetch_all_videos(video_ids): connector = aiohttp.TCPConnector(limit=100, limit_per_host=10) async with aiohttp.ClientSession(connector=connector) as session: tasks = [get_video_async(video_id, session) for video_id in video_ids] return await asyncio.gather(*tasks) videos = await my_code.fetch_all_videos(VIDEO_IDS) 0.2s to run 1 2.1-12s to run 100 1.8-2.7s to run 100 (max 10 at a time) asyncio.gather Groups a list of awaitables into one awaitable running them concurrently Expand Define limits Session passed in
  16. Let’s do something more complicated Content warnings service Video service

    1. Get the video data 2. Get the transcript 3. Run our content warnings model 4. Update video with content warnings
  17. Let's write a job API_URL = "http://localhost:8000" async def get_video(video_id,

    session): async with session.get( f"{API_URL}/videos/{video_id}" ) as response: response.raise_for_status() return await response.json() async def get_video_transcript(video_id, session): async with session.get( f"{API_URL}/videos/{video_id}/transcript" ) as response: response.raise_for_status() return await response.text() def is_nasa(text): # Simulate model taking 1s [hashlib.sha512(b"a" * 10**8).hexdigest() for i in range(10)] if "nasa" in text.lower().split(): return True return False
  18. Let's write a job async def generate_content_warnings(video_id, session): video =

    await get_video(video_id, session) transcript = await get_video_transcript(video_id, session) text = f"{video["title"]} {video["description"]} {transcript}" warning_ids = [] result = is_nasa(text) if result: warning_ids.append("nasa") await save_video_content_warnings(video_id, warning_ids) async def main(): async with aiohttp.ClientSession() as session: tasks = [ generate_content_warnings(video_id, session) for video_id in VIDEO_IDS ] await asyncio.gather(*tasks) if __name__ == "__main__": asyncio.run(main())
  19. We are overwhelming the API - Limit per host async

    def main(): async with aiohttp.ClientSession() as session: tasks = [ generate_content_warnings(video_id, session) for video_id in VIDEO_IDS ] await asyncio.gather(*tasks) async def main(): connector = aiohttp.TCPConnector( limit=100, limit_per_host=10) async with aiohttp.ClientSession(connector=connector) as session: tasks = [ generate_content_warnings(video_id, session) for video_id in VIDEO_IDS ] await asyncio.gather(*tasks) But let’s do something else async with session.get( f"{API_URL}/videos/{video_id}" ) as response:
  20. We are overwhelming the API - Semaphores async def get_video(video_id,

    session): async with session.get( f"{API_URL}/videos/{video_id}" ) as response: response.raise_for_status() return await response.json() async def get_video_transcript(video_id, session): async with session.get( f"{API_URL}/videos/{video_id}/transcript" ) as response: response.raise_for_status() return await response.text() video_semaphore = asyncio.Semaphore(5) transcript_semaphore = asyncio.Semaphore(10) async def get_video(video_id, session): async with video_semaphore: async with session.get( f"{API_URL}/videos/{video_id}", ) as response: response.raise_for_status() return await response.json() async def get_video_transcript(video_id, session): async with transcript_semaphore: async with session.get( …
  21. We are running video and transcript in sequence async def

    generate_content_warnings(video_id, session): video = await get_video(video_id, session) transcript = await get_video_transcript(video_id, session) text = f"{video["title"]} {video["description"]} {transcript}" warning_ids = [] result = is_nasa(text) if result: warning_ids.append("nasa") await save_video_content_warnings(video_id, warning_ids) async def generate_content_warnings(video_id, session): video_task = get_video(video_id, session) transcript_task = get_video_transcript(video_id, session) video, transcript = await asyncio.gather( video_task, transcript_task) text = f"{video["title"]} {video["description"]} {transcript}" warning_ids = [] result = is_nasa(text) if result: warning_ids.append("nasa") await save_video_content_warnings(video_id, warning_ids)
  22. The model run is blocking async def generate_content_warnings(video_id, session): video_task

    = get_video(video_id, session) transcript_task = get_video_transcript(video_id, session) video, transcript = await asyncio.gather( video_task, transcript_task) text = f"{video["title"]} {video["description"]} {transcript}" warning_ids = [] loop = asyncio.get_running_loop() result = await loop.run_in_executor( None, is_nasa, text, video_id) if result: warning_ids.append("nasa") await save_video_content_warnings(video_id, warning_ids) async def generate_content_warnings(video_id, session): video_task = get_video(video_id, session) transcript_task = get_video_transcript(video_id, session) video, transcript = await asyncio.gather( video_task, transcript_task ) text = f"{video["title"]} {video["description"]} {transcript}" warning_ids = [] result = is_nasa(text) if result: warning_ids.append("nasa") await save_video_content_warnings(video_id, warning_ids)
  23. Aside - run_in_executor loop = asyncio.get_running_loop() result = await loop.run_in_executor(

    None, is_nasa, text, video_id) Will run in a thread, can still block from concurrent.futures import ProcessPoolExecutor loop = asyncio.get_running_loop() with ProcessPoolExecutor() as executor: result = await loop.run_in_executor(executor, is_nasa, text, video_id) Is now multiprocessing, but comes with the same pitfalls
  24. So are you ready to use async? • If you

    went away and used async for the first time let me know! • I would love to hear your feedback Thank you! Ryan Varley PyData Global | 3rd December 2024 https://www.linkedin.com/in/ryanvarley/ https://rynv.uk/async-pydata-global-24/