Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[PyCon KR] The next-generation SQLAlchemy with asyncio

Joongi Kim
October 02, 2021

[PyCon KR] The next-generation SQLAlchemy with asyncio

Joongi Kim

October 02, 2021
Tweet

More Decks by Joongi Kim

Other Decks in Programming

Transcript

  1. The next generation
    SQLAlchemy with asyncio
    Joongi Kim / Lablup Inc.

    View Slide

  2. View Slide

  3. § SQLAlchemy in 1 minute
    – Core API & ORM
    § Python databases in the asyncio ecosystem
    – aiopg + aiopg.sa / asyncpg + asyncpgsa
    § Technical issues when using aiopg & SQLAlchemy
    § Landing of SQLAlchemy v1.4
    § Technical issues during the migration
    Agenda
    1

    View Slide

  4. § “The de-facto standard DBAPI & ORM for Python programs”
    SQLAlchemy in 1 minute
    ORM
    Core
    DBAPI
    Database
    Schema /
    Types
    SQL
    Engine
    Dialect
    Connection
    Pool
    https://speakerdeck.com/zzzeek/introduction-to-
    sqlalchemy-pycon-2013?slide=8
    2

    View Slide

  5. § 1-to-1 translation of Python's chained function constructor to SQL statement
    SQLAlchemy Core API
    >>> from sqlalchemy import Table, Column, Integer, String, MetaData
    >>> metadata = MetaData()
    >>> users = Table('users', metadata,
    ... Column('id', Integer, primary_key=True),
    ... Column('name', String),
    ... Column('fullname', String),
    ... )
    >>> metadata.create_all(engine)
    https://docs.sqlalchemy.org/en/14/core/tutorial.html
    3

    View Slide

  6. § 1-to-1 translation of Python's chained function constructor to SQL statement
    SQLAlchemy Core API
    >>> ins = users.insert().values(name='jack', fullname='Jack Jones')
    >>> str(ins)
    'INSERT INTO users (name, fullname) VALUES (:name, :fullname)'
    >>> s = select(users.c.name, users.c.fullname)
    >>> result = conn.execute(s)
    >>> for row in result:
    ... print(row)
    (u'jack', u'Jack Jones')
    (u'wendy', u'Wendy Williams')
    https://docs.sqlalchemy.org/en/14/core/tutorial.html
    4

    View Slide

  7. § 1-to-1 mapping of Python objects to SQL table rows
    SQLAlchemy ORM
    https://docs.sqlalchemy.org/en/14/orm/tutorial.html
    >>> from sqlalchemy.orm import declarative_base
    >>> from sqlalchemy import Column, Integer, String
    >>> Base = declarative_base()
    >>> class User(Base):
    ... __tablename__ = 'users'
    ... id = Column(Integer, primary_key=True)
    ... name = Column(String)
    ... fullname = Column(String)
    ... nickname = Column(String)
    ...
    ... def __repr__(self):
    ... return "" % (
    ... self.name, self.fullname, self.nickname,
    ... )
    >>> Base.metadata.create_all(engine)
    5

    View Slide

  8. § 1-to-1 mapping of Python objects to SQL table rows
    SQLAlchemy ORM
    https://docs.sqlalchemy.org/en/14/orm/tutorial.html
    >>> from sqlalchemy.orm import sessionmaker
    >>> Session = sessionmaker(bind=engine)
    >>> session = Session()
    >>> ed_user = User(name='ed', fullname='Ed Jones', nickname='edsnickname')
    >>> session.add(ed_user)
    >>> session.commit()
    >>> for instance in session.query(User).order_by(User.id):
    ... print(instance.name, instance.fullname)
    ed Ed Jones
    wendy Wendy Williams
    mary Mary Contrary
    fred Fred Flintstone
    6

    View Slide

  9. § SQLAlchemy core binding for asyncio
    – aiopg + aiopg.sa
    – asyncpg + asyncpgsa
    § ORM on the Greenlet ecosystem
    – peewee
    § Yet another ORM with multi-database backends
    – databases + orm
    Python ORMs in the asyncio ecosystem
    7

    View Slide

  10. aiopg + aiopg.sa (Core only!)
    import asyncio
    import sqlalchemy as sa
    from aiopg.sa import create_engine
    metadata = sa.MetaData()
    tbl = sa.Table(…)
    async def create_table(conn):
    await conn.execute('DROP TABLE IF EXISTS tbl')
    await conn.execute('CREATE TABLE tbl (id serial PRIMARY KEY, val varchar(255))')
    async def main():
    async with create_engine(
    user='postgres', password='helloworld',
    host='127.0.0.1', port=15432,
    database='testing',
    ) as engine:
    async with engine.acquire() as conn:
    await create_table(conn)
    async with engine.acquire() as conn:
    await conn.execute(tbl.insert().values(val='abc'))
    async for row in conn.execute(tbl.select()):
    print(row.id, row.val)
    tbl = sa.Table(
    'tbl', metadata,
    sa.Column('id', sa.Integer, primary_key=True),
    sa.Column('val', sa.String(255)),
    )
    8

    View Slide

  11. asyncpg + asyncpgsa (Core only!)
    import asyncio
    import sqlalchemy as sa
    import asyncpgsa
    metadata = sa.MetaData()
    tbl = sa.Table(…)
    async def create_table(conn):
    await conn.execute('DROP TABLE IF EXISTS tbl')
    await conn.execute('CREATE TABLE tbl (id serial PRIMARY KEY, val varchar(255))')
    async def main():
    pool = await asyncpgsa.create_pool(
    host='127.0.0.1', port=15432,
    database='testing',
    user='postgres', password='helloworld',
    min_size=5, max_size=10,
    )
    async with pool.acquire() as conn:
    await create_table(conn)
    async with pool.acquire() as conn:
    await conn.execute(tbl.insert().values(val='abc'))
    for row in await conn.fetch(tbl.select()):
    print(row['id'], row['val'])
    tbl = sa.Table(
    'tbl', metadata,
    sa.Column('id', sa.Integer, primary_key=True),
    sa.Column('val', sa.String(255)),
    )
    9

    View Slide

  12. databases + SQLAlchemy (Core only)
    import asyncio
    import sqlalchemy as sa
    from databases import Database
    metadata = sa.MetaData()
    tbl = sa.Table(…)
    async def create_table(conn):
    await conn.execute('DROP TABLE IF EXISTS tbl')
    await conn.execute('CREATE TABLE tbl (id serial PRIMARY KEY, val varchar(255))')
    async def main():
    database = Database(
    'postgresql://postgres:[email protected]:15432/testing'
    )
    await database.connect()
    async with database.connection() as conn:
    async with conn.transaction():
    await create_table(conn)
    async with database.connection() as conn:
    async with conn.transaction():
    await conn.execute(query=tbl.insert(), values={'val': 'abc'})
    for row in await conn.fetch_all(query=tbl.select()):
    print(row['id'], row['val'])
    await database.disconnect()
    10

    View Slide

  13. § So, we could not use ORM. 😭
    – Introduced lots of boilerplate codes to convert between Row objects and actual
    Python objects.
    § Still, SQLAlchemy offers:
    – Structured schema version management via alembic
    – Seamless custom type conversion for complex-type columns (e.g., JSON)
    § Binding
    – Tried asyncpgsa at first but later moved to aiopg...
    Backend.AI + SQLAlchemy
    11

    View Slide

  14. § Unable to cancel the async task while beginning a transaction
    – The timing looks like a slit but it happens in production!
    – Solution: had to wrap many db transactions used in long-running websocket
    connection handlers with asyncio.shield()...
    ▹ asyncio.shield() itself is not an ideal API since it still may get cancelled upon
    event loop shutdown.
    – Patched after 1.5yr since the first report (aio-libs/aiopg#332)
    Technical issues with aiopg
    12

    View Slide

  15. § No Windows support!
    – Backend.AI has a plan to support Windows setup.
    – AbstractEventLoop.{add_reader(), remove_reader()}
    are not implemented on ProactorEventLoop
    Technical issues with aiopg
    13

    View Slide

  16. § Handling schema migrations
    – One answer for SQLAlchemy: alembic
    – Tricky case: adding/removing enum constants
    ▹ Need to cast enum constants to text back and forth
    ▹ Need to handle removal for downgrade
    – Easy-to-mistake case: Referring other table's schema in migrations
    ▹ Migrations may run in arbitrary git commits & branches!
    ▹ Referred table schemas must be duplicated in the migration script to reflect
    its definition at the time of writing the migration.
    Technical issues with Alembic
    14

    View Slide

  17. Table definitions in Alembic migrations
    15

    View Slide

  18. § Incremental updates of
    PostgreSQL JSONB columns
    – How to “update” an
    hierarchy of nested objects
    by merging with the existing
    value?
    – Solution: wrote a custom
    helper that generates SQL
    value expression
    Technical issues with SQLAlchemy
    def sql_json_merge(
    col,
    key: Tuple[str, ...],
    obj: Mapping[str, Any],
    *,
    _depth: int = 0,
    ):
    expr = sa.func.coalesce(
    col if _depth == 0 else col[key[:_depth]],
    sa.text("'{}'::jsonb"),
    ).concat(
    sa.func.jsonb_build_object(
    key[_depth],
    (
    sa.func.coalesce(col[key], sa.text("'{}'::jsonb"))
    .concat(sa.func.cast(obj, psql.JSONB))
    if _depth == len(key) - 1
    else sql_json_merge(col, key, obj=obj, _depth=_depth + 1)
    )
    )
    )
    return expr
    sa.update(tbl).values(status_data=sql_json_merge(
    tbl.c.status_data,
    ("scheduler",),
    {"last_try": datetime.now(tzutc()).isoformat(),
    "passed_predicates": ["concurrency", "keypair_resource_policy"]},
    )
    {
    "scheduler": {
    "first_try": "2021-05-01T02:01:01.00Z",
    "last_try": "2021-05-01T02:03:04.55Z",
    "passed_predicates": ["concurrency"],
    }
    }
    16

    View Slide

  19. § https://www.sqlalchemy.org/blog/2021/03/15/sqlalchemy-1.4.0-released/
    – “This is the most extensive release for SQLAlchemy in at least ten years”
    – Native asyncio support for both Core and ORM!
    – Type annotation support (sqlalchemy2-stubs)
    – Universal & transparent SQL compilation cache
    – 2.0-style transaction models (optional)
    – Core & ORM queries with unified query composition
    Landing of SQLAlchemy v1.4
    17

    View Slide

  20. § Recommended upgrade path
    – https://docs.sqlalchemy.org/en/14/changelog/migration_20.html
    SQLAlchemy Roadmap
    1.3 1.4 2.0
    ü A stepping stone
    for existing apps
    ü Ensure no
    deprecation
    warnings raised
    ü Subtle changes
    ü Changed behavior
    of SELECT and
    join() methods
    ü Python 3.6+ required
    ü Turn on "RemovedIn20Warning"
    ü Explicit connection context manager required
    ü Explicit binding for metadata required
    ü Add "future" flag to Engine and Session objects
    for new transaction behaviors
    ü No library-level autocommit
    ü Rows are now named tuples instead of dicts
    ü session.execute() everywhere in ORM
    ü ...
    18

    View Slide

  21. SQLAlchemy v1.4 Core
    import asyncio
    import sqlalchemy as sa
    from sqlalchemy.ext.asyncio import create_async_engine
    metadata = sa.MetaData()
    tbl = sa.Table(…)
    async def create_table(conn):
    await conn.run_sync(metadata.drop_all)
    await conn.run_sync(metadata.create_all)
    async def main():
    engine = create_async_engine(
    "postgresql+asyncpg://postgres:[email protected]:15432/testing"
    )
    async with engine.begin() as conn:
    await create_table(conn)
    async with engine.begin() as conn:
    await conn.execute(sa.insert(tbl).values(val='abc'))
    result = await conn.stream(sa.select(tbl))
    async for row in result:
    print(row.id, row.val)
    await engine.dispose()
    19

    View Slide

  22. SQLAlchemy v1.4 ORM
    import asyncio
    import sqlalchemy as sa
    from sqlalchemy.ext.asyncio import (
    AsyncSession, create_async_engine
    )
    from sqlalchemy.orm import declarative_base, sessionmaker
    Base = declarative_base()
    class A(Base):
    __tablename__ = "a"
    __mapper_args__ = {"eager_defaults": True}
    id = sa.Column(sa.Integer, primary_key=True)
    val = sa.Column(sa.String(255))
    created_at = sa.Column(
    sa.DateTime, server_default=sa.func.now(),
    )
    async def create_table(conn):
    await conn.run_sync(Base.metadata.drop_all)
    await conn.run_sync(Base.metadata.create_all)
    async def main():
    engine = create_async_engine(
    "postgresql+asyncpg://...",
    )
    async with engine.begin() as conn:
    await create_table(conn)
    async_session = sessionmaker(
    bind=engine,
    expire_on_commit=False,
    class_=AsyncSession,
    )
    async with async_session() as session:
    async with session.begin():
    session.add_all(
    [A(val="abc"), A(val="def")]
    )
    await session.commit()
    async with async_session() as session:
    result = await session.execute(sa.select(A))
    for a in result.scalars():
    print(a.id, a.val, a.created_at)
    await engine.dispose()
    20

    View Slide

  23. § Invisible errors
    – Transactions silently hangs up!
    ▹ It should have reported “operation in progress” DBAPI error
    – Patched through a series of v1.4 patch releases (v1.4.13+)
    ▹ nested transaction begin() missing
    Technical issues during SA 1.4 migration
    21

    View Slide

  24. § Swallowing of task cancellation
    – Patched in v1.4.19
    (discovered and not released yet at the time of writing this CFP…..)
    – Yes, “proper” cancellation is a difficult problem in asyncio….
    § greenlet is missing!
    – pip install "SQLAlchemy[postgresql_asyncpg]"
    – Update poetry to 1.1.8 or later!
    – Some envs may still need to install greenlet manually (e.g., M1 Mac) ...
    Technical issues during SA 1.4 migration
    22

    View Slide

  25. § Transaction isolation levels
    – Status transition of Backend.AI sessions1) involves many async tasks on multiple
    different nodes
    ▹ Should be globally synchronized to guarantee resource constraints!
    – READ COMMITTED vs. REPEATABLE READ vs. SERIALIZABLE
    Technical issues during SA 1.4 migration
    1) Backend.AI session is a collection of containers running in one or more computing agent nodes in GPU clusters.
    23

    View Slide

  26. § Transaction isolation levels
    – Solution: SERIALIZABLE + “proper” retrying of transactions + advisory lock
    + marking “read-only” transactions explicitly for performance
    Technical issues during SA 1.4 migration
    class ExtendedAsyncSAEngine(sqlalchemy.ext.asyncio.AsyncEngine):
    @aiotools.ctxmgr
    async def begin_readonly(self, deferrable: bool = False) -> AsyncIterator[AsyncConnection]:
    async with self.connect() as conn:
    conn_with_exec_opts = await conn.execution_options(
    postgresql_readonly=True,
    postgresql_deferrable=deferrable,
    )
    async with conn_with_exec_opts.begin():
    yield conn_with_exec_opts
    @aiotools.actxmgr
    async def advisory_lock(self, lock_id: int) -> AsyncIterator[None]:
    async with self.connect() as lock_conn:
    await lock_conn.exec_driver_sql(f"SELECT pg_advisory_lock({lock_id:d})")
    try:
    yield
    finally:
    await lock_conn.exec_driver_sql(f"SELECT pg_advisory_unlock({lock_id:d})")
    24

    View Slide

  27. § Transaction isolation levels
    – Solution: SERIALIZABLE + “proper” retrying of transactions + advisory lock
    + marking “read-only” transactions explicitly for performance
    Technical issues during SA 1.4 migration
    async def enumerate_instances(self, check_shadow=True):
    async with self.db.begin_readonly() as conn:
    query = (sa.select('*').select_from(agents))
    ...
    async for row in (await conn.stream(query)):
    yield row
    async def update_instance(self, inst_id, updated_fields):
    async def _update() -> None:
    async with self.db.begin() as conn:
    query = (
    sa.update(agents)
    .values(**updated_fields)
    .where(agents.c.id == inst_id)
    )
    await conn.execute(query)
    await execute_with_retry(_update)
    TQueryResult = TypeVar('TQueryResult')
    async def execute_with_retry(
    txn_func: Callable[[], Awaitable[TQueryResult]],
    ) -> TQueryResult:
    max_retries: Final = 10
    num_retries = 0
    while True:
    if num_retries == max_retries:
    raise RuntimeError("serialization failed")
    try:
    return await txn_func()
    except DBAPIError as e:
    num_retries += 1
    if getattr(e.orig, 'pgcode', None) == '40001':
    await asyncio.sleep((num_retries - 1) * 0.02)
    continue
    raise
    25

    View Slide

  28. § Planning removal of boilerplates like below used due to no async ORM...
    – Still ongoing!
    SA 1.4 ORM adoption
    @attr.s(auto_attribs=True, slots=True)
    class PendingSession:
    kernels: List[KernelInfo]
    access_key: AccessKey
    session_id: SessionId
    ...
    internal_data: Optional[MutableMapping[str, Any]]
    preopen_ports: List[int]
    @classmethod
    def from_row(cls, row: Row) -> PendingSession:
    return cls(
    kernels=[],
    access_key=row['access_key'],
    session_creation_id=row['session_creation_id'],
    session_id=row['session_id'],
    ...
    internal_data=row['internal_data'],
    preopen_ports=row['preopen_ports'],
    )
    26

    View Slide

  29. § Transaction isolation levels
    – Bigger view of the problem
    ▹ Is RDBMS suitable for strongly-consistent distributed state management?
    ▹ In early versions of Backend.AI, we used Redis but have migrated to
    PostgreSQL in favor of well-structured data schemes as more features added.
    – Why now (with SQLAlchemy v1.4 and not with aiopg)?
    ▹ Maybe due to their internal connection pooling mechanism differences...
    ▹ SA reuses the underlying DBAPI connection but recreates the connection
    wrapper (“fairy”) objects everytime
    Technical issues during SA 1.4 migration
    27

    View Slide

  30. § SQLAlchemy v1.4 is out!
    – with asyncio-native ORM and Core APIs based on asyncpg
    § Bleeding edge is bleeding edge...
    – It takes time for everyone (even experience developers)
    to deeply get used with asyncio (+cancellation)
    § Strongly-consistent state management on RDBMS is difficult.
    – Consider non-RDBMS solutions if possible
    – Use a proper transaction isolation level and retry serialization failures
    Summary
    28

    View Slide

  31. Bonus: What it's like replacing the DB engine of a live product
    EDS TV Commercial (2000)
    https://www.youtube.com/watch?v=L2zqTYgcpfg

    View Slide

  32. § https://github.com/achimnol/pycon-kr2021-samples
    § https://docs.sqlalchemy.org/en/14/
    – https://docs.sqlalchemy.org/en/14/_static/dragons.png
    § https://aiopg.readthedocs.io/en/stable/examples.html
    § https://asyncpgsa.readthedocs.io/en/latest/
    References
    29

    View Slide