Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rapid prototyping in BBC News with Python and AWS

Rapid prototyping in BBC News with Python and AWS

Talk given at EuroPython 2022

BBC News Labs is an innovation team within BBC R&D, working with journalists and production teams to build prototypes to demonstrate and trial new ideas for ways to help journalists or bring new experiences to audiences.

Working in short project cycles, it's important for us to be able to quickly build processing pipelines connected to BBC services, test and iterate on ideas and demonstrate working prototypes. We make use of modern cloud technologies to accelerate delivery and reduce friction.

In this talk I will share our ways of working, our ideation and research methods, and the tools we use to be able to build, deploy and iterate quickly, the BBC's cloud deployment platform, and our use of serverless AWS services such as Lambda, Step Functions and Serverless Postgres.

Ben Nuttall

July 15, 2022
Tweet

More Decks by Ben Nuttall

Other Decks in Programming

Transcript

  1. @ben_nuttall
    Rapid prototyping in BBC News with
    Python and AWS

    View full-size slide

  2. @ben_nuttall
    Ben Nuttall

    Software Engineer, BBC News Labs

    Former Community Manager at
    Raspberry Pi

    Based in Cambridgeshire, UK

    bennuttall.com

    twitter.com/ben_nuttall

    github.com/bennuttall

    View full-size slide

  3. @ben_nuttall
    COVID

    I was looking forward to attending
    EuroPython in person for the first
    time since 2019 (Basel)

    Unfortunately, I recently got
    COVID

    Thank you to EuroPython for
    making a remote-friendly
    conference

    View full-size slide

  4. @ben_nuttall
    BBC News Labs

    Multi-disciplinary innovation team within
    BBC News & BBC R&D

    Prototypes of new audience experiences

    Solutions to help journalists

    Research and trying out ideas

    bbcnewslabs.co.uk

    twitter.com/bbc_news_labs

    View full-size slide

  5. @ben_nuttall
    Projects

    IDX (Identify the X)

    Automated clipping of content in live
    radio for social media

    mosromgr

    Processing TV/radio running orders to
    extract structured metadata

    BBC Images

    Image metadata enrichment pipeline

    View full-size slide

  6. @ben_nuttall
    Project cycles

    3 x 2-week sprints

    2 weeks of tweaks
    Sprint 1 Sprint 2 Sprint 3 Tweaks Sprint 1

    View full-size slide

  7. @ben_nuttall
    Project cycles

    Research week

    3 x 2-week sprints

    Wrap-up week
    Research
    week
    Sprint 1 Sprint 2 Sprint 3
    Wrap up
    week
    Small
    projects

    View full-size slide

  8. @ben_nuttall
    Ideation

    Start with department objectives

    Devise "how might we..." statements

    Explode and converge

    Determine project objectives

    Research

    Bootstrapping

    Spikes

    Sprint goals

    Ticketing

    View full-size slide

  9. @ben_nuttall
    Research week

    Identify stakeholders

    Set up calls with journalists

    Learn about existing systems and
    workflows

    Get access to systems & data and get
    to know them

    Set up shadowing

    View full-size slide

  10. @ben_nuttall
    Shadowing

    Sit with a journalist or producer

    Watch them do their job using
    existing tools

    Work out what their workflows are

    Look for pain points, inefficiencies,
    slowness, manual work that could be
    automated

    View full-size slide

  11. @ben_nuttall
    AWS services for building processing
    pipelines

    Lambda functions

    Step functions / state machines

    Databases (DynamoDB, RDS, Timestream)

    S3

    SNS/SQS

    CloudWatch

    View full-size slide

  12. @ben_nuttall
    AWS Lambda

    Run code without managing server infrastructure

    Pay for compute time instead of provisioning for peak capacity

    Python/NodeJS/Go/etc

    View full-size slide

  13. @ben_nuttall
    Step functions / state machines

    Workflow design

    Sequence of Lambdas

    Lambdas can be implemented in
    different languages

    Failures, retries, parallelization

    View full-size slide

  14. @ben_nuttall
    Step functions / state machines

    Execute with initial data

    Pass new data on

    Parallel paths and decisional
    logic

    Specify retry logic

    Whole state machine succeeds
    or fails

    Easy access to data, exception
    info and lambda logs

    View full-size slide

  15. @ben_nuttall
    Step functions / state machines

    Execute with initial data

    Pass new data on

    Parallel paths and decisional
    logic

    Specify retry logic

    Whole state machine succeeds
    or fails

    Easy access to data, exception
    info and lambda logs

    View full-size slide

  16. @ben_nuttall
    Step functions / state machines
    def lambda_handler(event: dict, context=None) -> dict:
    ...
    event['thing'] = do_thing(data)
    return event

    View full-size slide

  17. @ben_nuttall
    Pydantic

    Data parsing and settings
    management using python type
    annotations

    Parse and validate a lambda’s input
    data and configuration

    View full-size slide

  18. @ben_nuttall
    Pydantic models
    from typing import Optional
    from datetime import datetime, timedelta
    from pydantic import BaseModel
    class InputEvent(BaseModel):
    file_id: str
    ncs_id: Optional[str]
    start_time: datetime
    duration: timedelta
    body: list[str] = []

    View full-size slide

  19. @ben_nuttall
    Pydantic models
    from .models import InputEvent
    from .utils import do_thing
    def lambda_handler(event, context=None):
    input_event = InputEvent(**event)
    do_thing(input_event.thing_id)

    View full-size slide

  20. @ben_nuttall
    Pydantic settings
    from pydantic import BaseSettings
    class Settings(BaseSettings):
    cert_file_path: str
    key_file_path: str
    @property
    def cert(self):
    return (self.cert_file_path, self.key_file_path)
    class Config:
    env_prefix = 'MOS_'
    env_file = '.env'

    View full-size slide

  21. @ben_nuttall
    Pydantic settings
    import requests
    from .models import Settings
    settings = Settings()
    def fetch_thing(url):
    r = requests.get(url, cert=settings.cert)
    return r.json()

    View full-size slide

  22. @ben_nuttall
    AWS databases

    DynamoDB

    Serverless

    NoSQL tables

    JSON data storage

    Timestream

    Serverless time series database

    SQL optimised for time series data

    RDS

    Managed SQL databases

    Serverless option available

    View full-size slide

  23. @ben_nuttall
    Serverless PostgreSQL

    Amazon Aurora PostgreSQL-
    Compatible Edition

    DB instance class: serverless v1

    Specify capacity range and scaling
    configuration

    Web service data API

    View full-size slide

  24. @ben_nuttall
    Serverless PostgreSQL - CloudFormation
    Database:
    Type: AWS::RDS::DBCluster
    Properties:
    DBClusterIdentifier: !Ref DBClusterName
    MasterUsername: !Ref DBUsername
    MasterUserPassword: !Ref DBPassword
    DatabaseName: !Ref DBName
    Engine: aurora-postgresql
    EngineMode: serverless
    ScalingConfiguration:
    AutoPause: true
    MinCapacity: !Ref DBMinCapacity
    MaxCapacity: !Ref DBMaxCapacity
    SecondsUntilAutoPause: !Ref DBSecondsUntilAutoPause
    EnableHttpEndpoint: true

    View full-size slide

  25. @ben_nuttall
    Serverless PostgreSQL

    View full-size slide

  26. @ben_nuttall
    Serverless PostgreSQL

    Access via boto3

    Or preferably use aurora-data-api or sqlalchemy-aurora-data-api

    Connect using AWS Secrets Manager

    View full-size slide

  27. @ben_nuttall
    News Labs Apps Portal

    EC2 web server hosting static files in S3

    Access via BBC Login or BBC certificate

    Every project can re-use the infrastructure

    Great for SPAs and static sites

    View full-size slide

  28. @ben_nuttall
    Static HTML websites with Chameleon

    Devise a website content structure with a layout template

    Create Chameleon templates for each page type

    Create logic layer for retrieving data required for each page write

    Create lambda for writing/rewriting relevant pages

    e.g. new episode processed:

    write new episode page ///index.htm

    update brand index page //index.htm

    update homepage /index.htm

    Create CLI for manual rewrites

    View full-size slide

  29. @ben_nuttall
    Structlog

    Structured logging

    Looks great when running locally -
    easy to see relevant information

    JSON logging support ideal for
    running in AWS - can access and
    search structured logs in
    CloudWatch

    Encourages good logging practice!
    import structlog
    if os.environ.get('MOS_LOGGING') == 'JSON':
    processors = [
    structlog.stdlib.add_log_level,
    structlog.processors.StackInfoRenderer(),
    structlog.processors.format_exc_info,
    structlog.processors.JSONRenderer(),
    ]
    structlog.configure(processors=processors)

    View full-size slide

  30. @ben_nuttall
    Structlog

    View full-size slide

  31. @ben_nuttall
    Pydantic database settings
    from pydantic import BaseSettings
    class DbSettings(BaseSettings):
    arn: str
    secret_arn: str
    name: str
    class Config:
    env_prefix = 'MOS_DB_'
    env_file = '.env'

    View full-size slide

  32. @ben_nuttall
    sqlalchemy
    from sqlalchemy import Column, ForeignKey, String, DateTime
    from sqlalchemy.orm import declarative_base
    Base = declarative_base()
    class Episode(Base):
    __tablename__ = 'episodes'
    episode_pid = Column(String, primary_key=True)
    version_pid = Column(String, nullable=False, unique=True)
    brand_pid = Column(String, ForeignKey('brands.brand_pid'), nullable=False)
    title = Column(String, nullable=False)
    image_pid = Column(String, nullable=False)
    service_id = Column(String, ForeignKey('services.service_id'), nullable=False)
    start_time = Column(DateTime)
    end_time = Column(DateTime)
    synopsis = Column(String, nullable=False)

    View full-size slide

  33. @ben_nuttall
    sqlalchemy
    class MosDatabase:
    def __init__(self):
    settings = Settings()
    self.engine = create_engine(
    f'postgresql+auroradataapi://:@/{settings.name}',
    connect_args=dict(
    aurora_cluster_arn=settings.arn,
    secret_arn=settings.secret_arn,
    )
    )
    def get_episode(self, episode_pid: str) -> Episode:
    with Session(self.engine) as session:
    query = Episode.__table__.select().where(Episode.episode_pid == episode_pid)
    return session.execute(query).mappings().one()

    View full-size slide

  34. @ben_nuttall
    Lambda function URLs and FastAPI

    Dedicated HTTP endpoint for a Lambda
    function

    Serverless REST API

    FastAPI (built on Starlette and Pydantic)
    makes it very easy to provide a REST API

    Serverless API for a serverless
    database

    Define in/out data structure with
    Pydantic

    Easily add authentication

    View full-size slide

  35. @ben_nuttall
    Learning

    Move fast and learn things!

    Take learnings into the next project

    Use spikes to try ideas out

    No project is perfect

    No hard rules - determine good practice and keep improving

    Knowledge share

    Prioritise for delivery

    Use research week and wrap-up week wisely

    View full-size slide

  36. @ben_nuttall
    Rapid prototyping in BBC News with
    Python and AWS

    View full-size slide