Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rapid prototyping in BBC News with Python and AWS

Rapid prototyping in BBC News with Python and AWS

Talk given at EuroPython 2022

BBC News Labs is an innovation team within BBC R&D, working with journalists and production teams to build prototypes to demonstrate and trial new ideas for ways to help journalists or bring new experiences to audiences.

Working in short project cycles, it's important for us to be able to quickly build processing pipelines connected to BBC services, test and iterate on ideas and demonstrate working prototypes. We make use of modern cloud technologies to accelerate delivery and reduce friction.

In this talk I will share our ways of working, our ideation and research methods, and the tools we use to be able to build, deploy and iterate quickly, the BBC's cloud deployment platform, and our use of serverless AWS services such as Lambda, Step Functions and Serverless Postgres.

Ben Nuttall

July 15, 2022
Tweet

More Decks by Ben Nuttall

Other Decks in Programming

Transcript

  1. @ben_nuttall Ben Nuttall • Software Engineer, BBC News Labs •

    Former Community Manager at Raspberry Pi • Based in Cambridgeshire, UK • bennuttall.com • twitter.com/ben_nuttall • github.com/bennuttall
  2. @ben_nuttall COVID • I was looking forward to attending EuroPython

    in person for the first time since 2019 (Basel) • Unfortunately, I recently got COVID • Thank you to EuroPython for making a remote-friendly conference
  3. @ben_nuttall BBC News Labs • Multi-disciplinary innovation team within BBC

    News & BBC R&D • Prototypes of new audience experiences • Solutions to help journalists • Research and trying out ideas • bbcnewslabs.co.uk • twitter.com/bbc_news_labs
  4. @ben_nuttall Projects • IDX (Identify the X) – Automated clipping

    of content in live radio for social media • mosromgr – Processing TV/radio running orders to extract structured metadata • BBC Images – Image metadata enrichment pipeline
  5. @ben_nuttall Project cycles • 3 x 2-week sprints • 2

    weeks of tweaks Sprint 1 Sprint 2 Sprint 3 Tweaks Sprint 1
  6. @ben_nuttall Project cycles • Research week • 3 x 2-week

    sprints • Wrap-up week Research week Sprint 1 Sprint 2 Sprint 3 Wrap up week Small projects
  7. @ben_nuttall Ideation • Start with department objectives • Devise "how

    might we..." statements • Explode and converge • Determine project objectives • Research • Bootstrapping • Spikes • Sprint goals • Ticketing
  8. @ben_nuttall Research week • Identify stakeholders • Set up calls

    with journalists • Learn about existing systems and workflows • Get access to systems & data and get to know them • Set up shadowing
  9. @ben_nuttall Shadowing • Sit with a journalist or producer •

    Watch them do their job using existing tools • Work out what their workflows are • Look for pain points, inefficiencies, slowness, manual work that could be automated
  10. @ben_nuttall AWS services for building processing pipelines • Lambda functions

    • Step functions / state machines • Databases (DynamoDB, RDS, Timestream) • S3 • SNS/SQS • CloudWatch
  11. @ben_nuttall AWS Lambda • Run code without managing server infrastructure

    • Pay for compute time instead of provisioning for peak capacity • Python/NodeJS/Go/etc
  12. @ben_nuttall Step functions / state machines • Workflow design •

    Sequence of Lambdas • Lambdas can be implemented in different languages • Failures, retries, parallelization
  13. @ben_nuttall Step functions / state machines • Execute with initial

    data • Pass new data on • Parallel paths and decisional logic • Specify retry logic • Whole state machine succeeds or fails • Easy access to data, exception info and lambda logs
  14. @ben_nuttall Step functions / state machines • Execute with initial

    data • Pass new data on • Parallel paths and decisional logic • Specify retry logic • Whole state machine succeeds or fails • Easy access to data, exception info and lambda logs
  15. @ben_nuttall Step functions / state machines def lambda_handler(event: dict, context=None)

    -> dict: ... event['thing'] = do_thing(data) return event
  16. @ben_nuttall Pydantic • Data parsing and settings management using python

    type annotations • Parse and validate a lambda’s input data and configuration
  17. @ben_nuttall Pydantic models from typing import Optional from datetime import

    datetime, timedelta from pydantic import BaseModel class InputEvent(BaseModel): file_id: str ncs_id: Optional[str] start_time: datetime duration: timedelta body: list[str] = []
  18. @ben_nuttall Pydantic models from .models import InputEvent from .utils import

    do_thing def lambda_handler(event, context=None): input_event = InputEvent(**event) do_thing(input_event.thing_id)
  19. @ben_nuttall Pydantic settings from pydantic import BaseSettings class Settings(BaseSettings): cert_file_path:

    str key_file_path: str @property def cert(self): return (self.cert_file_path, self.key_file_path) class Config: env_prefix = 'MOS_' env_file = '.env'
  20. @ben_nuttall Pydantic settings import requests from .models import Settings settings

    = Settings() def fetch_thing(url): r = requests.get(url, cert=settings.cert) return r.json()
  21. @ben_nuttall AWS databases • DynamoDB – Serverless – NoSQL tables

    – JSON data storage • Timestream – Serverless time series database – SQL optimised for time series data • RDS – Managed SQL databases – Serverless option available
  22. @ben_nuttall Serverless PostgreSQL • Amazon Aurora PostgreSQL- Compatible Edition •

    DB instance class: serverless v1 • Specify capacity range and scaling configuration • Web service data API
  23. @ben_nuttall Serverless PostgreSQL - CloudFormation Database: Type: AWS::RDS::DBCluster Properties: DBClusterIdentifier:

    !Ref DBClusterName MasterUsername: !Ref DBUsername MasterUserPassword: !Ref DBPassword DatabaseName: !Ref DBName Engine: aurora-postgresql EngineMode: serverless ScalingConfiguration: AutoPause: true MinCapacity: !Ref DBMinCapacity MaxCapacity: !Ref DBMaxCapacity SecondsUntilAutoPause: !Ref DBSecondsUntilAutoPause EnableHttpEndpoint: true
  24. @ben_nuttall Serverless PostgreSQL • Access via boto3 • Or preferably

    use aurora-data-api or sqlalchemy-aurora-data-api • Connect using AWS Secrets Manager
  25. @ben_nuttall News Labs Apps Portal • EC2 web server hosting

    static files in S3 • Access via BBC Login or BBC certificate • Every project can re-use the infrastructure • Great for SPAs and static sites
  26. @ben_nuttall Static HTML websites with Chameleon • Devise a website

    content structure with a layout template • Create Chameleon templates for each page type • Create logic layer for retrieving data required for each page write • Create lambda for writing/rewriting relevant pages – e.g. new episode processed: • write new episode page /<brand>/<episode>/index.htm • update brand index page /<brand>/index.htm • update homepage /index.htm • Create CLI for manual rewrites
  27. @ben_nuttall Structlog • Structured logging • Looks great when running

    locally - easy to see relevant information • JSON logging support ideal for running in AWS - can access and search structured logs in CloudWatch • Encourages good logging practice! import structlog if os.environ.get('MOS_LOGGING') == 'JSON': processors = [ structlog.stdlib.add_log_level, structlog.processors.StackInfoRenderer(), structlog.processors.format_exc_info, structlog.processors.JSONRenderer(), ] structlog.configure(processors=processors)
  28. @ben_nuttall Pydantic database settings from pydantic import BaseSettings class DbSettings(BaseSettings):

    arn: str secret_arn: str name: str class Config: env_prefix = 'MOS_DB_' env_file = '.env'
  29. @ben_nuttall sqlalchemy from sqlalchemy import Column, ForeignKey, String, DateTime from

    sqlalchemy.orm import declarative_base Base = declarative_base() class Episode(Base): __tablename__ = 'episodes' episode_pid = Column(String, primary_key=True) version_pid = Column(String, nullable=False, unique=True) brand_pid = Column(String, ForeignKey('brands.brand_pid'), nullable=False) title = Column(String, nullable=False) image_pid = Column(String, nullable=False) service_id = Column(String, ForeignKey('services.service_id'), nullable=False) start_time = Column(DateTime) end_time = Column(DateTime) synopsis = Column(String, nullable=False)
  30. @ben_nuttall sqlalchemy class MosDatabase: def __init__(self): settings = Settings() self.engine

    = create_engine( f'postgresql+auroradataapi://:@/{settings.name}', connect_args=dict( aurora_cluster_arn=settings.arn, secret_arn=settings.secret_arn, ) ) def get_episode(self, episode_pid: str) -> Episode: with Session(self.engine) as session: query = Episode.__table__.select().where(Episode.episode_pid == episode_pid) return session.execute(query).mappings().one()
  31. @ben_nuttall Lambda function URLs and FastAPI • Dedicated HTTP endpoint

    for a Lambda function • Serverless REST API • FastAPI (built on Starlette and Pydantic) makes it very easy to provide a REST API – Serverless API for a serverless database – Define in/out data structure with Pydantic – Easily add authentication
  32. @ben_nuttall Learning • Move fast and learn things! • Take

    learnings into the next project • Use spikes to try ideas out • No project is perfect • No hard rules - determine good practice and keep improving • Knowledge share • Prioritise for delivery • Use research week and wrap-up week wisely