$30 off During Our Annual Pro Sale. View Details »

Rapid prototyping in BBC News with Python and AWS

Ben Nuttall
September 17, 2022

Rapid prototyping in BBC News with Python and AWS

Talk given at PyCon UK 2022

Ben Nuttall

September 17, 2022
Tweet

More Decks by Ben Nuttall

Other Decks in Technology

Transcript

  1. RAPID PROTOTYPING
    IN BBC NEWS WITH
    PYTHON AND AWS
    BEN NUTTALL
    BBC NEWS LABS

    View Slide

  2. View Slide

  3. IT’S GOOD TO BE BACK

    View Slide

  4. View Slide

  5. https://www.youtube.com/watch?v=QCte3cOx49U

    View Slide

  6. • Software Engineer, BBC News Labs
    • Former Community Manager at Raspberry Pi
    • PyPI critical project maintainer
    • Based in Cambridgeshire
    • bennuttall.com
    • twitter.com/ben_nuttall
    • github.com/bennuttall
    Ben Nuttall

    View Slide

  7. • Multi-disciplinary innovation team within BBC News & BBC R&D
    • Prototypes of new audience experiences
    • Solutions to help journalists
    • Research and trying out ideas
    • bbcnewslabs.co.uk
    • twitter.com/bbc_news_labs
    BBC News Labs

    View Slide

  8. IDX (Identify the X)
    Automated clipping of content in live radio for social media
    mosromgr
    Processing TV/radio running orders to extract structured metadata
    BBC Images
    Image metadata enrichment pipeline
    Projects

    View Slide

  9. • 3 x 2-week sprints
    • 2 weeks of tweaks
    Project cycles
    Sprint 1 Sprint 2 Sprint 3 Tweaks Sprint 1

    View Slide

  10. • Research week
    • 3 x 2-week sprints
    • Wrap-up week
    Project cycles
    Research
    week
    Sprint 1 Sprint 2 Sprint 3 Wrap-up
    week
    Small
    projects

    View Slide

  11. • Start with department objectives
    • Devise "how might we..." statements
    • Explode and converge
    • Determine project objectives
    • Research
    • Bootstrapping
    • Spikes
    • Sprint goals
    • Ticketing
    Ideation

    View Slide

  12. • Identify stakeholders
    • Set up calls with journalists
    • Learn about existing systems and workflows
    • Get access to systems & data and get to know them
    • Set up shadowing
    Research week

    View Slide

  13. • Sit with a journalist or producer
    • Watch them do their job using existing tools
    • Work out what their workflows are
    • Look for pain points, inefficiencies, slowness, manual
    work that could be automated
    Shadowing

    View Slide

  14. • Lambda functions
    • Step functions / state machines
    • Databases (DynamoDB, RDS, Timestream, etc)
    • S3
    • SNS/SQS
    • CloudWatch
    AWS services for building processing pipelines

    View Slide

  15. • Run code without managing server infrastructure
    • Pay for compute time instead of provisioning for peak capacity
    • Python/JavaScript/Go/etc
    • Python 3.9
    AWS Lambda

    View Slide

  16. • Workflow design
    • Sequence of Lambdas
    • Lambdas can be implemented in different languages
    • Failures, retries, parallelisation
    Step functions / state machines

    View Slide

  17. • Execute with initial data
    • Pass new data on
    • Parallel paths and decisional logic
    • Specify retry logic
    • Whole state machine succeeds or fails
    • Easy access to data, exception info and lambda logs
    Step functions / state machines

    View Slide

  18. • Execute with initial data
    • Pass new data on
    • Parallel paths and decisional logic
    • Specify retry logic
    • Whole state machine succeeds or fails
    • Easy access to data, exception info and lambda logs
    Step functions / state machines

    View Slide

  19. def lambda_handler(event: dict, context=None) -> dict:
    ...
    event['thing'] = do_thing(data)
    return event
    Step functions / state machines

    View Slide

  20. • Data parsing and settings management using Python type
    annotations
    • Parse and validate a lambda’s input data and configuration
    Pydantic

    View Slide

  21. from typing import Optional
    from datetime import datetime, timedelta
    from pydantic import BaseModel
    class InputEvent(BaseModel):
    file_id: str
    ncs_id: Optional[str]
    start_time: datetime
    duration: timedelta
    body: list[str] = []
    Pydantic models

    View Slide

  22. from .models import InputEvent
    from .utils import do_thing
    def lambda_handler(event, context=None):
    input_event = InputEvent(**event)
    do_thing(input_event.thing_id)
    Pydantic models

    View Slide

  23. from pydantic import BaseSettings
    class Settings(BaseSettings):
    cert_file_path: str
    key_file_path: str
    @property
    def cert(self):
    return (self.cert_file_path, self.key_file_path)
    class Config:
    env_prefix = 'MOS_'
    env_file = '.env'
    Pydantic settings

    View Slide

  24. import requests
    from .models import Settings
    settings = Settings()
    def fetch_thing(url):
    r = requests.get(url, cert=settings.cert)
    return r.json()
    Pydantic settings

    View Slide

  25. • Amazon Aurora PostgreSQL-Compatible Edition
    • DB instance class: serverless v1
    • Specify capacity range and scaling configuration
    • Web service data API
    Serverless PostgreSQL

    View Slide

  26. Database:
    Type: AWS::RDS::DBCluster
    Properties:
    DBClusterIdentifier: !Ref DBClusterName
    MasterUsername: !Ref DBUsername
    MasterUserPassword: !Ref DBPassword
    DatabaseName: !Ref DBName
    Engine: aurora-postgresql
    EngineMode: serverless
    ScalingConfiguration:
    AutoPause: true
    MinCapacity: !Ref DBMinCapacity
    MaxCapacity: !Ref DBMaxCapacity
    SecondsUntilAutoPause: !Ref DBSecondsUntilAutoPause
    EnableHttpEndpoint: true
    Serverless PostgreSQL - CloudFormation

    View Slide

  27. Serverless PostgreSQL

    View Slide

  28. • Access via boto3
    • Or preferably use aurora-data-api or sqlalchemy-aurora-data-api
    • Connect using AWS Secrets Manager
    Serverless PostgreSQL

    View Slide

  29. • EC2 web server hosting static files in S3
    • Access via BBC Login or BBC certificate
    • Every project can re-use the infrastructure
    • Great for SPAs and static sites
    News Labs Apps Portal

    View Slide

  30. • Devise a website content structure with a layout template
    • Create Chameleon templates for each page type
    • Create logic layer for retrieving data required for each page write
    • Create lambda for writing/rewriting relevant pages
    • e.g. new episode processed:
    • write new episode page ///index.htm
    • update brand index page //index.htm
    • update homepage /index.htm
    • Create CLI for manual rewrites
    Static HTML websites with Chameleon

    View Slide

  31. • Structured logging
    • Looks great when running locally
    • easy to see relevant information
    • JSON logging support ideal for running in AWS
    • can access and search structured logs in
    CloudWatch
    • Encourages good logging practice!
    Structlog
    import structlog
    if os.environ.get('MOS_LOGGING') == 'JSON':
    processors = [
    structlog.stdlib.add_log_level,
    structlog.processors.StackInfoRenderer(),
    structlog.processors.format_exc_info,
    structlog.processors.JSONRenderer(),
    ]
    structlog.configure(processors=processors)

    View Slide

  32. Structlog

    View Slide

  33. from pydantic import BaseSettings
    class DbSettings(BaseSettings):
    arn: str
    secret_arn: str
    name: str
    class Config:
    env_prefix = 'MOS_DB_'
    env_file = '.env'
    Pydantic database settings

    View Slide

  34. from sqlalchemy import Column, ForeignKey, String, DateTime
    from sqlalchemy.orm import declarative_base
    Base = declarative_base()
    class Episode(Base):
    __tablename__ = 'episodes'
    episode_pid = Column(String, primary_key=True)
    version_pid = Column(String, nullable=False, unique=True)
    brand_pid = Column(String, ForeignKey('brands.brand_pid'), nullable=False)
    title = Column(String, nullable=False)
    image_pid = Column(String, nullable=False)
    service_id = Column(String, ForeignKey('services.service_id'), nullable=False)
    start_time = Column(DateTime)
    end_time = Column(DateTime)
    synopsis = Column(String, nullable=False)
    sqlalchemy

    View Slide