Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rapid prototyping in BBC News with Python and AWS

Ben Nuttall
September 17, 2022

Rapid prototyping in BBC News with Python and AWS

Talk given at PyCon UK 2022

Ben Nuttall

September 17, 2022
Tweet

More Decks by Ben Nuttall

Other Decks in Technology

Transcript

  1. RAPID PROTOTYPING IN BBC NEWS WITH PYTHON AND AWS BEN

    NUTTALL BBC NEWS LABS
  2. None
  3. IT’S GOOD TO BE BACK

  4. None
  5. https://www.youtube.com/watch?v=QCte3cOx49U

  6. • Software Engineer, BBC News Labs • Former Community Manager

    at Raspberry Pi • PyPI critical project maintainer • Based in Cambridgeshire • bennuttall.com • twitter.com/ben_nuttall • github.com/bennuttall Ben Nuttall
  7. • Multi-disciplinary innovation team within BBC News & BBC R&D

    • Prototypes of new audience experiences • Solutions to help journalists • Research and trying out ideas • bbcnewslabs.co.uk • twitter.com/bbc_news_labs BBC News Labs
  8. IDX (Identify the X) Automated clipping of content in live

    radio for social media mosromgr Processing TV/radio running orders to extract structured metadata BBC Images Image metadata enrichment pipeline Projects
  9. • 3 x 2-week sprints • 2 weeks of tweaks

    Project cycles Sprint 1 Sprint 2 Sprint 3 Tweaks Sprint 1
  10. • Research week • 3 x 2-week sprints • Wrap-up

    week Project cycles Research week Sprint 1 Sprint 2 Sprint 3 Wrap-up week Small projects
  11. • Start with department objectives • Devise "how might we..."

    statements • Explode and converge • Determine project objectives • Research • Bootstrapping • Spikes • Sprint goals • Ticketing Ideation
  12. • Identify stakeholders • Set up calls with journalists •

    Learn about existing systems and workflows • Get access to systems & data and get to know them • Set up shadowing Research week
  13. • Sit with a journalist or producer • Watch them

    do their job using existing tools • Work out what their workflows are • Look for pain points, inefficiencies, slowness, manual work that could be automated Shadowing
  14. • Lambda functions • Step functions / state machines •

    Databases (DynamoDB, RDS, Timestream, etc) • S3 • SNS/SQS • CloudWatch AWS services for building processing pipelines
  15. • Run code without managing server infrastructure • Pay for

    compute time instead of provisioning for peak capacity • Python/JavaScript/Go/etc • Python 3.9 AWS Lambda
  16. • Workflow design • Sequence of Lambdas • Lambdas can

    be implemented in different languages • Failures, retries, parallelisation Step functions / state machines
  17. • Execute with initial data • Pass new data on

    • Parallel paths and decisional logic • Specify retry logic • Whole state machine succeeds or fails • Easy access to data, exception info and lambda logs Step functions / state machines
  18. • Execute with initial data • Pass new data on

    • Parallel paths and decisional logic • Specify retry logic • Whole state machine succeeds or fails • Easy access to data, exception info and lambda logs Step functions / state machines
  19. def lambda_handler(event: dict, context=None) -> dict: ... event['thing'] = do_thing(data)

    return event Step functions / state machines
  20. • Data parsing and settings management using Python type annotations

    • Parse and validate a lambda’s input data and configuration Pydantic
  21. from typing import Optional from datetime import datetime, timedelta from

    pydantic import BaseModel class InputEvent(BaseModel): file_id: str ncs_id: Optional[str] start_time: datetime duration: timedelta body: list[str] = [] Pydantic models
  22. from .models import InputEvent from .utils import do_thing def lambda_handler(event,

    context=None): input_event = InputEvent(**event) do_thing(input_event.thing_id) Pydantic models
  23. from pydantic import BaseSettings class Settings(BaseSettings): cert_file_path: str key_file_path: str

    @property def cert(self): return (self.cert_file_path, self.key_file_path) class Config: env_prefix = 'MOS_' env_file = '.env' Pydantic settings
  24. import requests from .models import Settings settings = Settings() def

    fetch_thing(url): r = requests.get(url, cert=settings.cert) return r.json() Pydantic settings
  25. • Amazon Aurora PostgreSQL-Compatible Edition • DB instance class: serverless

    v1 • Specify capacity range and scaling configuration • Web service data API Serverless PostgreSQL
  26. Database: Type: AWS::RDS::DBCluster Properties: DBClusterIdentifier: !Ref DBClusterName MasterUsername: !Ref DBUsername

    MasterUserPassword: !Ref DBPassword DatabaseName: !Ref DBName Engine: aurora-postgresql EngineMode: serverless ScalingConfiguration: AutoPause: true MinCapacity: !Ref DBMinCapacity MaxCapacity: !Ref DBMaxCapacity SecondsUntilAutoPause: !Ref DBSecondsUntilAutoPause EnableHttpEndpoint: true Serverless PostgreSQL - CloudFormation
  27. Serverless PostgreSQL

  28. • Access via boto3 • Or preferably use aurora-data-api or

    sqlalchemy-aurora-data-api • Connect using AWS Secrets Manager Serverless PostgreSQL
  29. • EC2 web server hosting static files in S3 •

    Access via BBC Login or BBC certificate • Every project can re-use the infrastructure • Great for SPAs and static sites News Labs Apps Portal
  30. • Devise a website content structure with a layout template

    • Create Chameleon templates for each page type • Create logic layer for retrieving data required for each page write • Create lambda for writing/rewriting relevant pages • e.g. new episode processed: • write new episode page /<brand>/<episode>/index.htm • update brand index page /<brand>/index.htm • update homepage /index.htm • Create CLI for manual rewrites Static HTML websites with Chameleon
  31. • Structured logging • Looks great when running locally •

    easy to see relevant information • JSON logging support ideal for running in AWS • can access and search structured logs in CloudWatch • Encourages good logging practice! Structlog import structlog if os.environ.get('MOS_LOGGING') == 'JSON': processors = [ structlog.stdlib.add_log_level, structlog.processors.StackInfoRenderer(), structlog.processors.format_exc_info, structlog.processors.JSONRenderer(), ] structlog.configure(processors=processors)
  32. Structlog

  33. from pydantic import BaseSettings class DbSettings(BaseSettings): arn: str secret_arn: str

    name: str class Config: env_prefix = 'MOS_DB_' env_file = '.env' Pydantic database settings
  34. from sqlalchemy import Column, ForeignKey, String, DateTime from sqlalchemy.orm import

    declarative_base Base = declarative_base() class Episode(Base): __tablename__ = 'episodes' episode_pid = Column(String, primary_key=True) version_pid = Column(String, nullable=False, unique=True) brand_pid = Column(String, ForeignKey('brands.brand_pid'), nullable=False) title = Column(String, nullable=False) image_pid = Column(String, nullable=False) service_id = Column(String, ForeignKey('services.service_id'), nullable=False) start_time = Column(DateTime) end_time = Column(DateTime) synopsis = Column(String, nullable=False) sqlalchemy