Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rapid prototyping in BBC News with Python and AWS

Ben Nuttall
September 17, 2022

Rapid prototyping in BBC News with Python and AWS

Talk given at PyCon UK 2022

Ben Nuttall

September 17, 2022
Tweet

More Decks by Ben Nuttall

Other Decks in Technology

Transcript

  1. • Software Engineer, BBC News Labs • Former Community Manager

    at Raspberry Pi • PyPI critical project maintainer • Based in Cambridgeshire • bennuttall.com • twitter.com/ben_nuttall • github.com/bennuttall Ben Nuttall
  2. • Multi-disciplinary innovation team within BBC News & BBC R&D

    • Prototypes of new audience experiences • Solutions to help journalists • Research and trying out ideas • bbcnewslabs.co.uk • twitter.com/bbc_news_labs BBC News Labs
  3. IDX (Identify the X) Automated clipping of content in live

    radio for social media mosromgr Processing TV/radio running orders to extract structured metadata BBC Images Image metadata enrichment pipeline Projects
  4. • 3 x 2-week sprints • 2 weeks of tweaks

    Project cycles Sprint 1 Sprint 2 Sprint 3 Tweaks Sprint 1
  5. • Research week • 3 x 2-week sprints • Wrap-up

    week Project cycles Research week Sprint 1 Sprint 2 Sprint 3 Wrap-up week Small projects
  6. • Start with department objectives • Devise "how might we..."

    statements • Explode and converge • Determine project objectives • Research • Bootstrapping • Spikes • Sprint goals • Ticketing Ideation
  7. • Identify stakeholders • Set up calls with journalists •

    Learn about existing systems and workflows • Get access to systems & data and get to know them • Set up shadowing Research week
  8. • Sit with a journalist or producer • Watch them

    do their job using existing tools • Work out what their workflows are • Look for pain points, inefficiencies, slowness, manual work that could be automated Shadowing
  9. • Lambda functions • Step functions / state machines •

    Databases (DynamoDB, RDS, Timestream, etc) • S3 • SNS/SQS • CloudWatch AWS services for building processing pipelines
  10. • Run code without managing server infrastructure • Pay for

    compute time instead of provisioning for peak capacity • Python/JavaScript/Go/etc • Python 3.9 AWS Lambda
  11. • Workflow design • Sequence of Lambdas • Lambdas can

    be implemented in different languages • Failures, retries, parallelisation Step functions / state machines
  12. • Execute with initial data • Pass new data on

    • Parallel paths and decisional logic • Specify retry logic • Whole state machine succeeds or fails • Easy access to data, exception info and lambda logs Step functions / state machines
  13. • Execute with initial data • Pass new data on

    • Parallel paths and decisional logic • Specify retry logic • Whole state machine succeeds or fails • Easy access to data, exception info and lambda logs Step functions / state machines
  14. • Data parsing and settings management using Python type annotations

    • Parse and validate a lambda’s input data and configuration Pydantic
  15. from typing import Optional from datetime import datetime, timedelta from

    pydantic import BaseModel class InputEvent(BaseModel): file_id: str ncs_id: Optional[str] start_time: datetime duration: timedelta body: list[str] = [] Pydantic models
  16. from .models import InputEvent from .utils import do_thing def lambda_handler(event,

    context=None): input_event = InputEvent(**event) do_thing(input_event.thing_id) Pydantic models
  17. from pydantic import BaseSettings class Settings(BaseSettings): cert_file_path: str key_file_path: str

    @property def cert(self): return (self.cert_file_path, self.key_file_path) class Config: env_prefix = 'MOS_' env_file = '.env' Pydantic settings
  18. import requests from .models import Settings settings = Settings() def

    fetch_thing(url): r = requests.get(url, cert=settings.cert) return r.json() Pydantic settings
  19. • Amazon Aurora PostgreSQL-Compatible Edition • DB instance class: serverless

    v1 • Specify capacity range and scaling configuration • Web service data API Serverless PostgreSQL
  20. Database: Type: AWS::RDS::DBCluster Properties: DBClusterIdentifier: !Ref DBClusterName MasterUsername: !Ref DBUsername

    MasterUserPassword: !Ref DBPassword DatabaseName: !Ref DBName Engine: aurora-postgresql EngineMode: serverless ScalingConfiguration: AutoPause: true MinCapacity: !Ref DBMinCapacity MaxCapacity: !Ref DBMaxCapacity SecondsUntilAutoPause: !Ref DBSecondsUntilAutoPause EnableHttpEndpoint: true Serverless PostgreSQL - CloudFormation
  21. • Access via boto3 • Or preferably use aurora-data-api or

    sqlalchemy-aurora-data-api • Connect using AWS Secrets Manager Serverless PostgreSQL
  22. • EC2 web server hosting static files in S3 •

    Access via BBC Login or BBC certificate • Every project can re-use the infrastructure • Great for SPAs and static sites News Labs Apps Portal
  23. • Devise a website content structure with a layout template

    • Create Chameleon templates for each page type • Create logic layer for retrieving data required for each page write • Create lambda for writing/rewriting relevant pages • e.g. new episode processed: • write new episode page /<brand>/<episode>/index.htm • update brand index page /<brand>/index.htm • update homepage /index.htm • Create CLI for manual rewrites Static HTML websites with Chameleon
  24. • Structured logging • Looks great when running locally •

    easy to see relevant information • JSON logging support ideal for running in AWS • can access and search structured logs in CloudWatch • Encourages good logging practice! Structlog import structlog if os.environ.get('MOS_LOGGING') == 'JSON': processors = [ structlog.stdlib.add_log_level, structlog.processors.StackInfoRenderer(), structlog.processors.format_exc_info, structlog.processors.JSONRenderer(), ] structlog.configure(processors=processors)
  25. from pydantic import BaseSettings class DbSettings(BaseSettings): arn: str secret_arn: str

    name: str class Config: env_prefix = 'MOS_DB_' env_file = '.env' Pydantic database settings
  26. from sqlalchemy import Column, ForeignKey, String, DateTime from sqlalchemy.orm import

    declarative_base Base = declarative_base() class Episode(Base): __tablename__ = 'episodes' episode_pid = Column(String, primary_key=True) version_pid = Column(String, nullable=False, unique=True) brand_pid = Column(String, ForeignKey('brands.brand_pid'), nullable=False) title = Column(String, nullable=False) image_pid = Column(String, nullable=False) service_id = Column(String, ForeignKey('services.service_id'), nullable=False) start_time = Column(DateTime) end_time = Column(DateTime) synopsis = Column(String, nullable=False) sqlalchemy