Rapid prototyping in BBC News with Python and AWS

Slide 1

Slide 1 text

@ben_nuttall Rapid prototyping in BBC News with Python and AWS

Slide 2

Slide 2 text

@ben_nuttall Ben Nuttall ● Software Engineer, BBC News Labs ● Former Community Manager at Raspberry Pi ● Based in Cambridgeshire, UK ● bennuttall.com ● twitter.com/ben_nuttall ● github.com/bennuttall

Slide 3

Slide 3 text

@ben_nuttall COVID ● I was looking forward to attending EuroPython in person for the first time since 2019 (Basel) ● Unfortunately, I recently got COVID ● Thank you to EuroPython for making a remote-friendly conference

Slide 4

Slide 4 text

@ben_nuttall BBC News Labs ● Multi-disciplinary innovation team within BBC News & BBC R&D ● Prototypes of new audience experiences ● Solutions to help journalists ● Research and trying out ideas ● bbcnewslabs.co.uk ● twitter.com/bbc_news_labs

Slide 5

Slide 5 text

@ben_nuttall Projects ● IDX (Identify the X) – Automated clipping of content in live radio for social media ● mosromgr – Processing TV/radio running orders to extract structured metadata ● BBC Images – Image metadata enrichment pipeline

Slide 6

Slide 6 text

@ben_nuttall Project cycles ● 3 x 2-week sprints ● 2 weeks of tweaks Sprint 1 Sprint 2 Sprint 3 Tweaks Sprint 1

Slide 7

Slide 7 text

@ben_nuttall Project cycles ● Research week ● 3 x 2-week sprints ● Wrap-up week Research week Sprint 1 Sprint 2 Sprint 3 Wrap up week Small projects

Slide 8

Slide 8 text

@ben_nuttall Ideation ● Start with department objectives ● Devise "how might we..." statements ● Explode and converge ● Determine project objectives ● Research ● Bootstrapping ● Spikes ● Sprint goals ● Ticketing

Slide 9

Slide 9 text

@ben_nuttall Research week ● Identify stakeholders ● Set up calls with journalists ● Learn about existing systems and workflows ● Get access to systems & data and get to know them ● Set up shadowing

Slide 10

Slide 10 text

@ben_nuttall Shadowing ● Sit with a journalist or producer ● Watch them do their job using existing tools ● Work out what their workflows are ● Look for pain points, inefficiencies, slowness, manual work that could be automated

Slide 11

Slide 11 text

@ben_nuttall AWS services for building processing pipelines ● Lambda functions ● Step functions / state machines ● Databases (DynamoDB, RDS, Timestream) ● S3 ● SNS/SQS ● CloudWatch

Slide 12

Slide 12 text

@ben_nuttall AWS Lambda ● Run code without managing server infrastructure ● Pay for compute time instead of provisioning for peak capacity ● Python/NodeJS/Go/etc

Slide 13

Slide 13 text

@ben_nuttall Step functions / state machines ● Workflow design ● Sequence of Lambdas ● Lambdas can be implemented in different languages ● Failures, retries, parallelization

Slide 14

Slide 14 text

@ben_nuttall Step functions / state machines ● Execute with initial data ● Pass new data on ● Parallel paths and decisional logic ● Specify retry logic ● Whole state machine succeeds or fails ● Easy access to data, exception info and lambda logs

Slide 15

Slide 15 text

Slide 16

Slide 16 text

@ben_nuttall Step functions / state machines def lambda_handler(event: dict, context=None) -> dict: ... event['thing'] = do_thing(data) return event

Slide 17

Slide 17 text

@ben_nuttall Pydantic ● Data parsing and settings management using python type annotations ● Parse and validate a lambda’s input data and configuration

Slide 18

Slide 18 text

@ben_nuttall Pydantic models from typing import Optional from datetime import datetime, timedelta from pydantic import BaseModel class InputEvent(BaseModel): file_id: str ncs_id: Optional[str] start_time: datetime duration: timedelta body: list[str] = []

Slide 19

Slide 19 text

@ben_nuttall Pydantic models from .models import InputEvent from .utils import do_thing def lambda_handler(event, context=None): input_event = InputEvent(**event) do_thing(input_event.thing_id)

Slide 20

Slide 20 text

@ben_nuttall Pydantic settings from pydantic import BaseSettings class Settings(BaseSettings): cert_file_path: str key_file_path: str @property def cert(self): return (self.cert_file_path, self.key_file_path) class Config: env_prefix = 'MOS_' env_file = '.env'

Slide 21

Slide 21 text

@ben_nuttall Pydantic settings import requests from .models import Settings settings = Settings() def fetch_thing(url): r = requests.get(url, cert=settings.cert) return r.json()

Slide 22

Slide 22 text

@ben_nuttall AWS databases ● DynamoDB – Serverless – NoSQL tables – JSON data storage ● Timestream – Serverless time series database – SQL optimised for time series data ● RDS – Managed SQL databases – Serverless option available

Slide 23

Slide 23 text

@ben_nuttall Serverless PostgreSQL ● Amazon Aurora PostgreSQL- Compatible Edition ● DB instance class: serverless v1 ● Specify capacity range and scaling configuration ● Web service data API

Slide 24

Slide 24 text

@ben_nuttall Serverless PostgreSQL - CloudFormation Database: Type: AWS::RDS::DBCluster Properties: DBClusterIdentifier: !Ref DBClusterName MasterUsername: !Ref DBUsername MasterUserPassword: !Ref DBPassword DatabaseName: !Ref DBName Engine: aurora-postgresql EngineMode: serverless ScalingConfiguration: AutoPause: true MinCapacity: !Ref DBMinCapacity MaxCapacity: !Ref DBMaxCapacity SecondsUntilAutoPause: !Ref DBSecondsUntilAutoPause EnableHttpEndpoint: true

Slide 25

Slide 25 text

@ben_nuttall Serverless PostgreSQL

Slide 26

Slide 26 text

@ben_nuttall Serverless PostgreSQL ● Access via boto3 ● Or preferably use aurora-data-api or sqlalchemy-aurora-data-api ● Connect using AWS Secrets Manager

Slide 27

Slide 27 text

@ben_nuttall News Labs Apps Portal ● EC2 web server hosting static files in S3 ● Access via BBC Login or BBC certificate ● Every project can re-use the infrastructure ● Great for SPAs and static sites

Slide 28

Slide 28 text

@ben_nuttall Static HTML websites with Chameleon ● Devise a website content structure with a layout template ● Create Chameleon templates for each page type ● Create logic layer for retrieving data required for each page write ● Create lambda for writing/rewriting relevant pages – e.g. new episode processed: ● write new episode page ///index.htm ● update brand index page //index.htm ● update homepage /index.htm ● Create CLI for manual rewrites

Slide 29

Slide 29 text

@ben_nuttall Structlog ● Structured logging ● Looks great when running locally - easy to see relevant information ● JSON logging support ideal for running in AWS - can access and search structured logs in CloudWatch ● Encourages good logging practice! import structlog if os.environ.get('MOS_LOGGING') == 'JSON': processors = [ structlog.stdlib.add_log_level, structlog.processors.StackInfoRenderer(), structlog.processors.format_exc_info, structlog.processors.JSONRenderer(), ] structlog.configure(processors=processors)

Slide 30

Slide 30 text

@ben_nuttall Structlog

Slide 31

Slide 31 text

@ben_nuttall Pydantic database settings from pydantic import BaseSettings class DbSettings(BaseSettings): arn: str secret_arn: str name: str class Config: env_prefix = 'MOS_DB_' env_file = '.env'

Slide 32

Slide 32 text

@ben_nuttall sqlalchemy from sqlalchemy import Column, ForeignKey, String, DateTime from sqlalchemy.orm import declarative_base Base = declarative_base() class Episode(Base): __tablename__ = 'episodes' episode_pid = Column(String, primary_key=True) version_pid = Column(String, nullable=False, unique=True) brand_pid = Column(String, ForeignKey('brands.brand_pid'), nullable=False) title = Column(String, nullable=False) image_pid = Column(String, nullable=False) service_id = Column(String, ForeignKey('services.service_id'), nullable=False) start_time = Column(DateTime) end_time = Column(DateTime) synopsis = Column(String, nullable=False)

Slide 33

Slide 33 text

@ben_nuttall sqlalchemy class MosDatabase: def __init__(self): settings = Settings() self.engine = create_engine( f'postgresql+auroradataapi://:@/{settings.name}', connect_args=dict( aurora_cluster_arn=settings.arn, secret_arn=settings.secret_arn, ) ) def get_episode(self, episode_pid: str) -> Episode: with Session(self.engine) as session: query = Episode.__table__.select().where(Episode.episode_pid == episode_pid) return session.execute(query).mappings().one()

Slide 34

Slide 34 text

@ben_nuttall Lambda function URLs and FastAPI ● Dedicated HTTP endpoint for a Lambda function ● Serverless REST API ● FastAPI (built on Starlette and Pydantic) makes it very easy to provide a REST API – Serverless API for a serverless database – Define in/out data structure with Pydantic – Easily add authentication

Slide 35

Slide 35 text

@ben_nuttall Learning ● Move fast and learn things! ● Take learnings into the next project ● Use spikes to try ideas out ● No project is perfect ● No hard rules - determine good practice and keep improving ● Knowledge share ● Prioritise for delivery ● Use research week and wrap-up week wisely

Slide 36

Slide 36 text

@ben_nuttall Rapid prototyping in BBC News with Python and AWS