BBC Dynamic Semantic Publishing (Sport|Olympics) SemTechBiz UK

Future Media © BBC MMXII BBC Dynamic Semantic Publishing [DSP]
SemtechBizUK 2012 •  Jem Rayfield : BBC Future Media •  Peter Haase : fluid Operations •  Borislav Popov : Ontotext

Future Media © BBC MMXII Outline BBC News Online BBC
World Cup 2010 BBC Sport 2012 + Olympics BBC News Mobile Data management

Future Media © BBC MMXII Radio since 1922 TV Since
1930 Web since 1994

Future Media © BBC MMXII http://bbc.co.uk/news online

Future Media © BBC MMXII BBC News [Static Publishing]

Future Media © BBC MMXII Static News Architecture

Future Media © BBC MMXII BBC CPS/CMS Asset Authoring

Future Media © BBC MMXII BBC CPS/CMS Index Authoring

Future Media © BBC MMXII Static News The Good 1)
Simple 2) Scales cheaply 3) Difficult to break [bad rendering logic etc..] 4) Handles high load

Future Media © BBC MMXII Static News The BAD 1) 
Relational taxonomic meta model 2) Static! Inflexible! SSI! 3) Document publishing 4) Content non re-usable 5) Content non repurpose-able 6) Difficult to personalize 7) Publication per output

Future Media © BBC MMXII BBC World Cup 2010 http://bbc.co.uk/worldcup

Future Media © BBC MMXII 1.  32 teams, 8 groups,
736 players è 776 pages 2.  Fixtures & Results, Groups & Teams pages 3.  To many web pages for too few journalists 4.  Improve the publishing system to help achieve all of this World Cup 2010

Future Media © BBC MMXII Page Per Player http://news.bbc.co.uk/sport/football/world_cup_2010/groups_and_teams/team/england/wayne_rooney

Future Media © BBC MMXII Page Per Team

Future Media © BBC MMXII Page Per Group

Future Media © BBC MMXII Semantic publishing USER EXPERIENCE ONTOLOGY
TRIPLE STORE

Future Media © BBC MMXII BBC Sport: http://www.bbc.co.uk/ontologies/sport Open Sport
Ontology

Journalism © BBC MMIX Extendable Domain Driven Asset Tagging

Journalism © BBC MMIX Open Ontology/Dataset reuse Event | Geonames
| Foaf | Etc.

Journalism © BBC MMIX Infer… player->team->competition

Future Media © BBC MMXII Graffiti: Suggest -> Tag [Player]

Future Media © BBC MMXII Graffiti: Suggest -> Tag [Location]
(Geonames)

Journalism © BBC MMIX Graffiti Demo… (maybe video, depending on
wifi… )

Future Media © BBC MMXII World Cup DSP Architecture

Future Media © BBC MMXII API Stack

Journalism © BBC MMIX <http://www.chelseafc.com/> domain:documentType <http://www.bbc.co.uk/things/document-types/homepage> , <http://www.bbc.co.uk/things/document-types/external> .
<http://www.bbc.co.uk/sport/football/teams/chelsea> domain:documentType <http://www.bbc.co.uk/things/document-types/bbc-document> , <http://www.bbc.co.uk/things/document-types/homepage> . <http://www.bbc.co.uk/things/2acacd19-6609-1840-9c2b-b0820c50d281#id> a sport:CompetitiveSportingOrganisation ; domain:canonicalName "Chelsea"^^<xsd:string> ; domain:document <http://www.chelseafc.com/> , <http://www.bbc.co.uk/sport/football/teams/chelsea> ; domain:externalId <http://dbpedia.org/resource/Chelsea_F.C.> , <urn:sports-stats:137316635> ; domain:name "Chelsea" ; domain:shortName "Chelsea"^^<xsd:string> ; sport:competesIn <http://www.bbc.co.uk/things/5cd4682a-7643-f445-8b1f-bcbaf450bc89#id> . <http://dbpedia.org/resource/Chelsea_F.C.> domain:externalIdType <http://www.bbc.co.uk/things/external-id-types/dbpedia> . <urn:sports-stats:137316635> domain:externalIdType <http://www.bbc.co.uk/things/external-id-types/bbc-sport-stats> . <http://www.bbc.co.uk/things/5cd4682a-7643-f445-8b1f-bcbaf450bc89#id> domain:canonicalName "Premier League"^^<xsd:string> ; domain:externalId <urn:sports-stats:118996114> ; sport:competitionType <http://www.bbc.co.uk/things/competition-types/domestic-league> . GET Accept text/rdf+n3 https://api.live.bbc.co.uk/dsp/sport/football/teams/chelsea

Journalism © BBC MMIX PHP->EasyRDF->API PHP Render layer consumes RDF
from REST API via EasyRDF (http://www.aelius.com/njh/easyrdf/) EasyRDF open PHP library (Primary committer Nicholas Humfrey BBC) protected function getOptions() { return array( "config" => array("usecert" => true), "headers" => array( "Accept" => "application/rdf+json", "X-Expect" => "http://www.bbc.co.uk/things/platforms/hiweb" ) ); $options = $this->getOptions() $response = $this->get("https://api.test.bbc.co.uk/dsp/sport/football/teams/chelsea", $options) $this->data = new EasyRdf_Graph("http://www.bbc.co.uk", $response->getBody()); $teams = $this->data->allofType("sport:CompetitiveSportingOrganisation”)

Future Media © BBC MMXII Rationale •  Automated content publishing
•  Huge increase in content breadth (number of manageable pages) •  Content re-use and re-purposing, increasing reach •  Simplified content management •  Journalist headcount reduction •  Multi-dimensional entry points and semantic navigation •  Improved user experience with high levels of user engagement •  Dynamic, state (time|event) and semantic driven page layout •  Personalized content aggregations •  Open data and API’s

Future Media © BBC MMXII •  750+ Dynamic aggregations/pages (Player,
Squad, Group, etc..) •  Average unique page requests a day : 2 million + •  Average OWLIM SPARQL queries a day : 1 million •  100s RDF statement updates/inserts per minute with full OWL reasoning and associated inference. •  Multi data center fully resilient, clustered 6 node triple store •  RDF graph model ideally suited to model domain representations such as sport World Cup statistics the GOOD

Future Media © BBC MMXII •  Sports stories and indices
static •  Sport content not responsive or personalized •  RDF Store unable to handle thousands of statistic updates a second •  RDF Store forward-chained closures expensive increase write latency •  RDF graph model and SPARQL not ideally suited to the BBC’s News and Sport document publication model World Cup statistics the BAD

Future Media © BBC MMXII BBC Sport 2012; Online Refresh
http://bbc.co.uk/sport

Future Media © BBC MMXII Sport Refresh 2012 •  Page
per Athlete [10,000+], Page per country [200+], Page per Discipline [400-500], Page per venue, Page per team è A lot of output… •  Almost real time statistics and live event pages •  Time coded, metadata annotated, on demand video, 58,000 hours of content •  Far too many web pages for far too few journalists •  DSP annotation architecture to automate content aggregation

Future Media © BBC MMXII Sport Refresh 2012; Red ->
Bright Yellow

Future Media © BBC MMXII 10000+ Dynamic Aggregations

Future Media © BBC MMXII Lots of Dynamic (Live) sports
stats

Future Media © BBC MMXII Dynamic Navigation

Future Media © BBC MMXII Static Stories + Dynamic includes

Future Media © BBC MMXII Olympics - 27 Live Video
Steams Live Stats overlays Stats -> Ontology driven aggregations

Future Media © BBC MMXII Dynamic Schedule

Future Media © BBC MMXII

Future Media © BBC MMXII Video delivery

Future Media © BBC MMXII Augment architecture with a Content
Store 1.  Atomic content assets stored in MarkLogic XML store 2.  XML content queryable via Xquery 3.  Content Assets searchable 4.  Sports statistics searchable/queryable via XQuery 5.  Ontological SPARQL via BigOWLIM, assets Xquery via MarkLogic

Future Media © BBC MMXII API Stack MarkLogic BigOWLIM

Future Media © BBC MMXII Static Sports stats (OLD)

Future Media © BBC MMXII Dynamic Sports stats

Future Media © BBC MMXII Olympic Stats delivery Broadcast Data
Feed Delta Tre Olympic Feeds Receiver MQ (high & low priority) Olympic Feeds Service BigOWLIM or Content store

Future Media © BBC MMXII “You could run Nasa on
that” Lee Pollington

Future Media © BBC MMXII MarkLogic 5 & XA &
Atomikos

Future Media © BBC MMXII DSP Architecture

Future Media © BBC MMXII Ontology Aware NLP •  Information
Workbench •  OWLIM •  (Spice) GATE+Ontotext

Future Media © BBC MMXII Concept Extraction Zoom IN • 
Multi device capability

Future Media © BBC MMXII Concept Extraction Zoom IN • 
Entity Extraction and Disambiguation •  Based on an ontology and a knowledge base •  Continuous improvement through editorial feedback

Future Media © BBC MMXII Concept Extraction Zoom IN OWLIM
Enterprise Cluster Concept Extraction Service load balancing A P P N A P P N-1 A P P 1 … Extract Update Retrain Curation Extraction Update

Future Media © BBC MMXII Concept Extraction Objectives •  Recognize
concepts known from the KB •  Solve ambiguities •  Estimate confidence in extraction quality •  Rank by relevance to the article •  Continuously improve through journalist feedback

Future Media © BBC MMXII Know it all VS Know
nothing Extraction •  BBC-developed ontology of sports entity classes and relationships •  Extensive data set of teams, players and types of sports •  Semi-automatically distilled subset of GeoNames •  All these served through an OWLIM Enterprise Cluster

Future Media © BBC MMXII Challenge: Over-generation •  Finding mentions
of entities from the knowledge base: not an issue •  Due to the millions of entities – high rate of over generation and high ambiguity •  Up to 20 candidates for a mention in some cases

Future Media © BBC MMXII Disambiguation Approach •  Disambiguation based
on multitude of features in the vicinity of the mention and training of a Max Entropy classifier •  Graph based disambiguation using relatedness of entities •  Geospatial awareness used to disambiguate entities •  Accuracy in the range of 75-90% (F1)

Future Media © BBC MMXII Disambiguation of Locations •  Geospatial
distance - a feature of OWLIM •  Super region – GeoNames hierarchy and containment relations, e.g. parentFeature •  RDF Rank •  Human approval score (on the basis of curated documents) •  Class/code based priority – fine grained ontology may allow a rule or machine learning prioritization of classes and entities based on learning we already have. •  Asset geo association - some entities could be disambiguated by using the asset domain association. BBC UK local sports is more likely to talk about national entities.

Future Media © BBC MMXII Entity Relevance: Objective •  Rank
entities by their relatedness to the article •  Accuracy 75% •  We consider various frequencies of entity mentions in the article and in the entire set of articles •  Positions in the article fields or in the first paragraphs of the body boost the relevance

Future Media © BBC MMXII Continuous Adaptation •  Annotated articles
are manually curated by a journalist •  The resulting annotations trigger adaptation of the extraction as they are being stored in OWLIM •  Immediate update of the gazetteer models •  Regular update of the statistical models for disambiguation and relevance ranking •  Adapting only on a set of recent documents

Future Media © BBC MMXII

Future Media © BBC MMXII Ontology Aware NLP and Semantic
Disambiguation OWLIM Generic Analysis … KB Gazetteer … … … … Disambiguation … … … Relevance Ranking E x - E n g l a n d b o s s S v e n - Goran Eriksson says a "smear campaign" has been aimed at Roy Hodgson for omitting Rio Ferdinand. ? Roy Hodgson: coach ? Roy Hodgson: hockey player ? ………. V Roy Hodgson: coach -  Roy Hodgson: hockey player -  ………. V Rio Ferdinand -  ……. -  ………. V Sven-Goran Eriksson -  ……. -  ………. CES APP 1.  Eriksson (78%) 2.  Roy Hodgson (69%) 3.  Rio Ferdinand (58%) 4.  … Curate Update Retrain & Adapt

Future Media © BBC MMXII Plenty of Caching

Future Media © BBC MMXII Sport Stats REST API

Future Media © BBC MMXII Sport Stats REST API SSL
Accessible API GET https://api.live.bbc.co.uk/sportsdata/statsapi/football/table/ais/competition/118996114 GET https://api.live.bbc.co.uk/sportsdata/statsapi/football/table/ais/competition/118996114 Accept: application/json GET https://api.live.bbc.co.uk/sportsdata/statsapi/football/videprinter GET https://api.int.bbc.co.uk/sportsdata/statsapi/formula1/year/2012/calendar Accept: application/json etc……etc…..etc….

Future Media © BBC MMXII Olympics API (Semantic RDF) /tripod2012
/athletes /{uid} /countries /{country} /countries-iso /{iso} /sports /stories /{discipline} /{discipline}/events /{discipline}/events/stories /{discipline}/events/{event} /metadata /disciplines /onestowatch /countries/{countryUrlName} /london2012 /sports/{disciplineUrlName} /sports/{disciplineUrlName}/events/{eventUrlName} /podium/events /{rscCode} /record/events /venues /{urlName}

Future Media © BBC MMXII @prefix tag: <http://www.bbc.co.uk/ontologies/tag/> . @prefix
domain: <http://www.bbc.co.uk/ontologies/domain/> . @prefix sesame: <http://www.openrdf.org/schema/sesame#> . @prefix owlim: <http://www.ontotext.com/> . @prefix oly: <http://www.bbc.co.uk/ontologies/2012olympics/> . @prefix par: <http://purl.org/vocab/participation/schema#> . @prefix dc: <http://purl.org/dc/elements/1.1/> . <http://www.bbc.co.uk/things/82f5db84-0591-49ee-b6f4-a1d26e9381fb#id> a sport:Person ; rdfs:label "Usain Bolt"^^xsd:string , "Bolt Usain-athletics-jam-1986-08-21"^^xsd:string ; foaf:name "Usain Bolt"^^xsd:string , "Bolt Usain-athletics-jam-1986-08-21"^^xsd:string ; domain:canonicalName "Bolt Usain-athletics-jam-1986-08-21"^^xsd:string ; foaf:givenName "Usain"^^xsd:string ; foaf:familyName "Bolt"^^xsd:string ; domain:name "Usain Bolt"^^xsd:string ; oly:dateOfBirth "1986-08-21"^^xsd:date ; oly:gender "M"^^xsd:string ; oly:height "195.0"^^xsd:float ; oly:weight "94.0"^^xsd:float ; oly:worldOlympicDream "true"^^xsd:boolean ; sport:discipline <http://www.bbc.co.uk/things/b3a086df-ab42-2b44-be8b-76b600bfcdce#id> ; sport:competesIn <http://www.bbc.co.uk/things/1b499a08-4f02-4196-aa6c-c43ea353138b#id> . <http://www.bbc.co.uk/things/b3a086df-ab42-2b44-be8b-76b600bfcdce#id> a sport:SportsDiscipline ; domain:name "Athletics"^^xsd:string ; domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/athletics> . <http://www.bbc.co.uk/things/1b499a08-4f02-4196-aa6c-c43ea353138b#id> a sport:MedalCompetition ; domain:name "Men's 100m"^^xsd:string ; domain:shortName "Men's 100m"^^xsd:string ; domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/athletics/events/mens-100m> ; oly:measurementType <http://www.bbc.co.uk/things/measurement-types/time> ; domain:externalId <urn:ioc2012:ATM001000> . <http://www.bbc.co.uk/things/903ef380-bdae-4a45-9a8b-5e5a270a7d6c#id> oly:oneToWatch <http://www.bbc.co.uk/things/82f5db84-0591-49ee-b6f4-a1d26e9381fb#id> . <http://news.bbc.co.uk/sport1/hi/athletics/16554814.stm#asset> tag:tag <http://www.bbc.co.uk/things/a50dc8ba-947e-4856-8eb0-1cdbbf208ef7#thing> ; dc:title "Event Guide: ATHLETICS"^^xsd:string ; asset:storyType <http://www.bbc.co.uk/things/story-types/profile> ; domain:document <http://news.bbc.co.uk/sport1/mobile/athletics/16554814.stm> . <http://www.bbc.co.uk/things/a50dc8ba-947e-4856-8eb0-1cdbbf208ef7#thing> tag:taggedWithTag <http://www.bbc.co.uk/things/b3a086df-ab42-2b44-be8b-76b600bfcdce#id> . <http://news.bbc.co.uk/sport1/mobile/athletics/16554814.stm> domain:platform <http://www.bbc.co.uk/things/platforms/mobile> . Olympics API (RDF)

Future Media © BBC MMXII Olympics API (Stats XML) /olympicdata/
/athletes /simulator/{scenarioName} /{guid} /bdf-log /pid/{pid} /pid/{pid}/chapter-points /{logId} /chapter-points /pid/{pid} /{logId} /days-to-go /live-text /assets /assets/{id} /medallists /athletes/{guid} /{medalGroup} /medals /athletes/{athleteGuid} /countries/{country} /medaltable /countries/{country} /disciplines/medals /disciplines/{rsc} /overall /sportcontent /obsvideosessions /full /update /podium /countries/{country} /disciplines/{rsccode} /events/{rsccode} /latest

Future Media © BBC MMXII Olympics API (Stats XML) /olympicdata/
/pulse /beat /beats /records /athletes/{guid} /events/{rsc} /results /athletes/{guid} /schedule /detail/days/{date} /detail/disciplines-code/{rsccode} /detail/disciplines-code/{rsccode}/events/{eventrsccode} /detail/disciplines-code/{rsc}/days/{date} /detail/disciplines/{rsc}/days/{date} /detail/disciplines/{urlname} /detail/disciplines/{urlname}/days/{date} /detail/disciplines/{urlname}/events/{eventrsccode} /overview/days /overview/disciplines /overview/disciplines/{rsc} /stats /count/{directory:.+} /sessionCode/{sessionCode} /simulator/{scenarioName} /{documentSerial} /team /{odfid} /unit-status /{sessionCode} /video /days/{date} /videosessionid/{videoSessionId} /{pid}

Future Media © BBC MMXII Unique Browser Requests per day
Peaked at just over 8 million UK and 11 million Globally. Cumulative Unique browsers online total 34.6 million Olympic numbers

Future Media © BBC MMXII On the busiest day, the
BBC delivered 2.8 petabytes, with the peak traffic moment occurring when Bradley Wiggins won Gold and we shifted 700 Gb/s. 106 million requests for BBC Olympic video content across all online platforms Number of people watching individual Streams à Olympic numbers

Future Media © BBC MMXII Dynamic News mobile •  Multi
device capability •  Responsive Web design •  Built on a dynamic service API •  New re-usable content model •  Dynamic assets

Future Media © BBC MMXII News Index API Including story
data on news index XML HTTP GET https://api.live.bbc.co.uk/content/asset/news/technology/ HTTP Headers X-Candy-Audience: Domestic X-Candy-Platform: EnhancedMobile Accept: application/json Or HTTP Headers X-Candy-Audience: Domestic X-Candy-Platform: EnhancedMobile Accept: application/xml Contextualised output •  Audience •  Platform •  Response type

Future Media © BBC MMXII News Story API Including story
data on news index XML HTTP GET https://api.live.bbc.co.uk/content/asset/news/uk-17829360 HTTP Headers X-Candy-Audience: Domestic X-Candy-Platform: EnhancedMobile Accept: application/json Or HTTP Headers X-Candy-Audience: Domestic X-Candy-Platform: EnhancedMobile Accept: application/xml

Future Media © BBC MMXII Instance Data Management •  Authoring
•  Making it easy for the end user (abstracting from linked data technology) •  Highly customizable interface, driven by the ontology •  Interlinking and integration with other sources •  E.g. Linked Open Data sources (DBpedia, Geonames, ...) •  Assets such as images, video, audio •  Editorial & Publishing Workflows •  Provenance and change management •  Support for user roles •  Fine granular access control

Need for User Roles and Access Control •  Journalist View
Instance Data •  Subeditor Edit instance data •  Media Manager Edit instance data Approve/reject instance data edits •  Data Architect Edit instance data and ontology data edits Publish instance data •  Administrator Approve/publish ontology edits Configuration ACL changes

Staging Architecture Staging Database Live Database Data Layer Information Workbench
(Instance Data Management) SPARQL/RDF HTTP Journalist, Data Architect, ... Web-Frontend (Browser) Unpublished Data Published Data

Information Workbench Linked Data Frontend: Semantic Wiki + Rich Widgets
•  Semantic Wiki for presentation and authoring of data •  Declarative specification of the UI based on available pool of widgets and declarative wiki- based syntax •  Widgets have direct access to the database •  Type-based template mechanism Wiki Page in Edit Mode … … and Displayed Result Page

Summary Pages for Instances

Structured Data View

Visual Exploration of the Database

Data Management Ontology Visualization •  Special types of graphs for
certain entity types possible, e.g. to visualize ontology

Authoring Instance Data •  Instance creation/editing wizards based on ontology

Ontology-driven Forms •  Generated automatically based on the schema (domain
and range definitions) •  Auto-suggestions based on the ontology •  Input can be validated based on range definitions

User-specified Forms •  Forms can easily be customized, extending the
schema definition •  Supports users in interlinking existing entities by offering schema- or query-based suggestions

Change Management and Editorial Workflow Draft Approved Rejected Published Approve
(Reviewer) Publish (Publisher) Reject (Reviewer) Edit (Editor) •  All changes are logged and carry a state •  Changes are initially in draft state •  Changes can be approved or rejected •  Approved changes can be published to the live database

Editorial Workflow in the UI

Future Media © BBC MMXII BBC sport site re-engineered to
use fully dynamic approach (News Mobile style) BBC news high web site re-engineered to use fully dynamic approach (News Mobile style) MarkLogic as CMS repository (iSite) MarkLogic Binary storage R&D Etc….etc.. Platform future…..

Journalism © BBC MMIX Shameless plug for developers… Do you
or does someone you know have Extensive developer experience in: JAVA + Scala + SPARQL Apply online: https://careers.bbc.co.uk/fe/tpl_bbc01.asp?newms=se Send your CV to: [email protected]

BBC Dynamic Semantic Publishing (Sport|Olympics...

BBC Dynamic Semantic Publishing (Sport|Olympics) SemTechBiz UK

More Decks by jemrayfield

Other Decks in Technology

Featured

Transcript