Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BBC Dynamic Semantic Publishing (Sport|Olympics) SemTechBiz UK

jemrayfield
September 20, 2012

BBC Dynamic Semantic Publishing (Sport|Olympics) SemTechBiz UK

http://semtechbizuk2012.semanticweb.com/sessionPop.cfm?confid=67&proposalid=4945

In this talk, we describe the latest developments in the transformational technology strategy the BBC Future Media & Technology department is using to evolve from a relational content model and static publishing framework to a fully dynamic semantic publishing (DSP) architecture. This approach uses linked data technology to automate the aggregation, publishing and re-purposing of interrelated content objects according to an ontological domain-modelled information architecture, providing a greatly improved user experience and high levels of user engagement. The BBC's World Cup web site was the first showcase of DSP, and probably the first major implementation of semantic web technologies on a commercial media site. In 2012, the BBC will launch two new sites based on DSP: The 2012 Olympics and a completely redesigned BBC Sports site. We will focus on two recent additions to the implementation of DSP that empower these new developments: 1) fluid Operations’ Information Workbench for the authoring, curation and publishing of ontology and instance data following an editorial workflow, and 2) OntoText’s concept extraction and semantic disambiguation service (CES) facilitating journalist-moderated content annotation. CES combines machine learning with human feedback utilizing the underlying ontological domain model for a high level of quality and accuracy. In BBC’s DSP architecture, the CES makes use of instance data published with the Information Workbench in realtime, i.e. instance data changes are reflected immediately within concept extraction results. With this powerful combination, BBC journalists are enabled to publish higher quality semantic content annotations in a much more dynamic and automated fashion.

• Overview of Dynamic Semantic Publishing
• fluid Operations’ Information Workbench supporting the semantic authoring and publishing workflows
• OntoText’s concept extracting service designed for the sports domain
• Examples of DSP: BBC Sports and 2012 Olympics

Jem Rayfield is a Senior Technical architect in the Future Media and Technology division of the British Broadcasting Corporation (BBC), specifically focusing on News, Sport & Knowledge products. This places him at the centre of BBC online architectural strategy and implementation decisions. Prior to working at the BBC, Jem was Technology Director at Razorfish, architecting solutions for numerous clients including O2 and the Financial Times. In his free time, Jem enjoys listening and playing (badly) a wide and eclectic range of music. He also enjoys spending time at the gym.

Borislav Popov is the head of the semantic annotation and search group at Ontotext. He leads the product development of the KIM semantic platform for the last several years and is leading the implementation of numerous semantics based solutions for Ontotext clients in the media, government and defense sectors.

Peter Haase is working as a senior architect at fluid Operations, where he is leading the research and development activities at the interface of semantic technologies and cloud computing. Previously, Peter was at the Institute of Applied Informatics and Formal Description Methods ([AIFB) at the University of Karlsruhe, where he obtained his PhD in 2006. Before joining the AIFB, he worked in the Silicon Valley Labs of IBM in the development of DB2 until 2003. His research interests include ontology management and evolution, decentralized information systems and Semantic Web. At the AIFB, he previously worked in the EU IST project SWAP (Semantic Web and Peer-to-Peer) and SEKT (Semantically Enabled Knowledge Technologies) and was working as a project leader for the EU IST project NeOn (Lifecycle Support for Networked Ontologies).

jemrayfield

September 20, 2012
Tweet

More Decks by jemrayfield

Other Decks in Technology

Transcript

  1. Future Media © BBC MMXII BBC Dynamic Semantic Publishing [DSP]

    SemtechBizUK 2012 •  Jem Rayfield : BBC Future Media •  Peter Haase : fluid Operations •  Borislav Popov : Ontotext
  2. Future Media © BBC MMXII Outline BBC News Online BBC

    World Cup 2010 BBC Sport 2012 + Olympics BBC News Mobile Data management
  3. Future Media © BBC MMXII Static News The Good 1)

    Simple 2) Scales cheaply 3) Difficult to break [bad rendering logic etc..] 4) Handles high load
  4. Future Media © BBC MMXII Static News The BAD 1) 

    Relational taxonomic meta model 2) Static! Inflexible! SSI! 3) Document publishing 4) Content non re-usable 5) Content non repurpose-able 6) Difficult to personalize 7) Publication per output
  5. Future Media © BBC MMXII 1.  32 teams, 8 groups,

    736 players è 776 pages 2.  Fixtures & Results, Groups & Teams pages 3.  To many web pages for too few journalists 4.  Improve the publishing system to help achieve all of this World Cup 2010
  6. Journalism © BBC MMIX <http://www.chelseafc.com/> domain:documentType <http://www.bbc.co.uk/things/document-types/homepage> , <http://www.bbc.co.uk/things/document-types/external> .

    <http://www.bbc.co.uk/sport/football/teams/chelsea> domain:documentType <http://www.bbc.co.uk/things/document-types/bbc-document> , <http://www.bbc.co.uk/things/document-types/homepage> . <http://www.bbc.co.uk/things/2acacd19-6609-1840-9c2b-b0820c50d281#id> a sport:CompetitiveSportingOrganisation ; domain:canonicalName "Chelsea"^^<xsd:string> ; domain:document <http://www.chelseafc.com/> , <http://www.bbc.co.uk/sport/football/teams/chelsea> ; domain:externalId <http://dbpedia.org/resource/Chelsea_F.C.> , <urn:sports-stats:137316635> ; domain:name "Chelsea" ; domain:shortName "Chelsea"^^<xsd:string> ; sport:competesIn <http://www.bbc.co.uk/things/5cd4682a-7643-f445-8b1f-bcbaf450bc89#id> . <http://dbpedia.org/resource/Chelsea_F.C.> domain:externalIdType <http://www.bbc.co.uk/things/external-id-types/dbpedia> . <urn:sports-stats:137316635> domain:externalIdType <http://www.bbc.co.uk/things/external-id-types/bbc-sport-stats> . <http://www.bbc.co.uk/things/5cd4682a-7643-f445-8b1f-bcbaf450bc89#id> domain:canonicalName "Premier League"^^<xsd:string> ; domain:externalId <urn:sports-stats:118996114> ; sport:competitionType <http://www.bbc.co.uk/things/competition-types/domestic-league> . GET Accept text/rdf+n3 https://api.live.bbc.co.uk/dsp/sport/football/teams/chelsea
  7. Journalism © BBC MMIX PHP->EasyRDF->API PHP Render layer consumes RDF

    from REST API via EasyRDF (http://www.aelius.com/njh/easyrdf/) EasyRDF open PHP library (Primary committer Nicholas Humfrey BBC) protected function getOptions() { return array( "config" => array("usecert" => true), "headers" => array( "Accept" => "application/rdf+json", "X-Expect" => "http://www.bbc.co.uk/things/platforms/hiweb" ) ); $options = $this->getOptions() $response = $this->get("https://api.test.bbc.co.uk/dsp/sport/football/teams/chelsea", $options) $this->data = new EasyRdf_Graph("http://www.bbc.co.uk", $response->getBody()); $teams = $this->data->allofType("sport:CompetitiveSportingOrganisation”)
  8. Future Media © BBC MMXII Rationale •  Automated content publishing

    •  Huge increase in content breadth (number of manageable pages) •  Content re-use and re-purposing, increasing reach •  Simplified content management •  Journalist headcount reduction •  Multi-dimensional entry points and semantic navigation •  Improved user experience with high levels of user engagement •  Dynamic, state (time|event) and semantic driven page layout •  Personalized content aggregations •  Open data and API’s
  9. Future Media © BBC MMXII •  750+ Dynamic aggregations/pages (Player,

    Squad, Group, etc..) •  Average unique page requests a day : 2 million + •  Average OWLIM SPARQL queries a day : 1 million •  100s RDF statement updates/inserts per minute with full OWL reasoning and associated inference. •  Multi data center fully resilient, clustered 6 node triple store •  RDF graph model ideally suited to model domain representations such as sport World Cup statistics the GOOD
  10. Future Media © BBC MMXII •  Sports stories and indices

    static •  Sport content not responsive or personalized •  RDF Store unable to handle thousands of statistic updates a second •  RDF Store forward-chained closures expensive increase write latency •  RDF graph model and SPARQL not ideally suited to the BBC’s News and Sport document publication model World Cup statistics the BAD
  11. Future Media © BBC MMXII Sport Refresh 2012 •  Page

    per Athlete [10,000+], Page per country [200+], Page per Discipline [400-500], Page per venue, Page per team è A lot of output… •  Almost real time statistics and live event pages •  Time coded, metadata annotated, on demand video, 58,000 hours of content •  Far too many web pages for far too few journalists •  DSP annotation architecture to automate content aggregation
  12. Future Media © BBC MMXII Olympics - 27 Live Video

    Steams Live Stats overlays Stats -> Ontology driven aggregations
  13. Future Media © BBC MMXII Augment architecture with a Content

    Store 1.  Atomic content assets stored in MarkLogic XML store 2.  XML content queryable via Xquery 3.  Content Assets searchable 4.  Sports statistics searchable/queryable via XQuery 5.  Ontological SPARQL via BigOWLIM, assets Xquery via MarkLogic
  14. Future Media © BBC MMXII Olympic Stats delivery Broadcast Data

    Feed Delta Tre Olympic Feeds Receiver MQ (high & low priority) Olympic Feeds Service BigOWLIM or Content store
  15. Future Media © BBC MMXII Ontology Aware NLP •  Information

    Workbench •  OWLIM •  (Spice) GATE+Ontotext
  16. Future Media © BBC MMXII Concept Extraction Zoom IN • 

    Entity Extraction and Disambiguation •  Based on an ontology and a knowledge base •  Continuous improvement through editorial feedback
  17. Future Media © BBC MMXII Concept Extraction Zoom IN OWLIM

    Enterprise Cluster Concept Extraction Service load balancing A P P N A P P N-1 A P P 1 … Extract Update Retrain Curation Extraction Update
  18. Future Media © BBC MMXII Concept Extraction Objectives •  Recognize

    concepts known from the KB •  Solve ambiguities •  Estimate confidence in extraction quality •  Rank by relevance to the article •  Continuously improve through journalist feedback
  19. Future Media © BBC MMXII Know it all VS Know

    nothing Extraction •  BBC-developed ontology of sports entity classes and relationships •  Extensive data set of teams, players and types of sports •  Semi-automatically distilled subset of GeoNames •  All these served through an OWLIM Enterprise Cluster
  20. Future Media © BBC MMXII Challenge: Over-generation •  Finding mentions

    of entities from the knowledge base: not an issue •  Due to the millions of entities – high rate of over generation and high ambiguity •  Up to 20 candidates for a mention in some cases
  21. Future Media © BBC MMXII Disambiguation Approach •  Disambiguation based

    on multitude of features in the vicinity of the mention and training of a Max Entropy classifier •  Graph based disambiguation using relatedness of entities •  Geospatial awareness used to disambiguate entities •  Accuracy in the range of 75-90% (F1)
  22. Future Media © BBC MMXII Disambiguation of Locations •  Geospatial

    distance - a feature of OWLIM •  Super region – GeoNames hierarchy and containment relations, e.g. parentFeature •  RDF Rank •  Human approval score (on the basis of curated documents) •  Class/code based priority – fine grained ontology may allow a rule or machine learning prioritization of classes and entities based on learning we already have. •  Asset geo association - some entities could be disambiguated by using the asset domain association. BBC UK local sports is more likely to talk about national entities.
  23. Future Media © BBC MMXII Entity Relevance: Objective •  Rank

    entities by their relatedness to the article •  Accuracy 75% •  We consider various frequencies of entity mentions in the article and in the entire set of articles •  Positions in the article fields or in the first paragraphs of the body boost the relevance
  24. Future Media © BBC MMXII Continuous Adaptation •  Annotated articles

    are manually curated by a journalist •  The resulting annotations trigger adaptation of the extraction as they are being stored in OWLIM •  Immediate update of the gazetteer models •  Regular update of the statistical models for disambiguation and relevance ranking •  Adapting only on a set of recent documents
  25. Future Media © BBC MMXII Ontology Aware NLP and Semantic

    Disambiguation OWLIM Generic Analysis … KB Gazetteer … … … … Disambiguation … … … Relevance Ranking E x - E n g l a n d b o s s S v e n - Goran Eriksson says a "smear campaign" has been aimed at Roy Hodgson for omitting Rio Ferdinand. ? Roy Hodgson: coach ? Roy Hodgson: hockey player ? ………. V Roy Hodgson: coach -  Roy Hodgson: hockey player -  ………. V Rio Ferdinand -  ……. -  ………. V Sven-Goran Eriksson -  ……. -  ………. CES APP 1.  Eriksson (78%) 2.  Roy Hodgson (69%) 3.  Rio Ferdinand (58%) 4.  … Curate Update Retrain & Adapt
  26. Future Media © BBC MMXII Sport Stats REST API SSL

    Accessible API GET https://api.live.bbc.co.uk/sportsdata/statsapi/football/table/ais/competition/118996114 GET https://api.live.bbc.co.uk/sportsdata/statsapi/football/table/ais/competition/118996114 Accept: application/json GET https://api.live.bbc.co.uk/sportsdata/statsapi/football/videprinter GET https://api.int.bbc.co.uk/sportsdata/statsapi/formula1/year/2012/calendar Accept: application/json etc……etc…..etc….
  27. Future Media © BBC MMXII Olympics API (Semantic RDF) /tripod2012

    /athletes /{uid} /countries /{country} /countries-iso /{iso} /sports /stories /{discipline} /{discipline}/events /{discipline}/events/stories /{discipline}/events/{event} /metadata /disciplines /onestowatch /countries/{countryUrlName} /london2012 /sports/{disciplineUrlName} /sports/{disciplineUrlName}/events/{eventUrlName} /podium/events /{rscCode} /record/events /venues /{urlName}
  28. Future Media © BBC MMXII @prefix tag: <http://www.bbc.co.uk/ontologies/tag/> . @prefix

    domain: <http://www.bbc.co.uk/ontologies/domain/> . @prefix sesame: <http://www.openrdf.org/schema/sesame#> . @prefix owlim: <http://www.ontotext.com/> . @prefix oly: <http://www.bbc.co.uk/ontologies/2012olympics/> . @prefix par: <http://purl.org/vocab/participation/schema#> . @prefix dc: <http://purl.org/dc/elements/1.1/> . <http://www.bbc.co.uk/things/82f5db84-0591-49ee-b6f4-a1d26e9381fb#id> a sport:Person ; rdfs:label "Usain Bolt"^^xsd:string , "Bolt Usain-athletics-jam-1986-08-21"^^xsd:string ; foaf:name "Usain Bolt"^^xsd:string , "Bolt Usain-athletics-jam-1986-08-21"^^xsd:string ; domain:canonicalName "Bolt Usain-athletics-jam-1986-08-21"^^xsd:string ; foaf:givenName "Usain"^^xsd:string ; foaf:familyName "Bolt"^^xsd:string ; domain:name "Usain Bolt"^^xsd:string ; oly:dateOfBirth "1986-08-21"^^xsd:date ; oly:gender "M"^^xsd:string ; oly:height "195.0"^^xsd:float ; oly:weight "94.0"^^xsd:float ; oly:worldOlympicDream "true"^^xsd:boolean ; sport:discipline <http://www.bbc.co.uk/things/b3a086df-ab42-2b44-be8b-76b600bfcdce#id> ; sport:competesIn <http://www.bbc.co.uk/things/1b499a08-4f02-4196-aa6c-c43ea353138b#id> . <http://www.bbc.co.uk/things/b3a086df-ab42-2b44-be8b-76b600bfcdce#id> a sport:SportsDiscipline ; domain:name "Athletics"^^xsd:string ; domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/athletics> . <http://www.bbc.co.uk/things/1b499a08-4f02-4196-aa6c-c43ea353138b#id> a sport:MedalCompetition ; domain:name "Men's 100m"^^xsd:string ; domain:shortName "Men's 100m"^^xsd:string ; domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/athletics/events/mens-100m> ; oly:measurementType <http://www.bbc.co.uk/things/measurement-types/time> ; domain:externalId <urn:ioc2012:ATM001000> . <http://www.bbc.co.uk/things/903ef380-bdae-4a45-9a8b-5e5a270a7d6c#id> oly:oneToWatch <http://www.bbc.co.uk/things/82f5db84-0591-49ee-b6f4-a1d26e9381fb#id> . <http://news.bbc.co.uk/sport1/hi/athletics/16554814.stm#asset> tag:tag <http://www.bbc.co.uk/things/a50dc8ba-947e-4856-8eb0-1cdbbf208ef7#thing> ; dc:title "Event Guide: ATHLETICS"^^xsd:string ; asset:storyType <http://www.bbc.co.uk/things/story-types/profile> ; domain:document <http://news.bbc.co.uk/sport1/mobile/athletics/16554814.stm> . <http://www.bbc.co.uk/things/a50dc8ba-947e-4856-8eb0-1cdbbf208ef7#thing> tag:taggedWithTag <http://www.bbc.co.uk/things/b3a086df-ab42-2b44-be8b-76b600bfcdce#id> . <http://news.bbc.co.uk/sport1/mobile/athletics/16554814.stm> domain:platform <http://www.bbc.co.uk/things/platforms/mobile> . Olympics API (RDF)
  29. Future Media © BBC MMXII Olympics API (Stats XML) /olympicdata/

    /athletes /simulator/{scenarioName} /{guid} /bdf-log /pid/{pid} /pid/{pid}/chapter-points /{logId} /chapter-points /pid/{pid} /{logId} /days-to-go /live-text /assets /assets/{id} /medallists /athletes/{guid} /{medalGroup} /medals /athletes/{athleteGuid} /countries/{country} /medaltable /countries/{country} /disciplines/medals /disciplines/{rsc} /overall /sportcontent /obsvideosessions /full /update /podium /countries/{country} /disciplines/{rsccode} /events/{rsccode} /latest
  30. Future Media © BBC MMXII Olympics API (Stats XML) /olympicdata/

    /pulse /beat /beats /records /athletes/{guid} /events/{rsc} /results /athletes/{guid} /schedule /detail/days/{date} /detail/disciplines-code/{rsccode} /detail/disciplines-code/{rsccode}/events/{eventrsccode} /detail/disciplines-code/{rsc}/days/{date} /detail/disciplines/{rsc}/days/{date} /detail/disciplines/{urlname} /detail/disciplines/{urlname}/days/{date} /detail/disciplines/{urlname}/events/{eventrsccode} /overview/days /overview/disciplines /overview/disciplines/{rsc} /stats /count/{directory:.+} /sessionCode/{sessionCode} /simulator/{scenarioName} /{documentSerial} /team /{odfid} /unit-status /{sessionCode} /video /days/{date} /videosessionid/{videoSessionId} /{pid}
  31. Future Media © BBC MMXII Unique Browser Requests per day

    Peaked at just over 8 million UK and 11 million Globally. Cumulative Unique browsers online total 34.6 million Olympic numbers
  32. Future Media © BBC MMXII On the busiest day, the

    BBC delivered 2.8 petabytes, with the peak traffic moment occurring when Bradley Wiggins won Gold and we shifted 700 Gb/s. 106 million requests for BBC Olympic video content across all online platforms Number of people watching individual Streams à Olympic numbers
  33. Future Media © BBC MMXII Dynamic News mobile •  Multi

    device capability •  Responsive Web design •  Built on a dynamic service API •  New re-usable content model •  Dynamic assets
  34. Future Media © BBC MMXII News Index API Including story

    data on news index XML HTTP GET https://api.live.bbc.co.uk/content/asset/news/technology/ HTTP Headers X-Candy-Audience: Domestic X-Candy-Platform: EnhancedMobile Accept: application/json Or HTTP Headers X-Candy-Audience: Domestic X-Candy-Platform: EnhancedMobile Accept: application/xml Contextualised output •  Audience •  Platform •  Response type
  35. Future Media © BBC MMXII News Story API Including story

    data on news index XML HTTP GET https://api.live.bbc.co.uk/content/asset/news/uk-17829360 HTTP Headers X-Candy-Audience: Domestic X-Candy-Platform: EnhancedMobile Accept: application/json Or HTTP Headers X-Candy-Audience: Domestic X-Candy-Platform: EnhancedMobile Accept: application/xml
  36. Future Media © BBC MMXII Instance Data Management •  Authoring

    •  Making it easy for the end user (abstracting from linked data technology) •  Highly customizable interface, driven by the ontology •  Interlinking and integration with other sources •  E.g. Linked Open Data sources (DBpedia, Geonames, ...) •  Assets such as images, video, audio •  Editorial & Publishing Workflows •  Provenance and change management •  Support for user roles •  Fine granular access control
  37. Need for User Roles and Access Control •  Journalist View

    Instance Data •  Subeditor Edit instance data •  Media Manager Edit instance data Approve/reject instance data edits •  Data Architect Edit instance data and ontology data edits Publish instance data •  Administrator Approve/publish ontology edits Configuration ACL changes
  38. Staging Architecture Staging Database Live Database Data Layer Information Workbench

    (Instance Data Management) SPARQL/RDF HTTP Journalist, Data Architect, ... Web-Frontend (Browser) Unpublished Data Published Data
  39. Information Workbench Linked Data Frontend: Semantic Wiki + Rich Widgets

    •  Semantic Wiki for presentation and authoring of data •  Declarative specification of the UI based on available pool of widgets and declarative wiki- based syntax •  Widgets have direct access to the database •  Type-based template mechanism Wiki  Page  in  Edit  Mode  …   …  and  Displayed  Result  Page  
  40. Data Management Ontology Visualization •  Special types of graphs for

    certain entity types possible, e.g. to visualize ontology
  41. Ontology-driven Forms •  Generated automatically based on the schema (domain

    and range definitions) •  Auto-suggestions based on the ontology •  Input can be validated based on range definitions
  42. User-specified Forms •  Forms can easily be customized, extending the

    schema definition •  Supports users in interlinking existing entities by offering schema- or query-based suggestions
  43. Change Management and Editorial Workflow Draft Approved Rejected Published Approve

    (Reviewer) Publish (Publisher) Reject (Reviewer) Edit (Editor) •  All changes are logged and carry a state •  Changes are initially in draft state •  Changes can be approved or rejected •  Approved changes can be published to the live database
  44. Future Media © BBC MMXII BBC sport site re-engineered to

    use fully dynamic approach (News Mobile style) BBC news high web site re-engineered to use fully dynamic approach (News Mobile style) MarkLogic as CMS repository (iSite) MarkLogic Binary storage R&D Etc….etc.. Platform future…..
  45. Journalism © BBC MMIX Shameless plug for developers… Do you

    or does someone you know have Extensive developer experience in: JAVA + Scala + SPARQL Apply online: https://careers.bbc.co.uk/fe/tpl_bbc01.asp?newms=se Send your CV to: [email protected]