Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BBC Dynamic Semantic Publishing (Sport|Olympics...

jemrayfield
September 14, 2012

BBC Dynamic Semantic Publishing (Sport|Olympics) XMLAmsterdam

http://www.xmlamsterdam.com/2012/sessions#closingkeynote

Jem describes the latest developments in the transformational technology strategy the BBC Future Media & Technology department is using to evolve from a relational content model and static publishing framework to a fully dynamic semantic publishing (DSP) architecture. This approach uses linked data technology to automate the aggregation, publishing and re-purposing of interrelated content objects according to an ontological domain-modelled information architecture, providing a greatly improved user experience and high levels of user engagement. The BBC's World Cup web site was the first showcase of DSP, and probably the first major implementation of semantic web technologies on a commercial media site. In 2012, the BBC launched three new sites based on DSP: The 2012 Olympics, a completely redesigned BBC Sports site and the new news Mobile site.

jemrayfield

September 14, 2012
Tweet

More Decks by jemrayfield

Other Decks in Technology

Transcript

  1. Future Media © BBC MMXII BBC Dynamic Semantic Publishing [DSP]

    XMLAmsterdam 2012 •  Jem Rayfield : Lead Technical Architect •  BBC Future Media
  2. Future Media © BBC MMXII Outline BBC News Online BBC

    World Cup 2010 BBC Sport 2012 + Olympics BBC News Mobile
  3. Future Media © BBC MMXII Static News The Good 1)

    Simple 2) Scales cheaply 3) Difficult to break [bad rendering logic etc..] 4) Handles high load
  4. Future Media © BBC MMXII Static News The BAD 1) 

    Relational taxonomic meta model 2) Static! Inflexible! SSI! 3) Document publishing 4) Content non re-usable 5) Content non repurpose-able 6) Difficult to personalize 7) Publication per output
  5. Future Media © BBC MMXII 1.  32 teams, 8 groups,

    736 players è 776 pages 2.  Fixtures & Results, Groups & Teams pages 3.  To many web pages for too few journalists 4.  Improve the publishing system to help achieve all of this World Cup 2010
  6. Journalism © BBC MMIX <http://www.chelseafc.com/> domain:documentType <http://www.bbc.co.uk/things/document-types/homepage> , <http://www.bbc.co.uk/things/document-types/external> .

    <http://www.bbc.co.uk/sport/football/teams/chelsea> domain:documentType <http://www.bbc.co.uk/things/document-types/bbc-document> , <http://www.bbc.co.uk/things/document-types/homepage> . <http://www.bbc.co.uk/things/2acacd19-6609-1840-9c2b-b0820c50d281#id> a sport:CompetitiveSportingOrganisation ; domain:canonicalName "Chelsea"^^<xsd:string> ; domain:document <http://www.chelseafc.com/> , <http://www.bbc.co.uk/sport/football/teams/chelsea> ; domain:externalId <http://dbpedia.org/resource/Chelsea_F.C.> , <urn:sports-stats:137316635> ; domain:name "Chelsea" ; domain:shortName "Chelsea"^^<xsd:string> ; sport:competesIn <http://www.bbc.co.uk/things/5cd4682a-7643-f445-8b1f-bcbaf450bc89#id> . <http://dbpedia.org/resource/Chelsea_F.C.> domain:externalIdType <http://www.bbc.co.uk/things/external-id-types/dbpedia> . <urn:sports-stats:137316635> domain:externalIdType <http://www.bbc.co.uk/things/external-id-types/bbc-sport-stats> . <http://www.bbc.co.uk/things/5cd4682a-7643-f445-8b1f-bcbaf450bc89#id> domain:canonicalName "Premier League"^^<xsd:string> ; domain:externalId <urn:sports-stats:118996114> ; sport:competitionType <http://www.bbc.co.uk/things/competition-types/domestic-league> . GET Accept text/rdf+n3 https://api.live.bbc.co.uk/dsp/sport/football/teams/chelsea
  7. Journalism © BBC MMIX PHP->EasyRDF->API PHP Render layer consumes RDF

    from REST API via EasyRDF (http://www.aelius.com/njh/easyrdf/) EasyRDF open PHP library (Primary committer Nicholas Humfrey BBC) protected function getOptions() { return array( "config" => array("usecert" => true), "headers" => array( "Accept" => "application/rdf+json", "X-Expect" => "http://www.bbc.co.uk/things/platforms/hiweb" ) ); $options = $this->getOptions() $response = $this->get("https://api.test.bbc.co.uk/dsp/sport/football/teams/chelsea", $options) $this->data = new EasyRdf_Graph("http://www.bbc.co.uk", $response->getBody()); $teams = $this->data->allofType("sport:CompetitiveSportingOrganisation”)
  8. Future Media © BBC MMXII Rationale •  Automated content publishing

    •  Huge increase in content breadth (number of manageable pages) •  Content re-use and re-purposing, increasing reach •  Simplified content management •  Journalist headcount reduction •  Multi-dimensional entry points and semantic navigation •  Improved user experience with high levels of user engagement •  Dynamic, state (time|event) and semantic driven page layout •  Personalized content aggregations •  Open data and API’s
  9. Future Media © BBC MMXII •  750+ Dynamic aggregations/pages (Player,

    Squad, Group, etc..) •  Average unique page requests a day : 2 million + •  Average OWLIM SPARQL queries a day : 1 million •  100s RDF statement updates/inserts per minute with full OWL reasoning and associated inference. •  Multi data center fully resilient, clustered 6 node triple store •  RDF graph model ideally suited to model domain representations such as sport World Cup statistics the GOOD
  10. Future Media © BBC MMXII •  Sports stories and indices

    static •  Sport content not responsive or personalized •  RDF Store unable to handle thousands of statistic updates a second •  RDF Store forward-chained closures expensive increase write latency •  RDF graph model and SPARQL not ideally suited to the BBC’s News and Sport document publication model World Cup statistics the BAD
  11. Future Media © BBC MMXII Sport Refresh 2012 •  Page

    per Athlete [10,000+], Page per country [200+], Page per Discipline [400-500], Page per venue, Page per team è A lot of output… •  Almost real time statistics and live event pages •  Time coded, metadata annotated, on demand video, 58,000 hours of content •  Far too many web pages for far too few journalists •  DSP annotation architecture to automate content aggregation
  12. Future Media © BBC MMXII Olympics - 27 Live Video

    Steams Live Stats overlays Stats -> Ontology driven aggregations
  13. Future Media © BBC MMXII Augment architecture with a Content

    Store 1.  Atomic content assets stored in MarkLogic XML store 2.  XML content queryable via Xquery 3.  Content Assets searchable 4.  Sports statistics searchable/queryable via XQuery 5.  Ontological SPARQL via BigOWLIM, assets Xquery via MarkLogic
  14. Future Media © BBC MMXII Olympic Stats delivery Broadcast Data

    Feed Delta Tre Olympic Feeds Receiver MQ (high & low priority) Olympic Feeds Service BigOWLIM or Content store
  15. Future Media © BBC MMXII Ontology Aware NLP •  Information

    Workbench •  OWLIM •  (Spice) GATE+Ontotext
  16. Future Media © BBC MMXII Ontology Aware NLP and Semantic

    Disambiguation OWLIM Generic Analysis … KB Gazetteer … … … … Disambiguation … … … Relevance Ranking E x - E n g l a n d b o s s S v e n - Goran Eriksson says a "smear campaign" has been aimed at Roy Hodgson for omitting Rio Ferdinand. ? Roy Hodgson: coach ? Roy Hodgson: hockey player ? ………. V Roy Hodgson: coach -  Roy Hodgson: hockey player -  ………. V Rio Ferdinand -  ……. -  ………. V Sven-Goran Eriksson -  ……. -  ………. CES APP 1.  Eriksson (78%) 2.  Roy Hodgson (69%) 3.  Rio Ferdinand (58%) 4.  … Curate Update Retrain & Adapt
  17. Future Media © BBC MMXII Entity Relevance: Objective •  Rank

    entities by their relatedness to the article •  Accuracy 75% •  We consider various frequencies of entity mentions in the article and in the entire set of articles •  Positions in the article fields or in the first paragraphs of the body boost the relevance
  18. Future Media © BBC MMXII Confidence and Relevance The relevance

    of an entity in arbitrary document may depend on: Text context and the vicinity of an entity/concept within the text. (Confidence) Ontological graph context and the vicinity of an entity/concept within the graphs knowledge model The frequencies of entities in the corpus and document. (Relevance)
  19. Future Media © BBC MMXII Disambiguation of Locations •  Geospatial

    distance - a feature of OWLIM •  Super region – GeoNames hierarchy and containment relations, e.g. parentFeature •  RDF Rank •  Human approval score (on the basis of curated documents) •  Class/code based priority – fine grained ontology may allow a rule or machine learning prioritization of classes and entities based on learning we already have. •  Asset geo association - some entities could be disambiguated by using the asset domain association. BBC UK local sports is more likely to talk about national entities.
  20. Future Media © BBC MMXII Sport Stats REST API SSL

    Accessible API GET https://api.live.bbc.co.uk/sportsdata/statsapi/football/table/ais/competition/118996114 GET https://api.live.bbc.co.uk/sportsdata/statsapi/football/table/ais/competition/118996114 Accept: application/json GET https://api.live.bbc.co.uk/sportsdata/statsapi/football/videprinter GET https://api.int.bbc.co.uk/sportsdata/statsapi/formula1/year/2012/calendar Accept: application/json etc……etc…..etc….
  21. Future Media © BBC MMXII Olympics API (Semantic RDF) /tripod2012

    /athletes /{uid} /countries /{country} /countries-iso /{iso} /sports /stories /{discipline} /{discipline}/events /{discipline}/events/stories /{discipline}/events/{event} /metadata /disciplines /onestowatch /countries/{countryUrlName} /london2012 /sports/{disciplineUrlName} /sports/{disciplineUrlName}/events/{eventUrlName} /podium/events /{rscCode} /record/events /venues /{urlName}
  22. Future Media © BBC MMXII @prefix tag: <http://www.bbc.co.uk/ontologies/tag/> . @prefix

    domain: <http://www.bbc.co.uk/ontologies/domain/> . @prefix sesame: <http://www.openrdf.org/schema/sesame#> . @prefix owlim: <http://www.ontotext.com/> . @prefix oly: <http://www.bbc.co.uk/ontologies/2012olympics/> . @prefix par: <http://purl.org/vocab/participation/schema#> . @prefix dc: <http://purl.org/dc/elements/1.1/> . <http://www.bbc.co.uk/things/82f5db84-0591-49ee-b6f4-a1d26e9381fb#id> a sport:Person ; rdfs:label "Usain Bolt"^^xsd:string , "Bolt Usain-athletics-jam-1986-08-21"^^xsd:string ; foaf:name "Usain Bolt"^^xsd:string , "Bolt Usain-athletics-jam-1986-08-21"^^xsd:string ; domain:canonicalName "Bolt Usain-athletics-jam-1986-08-21"^^xsd:string ; foaf:givenName "Usain"^^xsd:string ; foaf:familyName "Bolt"^^xsd:string ; domain:name "Usain Bolt"^^xsd:string ; oly:dateOfBirth "1986-08-21"^^xsd:date ; oly:gender "M"^^xsd:string ; oly:height "195.0"^^xsd:float ; oly:weight "94.0"^^xsd:float ; oly:worldOlympicDream "true"^^xsd:boolean ; sport:discipline <http://www.bbc.co.uk/things/b3a086df-ab42-2b44-be8b-76b600bfcdce#id> ; sport:competesIn <http://www.bbc.co.uk/things/1b499a08-4f02-4196-aa6c-c43ea353138b#id> . <http://www.bbc.co.uk/things/b3a086df-ab42-2b44-be8b-76b600bfcdce#id> a sport:SportsDiscipline ; domain:name "Athletics"^^xsd:string ; domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/athletics> . <http://www.bbc.co.uk/things/1b499a08-4f02-4196-aa6c-c43ea353138b#id> a sport:MedalCompetition ; domain:name "Men's 100m"^^xsd:string ; domain:shortName "Men's 100m"^^xsd:string ; domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/athletics/events/mens-100m> ; oly:measurementType <http://www.bbc.co.uk/things/measurement-types/time> ; domain:externalId <urn:ioc2012:ATM001000> . <http://www.bbc.co.uk/things/903ef380-bdae-4a45-9a8b-5e5a270a7d6c#id> oly:oneToWatch <http://www.bbc.co.uk/things/82f5db84-0591-49ee-b6f4-a1d26e9381fb#id> . <http://news.bbc.co.uk/sport1/hi/athletics/16554814.stm#asset> tag:tag <http://www.bbc.co.uk/things/a50dc8ba-947e-4856-8eb0-1cdbbf208ef7#thing> ; dc:title "Event Guide: ATHLETICS"^^xsd:string ; asset:storyType <http://www.bbc.co.uk/things/story-types/profile> ; domain:document <http://news.bbc.co.uk/sport1/mobile/athletics/16554814.stm> . <http://www.bbc.co.uk/things/a50dc8ba-947e-4856-8eb0-1cdbbf208ef7#thing> tag:taggedWithTag <http://www.bbc.co.uk/things/b3a086df-ab42-2b44-be8b-76b600bfcdce#id> . <http://news.bbc.co.uk/sport1/mobile/athletics/16554814.stm> domain:platform <http://www.bbc.co.uk/things/platforms/mobile> . Olympics API (RDF)
  23. Future Media © BBC MMXII Olympics API (Stats XML) /olympicdata/

    /athletes /simulator/{scenarioName} /{guid} /bdf-log /pid/{pid} /pid/{pid}/chapter-points /{logId} /chapter-points /pid/{pid} /{logId} /days-to-go /live-text /assets /assets/{id} /medallists /athletes/{guid} /{medalGroup} /medals /athletes/{athleteGuid} /countries/{country} /medaltable /countries/{country} /disciplines/medals /disciplines/{rsc} /overall /sportcontent /obsvideosessions /full /update /podium /countries/{country} /disciplines/{rsccode} /events/{rsccode} /latest
  24. Future Media © BBC MMXII Olympics API (Stats XML) /olympicdata/

    /pulse /beat /beats /records /athletes/{guid} /events/{rsc} /results /athletes/{guid} /schedule /detail/days/{date} /detail/disciplines-code/{rsccode} /detail/disciplines-code/{rsccode}/events/{eventrsccode} /detail/disciplines-code/{rsc}/days/{date} /detail/disciplines/{rsc}/days/{date} /detail/disciplines/{urlname} /detail/disciplines/{urlname}/days/{date} /detail/disciplines/{urlname}/events/{eventrsccode} /overview/days /overview/disciplines /overview/disciplines/{rsc} /stats /count/{directory:.+} /sessionCode/{sessionCode} /simulator/{scenarioName} /{documentSerial} /team /{odfid} /unit-status /{sessionCode} /video /days/{date} /videosessionid/{videoSessionId} /{pid}
  25. Future Media © BBC MMXII Olympics API (XML – Video

    catch example) https://api.stage.live.co.uk/olympicdata/public/videos/catchup <?xml version="1.0" encoding="UTF-8"?> <catchup poll-interval-in-seconds="60"> <slot index="1" type="session"> <pid>b017t8b4</pid> <videoSessionId>OBS-1284552321</videoSessionId> <sessionCode>FB001</sessionCode> <discipline RSC="FB0000000" name="Football" url="http://www.bbc.co.uk/sport/olympics/2012/sports/football" urlName="football" icon=" http://static.live.bbci.co.uk/ivp2012/images/icons/sports/football.png"/> <scheduleStart>2001-12-17T09:30:47Z</scheduleStart> <scheduleEnd>2001-12-17T09:30:47Z</scheduleEnd> <actualStart>2001-12-17T09:30:47Z</actualStart> <actualEnd>2001-12-17T09:30:47Z</actualEnd> <available>true</available> <editorsPick>true</editorsPick> <live>false</live> <title>28/11/2011</title> <shortSynopsis>A desperate Roxy makes the unwise decision to dig for dirt on Derek.</shortSynopsis> <myImageBaseUrl>http://node2.bbcimg.co.uk/iplayer/images/episode/</myImageBaseUrl> </slot> <slot index="2" type="session"> <pid>hbtest1</pid> <videoSessionId>OBS-64835275893</videoSessionId> <sessionCode>HB001</sessionCode> <discipline RSC="HB0000001"/> <scheduleStart>2001-12-17T09:30:47Z</scheduleStart> <scheduleEnd>2001-12-17T09:30:47Z</scheduleEnd> <actualStart>2001-12-17T09:30:47Z</actualStart> <actualEnd>2001-12-17T09:30:47Z</actualEnd> <available>true</available> <editorsPick>true</editorsPick> <live>false</live> </slot> </catchup>
  26. Future Media © BBC MMXII Olympics API (Video log/chapter points)

    GET https://api.test.bbc.co.uk/olympicdata/bdf-log/pid/p00g2lqp <document> <logEvent> <header> <documentGroup>general</documentGroup> <documentType>videoLogging</documentType> <rsc code="FE0000000"/> <documentSerial>fdf38714-45d3-4d89-5100-8883be340700</documentSerial> <timeStamp>2012-07-10T10:42:38+01:00</timeStamp> <video> <pid>p00g2lqp</pid> <timeCode>2012-07-10T08:56:38Z</timeCode> </video> <taggingtool>Version 0.1.10</taggingtool> </header> <Log> <LogId>BBC-TT-1341913358.6783</LogId> <Action>UPDATE</Action> <Date>2012-07-10</Date> <TimeCode>09:56:38</TimeCode> <RSC>FE0000000</RSC> <Keywords> <Keyword>BBC free text</Keyword> </Keywords> <bbcLabel>Test One</bbcLabel> </Log> </logEvent> <logEvent> <header> <documentGroup>general</documentGroup> <documentType>videoLogging</documentType> <rsc code=""/> <documentSerial>20dc4019-d3be-4bb1-6a75-256ba7622b5d</documentSerial> <timeStamp>2012-07-10T10:42:51+01:00</timeStamp> <video> <pid>p00g2lqp</pid> <timeCode>2012-07-10T09:16:38Z</timeCode> </video> <taggingtool>Version 0.1.10</taggingtool> </header> <Log>
  27. Future Media © BBC MMXII Unique Browser Requests per day

    Peaked at just over 8 million UK and 11 million Globally. Cumulative Unique browsers online total 34.6 million Olympic numbers
  28. Future Media © BBC MMXII On the busiest day, the

    BBC delivered 2.8 petabytes, with the peak traffic moment occurring when Bradley Wiggins won Gold and we shifted 700 Gb/s. 106 million requests for BBC Olympic video content across all online platforms Number of people watching individual Streams à Olympic numbers
  29. Future Media © BBC MMXII Dynamic News mobile •  Multi

    device capability •  Responsive Web design •  Built on a dynamic service API •  New re-usable content model •  Dynamic assets
  30. Future Media © BBC MMXII MarkLogics handy Xinclude resolution Including

    story data on news index XML <item> <xi:include href="http://www.bbc.co.uk/asset/13447877" xpointer="xmlns(bbc=http://www.bbc.co.uk/content/asset) xpointer(/bbc:story/bbc:itemMeta)"> <xi:fallback> <!-- Unable to find href="http://www.bbc.co.uk/asset/13447877" xpointer="xmlns(bbc=http://www.bbc.co.uk/content/asset) xpointer(/bbc:story/bbc:itemMeta)" --> </xi:fallback> </xi:include> ...
  31. Future Media © BBC MMXII News Index API Including story

    data on news index XML HTTP GET https://api.live.bbc.co.uk/content/asset/news/technology/ HTTP Headers X-Candy-Audience: Domestic X-Candy-Platform: EnhancedMobile Accept: application/json Or HTTP Headers X-Candy-Audience: Domestic X-Candy-Platform: EnhancedMobile Accept: application/xml Contextualised output •  Audience •  Platform •  Response type
  32. Future Media © BBC MMXII News Story API Including story

    data on news index XML HTTP GET https://api.live.bbc.co.uk/content/asset/news/uk-17829360 HTTP Headers X-Candy-Audience: Domestic X-Candy-Platform: EnhancedMobile Accept: application/json Or HTTP Headers X-Candy-Audience: Domestic X-Candy-Platform: EnhancedMobile Accept: application/xml
  33. Future Media © BBC MMXII BBC sport site re-engineered to

    use fully dynamic approach (News Mobile style) BBC news high web site re-engineered to use fully dynamic approach (News Mobile style) MarkLogic as CMS repository (iSite) MarkLogic Binary storage R&D Etc….etc.. Platform future…..