Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cesar_de_Pablo_-_SocialTV_-_BigData_Spain_2013.pdf

Cb6e6da05b5b943d2691ceefa3381cad?s=47 Big Data Spain
November 14, 2013
540

 Cesar_de_Pablo_-_SocialTV_-_BigData_Spain_2013.pdf

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

November 14, 2013
Tweet

Transcript

  1. Textalytics: Meaning-as-a-Service Real Time Semantic Search for Social TV streams

    César de Pablo Sánchez Daedalus Nov. 7th 2013 Big Data Spain (Madrid)
  2. None
  3. None
  4. The plot 1. What's Social TV? 2. Monitoring Social TV

    conversations. A preliminary architecture 3. Understanding the buzz. Textalytics 4. Organizing the mess. SenseiDB 5. Lessons learned
  5. Social TV Second Screen Transmedia

  6. Not just TV Sports Elections Alerts

  7. Big Data? Volume Velocity Variety

  8. Users? Viewers Channels Brands

  9. Viewers? Participate Vote Influence Confirm beliefs Keep updated Belong to

    group
  10. Viewers? Participate Influence Confirm beliefs Keep updated

  11. Channels? Understand React Measure

  12. Channels? Understand React

  13. Brands? Select programs Reputation Find public

  14. Reputation Find public Brands?

  15. Monitoring Social TV conversations. The architecture

  16. tracker gateway HTTP Stream pipeline Pull EPG

  17. Understanding the buzz Textalytics API

  18. API  NLP and Semantics API  Multilingual: EN, ES

    (FR,IT,PT,CA)  REST Service : JSON and XML  Combine best of all worlds  Deep language analysis  Comprehensive resources: linguistics and Dbs  Ontology  Rule Based Method  Statistics and Machine Learning Methods
  19.  High level semantic API – close to bussines scenarios

     Core API – building blocks Topics Sentiment Classif. Linked Data POS Configuración y Recursos Lingüísticos Configuración y Recursos Lingüísticos Configuración y Recursos Lingüísticos API Análisis Medios API Publicación Semántica … API
  20. Media Analysis Semantic Publishing Core API API

  21. Core API Topics Extraction Text Classification Sentiment Analysis Language identification

    Lemmatization, POS and Parsing Speeech Recognition and Speaker Diarization Semantic Linked Data Viewer Semantic Linked Data Viewer Spell, Grammar and Style User Demographics
  22. Language identification  Given a text identify a language list

    - or just one  62 languages  Using language ngrams signatures  Social TV  Filter – TV hashtags often implies language  Sometimes hashtags are multilingual – but not relevant for users
  23. None
  24. Text Classification  Theme labels – IPTC  Relevance 

    Multiple labels  Tailored for short text (tweets)  Define your own models and categories  Social TV – filter on topic content
  25. Sentiment analysis  Document level classification  Positive/Negative/Neutral  Subjective/Objective

     Tailored for short texts  Handles twitter jargon – RT, @, hashtags, emoticons, spelling errors, disfluence  Other features  Entity level sentiment  Segment level sentiment
  26. Topics Extraction  Personas: Ben Bernanke, Mariano Rajoy…  Empresas,

    Organizaciones: BBVA, Bankia, Goldman Sachs, Coca-Cola, Reserva Federal…  Entidades económicas: Ibex35, Dax Xetra…  Ubicaciones: Londres, EE.UU., París…  Conceptos: prima de riesgo, presidente del Gobierno, intervención parlamentaria, índice bursátil, situación económica…  Referencias de tiempo: hoy, ayer, sobre las 11 de la mañana…  Cantidades económicas: 104 dólares, 1 euro…  12 main types  Ontology with > 200 types  Instances – BBVA  Classes – bank  fictional/historic  SocialTV:  populate custom
  27. Entity Linking  Linking entities to their 'real' representation 

    Linking to several LOD sources
  28. Organizing the mess. SenseiDB

  29. SenseiDB  Open source, distributed, realtime, semi- structured database 

    From LinkedIn sna: powering Linkedin home and LinkedIn signals  Integrates other open source technologies:  Zoie – search engine  Bobo - faceted search  Apache Kafkan – pub-sub system  http://www.senseidb.com/
  30. SenseiDB features  'Hybrid' Information Retrieval – Database  Full

    text search  Structured and faceted search  Fast real time updates with low latency and high troughput – pull model  Single table/collection  BQL – a SQL like language  Eventual consistency  Distributed – sharding and partitioning  Hadoop integration
  31. Faceted search  Amazon.com?  Identify relevant attributes to use

    as filters  Predefined facets  Defina a table schema  Define fields as facets – facet schema
  32. Faceted search in depth  Field types  Basic: string,

    int, short, long float, double, char  Complex: date and text (analyzed, termvectors)  Facet types  Simple : 1 row – 1 value  Hierarchical – Path c>b>a  Range – define ranges  Multi : 1 row – n values  Histogram – define bins and their size  TimeRange – for real time data
  33. Using facets for semantic search  Define a facet for:

     entities/concept → tweets about Chicote – include all variants + user + hashtags  for each entity types → Navigate by type – Popular people  classification/sentiment/emotions → Positive tweets about Chicote  users or hashtags → popular users / popular mentions / correlated hashtags  Urls  Time range facets
  34. BQL – search, filter and facets  Examples  Facets

    support basic analytics task defined as facets  Relevance may be defined in query – text queries
  35. Architecture

  36. Real time indexing  Data events – add and delete

     Data streams – succession of data events  Gateways  Read data events from data streams  File  JDBC  JMS  Kafka  Custom: Twitter
  37. Scalability  Zookeper to keep replicas  Low indexing latency

    (no batch commit)  Low search latency – low votality  Horizontally scalable – shards  Shards may be replicated N times  Elastic – nodes can be added to accomodate growth
  38. Other features  Batch indexing via Hadoop – ETL 

    Simple analytics by batch indexing  Customized relevance models  MapReduce functions over facets  Sum, avg, min, max  DistinctCount  Activity values – volatile values – likes
  39. Lessons learned

  40. Conclusions  SenseiDB is fast at searching/indexing – no variance

     A couple nodes enough to handle Spanish SocialTV  Love query language and time operators  Support real time exploration
  41. Limitations  SenseiDB  Documentation is still scarce  Single

    table model – flat users and reputation  Tricks to store complex facets  Social TV Tracker  Group entity mentions across tweets  Relevance is tricky – ad hoc  Manageability  Integrate history
  42. Comparison  Solr  NearRT updates  Soft commits 

    Simple facets  Popular – great tools  Storm, S4 ? ElasticSearch Batch commits On line facets Aggregation after facets Much better plugin system
  43. Thanks and QA @zdepablo #bigdata #socialtv #2ndscreen #nlp @textalytics