Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cesar_de_Pablo_-_SocialTV_-_BigData_Spain_2013.pdf

Big Data Spain
November 14, 2013
690

 Cesar_de_Pablo_-_SocialTV_-_BigData_Spain_2013.pdf

Big Data Spain

November 14, 2013
Tweet

More Decks by Big Data Spain

Transcript

  1. Textalytics: Meaning-as-a-Service Real Time Semantic Search for Social TV streams

    César de Pablo Sánchez Daedalus Nov. 7th 2013 Big Data Spain (Madrid)
  2. The plot 1. What's Social TV? 2. Monitoring Social TV

    conversations. A preliminary architecture 3. Understanding the buzz. Textalytics 4. Organizing the mess. SenseiDB 5. Lessons learned
  3. API  NLP and Semantics API  Multilingual: EN, ES

    (FR,IT,PT,CA)  REST Service : JSON and XML  Combine best of all worlds  Deep language analysis  Comprehensive resources: linguistics and Dbs  Ontology  Rule Based Method  Statistics and Machine Learning Methods
  4.  High level semantic API – close to bussines scenarios

     Core API – building blocks Topics Sentiment Classif. Linked Data POS Configuración y Recursos Lingüísticos Configuración y Recursos Lingüísticos Configuración y Recursos Lingüísticos API Análisis Medios API Publicación Semántica … API
  5. Core API Topics Extraction Text Classification Sentiment Analysis Language identification

    Lemmatization, POS and Parsing Speeech Recognition and Speaker Diarization Semantic Linked Data Viewer Semantic Linked Data Viewer Spell, Grammar and Style User Demographics
  6. Language identification  Given a text identify a language list

    - or just one  62 languages  Using language ngrams signatures  Social TV  Filter – TV hashtags often implies language  Sometimes hashtags are multilingual – but not relevant for users
  7. Text Classification  Theme labels – IPTC  Relevance 

    Multiple labels  Tailored for short text (tweets)  Define your own models and categories  Social TV – filter on topic content
  8. Sentiment analysis  Document level classification  Positive/Negative/Neutral  Subjective/Objective

     Tailored for short texts  Handles twitter jargon – RT, @, hashtags, emoticons, spelling errors, disfluence  Other features  Entity level sentiment  Segment level sentiment
  9. Topics Extraction  Personas: Ben Bernanke, Mariano Rajoy…  Empresas,

    Organizaciones: BBVA, Bankia, Goldman Sachs, Coca-Cola, Reserva Federal…  Entidades económicas: Ibex35, Dax Xetra…  Ubicaciones: Londres, EE.UU., París…  Conceptos: prima de riesgo, presidente del Gobierno, intervención parlamentaria, índice bursátil, situación económica…  Referencias de tiempo: hoy, ayer, sobre las 11 de la mañana…  Cantidades económicas: 104 dólares, 1 euro…  12 main types  Ontology with > 200 types  Instances – BBVA  Classes – bank  fictional/historic  SocialTV:  populate custom
  10. SenseiDB  Open source, distributed, realtime, semi- structured database 

    From LinkedIn sna: powering Linkedin home and LinkedIn signals  Integrates other open source technologies:  Zoie – search engine  Bobo - faceted search  Apache Kafkan – pub-sub system  http://www.senseidb.com/
  11. SenseiDB features  'Hybrid' Information Retrieval – Database  Full

    text search  Structured and faceted search  Fast real time updates with low latency and high troughput – pull model  Single table/collection  BQL – a SQL like language  Eventual consistency  Distributed – sharding and partitioning  Hadoop integration
  12. Faceted search  Amazon.com?  Identify relevant attributes to use

    as filters  Predefined facets  Defina a table schema  Define fields as facets – facet schema
  13. Faceted search in depth  Field types  Basic: string,

    int, short, long float, double, char  Complex: date and text (analyzed, termvectors)  Facet types  Simple : 1 row – 1 value  Hierarchical – Path c>b>a  Range – define ranges  Multi : 1 row – n values  Histogram – define bins and their size  TimeRange – for real time data
  14. Using facets for semantic search  Define a facet for:

     entities/concept → tweets about Chicote – include all variants + user + hashtags  for each entity types → Navigate by type – Popular people  classification/sentiment/emotions → Positive tweets about Chicote  users or hashtags → popular users / popular mentions / correlated hashtags  Urls  Time range facets
  15. BQL – search, filter and facets  Examples  Facets

    support basic analytics task defined as facets  Relevance may be defined in query – text queries
  16. Real time indexing  Data events – add and delete

     Data streams – succession of data events  Gateways  Read data events from data streams  File  JDBC  JMS  Kafka  Custom: Twitter
  17. Scalability  Zookeper to keep replicas  Low indexing latency

    (no batch commit)  Low search latency – low votality  Horizontally scalable – shards  Shards may be replicated N times  Elastic – nodes can be added to accomodate growth
  18. Other features  Batch indexing via Hadoop – ETL 

    Simple analytics by batch indexing  Customized relevance models  MapReduce functions over facets  Sum, avg, min, max  DistinctCount  Activity values – volatile values – likes
  19. Conclusions  SenseiDB is fast at searching/indexing – no variance

     A couple nodes enough to handle Spanish SocialTV  Love query language and time operators  Support real time exploration
  20. Limitations  SenseiDB  Documentation is still scarce  Single

    table model – flat users and reputation  Tricks to store complex facets  Social TV Tracker  Group entity mentions across tweets  Relevance is tricky – ad hoc  Manageability  Integrate history
  21. Comparison  Solr  NearRT updates  Soft commits 

    Simple facets  Popular – great tools  Storm, S4 ? ElasticSearch Batch commits On line facets Aggregation after facets Much better plugin system