Cesar_de_Pablo_-_SocialTV_-_BigData_Spain_2013.pdf

Textalytics: Meaning-as-a-Service Real Time Semantic Search for Social TV streams
César de Pablo Sánchez Daedalus Nov. 7th 2013 Big Data Spain (Madrid)

The plot 1. What's Social TV? 2. Monitoring Social TV
conversations. A preliminary architecture 3. Understanding the buzz. Textalytics 4. Organizing the mess. SenseiDB 5. Lessons learned

Social TV Second Screen Transmedia

Not just TV Sports Elections Alerts

Big Data? Volume Velocity Variety

Users? Viewers Channels Brands

Viewers? Participate Vote Influence Confirm beliefs Keep updated Belong to
group

Viewers? Participate Influence Confirm beliefs Keep updated

Channels? Understand React Measure

Channels? Understand React

Brands? Select programs Reputation Find public

Reputation Find public Brands?

Monitoring Social TV conversations. The architecture

tracker gateway HTTP Stream pipeline Pull EPG

Understanding the buzz Textalytics API

API  NLP and Semantics API  Multilingual: EN, ES
(FR,IT,PT,CA)  REST Service : JSON and XML  Combine best of all worlds  Deep language analysis  Comprehensive resources: linguistics and Dbs  Ontology  Rule Based Method  Statistics and Machine Learning Methods

 High level semantic API – close to bussines scenarios
 Core API – building blocks Topics Sentiment Classif. Linked Data POS Configuración y Recursos Lingüísticos Configuración y Recursos Lingüísticos Configuración y Recursos Lingüísticos API Análisis Medios API Publicación Semántica … API

Media Analysis Semantic Publishing Core API API

Core API Topics Extraction Text Classification Sentiment Analysis Language identification
Lemmatization, POS and Parsing Speeech Recognition and Speaker Diarization Semantic Linked Data Viewer Semantic Linked Data Viewer Spell, Grammar and Style User Demographics

Language identification  Given a text identify a language list
- or just one  62 languages  Using language ngrams signatures  Social TV  Filter – TV hashtags often implies language  Sometimes hashtags are multilingual – but not relevant for users

Text Classification  Theme labels – IPTC  Relevance 
Multiple labels  Tailored for short text (tweets)  Define your own models and categories  Social TV – filter on topic content

Sentiment analysis  Document level classification  Positive/Negative/Neutral  Subjective/Objective
 Tailored for short texts  Handles twitter jargon – RT, @, hashtags, emoticons, spelling errors, disfluence  Other features  Entity level sentiment  Segment level sentiment

Topics Extraction  Personas: Ben Bernanke, Mariano Rajoy…  Empresas,
Organizaciones: BBVA, Bankia, Goldman Sachs, Coca-Cola, Reserva Federal…  Entidades económicas: Ibex35, Dax Xetra…  Ubicaciones: Londres, EE.UU., París…  Conceptos: prima de riesgo, presidente del Gobierno, intervención parlamentaria, índice bursátil, situación económica…  Referencias de tiempo: hoy, ayer, sobre las 11 de la mañana…  Cantidades económicas: 104 dólares, 1 euro…  12 main types  Ontology with > 200 types  Instances – BBVA  Classes – bank  fictional/historic  SocialTV:  populate custom

Entity Linking  Linking entities to their 'real' representation 
Linking to several LOD sources

Organizing the mess. SenseiDB

SenseiDB  Open source, distributed, realtime, semi- structured database 
From LinkedIn sna: powering Linkedin home and LinkedIn signals  Integrates other open source technologies:  Zoie – search engine  Bobo - faceted search  Apache Kafkan – pub-sub system  http://www.senseidb.com/

SenseiDB features  'Hybrid' Information Retrieval – Database  Full
text search  Structured and faceted search  Fast real time updates with low latency and high troughput – pull model  Single table/collection  BQL – a SQL like language  Eventual consistency  Distributed – sharding and partitioning  Hadoop integration

Faceted search  Amazon.com?  Identify relevant attributes to use
as filters  Predefined facets  Defina a table schema  Define fields as facets – facet schema

Faceted search in depth  Field types  Basic: string,
int, short, long float, double, char  Complex: date and text (analyzed, termvectors)  Facet types  Simple : 1 row – 1 value  Hierarchical – Path c>b>a  Range – define ranges  Multi : 1 row – n values  Histogram – define bins and their size  TimeRange – for real time data

Using facets for semantic search  Define a facet for:
 entities/concept → tweets about Chicote – include all variants + user + hashtags  for each entity types → Navigate by type – Popular people  classification/sentiment/emotions → Positive tweets about Chicote  users or hashtags → popular users / popular mentions / correlated hashtags  Urls  Time range facets

BQL – search, filter and facets  Examples  Facets
support basic analytics task defined as facets  Relevance may be defined in query – text queries

Architecture

Real time indexing  Data events – add and delete
 Data streams – succession of data events  Gateways  Read data events from data streams  File  JDBC  JMS  Kafka  Custom: Twitter

Scalability  Zookeper to keep replicas  Low indexing latency
(no batch commit)  Low search latency – low votality  Horizontally scalable – shards  Shards may be replicated N times  Elastic – nodes can be added to accomodate growth

Other features  Batch indexing via Hadoop – ETL 
Simple analytics by batch indexing  Customized relevance models  MapReduce functions over facets  Sum, avg, min, max  DistinctCount  Activity values – volatile values – likes

Lessons learned

Conclusions  SenseiDB is fast at searching/indexing – no variance
 A couple nodes enough to handle Spanish SocialTV  Love query language and time operators  Support real time exploration

Limitations  SenseiDB  Documentation is still scarce  Single
table model – flat users and reputation  Tricks to store complex facets  Social TV Tracker  Group entity mentions across tweets  Relevance is tricky – ad hoc  Manageability  Integrate history

Comparison  Solr  NearRT updates  Soft commits 
Simple facets  Popular – great tools  Storm, S4 ? ElasticSearch Batch commits On line facets Aggregation after facets Much better plugin system

Thanks and QA @zdepablo #bigdata #socialtv #2ndscreen #nlp @textalytics

Cesar_de_Pablo_-_SocialTV_-_BigData_Spain_2013.pdf

Cesar_de_Pablo_-_SocialTV_-_BigData_Spain_2013.pdf

More Decks by Big Data Spain

Featured

Transcript