Cesar_de_Pablo_-_SocialTV_-_BigData_Spain_2013.pdf

Slide 1

Slide 1 text

Textalytics: Meaning-as-a-Service Real Time Semantic Search for Social TV streams César de Pablo Sánchez Daedalus Nov. 7th 2013 Big Data Spain (Madrid)

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

The plot 1. What's Social TV? 2. Monitoring Social TV conversations. A preliminary architecture 3. Understanding the buzz. Textalytics 4. Organizing the mess. SenseiDB 5. Lessons learned

Slide 5

Slide 5 text

Social TV Second Screen Transmedia

Slide 6

Slide 6 text

Not just TV Sports Elections Alerts

Slide 7

Slide 7 text

Big Data? Volume Velocity Variety

Slide 8

Slide 8 text

Users? Viewers Channels Brands

Slide 9

Slide 9 text

Viewers? Participate Vote Influence Confirm beliefs Keep updated Belong to group

Slide 10

Slide 10 text

Viewers? Participate Influence Confirm beliefs Keep updated

Slide 11

Slide 11 text

Channels? Understand React Measure

Slide 12

Slide 12 text

Channels? Understand React

Slide 13

Slide 13 text

Brands? Select programs Reputation Find public

Slide 14

Slide 14 text

Reputation Find public Brands?

Slide 15

Slide 15 text

Monitoring Social TV conversations. The architecture

Slide 16

Slide 16 text

tracker gateway HTTP Stream pipeline Pull EPG

Slide 17

Slide 17 text

Understanding the buzz Textalytics API

Slide 18

Slide 18 text

API  NLP and Semantics API  Multilingual: EN, ES (FR,IT,PT,CA)  REST Service : JSON and XML  Combine best of all worlds  Deep language analysis  Comprehensive resources: linguistics and Dbs  Ontology  Rule Based Method  Statistics and Machine Learning Methods

Slide 19

Slide 19 text

 High level semantic API – close to bussines scenarios  Core API – building blocks Topics Sentiment Classif. Linked Data POS Configuración y Recursos Lingüísticos Configuración y Recursos Lingüísticos Configuración y Recursos Lingüísticos API Análisis Medios API Publicación Semántica … API

Slide 20

Slide 20 text

Media Analysis Semantic Publishing Core API API

Slide 21

Slide 21 text

Core API Topics Extraction Text Classification Sentiment Analysis Language identification Lemmatization, POS and Parsing Speeech Recognition and Speaker Diarization Semantic Linked Data Viewer Semantic Linked Data Viewer Spell, Grammar and Style User Demographics

Slide 22

Slide 22 text

Language identification  Given a text identify a language list - or just one  62 languages  Using language ngrams signatures  Social TV  Filter – TV hashtags often implies language  Sometimes hashtags are multilingual – but not relevant for users

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Text Classification  Theme labels – IPTC  Relevance  Multiple labels  Tailored for short text (tweets)  Define your own models and categories  Social TV – filter on topic content

Slide 25

Slide 25 text

Sentiment analysis  Document level classification  Positive/Negative/Neutral  Subjective/Objective  Tailored for short texts  Handles twitter jargon – RT, @, hashtags, emoticons, spelling errors, disfluence  Other features  Entity level sentiment  Segment level sentiment

Slide 26

Slide 26 text

Topics Extraction  Personas: Ben Bernanke, Mariano Rajoy…  Empresas, Organizaciones: BBVA, Bankia, Goldman Sachs, Coca-Cola, Reserva Federal…  Entidades económicas: Ibex35, Dax Xetra…  Ubicaciones: Londres, EE.UU., París…  Conceptos: prima de riesgo, presidente del Gobierno, intervención parlamentaria, índice bursátil, situación económica…  Referencias de tiempo: hoy, ayer, sobre las 11 de la mañana…  Cantidades económicas: 104 dólares, 1 euro…  12 main types  Ontology with > 200 types  Instances – BBVA  Classes – bank  fictional/historic  SocialTV:  populate custom

Slide 27

Slide 27 text

Entity Linking  Linking entities to their 'real' representation  Linking to several LOD sources

Slide 28

Slide 28 text

Organizing the mess. SenseiDB

Slide 29

Slide 29 text

SenseiDB  Open source, distributed, realtime, semi- structured database  From LinkedIn sna: powering Linkedin home and LinkedIn signals  Integrates other open source technologies:  Zoie – search engine  Bobo - faceted search  Apache Kafkan – pub-sub system  http://www.senseidb.com/

Slide 30

Slide 30 text

SenseiDB features  'Hybrid' Information Retrieval – Database  Full text search  Structured and faceted search  Fast real time updates with low latency and high troughput – pull model  Single table/collection  BQL – a SQL like language  Eventual consistency  Distributed – sharding and partitioning  Hadoop integration

Slide 31

Slide 31 text

Faceted search  Amazon.com?  Identify relevant attributes to use as filters  Predefined facets  Defina a table schema  Define fields as facets – facet schema

Slide 32

Slide 32 text

Faceted search in depth  Field types  Basic: string, int, short, long float, double, char  Complex: date and text (analyzed, termvectors)  Facet types  Simple : 1 row – 1 value  Hierarchical – Path c>b>a  Range – define ranges  Multi : 1 row – n values  Histogram – define bins and their size  TimeRange – for real time data

Slide 33

Slide 33 text

Using facets for semantic search  Define a facet for:  entities/concept → tweets about Chicote – include all variants + user + hashtags  for each entity types → Navigate by type – Popular people  classification/sentiment/emotions → Positive tweets about Chicote  users or hashtags → popular users / popular mentions / correlated hashtags  Urls  Time range facets

Slide 34

Slide 34 text

BQL – search, filter and facets  Examples  Facets support basic analytics task defined as facets  Relevance may be defined in query – text queries

Slide 35

Slide 35 text

Architecture

Slide 36

Slide 36 text

Real time indexing  Data events – add and delete  Data streams – succession of data events  Gateways  Read data events from data streams  File  JDBC  JMS  Kafka  Custom: Twitter

Slide 37

Slide 37 text

Scalability  Zookeper to keep replicas  Low indexing latency (no batch commit)  Low search latency – low votality  Horizontally scalable – shards  Shards may be replicated N times  Elastic – nodes can be added to accomodate growth

Slide 38

Slide 38 text

Other features  Batch indexing via Hadoop – ETL  Simple analytics by batch indexing  Customized relevance models  MapReduce functions over facets  Sum, avg, min, max  DistinctCount  Activity values – volatile values – likes

Slide 39

Slide 39 text

Lessons learned

Slide 40

Slide 40 text

Conclusions  SenseiDB is fast at searching/indexing – no variance  A couple nodes enough to handle Spanish SocialTV  Love query language and time operators  Support real time exploration

Slide 41

Slide 41 text

Limitations  SenseiDB  Documentation is still scarce  Single table model – flat users and reputation  Tricks to store complex facets  Social TV Tracker  Group entity mentions across tweets  Relevance is tricky – ad hoc  Manageability  Integrate history

Slide 42

Slide 42 text

Comparison  Solr  NearRT updates  Soft commits  Simple facets  Popular – great tools  Storm, S4 ? ElasticSearch Batch commits On line facets Aggregation after facets Much better plugin system

Slide 43

Slide 43 text

Thanks and QA @zdepablo #bigdata #socialtv #2ndscreen #nlp @textalytics