Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Science Challenges in Media

Big Data Science Challenges in Media

Talk by Chandan Rajah, Chief Architect Big Data @Sky Data Science London @ds_ldn meetup on 12/02/2013

Data Science London

February 18, 2013
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. Why Big Data Science ? Big Data Value & Vision

    • Machine learning (clustering, classification, regression, pattern mining, behaviour analysis, semantic analysis, topic extraction) • Real time analytics & recommendations • Central smelting pot • Cost to data benefits Volume & Variety • 10 million subscribers;10 different touch points • Petabytes of data; structured and unstructured • Event logs, program data, content metadata, purchase history, etc. • Too big for traditional data warehouse Velocity & Veracity • 140 MB/s approx. 12 TB/day • Too fast; 95% of the data dropped • Inconsistent data structure • No single version of truth
  2. Big Data Science Challenges Big Data Big Data Data Quality

    Feature Extraction Machine Learning Visualisation & Verification Productizing • Dirty unstructured data with inconsistent labels • Start but no end events • Field shifts between extracts • XML fragmented data; 100k frags • Data too big to run in R requires subsampling and effective implementation • 100s of features; too big for Scala / Scalding tuple • No clearly identifiable keys • Algorithm implementation issues (e.g. parallelism, scalability, testability) • Collaborative filtering, topic modelling, incremental clustering, sentiment analysis • Real time versus batch algorithm design • Visualisation tool support • Automated testing frameworks • R -> Scala / Scalding not easy • Disaster recovery & cross data centre • On the fly analytics; data streams