Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Topological Data Analysis - Visualisation and analysis of complex datasets

DataRefiner
November 11, 2014

Topological Data Analysis - Visualisation and analysis of complex datasets

DataRefiner

November 11, 2014
Tweet

More Decks by DataRefiner

Other Decks in Technology

Transcript

  1. + Instead of asking data specific questions we can use

    traditional tools to join different data sources and prepare a holistic dataset This dataset can be automatically processed using topological data analysis and presented as map of dependencies and correlations The motivation = Get answers to questions you didn’t ask yet
  2. A topological invariant is a map f that assigns the

    same object to homeomorphic spaces, that is: Homology: is a machine that converts local data about a space into global algebraic structure Reference: Wikipedia, 2010. Topological invariants
  3. a   b   a.  Compute a combinatorial model approximating

    the structure of the underlying space b.  Then compute topological invariants of this structure c.  Represent these topological invariants in 2d space Topology Data Analysis Pipeline c  
  4. Theorem: Suppose  h  :  X  g        

     is a discrete Morse function.   Then X is homotopy equivalent to a CW-complex with exactly one cell of dimension p for each critical simplex of dimension p. Reference:  Teng  Ma  ;  Zhuangzhi  Wu  ;  Pei  Luo  ;  Lu  Feng.  Reeb  graph  computa1on  through  spectral  clustering,  2011.   Morse Theory and Reeb Graph
  5. Case study: 20 Newsgroups The 20 Newsgroups data set is

    a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. •  18,820 documents •  From 6 to 5000 words each •  20 newsgroups (classes) 20  Newsgroups     academic  dataset   (unsupervised)  
  6. Case study: 20 Newsgroups Data Transformation re albert sabin in

    article apr nntpd cxo dec com sharpe nmesis enet dec com system privileged account writes in article c ftjt sunfish usd edu rfox charlie usd edu rich fox univ of south dakota writes in article apr rambo atlanta albert sabin articl apr nntpd cxo dec sharp nmesi enet dec system privileg account write articl ftjt sunfish usd edu rfox charli usd edu rich fox univ south dakota write articl apr rambo atlanta … account albert articl apr atlanta cxo dec edu … 564 3 0 5 1 0 0 0 0 565 0 1 2 0 1 0 0 0 566 0 0 0 2 0 0 1 2 567 1 0 0 0 0 0 0 0 568 0 0 2 0 0 2 0 0 569 2 2 1 4 5 2 3 4 570 0 2 0 2 2 0 0 2 571 0 0 0 2 2 0 1 0 572 1 0 1 0 0 2 0 0 573 0 2 0 0 0 1 0 2 574 0 1 0 2 2 0 0 1 575 3 0 3 1 0 0 5 1 576 0 0 0 0 0 3 3 2 577 1 1 1 0 5 3 4 0 578 0 0 0 3 7 3 5 7 Source text Data format for TDA 20000 words 18820 documents Stemming and stop words removal Text vectorisation
  7. General topology Case study: 20 Newsgroups rec.motorcycles   misc.forsale  

    comp.sys.ibm.   pc.hardware   monitor   computer   not   drive   intern   low   talk.poli:cs.misc   sci.space   fire   :i   hour   sci.crypt   sci.electronics   rec.sport.hockey   rec.sport.basketball  
  8. Case study: Netflix competition A dataset from Netflix open competition

    best collaborative filtering algorithm to predict user ratings for films: •  100,480,507 ratings •  480,189 users •  17,770 movies •  2.1 GB of CSV file
  9. Case study: Netflix competition Data Transformation Source data users movies

    Data format for TDA [100,480,507:3] 300 millions of elements [17,770:480,189] 8.5 billions of elements
  10. Challenges: •  During pivoting we’re transforming 300 millions of data

    items into 8.5 billions of data items, which require more than 200 GB of ram •  Current TDA algorithm implementation has O( n log(n) ) computational and memory complexity, which makes it even more complicated to compute as is Case study: Netflix competition Data Transformation
  11. Case study: Netflix competition Result comparison Locally Linear Embedding PCA

    Local Tangent Space Alignment Hessian LLE Topological Data Analysis Spectral Embedding
  12. Case study: Badoo A subset of user activity in the

    United States. Aggregated activity metrics over two weeks in August 2014. •  88,567 users •  867 metrics
  13. Case study: Badoo Data Transformation Used aggregated representations of user

    activities per day: •  Number of likes •  Number of dislikes •  Number of matches •  Profiles visited •  Photos uploaded •  Number of messages sent (no content analysed) •  Number of message replies •  Interactions with different app features
  14. Case study: Badoo Results •  Genders can be very well

    identified pure by activity •  Found a group of male users, who acting very similar to female, need to analyse more •  Performed the segmentation of users for future analysis with product team •  Blocked users were separated very well can be very well distinguished by TDA •  Found several interesting correlations between user actions and success on the app
  15. Links Topology And Data (Gunnar Carlsson): http://www.ams.org/journals/bull/2009-46-02/S0273-0979-09-01249-X/ S0273-0979-09-01249-X.pdf Discrete Morse

    Theory and Persistent Homology (Kevin P. Knudson): http://www.math.fsu.edu/~hironaka/FSUUF/knudson.pdf Topological Persistence and Simplification (Herbert Edelsbrunner, David Letscher, Afra Zomorodian): http://math.uchicago.edu/~shmuel/AAT-readings/Data%20Analysis%20/PersTop.pdf Netflix Diagram (3200x3200): http://datarefiner.com/netflix17770movies.png Netflix Diagram with movie titles (17000x17000, 86MB): http://datarefiner.com/netflix17770movies_annotation.png