Topological Data Analysis - Visualisation and analysis of complex datasets

Topological Data Analysis Visualisation and analysis of complex datasets Edward
Kibardin

Data Map

+ Instead of asking data speciﬁc questions we can use
traditional tools to join different data sources and prepare a holistic dataset This dataset can be automatically processed using topological data analysis and presented as map of dependencies and correlations The motivation = Get answers to questions you didn’t ask yet

A topological invariant is a map f that assigns the
same object to homeomorphic spaces, that is: Homology: is a machine that converts local data about a space into global algebraic structure Reference: Wikipedia, 2010. Topological invariants

a b a.  Compute a combinatorial model approximating
the structure of the underlying space b.  Then compute topological invariants of this structure c.  Represent these topological invariants in 2d space Topology Data Analysis Pipeline c

The Čech Complex Combinatorial Representations

Barcodes Reference: Robert Adler, TOPOS: Applied topologists do it with
persistence

Theorem: Suppose h : X g
is a discrete Morse function. Then X is homotopy equivalent to a CW-complex with exactly one cell of dimension p for each critical simplex of dimension p. Reference: Teng Ma ; Zhuangzhi Wu ; Pei Luo ; Lu Feng. Reeb graph computa1on through spectral clustering, 2011. Morse Theory and Reeb Graph

Case study: 20 Newsgroups The 20 Newsgroups data set is
a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. •  18,820 documents •  From 6 to 5000 words each •  20 newsgroups (classes) 20 Newsgroups academic dataset (unsupervised)

Case study: 20 Newsgroups Data Transformation re albert sabin in
article apr nntpd cxo dec com sharpe nmesis enet dec com system privileged account writes in article c ftjt sunﬁsh usd edu rfox charlie usd edu rich fox univ of south dakota writes in article apr rambo atlanta albert sabin articl apr nntpd cxo dec sharp nmesi enet dec system privileg account write articl ftjt sunﬁsh usd edu rfox charli usd edu rich fox univ south dakota write articl apr rambo atlanta … account albert articl apr atlanta cxo dec edu … 564 3 0 5 1 0 0 0 0 565 0 1 2 0 1 0 0 0 566 0 0 0 2 0 0 1 2 567 1 0 0 0 0 0 0 0 568 0 0 2 0 0 2 0 0 569 2 2 1 4 5 2 3 4 570 0 2 0 2 2 0 0 2 571 0 0 0 2 2 0 1 0 572 1 0 1 0 0 2 0 0 573 0 2 0 0 0 1 0 2 574 0 1 0 2 2 0 0 1 575 3 0 3 1 0 0 5 1 576 0 0 0 0 0 3 3 2 577 1 1 1 0 5 3 4 0 578 0 0 0 3 7 3 5 7 Source text Data format for TDA 20000 words 18820 documents Stemming and stop words removal Text vectorisation

Topology for different epsilon parameter Case study: 20 Newsgroups

General topology Case study: 20 Newsgroups rec.motorcycles misc.forsale
comp.sys.ibm. pc.hardware monitor computer not drive intern low talk.poli:cs.misc sci.space ﬁre :i hour sci.crypt sci.electronics rec.sport.hockey rec.sport.basketball

Highly detailed topology Case study: 20 Newsgroups

Case study: Netflix competition A dataset from Netflix open competition
best collaborative filtering algorithm to predict user ratings for films: •  100,480,507 ratings •  480,189 users •  17,770 movies •  2.1 GB of CSV file

Case study: Netﬂix competition Data Transformation Source data users movies
Data format for TDA [100,480,507:3] 300 millions of elements [17,770:480,189] 8.5 billions of elements

Challenges: •  During pivoting we’re transforming 300 millions of data
items into 8.5 billions of data items, which require more than 200 GB of ram •  Current TDA algorithm implementation has O( n log(n) ) computational and memory complexity, which makes it even more complicated to compute as is Case study: Netﬂix competition Data Transformation

Case study: Netﬂix competition

Case study: Netﬂix competition Horror Movies

Case study: Netﬂix competition Science Fiction / Fantasy Series

Case study: Netﬂix competition Concerts

Case study: Netﬂix competition Result comparison Locally Linear Embedding PCA
Local Tangent Space Alignment Hessian LLE Topological Data Analysis Spectral Embedding

Case study: Badoo A subset of user activity in the
United States. Aggregated activity metrics over two weeks in August 2014. •  88,567 users •  867 metrics

Case study: Badoo Data Transformation Used aggregated representations of user
activities per day: •  Number of likes •  Number of dislikes •  Number of matches •  Proﬁles visited •  Photos uploaded •  Number of messages sent (no content analysed) •  Number of message replies •  Interactions with different app features

Case study: Badoo Messages sent / received

Case study: Badoo Results •  Genders can be very well
identiﬁed pure by activity •  Found a group of male users, who acting very similar to female, need to analyse more •  Performed the segmentation of users for future analysis with product team •  Blocked users were separated very well can be very well distinguished by TDA •  Found several interesting correlations between user actions and success on the app

Case study: Badoo DataReﬁner tool

Links Topology And Data (Gunnar Carlsson): http://www.ams.org/journals/bull/2009-46-02/S0273-0979-09-01249-X/ S0273-0979-09-01249-X.pdf Discrete Morse
Theory and Persistent Homology (Kevin P. Knudson): http://www.math.fsu.edu/~hironaka/FSUUF/knudson.pdf Topological Persistence and Simplification (Herbert Edelsbrunner, David Letscher, Afra Zomorodian): http://math.uchicago.edu/~shmuel/AAT-readings/Data%20Analysis%20/PersTop.pdf Netflix Diagram (3200x3200): http://datarefiner.com/netflix17770movies.png Netflix Diagram with movie titles (17000x17000, 86MB): http://datarefiner.com/netflix17770movies_annotation.png

info@datareﬁner.com www.datareﬁner.com Please sign up for free beta access:

Topological Data Analysis - Visualisation and analysis of complex datasets

Topological Data Analysis - Visualisation and analysis of complex datasets

DataRefiner

More Decks by DataRefiner

Other Decks in Technology

Featured

Transcript