Slide 1

Slide 1 text

Enhancing ! Topological Data Analysis ! with semi-supervised ! Deep Learning Edward Kibardin

Slide 2

Slide 2 text

Data Map

Slide 3

Slide 3 text

The motivation

Slide 4

Slide 4 text

+ Instead of asking data specific questions we can use traditional tools to join different data sources and prepare a holistic dataset This dataset can be automatically processed using topological data analysis and presented as map of dependencies and correlations The motivation = Get answers to questions you didn’t ask yet

Slide 5

Slide 5 text

A topological invariant is a map f that assigns the same object to homeomorphic spaces, that is: Homology: is a machine that converts local data about a space into global algebraic structure Reference: Wikipedia, 2010. Topological invariants

Slide 6

Slide 6 text

a   b   a.  Compute a combinatorial model approximating the structure of the underlying space b.  Then compute topological invariants of this structure c.  Represent these topological invariants in 2d space Topology Data Analysis Pipeline c  

Slide 7

Slide 7 text

Barcodes Reference: Robert Adler, TOPOS: Applied topologists do it with persistence

Slide 8

Slide 8 text

Theorem: Suppose  h  :  X  g          is a discrete Morse function.   Then X is homotopy equivalent to a CW-complex with exactly one cell of dimension p for each critical simplex of dimension p. Reference:  Teng  Ma  ;  Zhuangzhi  Wu  ;  Pei  Luo  ;  Lu  Feng.  Reeb  graph  computa1on  through  spectral  clustering,  2011.   Morse Theory and Reeb Graph

Slide 9

Slide 9 text

Case study: Badoo A subset of user activity in the United States. Aggregated activity metrics over two weeks in August 2014. •  88,567 users •  867 metrics

Slide 10

Slide 10 text

Case study: Badoo Data Transformation Used aggregated representations of user activities per day: •  Number of likes •  Number of dislikes •  Number of matches •  Profiles visited •  Photos uploaded •  Number of messages sent (no content analysed) •  Number of message replies •  Interactions with different app features

Slide 11

Slide 11 text

Case study: Badoo

Slide 12

Slide 12 text

Case study: Badoo Messages sent / received

Slide 13

Slide 13 text

Case study: Badoo Results •  Genders can be very well identified purely by activity •  Found a group of male users, who are acting very similar to females, subject to further analysis •  Performed segmentation of users for potential product features with product team •  Distilled a group of blocked users •  Found several interesting correlations between usage and success on the app

Slide 14

Slide 14 text

Case study: 20 Newsgroups The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. •  18,820 documents •  From 6 to 5000 words each •  20 newsgroups (classes) 20  Newsgroups     academic  dataset   (semi-­‐supervised)  

Slide 15

Slide 15 text

Case study: 20 Newsgroups alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc

Slide 16

Slide 16 text

Case study: 20 Newsgroups Data Transformation re albert sabin in article apr nntpd cxo dec com sharpe nmesis enet dec com system privileged account writes in article c ftjt sunfish usd edu rfox charlie usd edu rich fox univ of south dakota writes in article apr rambo atlanta albert sabin articl apr nntpd cxo dec sharp nmesi enet dec system privileg account write articl ftjt sunfish usd edu rfox charli usd edu rich fox univ south dakota write articl apr rambo atlanta … account albert articl apr atlanta cxo dec edu … 564 3 0 5 1 0 0 0 0 565 0 1 2 0 1 0 0 0 566 0 0 0 2 0 0 1 2 567 1 0 0 0 0 0 0 0 568 0 0 2 0 0 2 0 0 569 2 2 1 4 5 2 3 4 570 0 2 0 2 2 0 0 2 571 0 0 0 2 2 0 1 0 572 1 0 1 0 0 2 0 0 573 0 2 0 0 0 1 0 2 574 0 1 0 2 2 0 0 1 575 3 0 3 1 0 0 5 1 576 0 0 0 0 0 3 3 2 577 1 1 1 0 5 3 4 0 578 0 0 0 3 7 3 5 7 Source text Data format for TDA 1000 words 18820 documents Stemming and stop words removal Text vectorisation

Slide 17

Slide 17 text

Topology for different epsilon parameter Case study: 20 Newsgroups

Slide 18

Slide 18 text

Detailed topology Case study: 20 Newsgroups

Slide 19

Slide 19 text

Detailed topology (user group overlay) Case study: 20 Newsgroups

Slide 20

Slide 20 text

Selecting first two clusters Case study: 20 Newsgroups

Slide 21

Slide 21 text

Baseball cluster Case study: 20 Newsgroups “pitch” > 1.2 This must be baseball speed game margin realist chip ucdavi edu gari built villanova huckabai basebal game and shade hour that damn long don plai hour game watch game for that long butt fall asleep and watch channel surf pitch catch color

Slide 22

Slide 22 text

Motorcycles cluster Case study: 20 Newsgroups “bike” > 1.114 This must be motorcycles ride sixteen dai had put test drive honda final saturdai rain fact clear warm and sunni and wind di week ago long cool ride hawk cycl for test ride had sold and deliv demo fifteen hour arriv and demo vfr bike lock showroom surround bike and not like move todai even bike us dirt bike us street bike car and big tent full outlandishli fat tour bike trailer squeez park lot sort fat bike convent shelli and dave run msf each time classroom and back lot usual free cookout distribut severli affect will bike perform such load cling back rest secur shift increas chanc surf collect wisdom request can afford leather pant boot and jean can make you knee protector rollerblad us bean and sell

Slide 23

Slide 23 text

Denoising Autoencoders High-level feature extraction  

Slide 24

Slide 24 text

Denoising Autoencoders A power of destruction   No destroyed inputs 25% destruction 50% destruction Neuron A (0%, 10%, 20%, 50% destruction) Neuron B (0%, 10%, 20%, 50% destruction)

Slide 25

Slide 25 text

Denoising Autoencoders Example of learning weights for MNIST dataset  

Slide 26

Slide 26 text

Stacked Denoising Autoencoders High-level feature extraction  

Slide 27

Slide 27 text

Stacked Denoising Autoencoders 20 Newsgroups network structure (learning)   Vector or word counts for each document 1000 500 500 Encoder 1 Encoder 2 f θ (1) f θ (2) x

Slide 28

Slide 28 text

Stacked Denoising Autoencoders 20 Newsgroups network structure (fine-tuning)   Vector or word counts for each document 1000 500 500 Labels for selected points Topological fine-tuner Encoder 1 Encoder 2 Fine-tuning weights Fine-tuning weights f θ (1) f θ (2) x

Slide 29

Slide 29 text

Learning high-level features for selected two clusters Case study: 20 Newsgroups Weights for autoencoder layer 1, corruption level = 0.2

Slide 30

Slide 30 text

Result of learning first two groups Case study: 20 Newsgroups Labeled baseball! Unlabeled baseball! Labeled Motorcycles! Unlabeled Motorcycles! Autos Pc.hardware Mac.hardware

Slide 31

Slide 31 text

Result of learning five groups Case study: 20 Newsgroups Mac.hardware Baseball! Pc.hardware Autos Motorcycles! Scy.med! Politics.misc! Politics.! mideast! Hockey!

Slide 32

Slide 32 text

Final result for 2nd layer Case study: 20 Newsgroups Motorcycles Christian Atheism Religion.misc Politics.guns Politics.misc Politics.mideast Scy.crypt Scy.med Hockey Baseball Autos Forsale Mac.hardware Electronics Scy.space Comp.graphics Windows.x Ms-windows.misc Pc.hardware

Slide 33

Slide 33 text

Links Topology And Data (Gunnar Carlsson): http://www.ams.org/journals/bull/2009-46-02/S0273-0979-09-01249-X/ S0273-0979-09-01249-X.pdf Discrete Morse Theory and Persistent Homology (Kevin P. Knudson): http://www.math.fsu.edu/~hironaka/FSUUF/knudson.pdf Topological Persistence and Simplification (Herbert Edelsbrunner, David Letscher, Afra Zomorodian): http://math.uchicago.edu/~shmuel/AAT-readings/Data%20Analysis%20/PersTop.pdf Extracting and Composing Robust Features with Denoising Autoencoders (Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol) http://www.iro.umontreal.ca/~vincentp/Publications/ denoising_autoencoders_tr1316.pdf

Slide 34

Slide 34 text

info@datarefiner.com www.datarefiner.com Please sign up for free beta access: