Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Topological Data Analysis meets Deep Learning

Topological Data Analysis meets Deep Learning

Theory and usage examples of Topological Data Analysis and Deep Learning for semi-supervised data segmentation

DataRefiner

May 01, 2015
Tweet

More Decks by DataRefiner

Other Decks in Technology

Transcript

  1. + Instead of asking data specific questions we can use

    traditional tools to join different data sources and prepare a holistic dataset This dataset can be automatically processed using topological data analysis and presented as map of dependencies and correlations The motivation = Get answers to questions you didn’t ask yet
  2. A topological invariant is a map f that assigns the

    same object to homeomorphic spaces, that is: Homology: is a machine that converts local data about a space into global algebraic structure Reference: Wikipedia, 2010. Topological invariants
  3. a   b   a.  Compute a combinatorial model approximating

    the structure of the underlying space b.  Then compute topological invariants of this structure c.  Represent these topological invariants in 2d space Topology Data Analysis Pipeline c  
  4. Theorem: Suppose  h  :  X  g        

     is a discrete Morse function.   Then X is homotopy equivalent to a CW-complex with exactly one cell of dimension p for each critical simplex of dimension p. Reference:  Teng  Ma  ;  Zhuangzhi  Wu  ;  Pei  Luo  ;  Lu  Feng.  Reeb  graph  computa1on  through  spectral  clustering,  2011.   Morse Theory and Reeb Graph
  5. Case study: Netflix competition A dataset from Netflix open competition

    best collaborative filtering algorithm to predict user ratings for films: •  100,480,507 ratings •  480,189 users •  17,770 movies •  2.1 GB of CSV file
  6. Case study: Yelp Dataset Challenge Sample of our data from

    the greater Phoenix, AZ metropolitan area including: •  15,585 businesses •  111,561 business attributes •  11,434 check-in sets •  70,817 users •  151,516 edge social graph •  113,993 tips •  335,022 reviews http://www.yelp.com/dataset_challenge
  7. Case study: Yelp Dataset Challenge Visualisation: cluster examination Cluster characteristics:

    •  Check-ins on Mondays at 0:00, •  Fridays have very few check-ins
  8. Case study: Badoo A subset of user activity in the

    United States. Aggregated activity metrics over two weeks in August 2014. •  88,567 users •  867 metrics
  9. Case study: Badoo Data Transformation Used aggregated representations of user

    activities per day: •  Number of likes •  Number of dislikes •  Number of matches •  Profiles visited •  Photos uploaded •  Number of messages sent (no content analysed) •  Number of message replies •  Interactions with different app features
  10. Case study: Badoo Results •  Genders can be very well

    identified purely by activity •  Found a group of male users, who are acting very similar to females, subject to further analysis •  Performed segmentation of users for potential product features with product team •  Distilled a group of blocked users •  Found several interesting correlations between usage and success on the app
  11. Case study: 20 Newsgroups The 20 Newsgroups data set is

    a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. •  18,820 documents •  From 6 to 5000 words each •  20 newsgroups (classes) 20  Newsgroups     academic  dataset   (semi-­‐supervised)  
  12. Case study: 20 Newsgroups alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x

    misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc
  13. Case study: 20 Newsgroups Data Transformation re albert sabin in

    article apr nntpd cxo dec com sharpe nmesis enet dec com system privileged account writes in article c ftjt sunfish usd edu rfox charlie usd edu rich fox univ of south dakota writes in article apr rambo atlanta albert sabin articl apr nntpd cxo dec sharp nmesi enet dec system privileg account write articl ftjt sunfish usd edu rfox charli usd edu rich fox univ south dakota write articl apr rambo atlanta … account albert articl apr atlanta cxo dec edu … 564 3 0 5 1 0 0 0 0 565 0 1 2 0 1 0 0 0 566 0 0 0 2 0 0 1 2 567 1 0 0 0 0 0 0 0 568 0 0 2 0 0 2 0 0 569 2 2 1 4 5 2 3 4 570 0 2 0 2 2 0 0 2 571 0 0 0 2 2 0 1 0 572 1 0 1 0 0 2 0 0 573 0 2 0 0 0 1 0 2 574 0 1 0 2 2 0 0 1 575 3 0 3 1 0 0 5 1 576 0 0 0 0 0 3 3 2 577 1 1 1 0 5 3 4 0 578 0 0 0 3 7 3 5 7 Source text Data format for TDA 1000 words 18820 documents Stemming and stop words removal Text vectorisation
  14. Baseball cluster Case study: 20 Newsgroups “pitch” > 1.2 This

    must be baseball speed game margin realist chip ucdavi edu gari built villanova huckabai basebal game and shade hour that damn long don plai hour game watch game for that long butt fall asleep and watch channel surf pitch catch color
  15. Motorcycles cluster Case study: 20 Newsgroups “bike” > 1.114 This

    must be motorcycles ride sixteen dai had put test drive honda final saturdai rain fact clear warm and sunni and wind di week ago long cool ride hawk cycl for test ride had sold and deliv demo fifteen hour arriv and demo vfr bike lock showroom surround bike and not like move todai even bike us dirt bike us street bike car and big tent full outlandishli fat tour bike trailer squeez park lot sort fat bike convent shelli and dave run msf each time classroom and back lot usual free cookout distribut severli affect will bike perform such load cling back rest secur shift increas chanc surf collect wisdom request can afford leather pant boot and jean can make you knee protector rollerblad us bean and sell
  16. Denoising Autoencoders A power of destruction   No destroyed inputs

    25% destruction 50% destruction Neuron A (0%, 10%, 20%, 50% destruction) Neuron B (0%, 10%, 20%, 50% destruction)
  17. Stacked Denoising Autoencoders 20 Newsgroups network structure (learning)   Vector

    or word counts for each document 1000 500 500 Encoder 1 Encoder 2 f θ (1) f θ (2) x
  18. Stacked Denoising Autoencoders 20 Newsgroups network structure (fine-tuning)   Vector

    or word counts for each document 1000 500 500 Labels for selected points Topological fine-tuner Encoder 1 Encoder 2 Fine-tuning weights Fine-tuning weights f θ (1) f θ (2) x
  19. Learning high-level features for selected two clusters Case study: 20

    Newsgroups Weights for autoencoder layer 1, corruption level = 0.2
  20. Result of learning first two groups Case study: 20 Newsgroups

    Labeled baseball! Unlabeled baseball! Labeled Motorcycles! Unlabeled Motorcycles! Autos Pc.hardware Mac.hardware
  21. Result of learning five groups Case study: 20 Newsgroups Mac.hardware

    Baseball! Pc.hardware Autos Motorcycles! Scy.med! Politics.misc! Politics.! mideast! Hockey!
  22. Final result for 2nd layer Case study: 20 Newsgroups Motorcycles

    Christian Atheism Religion.misc Politics.guns Politics.misc Politics.mideast Scy.crypt Scy.med Hockey Baseball Autos Forsale Mac.hardware Electronics Scy.space Comp.graphics Windows.x Ms-windows.misc Pc.hardware
  23. Links Topology And Data (Gunnar Carlsson): http://www.ams.org/journals/bull/2009-46-02/S0273-0979-09-01249-X/ S0273-0979-09-01249-X.pdf Discrete Morse

    Theory and Persistent Homology (Kevin P. Knudson): http://www.math.fsu.edu/~hironaka/FSUUF/knudson.pdf Topological Persistence and Simplification (Herbert Edelsbrunner, David Letscher, Afra Zomorodian): http://math.uchicago.edu/~shmuel/AAT-readings/Data%20Analysis%20/PersTop.pdf Extracting and Composing Robust Features with Denoising Autoencoders (Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol) http://www.iro.umontreal.ca/~vincentp/Publications/ denoising_autoencoders_tr1316.pdf