Super-Fast Clustering Report in MapR

1 ©MapR Technologies -‐ Conﬁden6al Super-‐Fast Clustering
Report from MapR workshop

2 ©MapR Technologies -‐ Conﬁden6al §  Contact:
–  [email protected] –  @ted_dunning §  TwiAer for this talk –  #mapr_uk §  Slides and such: –  hAp://info.mapr.com/ted-‐uk-‐05-‐2012

3 ©MapR Technologies -‐ Confiden6al Company Background
§  MapR provides the industry’s best Hadoop Distribu6on –  Combines the best of the Hadoop community contribu6ons with significant internally financed infrastructure development §  Background of Team –  Deep management bench with extensive analy6c, storage, virtualiza6on, and open source experience –  Google, EMC, Cisco, VMWare, Network Appliance, IBM, Microso[, Apache Founda6on, Aster Data, Brio, ParAccel §  Proven –  MapR used across industries (Financial Services, Media, Telcom, Health Care, Internet Services, Government) –  Strategic OEM rela6onship with EMC and Cisco –  Over 1,000 installs

4 ©MapR Technologies -‐ Conﬁden6al We Also Do
… §  Open source development –  Zookeeper –  Hadoop –  Mahout –  Stuﬀ §  Partner workshops –  Machine learning –  Informa6on architecture –  Cluster design

5 ©MapR Technologies -‐ Conﬁden6al We Also Do
… §  Open source development –  Zookeeper –  Hadoop –  Mahout –  Stuﬀ §  Partner workshops –  Machine learning –  Informa6on architecture –  Cluster design

6 ©MapR Technologies -‐ Conﬁden6al The Problem
§  A certain bank –  had lots of customers –  had lots of prospec6ve customers –  had a non-‐trivial number of fraudulent customers –  had a non-‐trivial number of fraudulent merchants §  They also –  collected data –  built models –  collected more data –  built more models

7 ©MapR Technologies -‐ Conﬁden6al But …
§  These models were arduous to build §  And hard to test §  So people suggested something simpler §  Like k-‐nearest neighbor

8 ©MapR Technologies -‐ Conﬁden6al What’s that?
§  Find the k nearest training examples §  Use the average value of the target variable from them §  This is easy … but hard –  easy because it is so conceptually simple and you don’t have knobs to turn or models to build –  hard because of the stunning amount of math –  also hard because we need top 50,000 results §  Ini6al prototype was massively too slow –  3K queries x 200K examples takes hours –  needed 20M x 25M in the same 6me

9 ©MapR Technologies -‐ Conﬁden6al What We Did
§  Mechanism for extending Mahout Vectors –  Delega6ngVector, WeightedVector, Centroid §  Searcher interface –  Projec6onSearch, KmeansSearch, LshSearch, Brute §  Super-‐fast clustering –  Kmeans, StreamingKmeans

10 ©MapR Technologies -‐ Conﬁden6al ProjecGon Search
1.5 -2 -1.5 -1 -0.5 0.5 1 3 -3 -2 -1 1 2 X Axis Y Axis

11 ©MapR Technologies -‐ Conﬁden6al K-‐means Search
1.5 -2 -1.5 -1 -0.5 0.5 1 3 -3 -2 -1 1 2 X Axis Y Axis

12 ©MapR Technologies -‐ Conﬁden6al But These Require
k-‐means! §  Need a new k-‐means algorithm to get speed §  Streaming k-‐means is –  One pass (through the original data) –  Very fast (20 us per data point with threads) –  Very parallelizable

13 ©MapR Technologies -‐ Conﬁden6al How It Works
§  For each point –  Find approximately nearest centroid (distance = d) –  If d > threshold, new centroid –  Else possibly new cluster –  Else add to nearest centroid §  If centroids > K ~ C log N –  Recursively cluster centroids with higher threshold §  Result is large set of centroids –  these provide approxima6on of original distribu6on –  we can cluster centroids to get a close approxima6on of clustering original –  or we can just use the result directly

14 ©MapR Technologies -‐ Conﬁden6al Parallel Speedup?
1 2 3 4 5 20 10 100 20 30 40 50 200 Threads Time per point (μs) 2 3 4 5 6 8 10 12 14 16 Threaded version Non-threaded Perfect Scaling ✓

15 ©MapR Technologies -‐ Conﬁden6al Warning, Recursive Descent
§  Inner loop requires ﬁnding nearest centroid §  With lots of centroids, this is slow §  But wait, we have classes to accelerate that!

16 ©MapR Technologies -‐ Conﬁden6al Warning, Recursive Descent
§  Inner loop requires ﬁnding nearest centroid §  With lots of centroids, this is slow §  But wait, we have classes to accelerate that! (Let’s not use k-‐means searcher, though)

17 ©MapR Technologies -‐ Conﬁden6al §  Contact:
–  [email protected] –  @ted_dunning §  Slides and such: –  hAp://info.mapr.com/ted-‐uk-‐05-‐2012

Super-Fast Clustering Report in MapR

Super-Fast Clustering Report in MapR

Data Science London

More Decks by Data Science London

Other Decks in Technology

Featured

Transcript

1 ©MapR Technologies -‐ Conﬁden6al Super-‐Fast Clustering

2 ©MapR Technologies -‐ Conﬁden6al §  Contact:

3 ©MapR Technologies -‐ Conﬁden6al Company Background

4 ©MapR Technologies -‐ Conﬁden6al We Also Do

5 ©MapR Technologies -‐ Conﬁden6al We Also Do

6 ©MapR Technologies -‐ Conﬁden6al The Problem

7 ©MapR Technologies -‐ Conﬁden6al But …

8 ©MapR Technologies -‐ Conﬁden6al What’s that?

9 ©MapR Technologies -‐ Conﬁden6al What We Did

10 ©MapR Technologies -‐ Conﬁden6al ProjecGon Search

11 ©MapR Technologies -‐ Conﬁden6al K-‐means Search

12 ©MapR Technologies -‐ Conﬁden6al But These Require

13 ©MapR Technologies -‐ Conﬁden6al How It Works

14 ©MapR Technologies -‐ Conﬁden6al Parallel Speedup?

15 ©MapR Technologies -‐ Conﬁden6al Warning, Recursive Descent

16 ©MapR Technologies -‐ Conﬁden6al Warning, Recursive Descent

17 ©MapR Technologies -‐ Conﬁden6al §  Contact:

18 ©MapR Technologies -‐ Conﬁden6al Thank You