Approximate Methods for Scalable Data Mining

Approximate methods for scalable data mining Andrew Clegg Data Analytics
& Visualization Team Pearson Technology Twitter: @andrew_clegg

Approximate methods for scalable data mining l 24/04/13 2 What
are approximate methods? Trading accuracy for scalability •  Often use probabilistic data structures – a.k.a. sketches or signatures •  Mostly stream-friendly – Allow you to query data you haven’t even kept! •  Generally simple to parallelize •  Predictable error rate (can be tuned)

Approximate methods for scalable data mining l 24/04/13 3 What
are approximate methods? Trading accuracy for scalability •  Represent characteristics or summary of data •  Use much less space than full dataset (generally via hashing tricks) – Can alleviate disk, memory, network bottlenecks •  Generally incur more CPU load than exact methods – This may not be true in a distributed system, overall ◦  [de]serialization for example – Many data-centric systems have CPU to spare anyway

Approximate methods for scalable data mining l 24/04/13 4 Why
approximate methods? A real-life example Icons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu Counting unique terms in time buckets across ElasticSearch shards Cluster nodes Master node Unique terms per bucket per shard Globally unique terms per bucket Client Number of globally unique terms per bucket

Approximate methods for scalable data mining l 24/04/13 5 Why
approximate methods? A real-life example Icons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu But what if each bucket contains a LOT of terms? … and what if there are too many to fit in memory? Memory cost CPU cost to serialize Network transfer cost CPU cost to deserialize CPU & memory cost to merge & count sets

Approximate methods for scalable data mining l 24/04/13 6 Cardinality
estimation Approximate distinct counts Intuitive explanation Long runs of trailing 0s in random bit strings are rare. But the more bit strings you look at, the more likely you are to see a long one. So “longest run of trailing 0s seen” can be used as an estimator of “number of unique bit strings seen”. 01110001 11101010 00100101 11001100 11110100 11101100 00010100 00000001 00000010 10001110 01110100 01101010 01111111 00100010 00110000 00001010 01000100 01111010 01011101 00000100

Approximate methods for scalable data mining l 24/04/13 7 Cardinality
estimation Probabilistic counting: basic algorithm Counting the items •  Let n = 0 •  For each input item: –  Hash item into bit string –  Count trailing zeroes in bit string –  If this count > n: ◦  Let n = count Calculating the count •  n = longest run of trailing 0s seen •  Estimated cardinality (“count distinct”) = 2^n … that’s it! This is an estimate, but not a great one. But…

Approximate methods for scalable data mining l 24/04/13 8 HyperLogLog
algorithm Billions of distinct values in 1.5KB of RAM with 2% relative error Image: http://www.aggregateknowledge.com/science/blog/hll.html Cool properties •  Stream-friendly: no need to keep data •  Error rates are predictable and tunable •  Size and speed stay constant •  Trivial to parallelize – Combine two HLL counters by taking the max of each register

Approximate methods for scalable data mining l 24/04/13 9 Resources
on cardinality estimation HyperLogLog paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf Java implementations: https://github.com/clearspring/stream-lib/ Algebird implements HyperLogLog (and much more!) in Scalding: https://github.com/twitter/algebird Simmer wraps Algebird in Hadoop Streaming command line: https://github.com/avibryant/simmer Our ElasticSearch plugin: https://github.com/ptdavteam/elasticsearch-approx-plugin MetaMarkets blog: http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/ Aggregate Knowledge blog, including JavaScript implementation and D3 visualization: http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

Approximate methods for scalable data mining l 24/04/13 10 Bloom
filters Set membership test with chance of false positives Image: http://en.wikipedia.org/wiki/Bloom_filter At least one 0 means w definitely isn’t in set. All 1s would mean w probably is in set. Hash each item n times 㱺 indices into bit field.

Approximate methods for scalable data mining l 24/04/13 11 Count-min
sketch Frequency histogram estimation with chance of over-counting A1 +1 +1 A2 +1 +1 A3 +2 “foo” h1 h2 h3 “bar” h1 h2 h3 More hashes / arrays 㱺 reduced chance of overcounting count(“foo”) = min(1, 1, 2) = 1

Approximate methods for scalable data mining l 24/04/13 12 Random
hyperplanes Locality-sensitive hashing for approximate nearest neighbours Hash(Item1 ) = 011 Hash(Item2 ) = 001 As the cosine distance decreases, the probability of a hash match increases Item1 h1 h2 h3 Item2 θ Bitwise hamming distance correlates with cosine distance

Approximate methods for scalable data mining l 24/04/13 13 Feature
hashing High-dimensional machine learning without feature dictionary “reduce”! “the”! “size”! “of”! “your”! “feature”! “vector”! “with”! “this”! “one”! “weird”! “old”! “trick”! h(“reduce”) = 9! h(“the”) = 3! h(“size”) = 1! . . .! +1 +1 +1 Effect of collisions on overall classification accuracy is surprisingly small! Multiple hashes, or 1-bit “sign hash”, can reduce collisions effects if necessary

Approximate methods for scalable data mining l 24/04/13 14 Thanks
for listening And some further reading… Great ebook available free from: http://infolab.stanford.edu/~ullman/mmds.html

Approximate Methods for Scalable Data Mining

Approximate Methods for Scalable Data Mining

Data Science London

More Decks by Data Science London

Other Decks in Technology

Featured

Transcript

Approximate methods for scalable data mining Andrew Clegg Data Analytics

Approximate methods for scalable data mining l 24/04/13 2 What

Approximate methods for scalable data mining l 24/04/13 3 What

Approximate methods for scalable data mining l 24/04/13 4 Why

Approximate methods for scalable data mining l 24/04/13 5 Why

Approximate methods for scalable data mining l 24/04/13 6 Cardinality

Approximate methods for scalable data mining l 24/04/13 7 Cardinality

Approximate methods for scalable data mining l 24/04/13 8 HyperLogLog

Approximate methods for scalable data mining l 24/04/13 9 Resources

Approximate methods for scalable data mining l 24/04/13 10 Bloom

Approximate methods for scalable data mining l 24/04/13 11 Count-min

Approximate methods for scalable data mining l 24/04/13 12 Random

Approximate methods for scalable data mining l 24/04/13 13 Feature

Approximate methods for scalable data mining l 24/04/13 14 Thanks