Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Nubank Machine Learning Meetup

Nubank Machine Learning Meetup

Juan Lopes

May 24, 2017
Tweet

More Decks by Juan Lopes

Other Decks in Programming

Transcript

  1. JUAN LOPES R&D software engineer @ INTELIE Ph.D. student @

    COPPE/UFRJ twitter.com/juanplopes github.com/juanplopes
  2. STREAM PROCESSING ALSO KNOWN AS • REACTIVE PROGRAMMING In the

    context of programming paradigms; • DATA STREAM MINING In the context of machine learning; • DATA FLOW PROGRAMMING In the context of programming languages. • EVENT STREAM PROCESSING In the context of systems monitoring and data analysis.
  3. STREAM PROCESSING ENGINES THEY REALLY LOVE SQL SELECT count(*) as

    fails, timestamp, local, description, FROM HttpMonitor(type = "error" OR description ="Timeout::Error") .std:groupwin(description) .win:time(15 minutes) var forward = inputStream.AlterEventStartTime( s => s.StartTime.AddSeconds(1)); var query = from evt in inputStream from prev in forward where prev.Value < threshold && evt.Value > threshold select new { Time = evt.Time, Low = prev.Value, High = evt.Value };
  4. STREAM PROCESSING ENGINES THEY REALLY LOVE SQL SELECT STREAM "SuspectLoginFailures"."accountNumber",

    "loginFailureCount", "transactionType", "amount" FROM "SuspectLoginFailures" OVER "lastFew" JOIN "Transactions" OVER "lastFew" ON "SuspectLoginFailures"."accountNumber" = "Transactions"."accountNumber" WHERE ("transactionType" = 'isDebit') WINDOW "lastFew" AS (RANGE INTERVAL '1' MINUTE PRECEDING); CREATE OUTPUT STREAM TickStats AS SELECT openval() AS StartOfTimeSlice, avg(NumberTicks) AS AvgTicksPerSecond, stdev(NumberTicks) AS StdevTicksPerSecond, lastval(NumberTicks) AS LastTicksPerSecond, FeedName FROM TicksPerSecond [ SIZE 20 ADVANCE 1 ON StartOfTimeSlice PARTITION BY FeedName ] GROUP BY FeedName;
  5. STREAM PROCESSING ENGINES THEY REALLY LOVE SQL SELECT T.firstc, T.lastc,

    T.Ac1, T.Bc1, T.avgCc1, T.Dc1 FROM S0 MATCH_RECOGNIZE ( MEASURES first(C.c2) as firstc, last(C.c2) as lastc, avg(C.c1) as avgCc1, A.c1 as Ac1, B.c1 as Bc1, D.c1 as Dc1 PATTERN(A B C* D) DEFINE A as A.c1 = 30, B as B.c2 = 10.0, C as C.c1 = 7, D as D.c1 = 40 ) as T declare Sale @role( event ) end declare window Ticks Sale() over window:length(5) from entry-point MyEntryPoint end rule "More than 2 sale suceess in 5 events" when Number($cnt : intValue,intValue > 2) from accumulate( Sale (saleHappened == "Y") from window Ticks, count(1) ) then System.out.println( "A sale has happened over " + $cnt +" events" ); end
  6. CHAINED COMPUTATION MODEL SIMILAR TO UNIX... [juanplopes ~]$ find .

    -name *.conf -print0 | xargs -0 grep -l -Z mem_limit | xargs -0 -i cp {} {}.bak PIPELINES!
  7. CHAINED COMPUTATION MODEL AN EXAMPLE QUERY Sales product:Pampers* size:(G|XG|XXG) =>

    sum(value#) as sales every minute => sales:regression->slope as slope over last hour => @filter slope < 0
  8. FILTER AUTOMATON MINIMIZES STRING COMPARISONS ?? field: type field: status

    http 2 3 404 ?? and 00 and or and type:http status:404 type:http status:200 type:http status:(2?? | 3??)
  9. PROBABILISTIC DATA STRUCTURES LESS RESOURCES, SOME ERROR BLOOM FILTER Cardinality

    testing, probabilistic counting; COUNT-MIN SKETCH Multiset frequency, stream quantiles, scalar product; MINHASH Locality sensitive hashing, set similarity; HYPERLOGLOG Probabilistic counting with ridiculously low memory usage.
  10. PROBABILISTIC DATA STRUCTURES LESS RESOURCES, SOME ERROR 1970 1990 1980

    2000 2010 BLOOM FILTERS [Blo70] FM-SKETCH [FM85] MINHASH [Bro97] KMV-SKETCH [BYJK+02] LSH THEORY [IM98] SIMHASH [Cha02] LOGLOG [DF03] AMS PAPER [AMS96] CM-SKETCH [CM05] HYPERLOGLOG [FFGM08] SPECTRAL BLOOM [CM03] HYPERLOGLOG++ [HNH13]
  11. [Blo70] Burton H Bloom. Space/time trade-offs in hash coding with

    allowable errors.Communications of the ACM, 13(7):422–426, 1970. [FM85] Philippe Flajolet and G Nigel Martin. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences, 31(2):182–209, 1985. [AMS96] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 20–29. ACM, 1996. [Bro97] Andrei Z Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings, pages 21–29. IEEE, 1997. [IM98] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613. ACM, 1998. [Cha02] Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388. ACM, 2002. [BYJK+02] Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In Randomization and Approximation Techniques in Computer Science, pages 1–10. Springer, 2002. [CM03] Saar Cohen and Yossi Matias. Spectral bloom filters. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 241–252. ACM, 2003. [DF03] Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities. In Algorithms-ESA 2003, pages 605–617. Springer, 2003. [CM05] Graham Cormode and S Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005. [FFGM08] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. DMTCS Proceedings, 2008. [HNH13] Stefan Heule, Marc Nunkesser, and Alexander Hall. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the 16th International Conference on Extending Database Technology, pages 683–692. ACM, 2013.
  12. “NOTIFY ME WHEN THE 99th PERCENTILE OF ALL RESPONSE TIMES

    TODAY BECOMES GREATER THAN 1 SECOND?”
  13. IT IS PROVED IMPOSSIBLE TO COMPUTE QUANTILES IN A SINGLE

    PASS, WITH LESS THAN O(N) MEMORY Munro, J. I., & Paterson, M. S. (1980). Selection and sorting with limited storage. Theoretical computer science, 12(3), 315-323. http://wrap.warwick.ac.uk/46321/1/WRAP_Munro_cs-rr-024.pdf
  14. 0.01 0.02 0.02 0.35 0.39 0.42 1.25 48 0 1

    2 N/2-1 N/2 N/2+1 99N 100 N-1 ... ... ... MEDIAN 99th PERCENTILE
  15. 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 COUNT-MIN SKETCH TO THE RESCUE
  16. 0 0 0 0 0 3 0 0 0 0

    3 0 0 0 0 0 COUNT-MIN SKETCH TO THE RESCUE A[“some value”] += 3 h 1 (“some value”) = 5 h 2 (“some value”) = 2
  17. 0 5 0 0 0 3 0 0 0 0

    3+5 0 0 0 0 0 COUNT-MIN SKETCH TO THE RESCUE A[“another value”] += 5 h 1 (“another value”) = 1 h 2 (“another value”) = 2
  18. 1 5 0 10 4 3 7 9 1 0

    8 3 1 0 0 1 COUNT-MIN SKETCH TO THE RESCUE A[“some value”]? h 1 (“some value”) = 5 h 2 (“some value”) = 2
  19. 4 9 4 1 2 3 9 0 5 1

    5 3 3 3 8 5 FENWICK TREES I TOLD YOU THERE WOULD BE TREES
  20. 4 9 4 1 2 3 9 0 5 1

    5 3 3 3 8 5 13 5 5 9 6 8 6 13 18 14 14 19 32 33 65
  21. 4 9 4 1 1 3 5 0.01 0.02 0.03

    0.04 0.35 0.99 42 ... ... ... ...
  22. 4 9 4 1 2 3 9 0 5 1

    5 3 3 3 8 5
  23. IT IS POSSIBLE TO ESTIMATE QUANTILES IN A SINGLE PASS,

    WITH LESS THAN O(N) MEMORY Munro, J. I., & Paterson, M. S. (1980). Selection and sorting with limited storage. Theoretical computer science, 12(3), 315-323. http://wrap.warwick.ac.uk/46321/1/WRAP_Munro_cs-rr-024.pdf
  24. + 1 5 0 10 4 3 7 9 1

    0 8 3 1 0 0 1 COUNT-MIN SKETCH IT IS THAT GOOD 2 3 9 1 0 0 2 4 9 4 4 11 13 7 0 1 +
  25. 0 0 3 0 0 0 0 0 01000101… ρ(x)

    = 3 PARTITION #2 0 1 2 3 4 5 6 7
  26. 1 8 3 12 9 1 0 7 0 1

    2 3 4 5 6 7 HARMONIC MEAN
  27. 3 1 8 9 15 12 14 2 0 1

    2 3 4 5 6 7 1 8 3 12 9 1 0 7 0 1 2 3 4 5 6 7 max
  28. THANKS => Machine Learning Meetup • Nubank • May 24th

    2017 twitter.com/juanplopes github.com/juanplopes