Slide 1

Slide 1 text

1 Joshua Thijssen jaytaph

Slide 2

Slide 2 text

2 Joshua Thijssen Consultant and trainer @ NoxLogic Founder of TechAnalyze.io Symfony Rainbow Books author Mastering the SPL author Blog: http://adayinthelifeof.nl Email: jthijssen@noxlogic.nl Twitter: @jaytaph Tech nalyze WWW.TECHANALYZE.IO

Slide 3

Slide 3 text

3 https://dutchtechrecruitment.nl/ Text

Slide 4

Slide 4 text

Disclaimer: I'm not a (mad) scientist nor a mathematician. 4

Slide 5

Slide 5 text

German Tank Problem 5

Slide 6

Slide 6 text

6

Slide 7

Slide 7 text

6 15

Slide 8

Slide 8 text

7

Slide 9

Slide 9 text

7 53 72 8 15

Slide 10

Slide 10 text

8 k = number of elements m = largest number

Slide 11

Slide 11 text

72 + (72 / 4) - 1 = 89 9

Slide 12

Slide 12 text

10 Intelligence Statistics Actual June 1940 1000 169 June 1941 1550 244 August 1942 1550 327 https://en.wikipedia.org/wiki/German_tank_problem

Slide 13

Slide 13 text

10 Intelligence Statistics Actual June 1940 1000 169 June 1941 1550 244 August 1942 1550 327 https://en.wikipedia.org/wiki/German_tank_problem 122

Slide 14

Slide 14 text

10 Intelligence Statistics Actual June 1940 1000 169 June 1941 1550 244 August 1942 1550 327 https://en.wikipedia.org/wiki/German_tank_problem 122 271

Slide 15

Slide 15 text

10 Intelligence Statistics Actual June 1940 1000 169 June 1941 1550 244 August 1942 1550 327 https://en.wikipedia.org/wiki/German_tank_problem 122 271 342

Slide 16

Slide 16 text

11

Slide 17

Slide 17 text

11 ➡ Data leakage.

Slide 18

Slide 18 text

11 ➡ Data leakage. ➡ User-id's, invoice-id's, etc

Slide 19

Slide 19 text

11 ➡ Data leakage. ➡ User-id's, invoice-id's, etc ➡ Used to approximate the number of iPhones sold in 2008.

Slide 20

Slide 20 text

11 ➡ Data leakage. ➡ User-id's, invoice-id's, etc ➡ Used to approximate the number of iPhones sold in 2008. ➡ Calculate approximations of datasets with (incomplete) information.

Slide 21

Slide 21 text

12

Slide 22

Slide 22 text

➡ Avoid (semi) sequential data to be leaked. ➡ Adding randomness and offsets will NOT solve the issue. ➡ Use UUIDs (better: timebased short IDs, you don't need UUIDs) 13

Slide 23

Slide 23 text

14 Collecting (big) data is easy Analyzing big data is the hard part.

Slide 24

Slide 24 text

Confirmation Bias 15

Slide 25

Slide 25 text

2 4 6 16 Z={…,−2,−1,0,1,2,…}

Slide 26

Slide 26 text

21% 17

Slide 27

Slide 27 text

18 5 8 ? ? If a card shows an even number on one face, then its opposite face is blue.

Slide 28

Slide 28 text

< 10% 19

Slide 29

Slide 29 text

20 coke beer 35 17 If you drink beer then you must be 18 yrs or older.

Slide 30

Slide 30 text

20 coke beer 35 17 If you drink beer then you must be 18 yrs or older.

Slide 31

Slide 31 text

20 coke beer 35 17 If you drink beer then you must be 18 yrs or older.

Slide 32

Slide 32 text

Cognitive Adaption for social exchange 21

Slide 33

Slide 33 text

hint: Try and place your "technical problem" in a more social context. 22

Slide 34

Slide 34 text

BDD 23

Slide 35

Slide 35 text

24 5 8 ? ? If a card shows an even number on one face, then its opposite face is blue.

Slide 36

Slide 36 text

24 5 8 ? ? If a card shows an even number on one face, then its opposite face is blue.

Slide 37

Slide 37 text

24 5 8 ? ? If a card shows an even number on one face, then its opposite face is blue.

Slide 38

Slide 38 text

TESTING 25

Slide 39

Slide 39 text

26 ➡ Step 1: Write code ➡ Step 2: Write tests ➡ Step 3: Profit

Slide 40

Slide 40 text

public function isLeapYeap($year) { return ($year % 4 == 0); } 27 https://www.sundoginteractive.com/blog/confirmation-bias-in-unit-testing testIs1996ALeapYeap(); testIs2000ALeapYeap(); testIs2004ALeapYeap(); testIs2008ALeapYeap(); testIs2012ALeapYeap(); testIs1997NotALeapYear(); testIs1998NotALeapYear(); testIs2001NotALeapYear(); testIs2013NotALeapYear();

Slide 41

Slide 41 text

public function isLeapYeap($year) { return ($year % 4 == 0); } 27 https://www.sundoginteractive.com/blog/confirmation-bias-in-unit-testing testIs1996ALeapYeap(); testIs2000ALeapYeap(); testIs2004ALeapYeap(); testIs2008ALeapYeap(); testIs2012ALeapYeap(); testIs1997NotALeapYear(); testIs1998NotALeapYear(); testIs2001NotALeapYear(); testIs2013NotALeapYear();

Slide 42

Slide 42 text

public function isLeapYeap($year) { return ($year % 4 == 0); } 28 https://www.sundoginteractive.com/blog/confirmation-bias-in-unit-testing

Slide 43

Slide 43 text

29 ➡ Tests where written based on actual code. ➡ Tests where written to CONFIRM actual code, not to DISPROVE actual code!

Slide 44

Slide 44 text

30 TDD

Slide 45

Slide 45 text

31 ➡ Step 1: Write tests ➡ Step 2: Write code ➡ Step 3: Profit, as less prone to confirmation bias (as there is nothing to bias!)

Slide 46

Slide 46 text

Birthday paradox 32

Slide 47

Slide 47 text

Question: 33 > 50% chance 4 march 18 september 5 december 25 juli 2 februari 9 october

Slide 48

Slide 48 text

23 people 34

Slide 49

Slide 49 text

366 persons = 100% 35

Slide 50

Slide 50 text

Collisions occur more often than you realize 36

Slide 51

Slide 51 text

Hash collisions 37

Slide 52

Slide 52 text

16 bits means 300 values before >50% collision probability 38

Slide 53

Slide 53 text

Watch out for: 39 ➡ Too small hashes. ➡ Unique data. ➡ Your data might be less "protected" as you might think.

Slide 54

Slide 54 text

Heisenberg uncertainty principle 40

Slide 55

Slide 55 text

It's not about star trek (heisenberg compensators) 41

Slide 56

Slide 56 text

nor crystal meth 42

Slide 57

Slide 57 text

43 x position p momentum (mass x velocity) ħ 0.0000000000000000000000000000000001054571800 (1.054571800E-34)

Slide 58

Slide 58 text

The more precise you know one property, the less you know the other. 44

Slide 59

Slide 59 text

This is NOT about observing! 45

Slide 60

Slide 60 text

Observer effect 46 heisenbug

Slide 61

Slide 61 text

It's about trade-offs 47

Slide 62

Slide 62 text

Benford's law 48

Slide 63

Slide 63 text

Numbers beginning with 1 are more common than numbers beginning with 9. 49

Slide 64

Slide 64 text

Default behavior for natural numbers. 50

Slide 65

Slide 65 text

51

Slide 66

Slide 66 text

find . -name \*.php -exec wc -l {} \; | sort | cut -b 1 | uniq -c 52

Slide 67

Slide 67 text

find . -name \*.php -exec wc -l {} \; | sort | cut -b 1 | uniq -c 52 1073 1 886 2 636 3 372 4 352 5 350 6 307 7 247 8 222 9

Slide 68

Slide 68 text

53

Slide 69

Slide 69 text

Bayesian filtering 54

Slide 70

Slide 70 text

What's the probability of an event, based on conditions that might be related to the event. 55

Slide 71

Slide 71 text

What is the chance that a message is spam when it contains certain words? 56

Slide 72

Slide 72 text

57 P(A|B) P(A) P(B) P(B|A) Probability event A, if event B (conditional) Probability event A Probability event B Probability event B, if event A

Slide 73

Slide 73 text

58 ➡ Figure out the probability a {mail, tweet, comment, review} is {spam, negative} etc.

Slide 74

Slide 74 text

➡ 10 out of 50 comments are "negative". ➡ 25 out of 50 comments uses the word "horrible". ➡ 8 comments with the word "horrible" are marked as "negative". 59

Slide 75

Slide 75 text

60 negative "horrible" 10 comments 25 comments 8 comments

Slide 76

Slide 76 text

61

Slide 77

Slide 77 text

62 ➡ More words? ➡ Complex algorithm, ➡ but, we can assume that words are not independent from eachother ➡ Naive Bayes approach

Slide 78

Slide 78 text

63

Slide 79

Slide 79 text

64 We must know beforehand which comments are negative?

Slide 80

Slide 80 text

TRAINING SET 65

Slide 81

Slide 81 text

66 "Your product is horrible and does not work properly. Also, you suck." "I had a horrible experience with another product. But yours really worked well. Thank you!" Negative: Positive:

Slide 82

Slide 82 text

67 ➡ You might want to filter stop-words first. ➡ You might want to make sure negatives are handled property "not great" => negative. ➡ Bonus points if you can spot sarcasm.

Slide 83

Slide 83 text

➡ Collaborative filtering (mahout): ➡ If user likes product A, B and C, what is the chance that they like product D? 68

Slide 84

Slide 84 text

69 Mess up your (training) data, and nothing can save you (except a training set reboot)

Slide 85

Slide 85 text

70 ➡ 30% change of acceptance for CFP ➡ 5 CFP's Binomial probability

Slide 86

Slide 86 text

70 ➡ 30% change of acceptance for CFP ➡ 5 CFP's 1 - (0.7 * 0.7 * 0.7 * 0.7 * 0.7) = 1 - 0.168 = 0.832 83% on getting selected at least once! Binomial probability

Slide 87

Slide 87 text

http://farm1.static.flickr.com/73/163450213_18478d3aa6_d.jpg 71

Slide 88

Slide 88 text

72 Find me on twitter: @jaytaph Find me for development and training: www.noxlogic.nl / www.techademy.nl Find me on email: jthijssen@noxlogic.nl Find me for blogs: www.adayinthelifeof.nl