Slide 1

Slide 1 text

1 Joshua Thijssen jaytaph

Slide 2

Slide 2 text

Disclaimer: I'm not a (mad) scientist nor a mathematician. 2

Slide 3

Slide 3 text

Second disclaimer: I will only tell lies 3

Slide 4

Slide 4 text

German Tank Problem 4

Slide 5

Slide 5 text

5

Slide 6

Slide 6 text

5 15

Slide 7

Slide 7 text

6

Slide 8

Slide 8 text

6 53 72 8 15

Slide 9

Slide 9 text

7 k = number of elements m = largest number

Slide 10

Slide 10 text

72 + (72 / 4) - 1 = 89 8

Slide 11

Slide 11 text

9 Intelligence Statistics Actual June 1940 1000 169 June 1941 1550 244 August 1942 1550 327 https://en.wikipedia.org/wiki/German_tank_problem

Slide 12

Slide 12 text

9 Intelligence Statistics Actual June 1940 1000 169 June 1941 1550 244 August 1942 1550 327 https://en.wikipedia.org/wiki/German_tank_problem 122

Slide 13

Slide 13 text

9 Intelligence Statistics Actual June 1940 1000 169 June 1941 1550 244 August 1942 1550 327 https://en.wikipedia.org/wiki/German_tank_problem 122 271

Slide 14

Slide 14 text

9 Intelligence Statistics Actual June 1940 1000 169 June 1941 1550 244 August 1942 1550 327 https://en.wikipedia.org/wiki/German_tank_problem 122 271 342

Slide 15

Slide 15 text

10

Slide 16

Slide 16 text

10 ➡ Data leakage.

Slide 17

Slide 17 text

10 ➡ Data leakage. ➡ User-id's, invoice-id's, etc

Slide 18

Slide 18 text

10 ➡ Data leakage. ➡ User-id's, invoice-id's, etc ➡ Used to approximate the number of iPhones sold in 2008.

Slide 19

Slide 19 text

10 ➡ Data leakage. ➡ User-id's, invoice-id's, etc ➡ Used to approximate the number of iPhones sold in 2008. ➡ Calculate approximations of datasets with (incomplete) information.

Slide 20

Slide 20 text

➡ Avoid (semi) sequential data to be leaked. ➡ Adding randomness and offsets will NOT solve the issue. ➡ Use UUIDs (better: timebased short IDs, you don't need UUIDs) 11

Slide 21

Slide 21 text

12 Collecting (big) data is easy Analyzing big data is the hard part.

Slide 22

Slide 22 text

Confirmation Bias 13

Slide 23

Slide 23 text

2 4 6 14 Z={…,−2,−1,0,1,2,…}

Slide 24

Slide 24 text

21% 15

Slide 25

Slide 25 text

16 5 8 ? ? If a card shows an even number on one face, then its opposite face is blue.

Slide 26

Slide 26 text

< 10% 17

Slide 27

Slide 27 text

18 coke beer 35 17 If you drink beer then you must be 18 yrs or older.

Slide 28

Slide 28 text

18 coke beer 35 17 If you drink beer then you must be 18 yrs or older.

Slide 29

Slide 29 text

18 coke beer 35 17 If you drink beer then you must be 18 yrs or older.

Slide 30

Slide 30 text

Cognitive Adaption for social exchange 19

Slide 31

Slide 31 text

hint: Try and place your "technical problem" in a more social context. 20

Slide 32

Slide 32 text

BDD 21

Slide 33

Slide 33 text

22 5 8 ? ? If a card shows an even number on one face, then its opposite face is blue.

Slide 34

Slide 34 text

22 5 8 ? ? If a card shows an even number on one face, then its opposite face is blue.

Slide 35

Slide 35 text

22 5 8 ? ? If a card shows an even number on one face, then its opposite face is blue.

Slide 36

Slide 36 text

TESTING 23

Slide 37

Slide 37 text

24 ➡ Step 1: Write code ➡ Step 2: Write tests ➡ Step 3: Profit

Slide 38

Slide 38 text

public function isLeapYeap($year) { return ($year % 4 == 0); } 25 https://www.sundoginteractive.com/blog/confirmation-bias-in-unit-testing testIs1996ALeapYeap(); testIs2000ALeapYeap(); testIs2004ALeapYeap(); testIs2008ALeapYeap(); testIs2012ALeapYeap(); testIs1997NotALeapYear(); testIs1998NotALeapYear(); testIs2001NotALeapYear(); testIs2013NotALeapYear();

Slide 39

Slide 39 text

public function isLeapYeap($year) { return ($year % 4 == 0); } 25 https://www.sundoginteractive.com/blog/confirmation-bias-in-unit-testing testIs1996ALeapYeap(); testIs2000ALeapYeap(); testIs2004ALeapYeap(); testIs2008ALeapYeap(); testIs2012ALeapYeap(); testIs1997NotALeapYear(); testIs1998NotALeapYear(); testIs2001NotALeapYear(); testIs2013NotALeapYear();

Slide 40

Slide 40 text

public function isLeapYeap($year) { return ($year % 4 == 0); } 26 https://www.sundoginteractive.com/blog/confirmation-bias-in-unit-testing

Slide 41

Slide 41 text

27 ➡ Tests where written based on actual code. ➡ Tests where written to CONFIRM actual code, not to DISPROVE actual code!

Slide 42

Slide 42 text

28 TDD

Slide 43

Slide 43 text

29 ➡ Step 1: Write tests ➡ Step 2: Write code ➡ Step 3: Profit, as less prone to confirmation bias (as there is nothing to bias!)

Slide 44

Slide 44 text

Birthday paradox 30

Slide 45

Slide 45 text

Question: 31 > 50% chance 4 march 18 september 5 december 25 juli 2 februari 9 october

Slide 46

Slide 46 text

23 people 32

Slide 47

Slide 47 text

366 persons = 100% 33

Slide 48

Slide 48 text

Collisions occur more often than you realize 34

Slide 49

Slide 49 text

Hash collisions 35

Slide 50

Slide 50 text

16 bits means 300 values before >50% collision probability 36

Slide 51

Slide 51 text

Watch out for: 37 ➡ Too small hashes. ➡ Unique data. ➡ Your data might be less "protected" as you might think.

Slide 52

Slide 52 text

Heisenberg uncertainty principle 38

Slide 53

Slide 53 text

It's not about star trek (heisenberg compensators) 39

Slide 54

Slide 54 text

nor crystal meth 40

Slide 55

Slide 55 text

41 x position p momentum (mass x velocity) ħ 0.0000000000000000000000000000000001054571800 (1.054571800E-34)

Slide 56

Slide 56 text

The more precise you know one property, the less you know the other. 42

Slide 57

Slide 57 text

This is NOT about observing! 43

Slide 58

Slide 58 text

Observer effect 44 heisenbug

Slide 59

Slide 59 text

It's about trade-offs 45

Slide 60

Slide 60 text

Benford's law 46

Slide 61

Slide 61 text

Numbers beginning with 1 are more common than numbers beginning with 9. 47

Slide 62

Slide 62 text

Default behavior for natural numbers. 48

Slide 63

Slide 63 text

49

Slide 64

Slide 64 text

find . -name \*.php -exec wc -l {} \; | sort | cut -b 1 | uniq -c 50

Slide 65

Slide 65 text

find . -name \*.php -exec wc -l {} \; | sort | cut -b 1 | uniq -c 50 1073 1 886 2 636 3 372 4 352 5 350 6 307 7 247 8 222 9

Slide 66

Slide 66 text

51

Slide 67

Slide 67 text

Bayesian filtering 52

Slide 68

Slide 68 text

What's the probability of an event, based on conditions that might be related to the event. 53

Slide 69

Slide 69 text

What is the chance that a message is spam when it contains certain words? 54

Slide 70

Slide 70 text

55 P(A|B) P(A) P(B) P(B|A) Probability event A, if event B (conditional) Probability event A Probability event B Probability event B, if event A

Slide 71

Slide 71 text

56 ➡ Figure out the probability a {mail, tweet, comment, review} is {spam, negative} etc.

Slide 72

Slide 72 text

➡ 10 out of 50 comments are "negative". ➡ 25 out of 50 comments uses the word "horrible". ➡ 8 comments with the word "horrible" are marked as "negative". 57

Slide 73

Slide 73 text

58 negative "horrible" 10 comments 25 comments 8 comments

Slide 74

Slide 74 text

59

Slide 75

Slide 75 text

60 ➡ More words? ➡ Complex algorithm, ➡ but, we can assume that words are not independent from eachother ➡ Naive Bayes approach

Slide 76

Slide 76 text

61

Slide 77

Slide 77 text

62 We must know beforehand which comments are negative?

Slide 78

Slide 78 text

TRAINING SET 63

Slide 79

Slide 79 text

64 "Your product is horrible and does not work properly. Also, you suck." "I had a horrible experience with another product. But yours really worked well. Thank you!" Negative: Positive:

Slide 80

Slide 80 text

$trainingset = [ 'negative' => [ 'count' => 1, 'words' => [ 'product' => 1, 'horrible' => 1, 'properly' => 1, 'suck' => 1, ], ], 'positive' => [ 'count' => 1, 'words' => [ 'horrible' => 1, 'experience' => 1, 'product' => 1, 'thank' => 1, ], ], ]; 65

Slide 81

Slide 81 text

66 $trainingset = [ 'negative' => [ 'count' => 631, 'words' => [ 'product' => 521, 'horrible' => 52, 'properly' => 36, 'suck' => 272, ], ], 'positive' => [ 'count' => 1263, 'words' => [ 'horrible' => 62, 'experience' => 16, 'product' => 311, 'great' => 363 'thank' => 63, ], ], ];

Slide 82

Slide 82 text

67 ➡ You might want to filter stop-words first. ➡ You might want to make sure negatives are handled property "not great" => negative. ➡ Bonus points if you can spot sarcasm.

Slide 83

Slide 83 text

➡ Collaborative filtering (mahout): ➡ If user likes product A, B and C, what is the chance that they like product D? 68

Slide 84

Slide 84 text

69 Mess up your (training) data, and nothing can save you (except a training set reboot)

Slide 85

Slide 85 text

➡ Binomial probability 70

Slide 86

Slide 86 text

71 ➡ 30% change of acceptance for CFP ➡ 5 CFP's

Slide 87

Slide 87 text

71 ➡ 30% change of acceptance for CFP ➡ 5 CFP's 1 - (0.7 * 0.7 * 0.7 * 0.7 * 0.7) = 1 - 0.168 = 0.832 83% on getting selected at least once!

Slide 88

Slide 88 text

Ockham's Razor 72

Slide 89

Slide 89 text

73 Among competing hypotheses, the one with the fewest assumptions should be selected.

Slide 90

Slide 90 text

74 82 Everything should be made as simple as possible, but no simpler.

Slide 91

Slide 91 text

YAGNI 75

Slide 92

Slide 92 text

76 Actually, ➡ The principle of plurality Plurality should not be posited with necessity. ➡ The principle of parsimony It is pointless to do more with what is done with less.

Slide 93

Slide 93 text

➡ Every element you add needs: design, development, maintenance, connectivity, support, etc etc. ➡ When "adding" elements, you are not adding, you are multiplying! 77

Slide 94

Slide 94 text

78 Food for thought: Would Ockham accept a Service Oriented Architecture?

Slide 95

Slide 95 text

http://farm1.static.flickr.com/73/163450213_18478d3aa6_d.jpg 79

Slide 96

Slide 96 text

80 Find me on twitter: @jaytaph Find me for development and training: www.noxlogic.nl / www.techademy.nl Find me on email: [email protected] Find me for blogs: www.adayinthelifeof.nl