2
Joshua Thijssen
Consultant and trainer @ NoxLogic
Founder of TechAnalyze.io
Symfony Rainbow Books author
Mastering the SPL author
Blog: http://adayinthelifeof.nl
Email: jthijssen@noxlogic.nl
Twitter: @jaytaph Tech nalyze
WWW.TECHANALYZE.IO
Slide 3
Slide 3 text
3
https://dutchtechrecruitment.nl/
Text
Slide 4
Slide 4 text
Disclaimer:
I'm not a (mad)
scientist nor a
mathematician.
4
Slide 5
Slide 5 text
German Tank
Problem
5
Slide 6
Slide 6 text
6
Slide 7
Slide 7 text
6
15
Slide 8
Slide 8 text
7
Slide 9
Slide 9 text
7
53
72
8
15
Slide 10
Slide 10 text
8
k = number of elements
m = largest number
Slide 11
Slide 11 text
72 + (72 / 4) - 1 = 89
9
Slide 12
Slide 12 text
10
Intelligence Statistics Actual
June 1940 1000 169
June 1941 1550 244
August
1942
1550 327
https://en.wikipedia.org/wiki/German_tank_problem
Slide 13
Slide 13 text
10
Intelligence Statistics Actual
June 1940 1000 169
June 1941 1550 244
August
1942
1550 327
https://en.wikipedia.org/wiki/German_tank_problem
122
Slide 14
Slide 14 text
10
Intelligence Statistics Actual
June 1940 1000 169
June 1941 1550 244
August
1942
1550 327
https://en.wikipedia.org/wiki/German_tank_problem
122
271
Slide 15
Slide 15 text
10
Intelligence Statistics Actual
June 1940 1000 169
June 1941 1550 244
August
1942
1550 327
https://en.wikipedia.org/wiki/German_tank_problem
122
271
342
Slide 16
Slide 16 text
11
Slide 17
Slide 17 text
11
➡ Data leakage.
Slide 18
Slide 18 text
11
➡ Data leakage.
➡ User-id's, invoice-id's, etc
Slide 19
Slide 19 text
11
➡ Data leakage.
➡ User-id's, invoice-id's, etc
➡ Used to approximate the number of
iPhones sold in 2008.
Slide 20
Slide 20 text
11
➡ Data leakage.
➡ User-id's, invoice-id's, etc
➡ Used to approximate the number of
iPhones sold in 2008.
➡ Calculate approximations of datasets with
(incomplete) information.
Slide 21
Slide 21 text
12
Slide 22
Slide 22 text
➡ Avoid (semi) sequential data to be leaked.
➡ Adding randomness and offsets will NOT
solve the issue.
➡ Use UUIDs
(better: timebased short IDs, you don't need UUIDs)
13
Slide 23
Slide 23 text
14
Collecting (big) data is easy
Analyzing big data is the hard part.
Slide 24
Slide 24 text
Confirmation Bias
15
Slide 25
Slide 25 text
2 4 6
16
Z={…,−2,−1,0,1,2,…}
Slide 26
Slide 26 text
21%
17
Slide 27
Slide 27 text
18
5 8 ? ?
If a card shows an even number on one face,
then its opposite face is blue.
Slide 28
Slide 28 text
< 10%
19
Slide 29
Slide 29 text
20
coke beer 35 17
If you drink beer
then you must be 18 yrs or older.
Slide 30
Slide 30 text
20
coke beer 35 17
If you drink beer
then you must be 18 yrs or older.
Slide 31
Slide 31 text
20
coke beer 35 17
If you drink beer
then you must be 18 yrs or older.
Slide 32
Slide 32 text
Cognitive Adaption
for social exchange
21
Slide 33
Slide 33 text
hint:
Try and place your "technical
problem" in a more social context.
22
Slide 34
Slide 34 text
BDD
23
Slide 35
Slide 35 text
24
5 8 ? ?
If a card shows an even number on one face,
then its opposite face is blue.
Slide 36
Slide 36 text
24
5 8 ? ?
If a card shows an even number on one face,
then its opposite face is blue.
Slide 37
Slide 37 text
24
5 8 ? ?
If a card shows an even number on one face,
then its opposite face is blue.
What's the probability of an
event, based on conditions that
might be related to the event.
55
Slide 71
Slide 71 text
What is the chance that a
message is spam when it
contains certain words?
56
Slide 72
Slide 72 text
57
P(A|B)
P(A)
P(B)
P(B|A)
Probability event A, if event B (conditional)
Probability event A
Probability event B
Probability event B, if event A
Slide 73
Slide 73 text
58
➡ Figure out the probability a {mail, tweet,
comment, review} is {spam, negative} etc.
Slide 74
Slide 74 text
➡ 10 out of 50 comments are "negative".
➡ 25 out of 50 comments uses the word
"horrible".
➡ 8 comments with the word "horrible" are
marked as "negative".
59
62
➡ More words?
➡ Complex algorithm,
➡ but, we can assume that words are not
independent from eachother
➡ Naive Bayes approach
Slide 78
Slide 78 text
63
Slide 79
Slide 79 text
64
We must know
beforehand which
comments are
negative?
Slide 80
Slide 80 text
TRAINING SET
65
Slide 81
Slide 81 text
66
"Your product is horrible and does
not work properly. Also, you suck."
"I had a horrible experience with
another product. But yours really
worked well. Thank you!"
Negative:
Positive:
Slide 82
Slide 82 text
67
➡ You might want to filter stop-words first.
➡ You might want to make sure negatives are
handled property "not great" => negative.
➡ Bonus points if you can spot sarcasm.
Slide 83
Slide 83 text
➡ Collaborative filtering (mahout):
➡ If user likes product A, B and C, what is the
chance that they like product D?
68
Slide 84
Slide 84 text
69
Mess up your (training) data, and nothing can save you
(except a training set reboot)
Slide 85
Slide 85 text
70
➡ 30% change of acceptance for CFP
➡ 5 CFP's
Binomial probability
Slide 86
Slide 86 text
70
➡ 30% change of acceptance for CFP
➡ 5 CFP's
1 - (0.7 * 0.7 * 0.7 * 0.7 * 0.7) = 1 - 0.168 = 0.832
83% on getting selected at least once!
Binomial probability
72
Find me on twitter: @jaytaph
Find me for development and training:
www.noxlogic.nl / www.techademy.nl
Find me on email: jthijssen@noxlogic.nl
Find me for blogs: www.adayinthelifeof.nl