Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Jan Stępień - Tracking those who Track
Search
Munich DataGeeks
July 02, 2013
Technology
1
200
Jan Stępień - Tracking those who Track
Talk by Jan Stępień at the firsta Munich DataGeeks Meetup
Data: 02.07.2013
Munich DataGeeks
July 02, 2013
Tweet
Share
More Decks by Munich DataGeeks
See All by Munich DataGeeks
Florian Haselbeck- Advancing Synthetic Protein Design with Large Language Models
munichdatageeks
0
77
Tobias Ladner- Formal Verification of Neural Networks in Safety-Critical Environments
munichdatageeks
0
97
Uladzislau Sazanovich - JetBrains AI: Deep Dive
munichdatageeks
0
82
Jan Hauffa - A Case Study on Retrieval-Augmented Generation for Document Q&A: Experiences and Future Perspectives
munichdatageeks
0
100
Thomas Schmidt - Revolutionizing SQL Data Model Testing: Introducing SQL-Mock by DeepL
munichdatageeks
0
60
Maximilian Duesberg - The Data is Clear - But Humans are not
munichdatageeks
0
100
Dr.Christoph Mittendorf-Beyond Bard and Transformers: Unconventional ML Use Cases
munichdatageeks
0
150
Heidi Seibold - Are (data) scientists bad at science?
munichdatageeks
0
140
Roland Rodde- Vegetation management for powerlines with remote sensing data
munichdatageeks
0
160
Other Decks in Technology
See All in Technology
Amplify Gen2から知るAWS CDK Toolkit Libraryの使い方/How to use the AWS CDK Toolkit Library as known from Amplify Gen2
fossamagna
1
350
60以上のプロダクトを持つ組織における開発者体験向上への取り組み - チームAPIとBackstageで構築する組織の可視化基盤 - / sre next 2025 Efforts to Improve Developer Experience in an Organization with Over 60 Products
vtryo
3
1.9k
マルチプロダクト環境におけるSREの役割 / SRE NEXT 2025 lunch session
sugamasao
1
730
IPA&AWSダブル全冠が明かす、人生を変えた勉強法のすべて
iwamot
PRO
2
230
“日本一のM&A企業”を支える、少人数SREの効率化戦略 / SRE NEXT 2025
genda
1
270
「現場で活躍するAIエージェント」を実現するチームと開発プロセス
tkikuchi1002
3
320
ゼロから始めるSREの事業貢献 - 生成AI時代のSRE成長戦略と実践 / Starting SRE from Day One
shinyorke
PRO
0
110
LIXIL基幹システム刷新に立ち向かう技術的アプローチについて
tsukuha
1
380
助けて! XからWaylandに移行しないと新しいGNOMEが使えなくなっちゃう 2025-07-12
nobutomurata
2
200
三視点LLMによる複数観点レビュー
mhlyc
0
230
サイバーエージェントグループのSRE10年の歩みとAI時代の生存戦略
shotatsuge
4
1k
ビジネス職が分析も担う事業部制組織でのデータ活用の仕組みづくり / Enabling Data Analytics in Business-Led Divisional Organizations
zaimy
1
390
Featured
See All Featured
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
44
2.4k
Music & Morning Musume
bryan
46
6.7k
Designing for humans not robots
tammielis
253
25k
YesSQL, Process and Tooling at Scale
rocio
173
14k
Git: the NoSQL Database
bkeepers
PRO
430
65k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
35
2.4k
A better future with KSS
kneath
238
17k
Why Our Code Smells
bkeepers
PRO
337
57k
Balancing Empowerment & Direction
lara
1
450
jQuery: Nuts, Bolts and Bling
dougneiner
63
7.8k
The Invisible Side of Design
smashingmag
301
51k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
Transcript
Tracking those who track us Jan Stępień
My name is Jan Stępień and I come from Warsaw
Data analysis is not just big data
Data analysis is fun
It all started with ads tracking “like” buttons other irrelevant
things
1. Use an adblock plugin 2. Block all network communication
to unwelcome domains
My machine website.com ads.website.com
My machine website.com ads.website.com
Let’s capture all those requests!
03.2012 – 06.2013 106 414 requests 322 distinct days approx.
330 requests per day
SQLite3 + Incanter + R + Weka
http_if_none_match http_referer http_accept_encoding http_accept http_cookie http_connection http_host http_user_agent http_version path_info
http_accept_charset http_accept_language http_cache_control http_if_modified_since request_method request_path request_uri query_string remote_host remote_addr script_name server_name server_port server_protocol http_dnt timestamp
timestamp
03 04 05 06 07 08 09 10 11 12
01 02 03 04 05 06 15k 10k 5k 0
00 01 02 03 04 05 06 07 08 09
10 11 12 13 14 15 16 17 18 19 20 21 22 23 100 0 200 300 400 500
8k 6k 4k 2k 0 Mo Tu We Th Fr
Sa Su
http_host
www.google-analytics.com 36197 static.adzerk.net 13983 edge.quantserve.com 11659 www.facebook.com 9641 ad.doubleclick.net 3822
pagead2.googlesyndication.com 3764 s.youtube.com 2173 b.scorecardresearch.com 1974 pubads.g.doubleclick.net 1465 googleads.g.doubleclick.net 1231
48.9% of requests sent to domains owned by Google
http_referer
22902 distinct referrers 4692 distinct domains
Let’s try to combine this dataset with something else
Weather influence?
ogimet.com Humidity, min/max/avg temperature, cloud coverage, visibility, rain/snow, wind speed/direction,
etc.
No correlations!
Tags at stackoverflow.com
http://stackoverflow.com/questions/123/title
data.stackexchange.com
Thanks, wordle.net!
Can be my WWW traffic grouped into clusters?
1. Group requests into 15 minute intervals 2. Count domains
per interval
5008 intervals Each described by over 4500 values
1. Select request from popular domains 2. Group requests into
15 minute intervals 3. Count domains per interval
5008 intervals Each described by 95 values Only 2% of
cells with non-zero values
Principal Component Analysis 95 domains → 16 descriptors
X-means K-means based clustering algorithm
cluster 0 1268 cluster 1 702 cluster 2 651 cluster
3 2387 What is the meaning behind these clusters?
3 stackoverflow.com
2 reddit.com redditmedia.com bbc.co.uk
1 linkedin.com dictionary.reference.com meetup.com
0 rubyonrails.pl developer.android.com tex.stackexchange.com amazon.com youtube.com
How accurate is this clustering? Let’s build a classifier on
the original data
0 1 2 3 ← classified as 1188 29 11
40 cluster 0 47 654 1 0 cluster 1 10 1 622 18 cluster 2 50 0 18 2319 cluster 3 cluster 0: rubyonrails.pl developer.android.com amazon.com youtube.com cluster 1: linkedin.com dictionary.reference.com meetup.com cluster 2: reddit.com redditmedia.com bbc.co.uk cluster 3: stackoverflow.com
Let’s wrap up
Data analysis is not just big data
Data analysis is fun
Thank you very much The picture of Warsaw is ©
Dennis Jarvis 2009