Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MindYourPrivacy: Design and Implementation of a Visualization System for Third-Party Web Tracking - IEEE PST 2014

ytakano
May 31, 2016

MindYourPrivacy: Design and Implementation of a Visualization System for Third-Party Web Tracking - IEEE PST 2014

Third-party web tracking is a serious privacy issue.
Advertisement sites and social networking sites stealthily collect users' web browsing history for purposes such as targeted advertising or predicting trends.
Unfortunately, very few Internet users realize this, and their privacy has been infringed upon since they have no means of recognizing the situation.
This paper presents the design and implementation of a system called MindYourPrivacy that visualizes third-party web tracking and clarifies the entities threatening users' privacy.
The implementation adopts deep packet inspection, DNS-SOA-record-based categorization, and HTTP-referred graph analysis to visualize collectors of web browsing histories without device dependency.
In order to demonstrate the effectiveness of our proof-of-concept implementation, we conducted an experiment in an IT technology camp, where 129 attendees discussed IT technologies for four days,
The experiment's results revealed that visualizing web tracking effectively influences users' perception of privacy.
The result of analysis of user data we collected at the camp also revealed that MCODE clustering and some features derived by graph theory are useful for detecting advertising sites that potentially collect user information by web tracking for their own purposes.

ytakano

May 31, 2016
Tweet

More Decks by ytakano

Other Decks in Research

Transcript

  1. IEEE, 12th Annual Conference on Privacy Security Trust, PST 2014

    MindYourPrivacy: Design and Implementation of a Visualization System for Third-Party Web Tracking Yuuki Takano, Satoshi Ohta, Takeshi Takahashi, Ruo Ando, Tomoya Inoue 1
  2. Introduction ❖ The number of third-party Web tracking is growing

    each year.! ❖ online privacy is now significant issue! ❖ SNSs and targeted ads can associate real names of individuals with tracking information! ❖ Propose MindYourPrivacy to visualize and show third-party web tracking.! ❖ deep-packet-inspection based architecture! ❖ to support heterogeneous browsers and devices! ❖ Experimented MindYourPrivacy at the Workshop (WIDE Camp 2014 Autumn in JAPAN), which has 129 attendees.! ❖ reveal that clustering web graph helps to detect ads’ sites by analyzing user traffic! ❖ some graph theory features also help to heuristically detect ads sites 2
  3. Related Work Web Tracking Mechanism ❖ Third-party Web tracker typically

    tracks by cookie, Etags or flash storage XFCCVH YQJDU BET TPDJBMXJEHFUT 'JSTUQBSUZ8FCTFSWFST 5IJSEQBSUZ8FCUSBDLFS USBDLJOHJE DPPLJF &UBHT qBTITUPSBHF FUD DPOUFOUT DPOUFOUT 3
  4. Related Work Web Tracking Detection Techniques ❖ ShareMeNot! ❖ swap

    a link to known data-collection sites such as Facebook! ❖ Roesner et al. “Detecting and defending against third-party tracking on the web”, USENIX NSDI 2012! ❖ Lightbeam! ❖ visualize web graph between first and third-party sites! ❖ https://www.mozilla.org/lightbeam/! ❖ AdBlock Plus! ❖ signature based ads detection and blocking! ❖ https://adblockplus.org/en/firefox 6
  5. Related Work Measurements ❖ Several researchers reported on third party

    web tracker.! ❖ One of the research reported third-party trackers within Alexa’s top 500 domains.! ❖ Roesner et al, “Detecting and defending against third-party tracking on the web”, USENIX NSDI 2012! e fact that the tracking t it is thus difficult to or policy solutions. s classification is ag- on of the mechanisms e storage may be done , and information may ker in any way. This anism makes the clas- evolution of specific by trackers. ework, we created a tomatically classifies rved on the client-side. Figure 6: Prevalence of Trackers on Top 500 Domains. Trackers are counted on domains, i.e., if a particular tracker appears on two pages of a domain, it is counted once. Top 20 Trackers on Alexa’s Top 500 Domains! [Roesner et al. NSDI 2012] 7
  6. MindYourPrivacy Design Principle ❖ We designed and implemented a visualization

    system for third-party web tracking called MindYourPrivacy.! ❖ To clearly show third-party web trackers to users.! ❖ Design Principles of MindYourPrivacy! ❖ Independence from browsers and devices! ❖ the existence of various OSes or devices such as Linux, Windows, MacOS, and smartphone OSes such as Android and iOS complicates the problem! ❖ adopt a deep-packet-inspection based approach to support heterogeneous browsers and devices! ❖ Accessibility and comprehensiveness of the analysis results! ❖ easy to access: MindYourPrivacy provides analysis results in the form of an HTML file via an HTTP server to facilitate users’ access to them! ❖ easy to understand: visualize trackers by tag cloud fashion, and provide web graph’s file further analysis 8
  7. Design and Implementation Web Tracker Identification Methodology (1) ❖ HTTP

    Referrer Web Graph Analysis! ❖ generate a web graph by using HTTP referrer tag! ❖ if an site is referred by many other sites, MindYourPrivacy assumes that it is a suspicious site tracking users! ❖ Domain Aggregation! ❖ to show users which organizations track them, MindYourPrivacy aggregates domains as either second or third level! ❖ platform.twitter.com and platform0.twitter.com are aggregated to twitter.com 9
  8. Design and Implementation Web Tracker Identification Methodology (2) ❖ DNS-SOA-Record-Based

    Grouping! ❖ aggregate domains by DNS SOA record! ❖ facebook.com and facebook.net are aggregated into dns.facebook.com, which is their DNS SOA record! ❖ Balanchander et al., “Privacy diffusion on the web: a longitudinal perspective”, WWW 2009! ❖ Weighted site Ranking of User Data Leakage! ❖ MindYourPrivacy shows not only web trackers but also leaking sites to trackers! ❖ leaking sites are scored, but the details are omitted here. see our paper 10
  9. Design and Implementation System Model ❖ MindYourPrivacy captures traffic of

    users’ web access! ❖ show analyzed results via MindYourPrivacy’s web server! ❖ users need not install or configure specific applications MindYourPrivacy The Internet Traffic Capture Web Access Analyzed Result via HTTP Outgoing Traffic Router ɾɾɾ Users 11
  10. Design and Implementation Implementation Architecture ❖ Catenaccio DPI! ❖ capture

    traffic from network IF! ❖ reconstruct TCP stream and store captured data into NoSQL DB! ❖ written in C++! ❖ NoSQL DB! ❖ use MongoDB as a database! ❖ Tracking Analyzer! ❖ analyze measurement data! ❖ written in JavaScript and Python! ❖ HTML/Graph File Generator! ❖ generate visualized results! ❖ written in Python! ❖ HTML Server! ❖ serve HTML/Graph files to users Catenaccio DPI NoSQL DB Tracking Analyzer HTML/Graph File Generator HTML Server NW/IF L2 Datagram Measurement Data Analyzed Result Measurement Data HTML/Graph Files Analyzing Result 12
  11. Design and Implementation Web User Interface ❖ visualize suspicious web

    trackers as tag cloud fashion! ❖ domains are grouped by DNS SOA records! ❖ referring sites are shown in right pane
  12. Experiment at WIDE Camp 2013 Autumn ❖ We experimented MindYourPrivacy

    at WIDE camp 2013 autumn.! ❖ WIDE Camp 2013 Autumn (Sep. 10 - Sep. 13)! ❖ a workshop for Internet researchers, operators and developers! ❖ 129 attendees, most of whom are either IT specialists or students majoring IT! ❖ the experiment is agreed by every attendees (for only research purpose)! ❖ We captured the attendees’ web browsing traffic and analyzed. 14
  13. Experiment User Traffic Analysis (1) ❖ Obtained 734,194 HTTP requests

    and 1,661 individual source IP addresses (IPv4 and IPv6).! ❖ A directed web graph is generated by using HTTP referrer header.! ❖ There are 3,966 nodes and 12,941 edges.! ❖ Analyze this web graph to find web trackers. 15
  14. Experiment User Traffic Analysis (2) ❖ To find web trackers,

    we extract top most-referred sites from the web graph! ❖ Advertisements and social sites, which tend to track users, have many incoming links. ttendees Total 117 12 129 RLs are only TABLE II: Top-five Most-referred Sites Site # of incoming links google-analytics.com 847 facebook.com 437 twitter.com 393 doubleclick.net 380 google.com 356 16 Top-Five Most-referred Sites
  15. Experiment User Traffic Analysis (3) ❖ We then adopted a

    clustering technique (M-CODE) to the web graph.! ❖ As a result of clustering, many ad-sites are found in cluster. referred Graph Pane: This pane provides referred .dot and .sif formats. Users can download these re and analyze or visualize the referred graph by viz, Cytoscape, etc. Figures 5 and Figure 6 show examples using Cytoscape. Through this sort of users can easily find to which sites many other IV. Experiment strate the usability and effectiveness of the pro- m, we conducted an experiment at WIDE camp September 10–13 2013. E project [19] is a research and development apan aimed at developing a widely integrated nvironment. It organizes camps every spring and many researchers, developers, and students tak- discussing Internet technologies. Table I lists the f the camp attendees. There were 129 attendees, m are either IT specialists or students majoring in conducted two types of experiments: user traffic questionnaire-based use analysis. whose values are random text strings, the number of coo values we observed, and examples. In total we obser 2,309 and 2,671 requests for platform.twitter.com www.facebook.com, respectively. However, we found o about 100 unique values for each cookie, though fr www.facebook.com is 397. fr thus does not seem to tracking cookies, and the 100 likely indicates the numbe attendees (which was also around 100) or devices. The res reveal that tracking cookies can also be used for per-u analysis and visualization. We then applied MCODE clustering [20] to the graph Figure 5 to find further features. This allowed us to obse many ad sites clustered into the rank 1 cluster by MCO The following domains were ad sites found in the ran cluster of Figure 6: doubleclick.net, amazon-adsystem.com, googleadservices.com, i-mobile.co.jp, advg.jp, adingo.jp, iogous.com, admeld.com, criteo.com. Ad sites generally tend to collect user information for busin purposes. We therefore should be concerned with the priv issues they present. This discovery should help further anal and visualization concerning such sites. Table IV lists feature vector of ads and other sites that appeared in Figur ad-sites in cluster 17
  16. Experiment User Traffic Analysis (4) ❖ We analyzed the cluster

    from the aspect of graph theory’s feature.! ❖ As a result of that, we found that ad-sites’ #incoming links, #outgoing links and neighborhood connectivity are quite different from others.! ❖ ad-sites have many incoming links, but few outgoing links! ❖ ad-sites’ neighborhood connectivity is relatively low 18 Fig. 6: Rank 1 Cluster by MCODE (include loops = false, degree cutoff = 2, haircut = true, fluff = false, node score cutoff = 0.2, k-core = 2, and max. depth = 100) TABLE IV: Feature Vector of Rank 1 Cluster’s Edge (Average and Unbiased Variance) #incoming links # of outgoing links Neighborhood connectivity avg. var. avg. var. avg. var. ad sites 90.2 12405.4 15.2 3972.9 46.0 3972.9 others 30.2 3972.9 29.7 569.3 130.2 5212.0 measures, and the most popular measure is to use multiple browsers. Although multiple browser usage does not strictly the DNT flag i tracking; it is ju referrers or coo online usability not use SNSs. of infrastructur pros and cons o The free-form • Use privat • Delete HT • Use AdBlo • Absolutely Modern Web b mode to isolat responded that Some of them for not disablin Some attendee blocks online a leakage throug attendees answ tracking. Such privacy are qui Question 3: D after seeing the
  17. Experiment User Traffic Analysis (5) ❖ Do Not Track flag

    is used to announce a wish of users to third-party trackers.! ❖ However only 40,650 (40,605/734,194 = 6 %) DNT enabled requests are observed. 19
  18. Conclusion and Future Work ❖ Proposed a visualization system for

    third-party web tracking called MindYourPrivacy.! ❖ browser and device independent architecture! ❖ visualize web trackers as tag cloud fashion! ❖ Experimented MindYourPrivacy at WIDE camp 2013 autumn and analyze users’ web browsing traffic.! ❖ generate web graph by HTTP referrer and analyze it! ❖ revealed that graph clustering and some graph theory’s features are useful to find web trackers! ❖ Adopting more sophisticated approaches we revealed at the experiment, and signature based approach is a future work. 20