$30 off During Our Annual Pro Sale. View Details »

MindYourPrivacy: Design and Implementation of a Visualization System for Third-Party Web Tracking - IEEE PST 2014

ytakano
May 31, 2016

MindYourPrivacy: Design and Implementation of a Visualization System for Third-Party Web Tracking - IEEE PST 2014

Third-party web tracking is a serious privacy issue.
Advertisement sites and social networking sites stealthily collect users' web browsing history for purposes such as targeted advertising or predicting trends.
Unfortunately, very few Internet users realize this, and their privacy has been infringed upon since they have no means of recognizing the situation.
This paper presents the design and implementation of a system called MindYourPrivacy that visualizes third-party web tracking and clarifies the entities threatening users' privacy.
The implementation adopts deep packet inspection, DNS-SOA-record-based categorization, and HTTP-referred graph analysis to visualize collectors of web browsing histories without device dependency.
In order to demonstrate the effectiveness of our proof-of-concept implementation, we conducted an experiment in an IT technology camp, where 129 attendees discussed IT technologies for four days,
The experiment's results revealed that visualizing web tracking effectively influences users' perception of privacy.
The result of analysis of user data we collected at the camp also revealed that MCODE clustering and some features derived by graph theory are useful for detecting advertising sites that potentially collect user information by web tracking for their own purposes.

ytakano

May 31, 2016
Tweet

More Decks by ytakano

Other Decks in Research

Transcript

  1. IEEE, 12th Annual Conference on Privacy Security
    Trust, PST 2014
    MindYourPrivacy: Design and
    Implementation of a Visualization
    System for Third-Party Web
    Tracking
    Yuuki Takano, Satoshi Ohta,
    Takeshi Takahashi, Ruo Ando,
    Tomoya Inoue
    1

    View Slide

  2. Introduction
    ❖ The number of third-party Web tracking is growing each year.!
    ❖ online privacy is now significant issue!
    ❖ SNSs and targeted ads can associate real names of individuals with tracking
    information!
    ❖ Propose MindYourPrivacy to visualize and show third-party web tracking.!
    ❖ deep-packet-inspection based architecture!
    ❖ to support heterogeneous browsers and devices!
    ❖ Experimented MindYourPrivacy at the Workshop (WIDE Camp 2014 Autumn in
    JAPAN), which has 129 attendees.!
    ❖ reveal that clustering web graph helps to detect ads’ sites by analyzing user traffic!
    ❖ some graph theory features also help to heuristically detect ads sites
    2

    View Slide

  3. Related Work
    Web Tracking Mechanism
    ❖ Third-party Web tracker typically tracks by cookie,
    Etags or flash storage
    XFCCVH YQJDU

    BET
    TPDJBMXJEHFUT
    'JSTUQBSUZ8FCTFSWFST
    5IJSEQBSUZ8FCUSBDLFS
    USBDLJOHJE DPPLJF &UBHT qBTITUPSBHF FUD

    DPOUFOUT
    DPOUFOUT
    3

    View Slide

  4. platform.twitter.com
    guest_id=v1%3A135875454567229819!
    twll=l%3D1363156464
    4

    View Slide

  5. platform.twitter.com
    guest_id=v1%3A135875454567229819!
    twll=l%3D1363156464
    YES. Twitter knows our tendency.
    5

    View Slide

  6. Related Work
    Web Tracking Detection Techniques
    ❖ ShareMeNot!
    ❖ swap a link to known data-collection sites such as Facebook!
    ❖ Roesner et al. “Detecting and defending against third-party tracking on the
    web”, USENIX NSDI 2012!
    ❖ Lightbeam!
    ❖ visualize web graph between first and third-party sites!
    ❖ https://www.mozilla.org/lightbeam/!
    ❖ AdBlock Plus!
    ❖ signature based ads detection and blocking!
    ❖ https://adblockplus.org/en/firefox
    6

    View Slide

  7. Related Work
    Measurements
    ❖ Several researchers reported on third party web tracker.!
    ❖ One of the research reported third-party trackers within Alexa’s top 500 domains.!
    ❖ Roesner et al, “Detecting and defending against third-party tracking on the web”, USENIX NSDI 2012!
    e fact that the tracking
    t it is thus difficult to
    or policy solutions.
    s classification is ag-
    on of the mechanisms
    e storage may be done
    , and information may
    ker in any way. This
    anism makes the clas-
    evolution of specific
    by trackers.
    ework, we created a
    tomatically classifies
    rved on the client-side.
    Figure 6: Prevalence of Trackers on Top 500 Domains.
    Trackers are counted on domains, i.e., if a particular tracker
    appears on two pages of a domain, it is counted once.
    Top 20 Trackers on Alexa’s Top 500 Domains!
    [Roesner et al. NSDI 2012]
    7

    View Slide

  8. MindYourPrivacy
    Design Principle
    ❖ We designed and implemented a visualization system for third-party web tracking called
    MindYourPrivacy.!
    ❖ To clearly show third-party web trackers to users.!
    ❖ Design Principles of MindYourPrivacy!
    ❖ Independence from browsers and devices!
    ❖ the existence of various OSes or devices such as Linux, Windows, MacOS, and smartphone
    OSes such as Android and iOS complicates the problem!
    ❖ adopt a deep-packet-inspection based approach to support heterogeneous browsers and devices!
    ❖ Accessibility and comprehensiveness of the analysis results!
    ❖ easy to access: MindYourPrivacy provides analysis results in the form of an HTML file via an
    HTTP server to facilitate users’ access to them!
    ❖ easy to understand: visualize trackers by tag cloud fashion, and provide web graph’s file further
    analysis
    8

    View Slide

  9. Design and Implementation
    Web Tracker Identification Methodology (1)
    ❖ HTTP Referrer Web Graph Analysis!
    ❖ generate a web graph by using HTTP referrer tag!
    ❖ if an site is referred by many other sites, MindYourPrivacy
    assumes that it is a suspicious site tracking users!
    ❖ Domain Aggregation!
    ❖ to show users which organizations track them, MindYourPrivacy
    aggregates domains as either second or third level!
    ❖ platform.twitter.com and platform0.twitter.com are aggregated to
    twitter.com
    9

    View Slide

  10. Design and Implementation
    Web Tracker Identification Methodology (2)
    ❖ DNS-SOA-Record-Based Grouping!
    ❖ aggregate domains by DNS SOA record!
    ❖ facebook.com and facebook.net are aggregated into dns.facebook.com,
    which is their DNS SOA record!
    ❖ Balanchander et al., “Privacy diffusion on the web: a longitudinal
    perspective”, WWW 2009!
    ❖ Weighted site Ranking of User Data Leakage!
    ❖ MindYourPrivacy shows not only web trackers but also leaking sites to
    trackers!
    ❖ leaking sites are scored, but the details are omitted here. see our paper
    10

    View Slide

  11. Design and Implementation
    System Model
    ❖ MindYourPrivacy captures traffic of users’ web access!
    ❖ show analyzed results via MindYourPrivacy’s web server!
    ❖ users need not install or configure specific applications
    MindYourPrivacy
    The Internet
    Traffic Capture
    Web Access
    Analyzed Result via HTTP
    Outgoing Traffic
    Router
    ɾɾɾ
    Users
    11

    View Slide

  12. Design and Implementation
    Implementation Architecture
    ❖ Catenaccio DPI!
    ❖ capture traffic from network IF!
    ❖ reconstruct TCP stream and store captured data into
    NoSQL DB!
    ❖ written in C++!
    ❖ NoSQL DB!
    ❖ use MongoDB as a database!
    ❖ Tracking Analyzer!
    ❖ analyze measurement data!
    ❖ written in JavaScript and Python!
    ❖ HTML/Graph File Generator!
    ❖ generate visualized results!
    ❖ written in Python!
    ❖ HTML Server!
    ❖ serve HTML/Graph files to users
    Catenaccio DPI NoSQL DB
    Tracking Analyzer
    HTML/Graph File
    Generator
    HTML Server
    NW/IF
    L2 Datagram
    Measurement Data
    Analyzed Result
    Measurement Data
    HTML/Graph Files
    Analyzing Result
    12

    View Slide

  13. Design and Implementation
    Web User Interface
    ❖ visualize suspicious web trackers as tag cloud fashion!
    ❖ domains are grouped by DNS SOA records!
    ❖ referring sites are shown in right pane

    View Slide

  14. Experiment at WIDE Camp 2013 Autumn
    ❖ We experimented MindYourPrivacy at WIDE camp 2013 autumn.!
    ❖ WIDE Camp 2013 Autumn (Sep. 10 - Sep. 13)!
    ❖ a workshop for Internet researchers, operators and developers!
    ❖ 129 attendees, most of whom are either IT specialists or
    students majoring IT!
    ❖ the experiment is agreed by every attendees (for only research
    purpose)!
    ❖ We captured the attendees’ web browsing traffic and analyzed.
    14

    View Slide

  15. Experiment
    User Traffic Analysis (1)
    ❖ Obtained 734,194 HTTP
    requests and 1,661
    individual source IP
    addresses (IPv4 and IPv6).!
    ❖ A directed web graph is
    generated by using HTTP
    referrer header.!
    ❖ There are 3,966 nodes and
    12,941 edges.!
    ❖ Analyze this web graph to
    find web trackers.
    15

    View Slide

  16. Experiment
    User Traffic Analysis (2)
    ❖ To find web trackers, we extract top most-referred sites
    from the web graph!
    ❖ Advertisements and social sites, which tend to track
    users, have many incoming links.
    ttendees
    Total
    117
    12
    129
    RLs are only
    TABLE II: Top-five Most-referred Sites
    Site # of incoming links
    google-analytics.com 847
    facebook.com 437
    twitter.com 393
    doubleclick.net 380
    google.com 356
    16
    Top-Five Most-referred Sites

    View Slide

  17. Experiment
    User Traffic Analysis (3)
    ❖ We then adopted a clustering technique (M-CODE) to the web graph.!
    ❖ As a result of clustering, many ad-sites are found in cluster.
    referred Graph Pane: This pane provides referred
    .dot and .sif formats. Users can download these
    re and analyze or visualize the referred graph by
    viz, Cytoscape, etc. Figures 5 and Figure 6 show
    examples using Cytoscape. Through this sort of
    users can easily find to which sites many other
    IV. Experiment
    strate the usability and effectiveness of the pro-
    m, we conducted an experiment at WIDE camp
    September 10–13 2013.
    E project [19] is a research and development
    apan aimed at developing a widely integrated
    nvironment. It organizes camps every spring and
    many researchers, developers, and students tak-
    discussing Internet technologies. Table I lists the
    f the camp attendees. There were 129 attendees,
    m are either IT specialists or students majoring in
    conducted two types of experiments: user traffic
    questionnaire-based use analysis.
    whose values are random text strings, the number of coo
    values we observed, and examples. In total we obser
    2,309 and 2,671 requests for platform.twitter.com
    www.facebook.com, respectively. However, we found o
    about 100 unique values for each cookie, though fr
    www.facebook.com is 397. fr thus does not seem to
    tracking cookies, and the 100 likely indicates the numbe
    attendees (which was also around 100) or devices. The res
    reveal that tracking cookies can also be used for per-u
    analysis and visualization.
    We then applied MCODE clustering [20] to the graph
    Figure 5 to find further features. This allowed us to obse
    many ad sites clustered into the rank 1 cluster by MCO
    The following domains were ad sites found in the ran
    cluster of Figure 6:
    doubleclick.net, amazon-adsystem.com,
    googleadservices.com, i-mobile.co.jp,
    advg.jp, adingo.jp, iogous.com, admeld.com,
    criteo.com.
    Ad sites generally tend to collect user information for busin
    purposes. We therefore should be concerned with the priv
    issues they present. This discovery should help further anal
    and visualization concerning such sites. Table IV lists
    feature vector of ads and other sites that appeared in Figur
    ad-sites in cluster
    17

    View Slide

  18. Experiment
    User Traffic Analysis (4)
    ❖ We analyzed the cluster from the aspect of graph theory’s feature.!
    ❖ As a result of that, we found that ad-sites’ #incoming links, #outgoing links
    and neighborhood connectivity are quite different from others.!
    ❖ ad-sites have many incoming links, but few outgoing links!
    ❖ ad-sites’ neighborhood connectivity is relatively low
    18
    Fig. 6: Rank 1 Cluster by MCODE (include loops = false,
    degree cutoff = 2, haircut = true, fluff = false, node score
    cutoff = 0.2, k-core = 2, and max. depth = 100)
    TABLE IV: Feature Vector of Rank 1 Cluster’s Edge (Average
    and Unbiased Variance)
    #incoming links # of outgoing
    links
    Neighborhood
    connectivity
    avg. var. avg. var. avg. var.
    ad sites 90.2 12405.4 15.2 3972.9 46.0 3972.9
    others 30.2 3972.9 29.7 569.3 130.2 5212.0
    measures, and the most popular measure is to use multiple
    browsers. Although multiple browser usage does not strictly
    the DNT flag i
    tracking; it is ju
    referrers or coo
    online usability
    not use SNSs.
    of infrastructur
    pros and cons o
    The free-form
    • Use privat
    • Delete HT
    • Use AdBlo
    • Absolutely
    Modern Web b
    mode to isolat
    responded that
    Some of them
    for not disablin
    Some attendee
    blocks online a
    leakage throug
    attendees answ
    tracking. Such
    privacy are qui
    Question 3: D
    after seeing the

    View Slide

  19. Experiment
    User Traffic Analysis (5)
    ❖ Do Not Track flag is used to announce a wish of users to
    third-party trackers.!
    ❖ However only 40,650 (40,605/734,194 = 6 %) DNT
    enabled requests are observed.
    19

    View Slide

  20. Conclusion and Future Work
    ❖ Proposed a visualization system for third-party web tracking called
    MindYourPrivacy.!
    ❖ browser and device independent architecture!
    ❖ visualize web trackers as tag cloud fashion!
    ❖ Experimented MindYourPrivacy at WIDE camp 2013 autumn and analyze users’
    web browsing traffic.!
    ❖ generate web graph by HTTP referrer and analyze it!
    ❖ revealed that graph clustering and some graph theory’s features are useful to
    find web trackers!
    ❖ Adopting more sophisticated approaches we revealed at the experiment, and
    signature based approach is a future work.
    20

    View Slide

  21. EOF
    21

    View Slide