Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Visualizing the Health of the Internet with Mea...

Visualizing the Health of the Internet with Measurement Lab

Measurement Lab (M-Lab)—the largest collection of open internet performance data on the planet—collects hundreds of thousands of consumer internet performance tests daily and provides that data in the public domain for research, analysis, and advocacy. Data has been piling up since 2009 (over five petabytes of information about the quality of experience on the internet), and more data is generated every day. Big data at this scale presents interesting challenges for everything from readability, visualization, and navigation to public access and affordability. The M-Lab data allows anyone to explore how their internet experience is mediated by all of the various actors that make up the internet.

In this talk, I share recent work with the M-Lab team to develop a data processing pipeline, API, and visualizations to make the data more accessible to anyone interested in exploring open internet through consumer measurements, exploring both the technical and design aspects of the project.

To see a live version of this presentation (with all the gif goodness) you can see it here: https://docs.google.com/presentation/d/1RBMzIIvfyE1NDRPJZHvFBpJ5BEg2cE21L3pUtz29qfI/pub?start=false&loop=false&delayms=60000&slide=id.p

Irene Ros

May 25, 2017
Tweet

More Decks by Irene Ros

Other Decks in Technology

Transcript

  1. M-Lab is an open, distributed server platform for researchers to

    deploy active Internet measurement tools, advanced networking research, and empower the public with useful information about their broadband connections. M-Lab's data is open to anyone [1] 1.Using it isn't easy... We'll get into that. What is Measurement Lab?
  2. Why it exists? The goal of M-Lab is to advance

    network research and empower the public with useful information about their broadband connections. Driving challenges: 1. a lack of well-provisioned and well-connected measurement servers in geographically distributed areas. 2. the difficulties in sharing large Internet measurement datasets between different research projects 3. legislators lack of broadband measurement data in their efforts to craft public policy. http://www.measurementlab.net/publications/measurement-lab-overview.pdf
  3. This is what happens when you request a website that

    is hosted on servers somewhere far away, for example, Turkey.
  4. Goal - Build a web application that can tell us:

    What does performance look like • at a specific location? (which could be city/ region/ country/ continent) • over any time period? (which could be day / month / year over any period) • for a specific ISP (consumer or transit) All of the above in any combination.
  5. Zimbabwe local ISP ran promotion in Dec 2016 during which

    they raised the bandwidth limits to get new subscribers.
  6. How much data is that? Source: friends with root access...

    819,217,639 Tests in BigQuery All test data 243 TB on disk, Probably >100 TB of actual data. . . "Data so large we aren't even sure how much we have"
  7. A crash course in Dataflow • Provides a simple, powerful

    programming model for building both batch and streaming parallel data processing pipelines in a single code stream. • Based on Apache Beam SDK • Can run on Open Source runtimes like Spark or Flink https://github.com/GoogleCloudPlatform/DataflowJavaSDK
  8. slightly less data... Data on disk: 3.65 TB, stored over

    30 nodes Pipeline runs weekly & keys get overwritten if data changes.
  9. Initial data analysis with sample data or aggregations to verify

    data structure and confirm expected results and methods. Also useful for early algorithm prototyping. We identified and were able to remove rogue IPs that skewed actual data medians (testing IPs?)
  10. Hand drawn sketches, translated into high fidelity Sketch mockups used

    for soliciting feedback. Quick to adjust navigation, component priority and assumption about user intent.
  11. Reusable components make for quick Ul extension through clearly defined

    parameters, minimal reliance on data structure and no intra-component dependency (except through Redux shared store)
  12. Challenges • Long running pipelines make for slow debugging. •

    Not cheap infrastructure. • Data coverage is better where the connectivity is better. • Requires human intervention. • We don't have the "why". • No ability to add annotations or context.
  13. Read more: https://bocoup.com/blog/visualizing-the-health-of-the-internet-with-measurement-lab Additional Open Source Modules by Peter Beshai:

    • https://github.com/pbeshai/d3-line-chunked • https://github.com/pbeshai/d3-interpolate-path • https://github.com/pbeshai/react-url-query