Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Visualizing the Health of the Internet with Measurement Lab

Visualizing the Health of the Internet with Measurement Lab

Measurement Lab (M-Lab)—the largest collection of open internet performance data on the planet—collects hundreds of thousands of consumer internet performance tests daily and provides that data in the public domain for research, analysis, and advocacy. Data has been piling up since 2009 (over five petabytes of information about the quality of experience on the internet), and more data is generated every day. Big data at this scale presents interesting challenges for everything from readability, visualization, and navigation to public access and affordability. The M-Lab data allows anyone to explore how their internet experience is mediated by all of the various actors that make up the internet.

In this talk, I share recent work with the M-Lab team to develop a data processing pipeline, API, and visualizations to make the data more accessible to anyone interested in exploring open internet through consumer measurements, exploring both the technical and design aspects of the project.

To see a live version of this presentation (with all the gif goodness) you can see it here: https://docs.google.com/presentation/d/1RBMzIIvfyE1NDRPJZHvFBpJ5BEg2cE21L3pUtz29qfI/pub?start=false&loop=false&delayms=60000&slide=id.p

F99c66cc9f64454df105006286d47521?s=128

Irene Ros

May 25, 2017
Tweet

More Decks by Irene Ros

Other Decks in Technology

Transcript

  1. Visualizing the health of the internet Measurement Lab + Bocoup

    Irene Ros irene@bocoup.com @ireneros
  2. Get in touch! hello@bocoup.com http://bocoup.com/datavis

  3. M-Lab is an open, distributed server platform for researchers to

    deploy active Internet measurement tools, advanced networking research, and empower the public with useful information about their broadband connections. M-Lab's data is open to anyone [1] 1.Using it isn't easy... We'll get into that. What is Measurement Lab?
  4. Why it exists? The goal of M-Lab is to advance

    network research and empower the public with useful information about their broadband connections. Driving challenges: 1. a lack of well-provisioned and well-connected measurement servers in geographically distributed areas. 2. the difficulties in sharing large Internet measurement datasets between different research projects 3. legislators lack of broadband measurement data in their efforts to craft public policy. http://www.measurementlab.net/publications/measurement-lab-overview.pdf
  5. M-Lab has sites around the world (and growing). Each site

    has one or more servers.
  6. Users run speed tests = data

  7. What measurements matter to consumers?

  8. None
  9. This is what happens when you request a website that

    is hosted on servers somewhere far away, for example, Turkey.
  10. ISP tests measure the route until their first router (Speedtest.net.)

  11. M-lab emulates the consumer experience so much better by measuring

    the full route from consumer to content
  12. Goal - Build a web application that can tell us:

    What does performance look like • at a specific location? (which could be city/ region/ country/ continent) • over any time period? (which could be day / month / year over any period) • for a specific ISP (consumer or transit) All of the above in any combination.
  13. Who is our audience?

  14. None
  15. None
  16. None
  17. None
  18. None
  19. None
  20. None
  21. None
  22. Zimbabwe local ISP ran promotion in Dec 2016 during which

    they raised the bandwidth limits to get new subscribers.
  23. Regulation change in Brazil resulted in lower speeds for consumers.

  24. All is Open Source https://github.com/m-lab

  25. The data pipeline

  26. Our data: NDT Tests https://github.com/ndt-project/ndt/wiki/NDTDataFormat NDT reports upload and download

    speeds and attempts to determine what problems limit speeds.
  27. Raw data for the brave ones. Raw Data files: https://console.cloud.google.com/storage/browser/m-lab/ndt

    https://www.measurementlab.net/tools/ndt/
  28. How much data is that? Source: friends with root access...

    819,217,639 Tests in BigQuery All test data 243 TB on disk, Probably >100 TB of actual data. . . "Data so large we aren't even sure how much we have"
  29. BigQuery Schema https://www.measurementlab.net/data/b q/schema/ https://rawgit.com/pboothe/bigquerygen erator/master/generator.html Relevant fields: - Download/upload

    flag - Location - IPs for client and server - Actual measurements
  30. https://github.com/m-lab/mlab-vis-pipeline/blob/master/dataflow/data/bigquery/queries/base_downloads_ip_by_hour.s ql

  31. https://www.measurementlab.net/data/bq/quickstart/

  32. None
  33. None
  34. None
  35. Data pipeline, the simple version

  36. Data pipeline, complicated version

  37. A crash course in Dataflow • Provides a simple, powerful

    programming model for building both batch and streaming parallel data processing pipelines in a single code stream. • Based on Apache Beam SDK • Can run on Open Source runtimes like Spark or Flink https://github.com/GoogleCloudPlatform/DataflowJavaSDK
  38. None
  39. https://www.slideshare.net/VadimSolovey/dataflow-a-unified-model-for-batch-and-streaming-data-processing PTransforms:

  40. AddLocalTimePipeline Example pipeline transformation-adding time zones (element-wise) https://github.com/m-lab/mlab-vis-pipeline/blob/master/dataflow/src/main/java/mlab/dataviz/ AddLocalTimePipeline.java

  41. https://github.com/m-lab/mlab-vis-pipeline/blob/master/dataflow/src/main/java/mlab/dataviz/HistoricPipeline.java

  42. None
  43. https://github.com/m-lab/mlab-vis-pipeline/blob/master/dataflow/data/bigtable/client_loc_by_year.json

  44. https://github.com/m-lab/mlab-vis-pipeline/tree/master/dataflow/data/bigtable Date range queries fetch multiple keys.

  45. None
  46. slightly less data... Data on disk: 3.65 TB, stored over

    30 nodes Pipeline runs weekly & keys get overwritten if data changes.
  47. The data API

  48. None
  49. None
  50. *Na - North America

  51. The Front End

  52. Tech Stack React + Redux + D3.js + Webpack +

    Karma
  53. Design Process

  54. Initial data analysis with sample data or aggregations to verify

    data structure and confirm expected results and methods. Also useful for early algorithm prototyping. We identified and were able to remove rogue IPs that skewed actual data medians (testing IPs?)
  55. Hand drawn sketches, translated into high fidelity Sketch mockups used

    for soliciting feedback. Quick to adjust navigation, component priority and assumption about user intent.
  56. Reusable components make for quick Ul extension through clearly defined

    parameters, minimal reliance on data structure and no intra-component dependency (except through Redux shared store)
  57. Challenges • Long running pipelines make for slow debugging. •

    Not cheap infrastructure. • Data coverage is better where the connectivity is better. • Requires human intervention. • We don't have the "why". • No ability to add annotations or context.
  58. Read more: https://bocoup.com/blog/visualizing-the-health-of-the-internet-with-measurement-lab Additional Open Source Modules by Peter Beshai:

    • https://github.com/pbeshai/d3-line-chunked • https://github.com/pbeshai/d3-interpolate-path • https://github.com/pbeshai/react-url-query
  59. Thank you Irene Ros irene@bocoup.com @ireneros Get in touch! hello@bocoup.com