Visualizing the Health of the Internet with Measurement Lab

Visualizing the health of the internet Measurement Lab + Bocoup
Irene Ros [email protected] @ireneros

Get in touch! [email protected] http://bocoup.com/datavis

M-Lab is an open, distributed server platform for researchers to
deploy active Internet measurement tools, advanced networking research, and empower the public with useful information about their broadband connections. M-Lab's data is open to anyone [1] 1.Using it isn't easy... We'll get into that. What is Measurement Lab?

Why it exists? The goal of M-Lab is to advance
network research and empower the public with useful information about their broadband connections. Driving challenges: 1. a lack of well-provisioned and well-connected measurement servers in geographically distributed areas. 2. the difficulties in sharing large Internet measurement datasets between different research projects 3. legislators lack of broadband measurement data in their efforts to craft public policy. http://www.measurementlab.net/publications/measurement-lab-overview.pdf

M-Lab has sites around the world (and growing). Each site
has one or more servers.

Users run speed tests = data

What measurements matter to consumers?

This is what happens when you request a website that
is hosted on servers somewhere far away, for example, Turkey.

ISP tests measure the route until their first router (Speedtest.net.)

M-lab emulates the consumer experience so much better by measuring
the full route from consumer to content

Goal - Build a web application that can tell us:
What does performance look like • at a specific location? (which could be city/ region/ country/ continent) • over any time period? (which could be day / month / year over any period) • for a specific ISP (consumer or transit) All of the above in any combination.

Who is our audience?

Zimbabwe local ISP ran promotion in Dec 2016 during which
they raised the bandwidth limits to get new subscribers.

Regulation change in Brazil resulted in lower speeds for consumers.

All is Open Source https://github.com/m-lab

The data pipeline

Our data: NDT Tests https://github.com/ndt-project/ndt/wiki/NDTDataFormat NDT reports upload and download
speeds and attempts to determine what problems limit speeds.

Raw data for the brave ones. Raw Data files: https://console.cloud.google.com/storage/browser/m-lab/ndt
https://www.measurementlab.net/tools/ndt/

How much data is that? Source: friends with root access...
819,217,639 Tests in BigQuery All test data 243 TB on disk, Probably >100 TB of actual data. . . "Data so large we aren't even sure how much we have"

BigQuery Schema https://www.measurementlab.net/data/b q/schema/ https://rawgit.com/pboothe/bigquerygen erator/master/generator.html Relevant fields: - Download/upload
flag - Location - IPs for client and server - Actual measurements

https://github.com/m-lab/mlab-vis-pipeline/blob/master/dataflow/data/bigquery/queries/base_downloads_ip_by_hour.s ql

https://www.measurementlab.net/data/bq/quickstart/

Data pipeline, the simple version

Data pipeline, complicated version

A crash course in Dataflow • Provides a simple, powerful
programming model for building both batch and streaming parallel data processing pipelines in a single code stream. • Based on Apache Beam SDK • Can run on Open Source runtimes like Spark or Flink https://github.com/GoogleCloudPlatform/DataflowJavaSDK

https://www.slideshare.net/VadimSolovey/dataflow-a-unified-model-for-batch-and-streaming-data-processing PTransforms:

AddLocalTimePipeline Example pipeline transformation-adding time zones (element-wise) https://github.com/m-lab/mlab-vis-pipeline/blob/master/dataflow/src/main/java/mlab/dataviz/ AddLocalTimePipeline.java

https://github.com/m-lab/mlab-vis-pipeline/blob/master/dataflow/src/main/java/mlab/dataviz/HistoricPipeline.java

https://github.com/m-lab/mlab-vis-pipeline/blob/master/dataflow/data/bigtable/client_loc_by_year.json

https://github.com/m-lab/mlab-vis-pipeline/tree/master/dataflow/data/bigtable Date range queries fetch multiple keys.

slightly less data... Data on disk: 3.65 TB, stored over
30 nodes Pipeline runs weekly & keys get overwritten if data changes.

The data API

*Na - North America

The Front End

Tech Stack React + Redux + D3.js + Webpack +
Karma

Design Process

Initial data analysis with sample data or aggregations to verify
data structure and confirm expected results and methods. Also useful for early algorithm prototyping. We identified and were able to remove rogue IPs that skewed actual data medians (testing IPs?)

Hand drawn sketches, translated into high fidelity Sketch mockups used
for soliciting feedback. Quick to adjust navigation, component priority and assumption about user intent.

Reusable components make for quick Ul extension through clearly defined
parameters, minimal reliance on data structure and no intra-component dependency (except through Redux shared store)

Challenges • Long running pipelines make for slow debugging. •
Not cheap infrastructure. • Data coverage is better where the connectivity is better. • Requires human intervention. • We don't have the "why". • No ability to add annotations or context.

Read more: https://bocoup.com/blog/visualizing-the-health-of-the-internet-with-measurement-lab Additional Open Source Modules by Peter Beshai:
• https://github.com/pbeshai/d3-line-chunked • https://github.com/pbeshai/d3-interpolate-path • https://github.com/pbeshai/react-url-query

Thank you Irene Ros [email protected] @ireneros Get in touch! [email protected]

Visualizing the Health of the Internet with Mea...

Visualizing the Health of the Internet with Measurement Lab

More Decks by Irene Ros

Other Decks in Technology

Featured

Transcript