Measurement Lab (M-Lab)—the largest collection of open internet performance data on the planet—collects hundreds of thousands of consumer internet performance tests daily and provides that data in the public domain for research, analysis, and advocacy. Data has been piling up since 2009 (over five petabytes of information about the quality of experience on the internet), and more data is generated every day. Big data at this scale presents interesting challenges for everything from readability, visualization, and navigation to public access and affordability. The M-Lab data allows anyone to explore how their internet experience is mediated by all of the various actors that make up the internet.
In this talk, I share recent work with the M-Lab team to develop a data processing pipeline, API, and visualizations to make the data more accessible to anyone interested in exploring open internet through consumer measurements, exploring both the technical and design aspects of the project.
To see a live version of this presentation (with all the gif goodness) you can see it here: https://docs.google.com/presentation/d/1RBMzIIvfyE1NDRPJZHvFBpJ5BEg2cE21L3pUtz29qfI/pub?start=false&loop=false&delayms=60000&slide=id.p
Visualizing the health of
Measurement Lab + Bocoup
Get in touch!
M-Lab is an open, distributed server platform for researchers to deploy
active Internet measurement tools, advanced networking research, and
empower the public with useful information about their broadband
M-Lab's data is open to anyone 
1.Using it isn't easy... We'll get into that.
What is Measurement Lab?
Why it exists?
The goal of M-Lab is to advance network research and empower the public with useful
information about their broadband connections.
1. a lack of well-provisioned and well-connected measurement servers in
geographically distributed areas.
2. the difficulties in sharing large Internet measurement datasets between different
3. legislators lack of broadband measurement data in their efforts to craft public policy.
M-Lab has sites
around the world
(and growing). Each
site has one or more
Users run speed tests
What measurements matter to
This is what happens
when you request a
website that is
hosted on servers
somewhere far away,
for example, Turkey.
ISP tests measure
the route until
their first router
M-lab emulates the
much better by
measuring the full
Goal - Build a web application that can
What does performance look like
● at a specific location? (which could be city/ region/
● over any time period? (which could be day / month / year
over any period)
● for a specific ISP (consumer or transit)
All of the above in any combination.
Who is our audience?
Zimbabwe local ISP ran promotion in Dec 2016
during which they raised the bandwidth limits to
get new subscribers.
Regulation change in Brazil resulted in lower speeds for consumers.
All is Open Source
The data pipeline
Our data: NDT Tests
NDT reports upload and download speeds and attempts to determine what problems limit speeds.
Raw data for the brave ones.
Raw Data files: https://console.cloud.google.com/storage/browser/m-lab/ndt
How much data is that?
Source: friends with root access...
819,217,639 Tests in BigQuery
All test data 243 TB on disk,
Probably >100 TB of actual data. . .
"Data so large we aren't even sure how much we have"
- Download/upload flag
- IPs for client and server
- Actual measurements
Data pipeline, the simple version
A crash course in Dataflow
● Provides a simple, powerful programming model for
building both batch and streaming parallel data processing
pipelines in a single code stream.
● Based on Apache Beam SDK
● Can run on Open Source runtimes like Spark or Flink
Example pipeline transformation-adding time zones (element-wise)
Date range queries fetch multiple
slightly less data...
Data on disk: 3.65 TB, stored over 30 nodes
Pipeline runs weekly & keys get overwritten if data changes.
The data API
*Na - North America
The Front End
React + Redux + D3.js
Initial data analysis with sample data
or aggregations to verify data
structure and confirm expected
results and methods. Also useful for
early algorithm prototyping.
We identified and were able to
remove rogue IPs that skewed actual
data medians (testing IPs?)
Hand drawn sketches,
translated into high fidelity
Sketch mockups used for
Quick to adjust navigation,
component priority and
assumption about user intent.
Reusable components make for quick
Ul extension through clearly defined
parameters, minimal reliance on data
structure and no intra-component
dependency (except through Redux
● Long running pipelines make for slow debugging.
● Not cheap infrastructure.
● Data coverage is better where the connectivity is better.
● Requires human intervention.
● We don't have the "why".
● No ability to add annotations or context.
Read more: https://bocoup.com/blog/visualizing-the-health-of-the-internet-with-measurement-lab
Additional Open Source Modules by Peter Beshai: