Visualizing the Health of the Internet with Measurement Lab

Slide 1

Slide 1 text

Visualizing the health of the internet Measurement Lab + Bocoup Irene Ros [email protected] @ireneros

Slide 2

Slide 2 text

Get in touch! [email protected] http://bocoup.com/datavis

Slide 3

Slide 3 text

M-Lab is an open, distributed server platform for researchers to deploy active Internet measurement tools, advanced networking research, and empower the public with useful information about their broadband connections. M-Lab's data is open to anyone [1] 1.Using it isn't easy... We'll get into that. What is Measurement Lab?

Slide 4

Slide 4 text

Why it exists? The goal of M-Lab is to advance network research and empower the public with useful information about their broadband connections. Driving challenges: 1. a lack of well-provisioned and well-connected measurement servers in geographically distributed areas. 2. the difficulties in sharing large Internet measurement datasets between different research projects 3. legislators lack of broadband measurement data in their efforts to craft public policy. http://www.measurementlab.net/publications/measurement-lab-overview.pdf

Slide 5

Slide 5 text

M-Lab has sites around the world (and growing). Each site has one or more servers.

Slide 6

Slide 6 text

Users run speed tests = data

Slide 7

Slide 7 text

What measurements matter to consumers?

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

This is what happens when you request a website that is hosted on servers somewhere far away, for example, Turkey.

Slide 10

Slide 10 text

ISP tests measure the route until their first router (Speedtest.net.)

Slide 11

Slide 11 text

M-lab emulates the consumer experience so much better by measuring the full route from consumer to content

Slide 12

Slide 12 text

Goal - Build a web application that can tell us: What does performance look like ● at a specific location? (which could be city/ region/ country/ continent) ● over any time period? (which could be day / month / year over any period) ● for a specific ISP (consumer or transit) All of the above in any combination.

Slide 13

Slide 13 text

Who is our audience?

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Zimbabwe local ISP ran promotion in Dec 2016 during which they raised the bandwidth limits to get new subscribers.

Slide 23

Slide 23 text

Regulation change in Brazil resulted in lower speeds for consumers.

Slide 24

Slide 24 text

All is Open Source https://github.com/m-lab

Slide 25

Slide 25 text

The data pipeline

Slide 26

Slide 26 text

Our data: NDT Tests https://github.com/ndt-project/ndt/wiki/NDTDataFormat NDT reports upload and download speeds and attempts to determine what problems limit speeds.

Slide 27

Slide 27 text

Raw data for the brave ones. Raw Data files: https://console.cloud.google.com/storage/browser/m-lab/ndt https://www.measurementlab.net/tools/ndt/

Slide 28

Slide 28 text

How much data is that? Source: friends with root access... 819,217,639 Tests in BigQuery All test data 243 TB on disk, Probably >100 TB of actual data. . . "Data so large we aren't even sure how much we have"

Slide 29

Slide 29 text

BigQuery Schema https://www.measurementlab.net/data/b q/schema/ https://rawgit.com/pboothe/bigquerygen erator/master/generator.html Relevant fields: - Download/upload flag - Location - IPs for client and server - Actual measurements

Slide 30

Slide 30 text

https://github.com/m-lab/mlab-vis-pipeline/blob/master/dataflow/data/bigquery/queries/base_downloads_ip_by_hour.s ql

Slide 31

Slide 31 text

https://www.measurementlab.net/data/bq/quickstart/

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

Data pipeline, the simple version

Slide 36

Slide 36 text

Data pipeline, complicated version

Slide 37

Slide 37 text

A crash course in Dataflow ● Provides a simple, powerful programming model for building both batch and streaming parallel data processing pipelines in a single code stream. ● Based on Apache Beam SDK ● Can run on Open Source runtimes like Spark or Flink https://github.com/GoogleCloudPlatform/DataflowJavaSDK

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

https://www.slideshare.net/VadimSolovey/dataflow-a-unified-model-for-batch-and-streaming-data-processing PTransforms:

Slide 40

Slide 40 text

AddLocalTimePipeline Example pipeline transformation-adding time zones (element-wise) https://github.com/m-lab/mlab-vis-pipeline/blob/master/dataflow/src/main/java/mlab/dataviz/ AddLocalTimePipeline.java

Slide 41

Slide 41 text

https://github.com/m-lab/mlab-vis-pipeline/blob/master/dataflow/src/main/java/mlab/dataviz/HistoricPipeline.java

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

https://github.com/m-lab/mlab-vis-pipeline/blob/master/dataflow/data/bigtable/client_loc_by_year.json

Slide 44

Slide 44 text

https://github.com/m-lab/mlab-vis-pipeline/tree/master/dataflow/data/bigtable Date range queries fetch multiple keys.

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

slightly less data... Data on disk: 3.65 TB, stored over 30 nodes Pipeline runs weekly & keys get overwritten if data changes.

Slide 47

Slide 47 text

The data API

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

*Na - North America

Slide 51

Slide 51 text

The Front End

Slide 52

Slide 52 text

Tech Stack React + Redux + D3.js + Webpack + Karma

Slide 53

Slide 53 text

Design Process

Slide 54

Slide 54 text

Initial data analysis with sample data or aggregations to verify data structure and confirm expected results and methods. Also useful for early algorithm prototyping. We identified and were able to remove rogue IPs that skewed actual data medians (testing IPs?)

Slide 55

Slide 55 text

Hand drawn sketches, translated into high fidelity Sketch mockups used for soliciting feedback. Quick to adjust navigation, component priority and assumption about user intent.

Slide 56

Slide 56 text

Reusable components make for quick Ul extension through clearly defined parameters, minimal reliance on data structure and no intra-component dependency (except through Redux shared store)

Slide 57

Slide 57 text

Challenges ● Long running pipelines make for slow debugging. ● Not cheap infrastructure. ● Data coverage is better where the connectivity is better. ● Requires human intervention. ● We don't have the "why". ● No ability to add annotations or context.

Slide 58

Slide 58 text

Read more: https://bocoup.com/blog/visualizing-the-health-of-the-internet-with-measurement-lab Additional Open Source Modules by Peter Beshai: ● https://github.com/pbeshai/d3-line-chunked ● https://github.com/pbeshai/d3-interpolate-path ● https://github.com/pbeshai/react-url-query

Slide 59

Slide 59 text

Thank you Irene Ros [email protected] @ireneros Get in touch! [email protected]