Predictive Analytics, Cities, and Public Health

PREDICTIVE ANALYTICS, CITIES, AND PUBLIC HEALTH TOM SCHENK JR. KPMG
@TOMSCHENKJR

Source: techplan.cityofchicago.org IN CHICAGO, WE BELIEVE THAT THE POWER OF
TECHNOLOGY IS DRIVEN BY THE PEOPLE WHO USE AND BENEFIT FROM IT.

Data on potholes are reported by residents and city staff
through the 311 system, which is then reported on the City’s #opendata portal—updated daily. data.cityofchicago.org

Chicago has released more #opendata, including important items such as
red light and speed camera violations, problem landlords, and public chauffeurs. data.cityofchicago.org/view/caas-knxs

#OPENDATA PROVIDES A MEANS TO CREATE AN ECOSYSTEM AROUND DATA,
WHICH INCLUDES MULTIPLE STAKEHOLDERS AND INITIATIVES THAT EXTEND BEYOND TRANSPARENCY.

Civic Tech Community Chicago has a large, vibrant, productive, civic
community. This is led by Chicago residents interested in technology and society that, along with non-profits, help Chicagoans. © Tom Schenk Jr, 2016. CC-BY

Using #opendata, this service developed by the civic community alerts
individuals to street sweeping activity by providing email, text, or calendar alerts. sweeparound.us

Chicago Flu Shots was developed to easily find flu-shot locations
across Chicago. The code was created by a volunteer is #opensource so the site was adopted by Boston, Philadelphia, and San Francisco. chicagoflushots.org

The City of Chicago has a number of high-quality research
universities and groups willing to engage in projects with the city. We can leverage #opendata portal and data itself to create cooperative relationships. Academia

#Predictions Using advanced research techniques to forecast and predict events
in the city. #Optimization Optimizing the allocation of resources across the city to for a more efficient city. #Evaluation Evaluate programs, including the effectiveness of advanced analytics.

Image adapted from Michael Mooney’s Little Chicago (CC-BY 2.0).

Chicago leveraged the #opendata portal to share data with external
researchers, leveraging the city’s premiere method of sharing data and saving time on data- sharing agreements to create #predictions. Using #opendata

Establishments with previous critical or serious violations Three-day average high
temperature Nearby garbage and sanitation complaints Nearby burglaries Whether establishments has tobacco or alcohol license Length of time since last inspection Length of time establishment has been operating Inspector assigned The model predicts the likelihood of a food establishment having a critical violation, a violation most likely to lead to food borne illnesses. Over a dozen #opendata sources were used to help define the model. Ultimately, ten different variables proved to create #predictions of critical violations. Significant Predictors

The #predictions revealed an opportunity to find deliver results faster.
Within the first half of work, 69% of critical violations would have been found by inspectors using a data- driven approach. During the same period, only 55% of violations were found using the status quo method. Critical violations Data-driven Status quo 0% 10% 20% 30% 40% 50% 60% 70%

The food inspection model is able to deliver results faster.
After comparing a data-driven approach versus the status quo, the rate of finding violations was accelerated by an average of 7.4 days in the 60 day pilot. That means the #predictions led to more violations would be found sooner by inspectors. IMPROVEMENT 7 days

OPTIMIZING FOOD INSPECTIONS Discovering critical violations sooner rather than later
reduces the risk of patrons becoming ill, which helps reduce medical expenses, lost time at work, and even a limited number of fatalities.

The data science team has built a website which lets
CDPH prioritize inspections based on projected risk.

http://chicago.github.io/food-inspections-evaluation/

The analytical model will be released as an open source
project on GitHub, allowing other cities to study or even adopt the model in their respective cities. No other city has released their analytic models before this release. #OPENSOURCE

WEST NILE VIRUS CURTAILING VECTOR-BORN DISEASES

WEST NILE VIRUS • Between 5 and 884 human cases
reported annually in Illinois since 2002 • 2,371 confirmed human infections since 2002 • Most people who become infected with West Nile virus never develop any symptoms • About 1 in 5 people who are infected will develop flu like symptoms • Less than 1% of people who are infected will develop a serious neurologic illness

PREVENTION The Chicago Department of Public Health (CDPH) uses a
multi-pronged approach to fight the spread of WNV • Larvicide in stormwater drains • DNA tests of mosquitoes (pictured) • Spraying when WNV is present

DNA MONITORING • At any given time there are 60+
traps in Chicago collecting (mostly) Culex Pipiens and Culex Restuans mosquitoes • The traps are collected twice / week • Batches of up to 50 mosquitoes are DNA tested • The data is published on https://data.cityofchicago.org/ • The results and model predictions are displayed in WindyGrid

SEASONAL MODEL • We use a generalized linear mixed-effects model
• Incorporates season and regional bias • Predicts likelihood of WNV one week in advance

WE WERE ABLE TO IDENTIFY WNV ONE WEEK IN ADVANCE
IN OUT OF SAMPLE DATA 78% OF THE TIME, AND OUR PREDICTION WAS CORRECT 65% OF THE TIME

CLEAR WATER PROTECTING CHICAGO BEACHES

Today Yesterday Beach 2 results Hydrometerological Predictors Beach 1 results
Beach 2 predictions Beach 1 predictions “PRIOR-DAY” BEACH MODELS

PREDICTIONS IN 2016 Prior-day forecasting methods are very noisy. They
are, overall, accurate, but often fail to predict elevated E. coli levels. In Chicago, the true positive rate (sensitivity) is around 5 percent. 0% 20% 40% 60% 80% 100% Specificity Sensitivity Precision Accuracy

BEACH CORRELATIONS Our research, and limited others, noted that Chicago
beaches tend to be correlated with each other on a given day.

Hybrid Model Prior-Day Model = −1, 1 , 2 ,
⋯ , Today’s prediction Yesterday’s culture-based results Hydrometerological predictors ∈ = ∈ Today’s prediction …in the same “cluster” of beaches “Lead” qPCR results on same day…

Hybrid Model Prior-Day Model Today Yesterday qPCR testing at beach
1 in group 1 Predictions at beaches 2, 3, and 4 in group 1

CLUSTERING BEACHES We used a simple k-means clustering algorithm to
group. Beaches were grouped into 5 clusters given the availability of qPCR equipment. These 5 beaches are used to predict results at remaining beaches. , = =1 ∈ − 2

The map shows the five clusters were usually, but not
strictly geographically correlated. Some beaches were excluded because they have very unique features, namely, breakwaters.

SUMMER 2017 PILOT Predictions for the summer 2017 yielded similar
accuracy, but a 175% increase in sensitivity from 4% to 11 percent while precision grew from 17% to 27 percent. “False positives” remained consistent. 0% 20% 40% 60% 80% 100% Specificity Sensitivity Precision Accuracy Hybrid (2017) Prior-Day (2016)

Precipitation We tested the impact of adding typical hydrometerological variables
in the hybrid format. This model was the same functional form but added predictors to better- estimate Enterococci levels. Lake Levels Sunlight Wind Tidal Levels Human Density

MULTIVARIATE HYBRID MODEL Hydrometerological variables did not add significant improvements.
Overall AUC between hybrid- only and a multivariate hybrid model was 0.753 and 0.744, respectively. 0% 20% 40% 60% 80% 100% Specificity Sensitivity Precision Accuracy Hybrid Multivariate Hybrid-only

The project was released using an academic- quality technical paper
instructing others on the the variables and statistical methodology used in the project. In addition to source code, the paper will help researchers adopt this approach. Technical Documentation

The technical paper was written as a highly- reproducible “knitr”
document, allowing other researchers to understand how summary numbers were calculated. Each statement in the project can be traced to an original source. Reproducible Research

Open science is part of a workflow that extends from
data collection, to posting results, and generating the final publications. Sources: Code: 10.5281/zenodo.1420460 Paper: 10.5281/zenodo.1434260 Pre-print: 10.1101/250480 Open Data: Test Results GitHub: Source Code GitHub: Reproducible Paper Open Data: Predictions Open Data: Hydrometeorological Test Collector Online App Lab Supervisors

CITIZEN SCIENCE PROJECT The project was primarily completed by citizen
data scientists who volunteered their time each week at the weekly Chi Hack Night meetings.

Total hours dedicated to this project through volunteers, Chi Hack
Night, and students. 1,000 HOURS

] [ Insert GIF of someone vomiting

CROSS-REF FOOD POISONING

CROSS-REF FOOD POISONING Work Gym

CROSS-REF FOOD POISONING Work Gym Restaurant

CROSS-REF FOOD POISONING Work Gym Restaurant Airport School

CROSS-REF FOOD POISONING Work Gym Restaurant Airport School “food poisoning
symptoms” “diarrhea”

FINDER Complaints 0% 10% 20% 30% 40% 50% 60% FINDER
had higher precision for finding restaurants with critical violations. Over 52% of FINDER-recommended inspections resulted in violations compared to almost 23% of complaints reported to the City of Chicago. Source: 10.1038/s41746-018-0045-1 FINDER inspections

THANK YOU Contact Info: Websites: Tom Schenk Jr. Director of
Analytics KPMG @tomschenkjr [email protected] data.cityofchicago.org github.com/Chicago Clear Water: Code: 10.5281/zenodo.1420460 Paper: 10.5281/zenodo.1434260 Pre-print: 10.1101/250480 Cross-referencing Food Poisoning: 10.1038/s41746-018-0045-1

Predictive Analytics, Cities, and Public Health

Predictive Analytics, Cities, and Public Health

More Decks by Tom Schenk Jr

Other Decks in Science

Featured

Transcript