Just because you can, doesn't mean you should. We'll discuss using data efficiently, review the choices we face and determine how data literacy, statistical reasoning and algorithmic thinking help.

Jonathan Wallace @jonathanwallace Modern Data Literacy SoundBoard Cross-Functional Online Marketing Conference Jonathan Wallace good afternoon everyone, thanks for coming out. My name is Jonathan Wallace and i'm glad you're here to listen to my talk. It has been ﬁve years since I last presented at this conference and I want to thank the organizers for having me out again. today, I’ve come to talk about what I think is an important topic of which everyone should be aware, modern data literacy.

Jonathan Wallace @jonathanwallace http://www.contentrow.com/tools/link-bait-title-generator • Statistical Reasoning • Algorithmic Thinking • Curiosity I want you to… Goals Real quick, these are my goals for the talk. Really, I want your interest piqued in these topics because with only twenty minutes, we’re only going to superﬁcially cover these topics with a few examples.

Jonathan Wallace @jonathanwallace http://www.bignerdranch.com/about-us/nerds/jonathan-wallace.html 2014! But ﬁrst, let me share why I’m worth listening to about this topic. Last time I was here, I shared that I’d been a web developer for the past seven years, ﬁve of which was at a consulting company. Believe it or not, it took a lot of takes to get that shot!

Jonathan Wallace @jonathanwallace https://www.linkedin.com/in/jonathan-wallace-8888ba9/ • Director of Medical Billing Company • Principal Engineer at Stitch Fix • Representative of Ga. House District 119 • V.P. of Engineering at Softgiving I’ve had a few more jobs associated with data Since then.. Anyway, since then, I’ve done all the things listed here. I was the director of a medical billing company where we analyzed claims for medical labs and helped them get paid by insurance companies. We had over 30 labs so there was a decent, but small, ﬂow of data with which to contend.

Jonathan Wallace @jonathanwallace https://www.linkedin.com/in/jonathan-wallace-8888ba9/ • Director of Medical Billing Company • Principal Engineer at Stitch Fix • Representative of Ga. House District 119 • V.P. of Engineering at Softgiving I’ve had a few more jobs associated with data Since then.. After that, I was at Stitch Fix, where our primary database was over 1/2 a terabyte in size. Not huge when compared to FB, Apple, Amazon, Netﬂix, or Google but still pretty big! Let’s talk about that for a quick moment.

Jonathan Wallace @jonathanwallace https://www.linkedin.com/in/jonathan-wallace-8888ba9/ • Snapchat users share 527,760 photos • Users watch 4,146,600 YouTube videos • 456,000 tweets are sent on Twitter From 2017! Have you heard? These stats are for every minute. What do you think those companies are doing with that data? What kinds of questions are they trying to answer? How much data do you think you generate?

Jonathan Wallace @jonathanwallace https://www.ﬂickr.com/photos/hagdorned/15021067991/in/photolist-oTmUbT-8y6DpS-4AMJ1k-89KRMv-4Xo8u9-96xajV-aPW1ut-77EZpJ- dE8g9W-67JHVz-7UgzgP-mLNdH-aiiT82-5HZuzs-8zrt3g-oZn3NB-5rUT4-8CTLZ3-4jF5Zz-yGFki-9JSGSe-5a4Cz9-8PwNgL-8PwR8q- Technology While you’re pondering those questions, let’s talk about technology. Technology is what allows us to generate those mind boggling amounts of data. It is a force multiplier. It isn’t inherently good or bad. What does this mean when it comes to data? If technology is neutral, does that mean the data is neutral?

Jonathan Wallace @jonathanwallace https://www.ﬂickr.com/photos/hagdorned/15021067991/in/photolist-oTmUbT-8y6DpS-4AMJ1k-89KRMv-4Xo8u9-96xajV-aPW1ut-77EZpJ- dE8g9W-67JHVz-7UgzgP-mLNdH-aiiT82-5HZuzs-8zrt3g-oZn3NB-5rUT4-8CTLZ3-4jF5Zz-yGFki-9JSGSe-5a4Cz9-8PwNgL-8PwR8q- Technology No, data is not neutral. It isn’t neutral because the data you acquire that depends on what questions you ask. How do you capture it? How do you store it? The temptation is to grab all the data you can. And then later, you’ll try to ﬁgure out what it means.

Jonathan Wallace @jonathanwallace https://www.ﬂickr.com/photos/hagdorned/15021067991/in/photolist-oTmUbT-8y6DpS-4AMJ1k-89KRMv-4Xo8u9-96xajV-aPW1ut-77EZpJ- dE8g9W-67JHVz-7UgzgP-mLNdH-aiiT82-5HZuzs-8zrt3g-oZn3NB-5rUT4-8CTLZ3-4jF5Zz-yGFki-9JSGSe-5a4Cz9-8PwNgL-8PwR8q- Technology I’m here to make the case that you be better off by knowing the question that you want to ask before you get started. What do you want to know?

Jonathan Wallace @jonathanwallace • Formulate your question • Collect data • Analyze data • Interpret Results Statistical Problem Solving Here’re the basics. First, we need to formulate your question. This might seem elementary and obvious but I assure you it is not. People often skip this.

Jonathan Wallace @jonathanwallace Improve the conversions on our website A Simple Example You have a website that contains multiple steps? Or maybe, even simpler, multiple forms. What might your question be? “At which step in our signup process do we see the largest drop off?”

Jonathan Wallace @jonathanwallace Improve the conversions on our website A Simple Example Great, now we know what to measure. Let’s measure the amount of views for each page. Knowing your question is the ﬁrst step. But this example, although it highlights formulating a question, doesn’t address some of the problems that happens with data at large scales.

Jonathan Wallace @jonathanwallace • Formulate your question • Collect data • Analyze data • Interpret Results Statistical Problem Solving Now we’re talking about scale and we’re talking about collecting data. I’m not going in to great detail as with a strong technology team, you shouldn’t have to know the details, they, or an application that commoditizes that work, should handle that for you. But we’ll talk about a little about scale and magnitudes.

Jonathan Wallace @jonathanwallace https://www.ﬂickr.com/photos/jimchoate/26697097528/in/photolist-GF8AYY-s7nTBP-cAVnUS-cADPuJ-26yQv1K-5D4Jr9-b9CHqz-A92fW5-cSuRBw-oEfSEB- bwsn5n-5Qyg2k-5vZa8J-qc39Py-8TASxQ-bWwD61-aEzbHF-gnmsmN-fQ6JtK-VEvfPn-21D4bZz-y4Uqo-dPPK8p-bCc3Rf-bcH8tc-oW3EPe-2EAG9-2entne6-afa2Yg- Scale How long is one million seconds? 11.6 days How long is one billion seconds? 32 years How can we manage one million of anything let alone one billion? How do we cogently handle that many data points?

Jonathan Wallace @jonathanwallace https://www.ﬂickr.com/photos/jimchoate/26697097528/in/photolist-GF8AYY-s7nTBP-cAVnUS-cADPuJ-26yQv1K-5D4Jr9-b9CHqz-A92fW5-cSuRBw-oEfSEB- bwsn5n-5Qyg2k-5vZa8J-qc39Py-8TASxQ-bWwD61-aEzbHF-gnmsmN-fQ6JtK-VEvfPn-21D4bZz-y4Uqo-dPPK8p-bCc3Rf-bcH8tc-oW3EPe-2EAG9-2entne6-afa2Yg- Scale We have to group the data. We have to aggregate similar data points into groups. But what problems arise when we start grouping data?

Jonathan Wallace @jonathanwallace • Formulate your question • Collect data • Analyze Results • Interpret Results Statistical Problem Solving Problems that arise when grouping data, speaks to analysis. Let’s look at a quick one.

Jonathan Wallace @jonathanwallace You have a classroom of ten students and the average grade is 93. Aggregation Bias https://www.ﬂickr.com/photos/nwabr/5917202414/in/photolist-a1Td8h-nwEGdX-ifmce-63ZU4g-byXhtx-dtD6bE-8AdtKD-amnveH-rQdqSb-dSWa9F-bmzTZC-s5vEDE-MFiCpg-raNGeA- raNG57-5v3gqp-5v7B7G-9xXzVb-JX3TqX-99utdQ-9NFSQV-8cbDh3-aPmoGr-9GNHEk-2f1GhVi-9GRAny-aZpEUB-7gry7L-aZpER6-9GNHFX-6SddEn-9GRApj-oZfNK7-oZgvwe-peHNN1-NrPyP- obC4dS-phbAkK-eiafBS-JbrPdz-cEJnWs-SxnDGj-Hwq38u-SxotnQ-GXNzJ-g3FZLC-GXPW4-g3FVHB-2co3WSt-7Uh3fu Here’s one. Aggregation Bias. What can we say about how any particular student is doing? We can’t. We would need more data about our grouping.

Jonathan Wallace @jonathanwallace Scores are nine 100s and one 30. Aggregation Bias https://www.ﬂickr.com/photos/nwabr/5917202414/in/photolist-a1Td8h-nwEGdX-ifmce-63ZU4g-byXhtx-dtD6bE-8AdtKD-amnveH-rQdqSb-dSWa9F-bmzTZC-s5vEDE-MFiCpg-raNGeA- raNG57-5v3gqp-5v7B7G-9xXzVb-JX3TqX-99utdQ-9NFSQV-8cbDh3-aPmoGr-9GNHEk-2f1GhVi-9GRAny-aZpEUB-7gry7L-aZpER6-9GNHFX-6SddEn-9GRApj-oZfNK7-oZgvwe-peHNN1-NrPyP- obC4dS-phbAkK-eiafBS-JbrPdz-cEJnWs-SxnDGj-Hwq38u-SxotnQ-GXNzJ-g3FZLC-GXPW4-g3FVHB-2co3WSt-7Uh3fu That’s easy enough. Remember we’re working with a small data set so it is easy to reason and think about but consider if we’re talking about a ten thousand or a billion numbers.

Jonathan Wallace @jonathanwallace Mean: 93 Median: 100 Mode: 100 “Average” https://www.ﬂickr.com/photos/nwabr/5917202414/in/photolist-a1Td8h-nwEGdX-ifmce-63ZU4g-byXhtx-dtD6bE-8AdtKD-amnveH-rQdqSb-dSWa9F-bmzTZC-s5vEDE-MFiCpg-raNGeA- raNG57-5v3gqp-5v7B7G-9xXzVb-JX3TqX-99utdQ-9NFSQV-8cbDh3-aPmoGr-9GNHEk-2f1GhVi-9GRAny-aZpEUB-7gry7L-aZpER6-9GNHFX-6SddEn-9GRApj-oZfNK7-oZgvwe-peHNN1-NrPyP- obC4dS-phbAkK-eiafBS-JbrPdz-cEJnWs-SxnDGj-Hwq38u-SxotnQ-GXNzJ-g3FZLC-GXPW4-g3FVHB-2co3WSt-7Uh3fu I initially used the word “average” and that is misleading because we don’t know the distribution of the numbers. Now here are three more common mathematical concepts that help us understand the data set without knowing the details.

Jonathan Wallace @jonathanwallace Standard deviation: 22.136 Variance: 490 “Average” https://www.ﬂickr.com/photos/nwabr/5917202414/in/photolist-a1Td8h-nwEGdX-ifmce-63ZU4g-byXhtx-dtD6bE-8AdtKD-amnveH-rQdqSb-dSWa9F-bmzTZC-s5vEDE-MFiCpg-raNGeA- raNG57-5v3gqp-5v7B7G-9xXzVb-JX3TqX-99utdQ-9NFSQV-8cbDh3-aPmoGr-9GNHEk-2f1GhVi-9GRAny-aZpEUB-7gry7L-aZpER6-9GNHFX-6SddEn-9GRApj-oZfNK7-oZgvwe-peHNN1-NrPyP- obC4dS-phbAkK-eiafBS-JbrPdz-cEJnWs-SxnDGj-Hwq38u-SxotnQ-GXNzJ-g3FZLC-GXPW4-g3FVHB-2co3WSt-7Uh3fu We can look at other numbers to help us understand the data we’ve collected. And we should. And it is important to understand how the data is distributed. Is it randomly distributed? Or does it have a fun distribution shape? Alright, I have one more thing to cover before I get to a real world example

Jonathan Wallace @jonathanwallace https://bl.ocks.org/bryik/a3d0d7a0d9d69e6afe0fd8b8b3becec1 Algorithmic Thinking Does anyone know what this is? This is a graph. More speciﬁcally, this called a complete graph because every vertex, that is, the colored circles, is connected to every other vertex via an edge. What do the vertices represent? Well, they can represent anything.

Jonathan Wallace @jonathanwallace https://www.ﬂickr.com/photos/alanchan/2269845817/in/photolist-4szy5H-rdfAX8-jLXQwg-8BTgwh-o47WXh-2kNpGM-pP5TAt-aZewz8-a4eSs9- pBrH2C-dqY18g-pBeme8-6NEaLF-5AUDUZ-dEzMgS-dANMPT-8tgm2H-7MuvmD-pjZcz4-pQWvjM-7nafsM-ni3B3R-6ZCC48-pjYdcb-KvSsM- Algorithmic Thinking Let’s think them as people. There’s something called the handshake problem. How many handshakes have to occur in this room for everyone to shake everyone else’s hand?

Jonathan Wallace @jonathanwallace https://bl.ocks.org/bryik/a3d0d7a0d9d69e6afe0fd8b8b3becec1 Algorithmic Thinking The answer can be modeled by this graph. We can manually count up the number of edges but there’s an easier way. (It is 15!)

Jonathan Wallace @jonathanwallace https://bl.ocks.org/bryik/a3d0d7a0d9d69e6afe0fd8b8b3becec1 Algorithmic Thinking N * (N - 1) / 2 Here’s the formula for calculating the number of edges or connections between those vertices. 6*5 / 2 = 15

Jonathan Wallace @jonathanwallace https://bl.ocks.org/bryik/a3d0d7a0d9d69e6afe0fd8b8b3becec1 Algorithmic Thinking But what happens when the room is a lot larger. You can see that the number connections or edges, grows quickly. In this case, there are 351 edges.

Jonathan Wallace @jonathanwallace Algorithmic Thinking Now let me show you my application from Softgiving. This is one folder in one directory in the application. Here’s where we’ve modeled the domain. We have 59 models. That would be 1711 connections if each model were fully connected to every other model. My boss the other day asked about adding multiple accounts per organization. We would now have another dimension, or another node, in the application. This would lead to 1770 connections. You can see how each new variable, new dimension, new grouping that we add, may add complexity to how we capture, store, and associate data.

Jonathan Wallace @jonathanwallace https://bl.ocks.org/bryik/a3d0d7a0d9d69e6afe0fd8b8b3becec1 Algorithmic Thinking An for fun, here’s a complete graph with 44 notes, 946 edges. You can visually see how the complexity grows. Getting pretty dark, isn’t it?

Jonathan Wallace @jonathanwallace https://www.ﬂickr.com/photos/corporate-traveller/6826708843/in/photolist-56Gpiy-bpfEGB-9bST8G-7AN25d-8Lda6k-5u3EmK-5u3M6i- DhTte-284dUNK-22uxc6R-5gBQyo-LDzc-n2rii-3dw21B-sjm42-5SfzgF-EXZkiL-2f2dXtv-qfkVVp-hZtHME-21PWcmu-sbbUQA-6vhUte-rAM4Pr- Algorithmic Thinking So back to the handshake problem, and I promise it has real-world relevance in the next section. Everyone shaking everyone’s hand is a large number as the number of people goes up.

Jonathan Wallace @jonathanwallace https://bl.ocks.org/bryik/a3d0d7a0d9d69e6afe0fd8b8b3becec1 Algorithmic Thinking N * (N - 1) / 2 Here’s that formula again. Notice how I’ve made the N’s bigger. The minus one and the divide by two are not what dictates the size of the result. The value represent by N is what dictates. In programming terms, we think about a function this and we describe the growth rate for a function in terms of Big Oh notation.

Jonathan Wallace @jonathanwallace https://www.ﬂickr.com/photos/corporate-traveller/6826708843/in/photolist-56Gpiy-bpfEGB-9bST8G-7AN25d-8Lda6k-5u3EmK-5u3M6i- DhTte-284dUNK-22uxc6R-5gBQyo-LDzc-n2rii-3dw21B-sjm42-5SfzgF-EXZkiL-2f2dXtv-qfkVVp-hZtHME-21PWcmu-sbbUQA-6vhUte-rAM4Pr- Algorithmic Thinking • O(1) • O(n) • O(n*n) So you can think about it. A big oh of one, is a constant. So if I want to say high to everyone in this room, I wave my hand and say hello.

If I want to shake each of your hands individual, there are 20 people, then n is twenty and I shake twenty hands. But, if I want to build a strong community in this room, then I would like for everyone to shake everyone’s hand.

Jonathan Wallace @jonathanwallace https://www.ﬂickr.com/photos/corporate-traveller/6826708843/in/photolist-56Gpiy-bpfEGB-9bST8G-7AN25d-8Lda6k-5u3EmK-5u3M6i- DhTte-284dUNK-22uxc6R-5gBQyo-LDzc-n2rii-3dw21B-sjm42-5SfzgF-EXZkiL-2f2dXtv-qfkVVp-hZtHME-21PWcmu-sbbUQA-6vhUte-rAM4Pr- Algorithmic Thinking • O(1) • O(n) • O(n*n) Spoiler, this is why politicians are gluttons for events and interviews. They can meet a lot of people at once. And with a speech and a hand wave, they’re very efficient. But they also take the time to shake individual hands as much as they can. Finally, the very good ones, focus on community building which gets people to shake each other’s hand. Okay, on to the real world example.

Jonathan Wallace @jonathanwallace https://www.redandblack.com/athensnews/breaking-democratic-candidate-jonathan-wallace-wins-district-state-house-seat/article_dd1849c2- c432-11e7-9f0e-238235d06a0a.html Real World This was me on Nov. 7th, 2017 in a special election for Georgia House District 119.

Jonathan Wallace @jonathanwallace https://www.redandblack.com/athensnews/breaking-jonathan-wallace-loses-seat-in-state-house-district-race/article_d813c034-e22f-11e8- b0f8-1f98f862af60.html Real World This was not me three hundred and ﬁfty ﬁve days later on Nov. 6th, 2018. Let’s talk about how being ignorant of a distribution helped lead to this result.

Jonathan Wallace @jonathanwallace • Formulate your question • Collect data • Analyze data • Interpret Results Statistical Problem Solving The question in a campaign is straight forward. How many votes do I need to win? Every campaign does this. You look at history and examine like races, make some explicit assumptions, and then formulate a strategy.

Jonathan Wallace @jonathanwallace Real World We used a tool called Votebuilder. Votebuilder provides arbitrary metrics called scores. These scores predict the support for one particular party over another. When we looked at the score we used for formulating our strategy, we knew that the distribution on the scale of 0-100 was not linear.

Jonathan Wallace @jonathanwallace Real World https://en.wikipedia.org/wiki/Multimodal_distribution We knew the distribution was bimodal. On one side, you had a cluster of Democrats / progressives. On the other, you had a cluster of Republicans / conservative. So we established our win number and the size of our universe, 12000, based on that score.

Jonathan Wallace @jonathanwallace • Formulate your question • Collect data • Analyze data • Interpret Results Statistical Problem Solving For collecting data, we used a different tool.

Jonathan Wallace @jonathanwallace Real World We used a tool called minivan. And also pen and paper. When someone knocked on a constituent’s door and shook their hand, they would ask questions like, will you vote for Jonathan Wallace. This data was then analyzed on a regular basis to adjust our strategy.

Jonathan Wallace @jonathanwallace • Formulate your question • Collect data • Analyze Results • Interpret Results Statistical Problem Solving We analyzed our results and thought we were doing well. I ended with 11929 votes which was more than I thought I needed. So what went wrong?

Jonathan Wallace @jonathanwallace https://www.redandblack.com/athensnews/breaking-jonathan-wallace-loses-seat-in-state-house-district-race/article_d813c034-e22f-11e8- b0f8-1f98f862af60.html Real World When we dug into the question about how we were going to collect data, we made a mistake related to distributions. I.e., how was the data grouped.

Jonathan Wallace @jonathanwallace Real World Remember, votebuilder provides arbitrary metrics called scores. These scores predict the support for one particular party over another.

Jonathan Wallace @jonathanwallace Real World https://en.wikipedia.org/wiki/Multimodal_distribution We knew it was bimodal. On one side, you had a cluster of Democrats / progressives. On the other, you had a cluster of Republicans / conservative.

Jonathan Wallace @jonathanwallace Real World https://en.wikipedia.org/wiki/Multimodal_distribution We assumed that if we went down to a score of 30 towards the conservative end, that we would have a total universe of 12,000 votes. Plenty to win the election. We thought the variance was smaller than it really was.

Jonathan Wallace @jonathanwallace Real World https://en.wikipedia.org/wiki/Multimodal_distribution In reality, what we found in our analysis, is that the clusters of this bimodal distribution is extremely concentrated on either end. And we should have used a score of 15 which would have made our universe even larger and given a better chance of victory in that race. The variance was much larger.

Jonathan Wallace @jonathanwallace http://www.contentrow.com/tools/link-bait-title-generator • Statistical Reasoning • Algorithmic Thinking • Curiosity I want you to… Goals Real quick, these are my goals for the talk.