Data Creationism - Speaker Deck

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

DATAFY ALL THE THINGS Max Humber

Slide 3

Slide 3 text

DATAFY ALL THE THINGS Max Humber

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Creator Data Creationism

Slide 6

Slide 6 text

Data is everywhere. And it’s everything (if you’re creative)! So it makes me so sad to see Iris and Titanic in every blog, tutorial and book on data science and machine learning. In DATAFY ALL THE THINGS I’ll empower you to curate and create your own data sets (so that we can all ﬁnally let Iris die). You’ll learn how to parse unstructured text, harvest data from interesting websites and public APIs and about capturing and dealing with sensor data. Examples in this talk will be provided and written in python and will rely on requests, beautifulsoup, mechanicalsoup, pandas and some 3.6+ magic!

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

…Who hasn’t stared at an iris plant and gone crazy trying to decide whether it’s an iris setosa, versicolor, or maybe even virginica? It’s the stuﬀ that keeps you up at night for days at a time. Luckily, the iris dataset makes that super easy. All you have to do is measure the length and width of your particular iris’s petal and sepal, and you’re ready to rock! What’s that, you still can’t decide because the classes overlap? Well, but at least now you have data!

Slide 15

Slide 15 text

Iris Bespoke data

Slide 16

Slide 16 text

Iris Bespoke data

Slide 17

Slide 17 text

This presentation…

Slide 18

Slide 18 text

capture curate create

Slide 19

Slide 19 text

capture curate create

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

pd.DataFrame()

Slide 22

Slide 22 text

import pandas as pd data = [ ['conference', 'month', 'attendees'], ['ODSC', 'May', 5000], ['PyData', 'June', 1500], ['PyCon', 'May', 3000], ['useR!', 'July', 2000], ['Strata', 'August', 2500] ] df = pd.DataFrame(data, columns=data.pop(0))

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

data = { 'package': ['requests', 'pandas', 'Keras', 'mummify'], 'installs': [4000000, 9000000, 875000, 1200] } df = pd.DataFrame(data)

Slide 26

Slide 26 text

data = { 'package': ['requests', 'pandas', 'Keras', 'mummify'], 'installs': [4000000, 9000000, 875000, 1200] } df = pd.DataFrame(data)

Slide 27

Slide 27 text

df = pd.DataFrame([ {'artist': 'Bino', 'plays': 100_000}, {'artist': 'Drake', 'plays': 1_000}, {'artist': 'ODESZA', 'plays': 10_000}, {'artist': 'Brasstracks', 'plays': 100} ])

Slide 28

Slide 28 text

df = pd.DataFrame([ {'artist': 'Bino', 'plays': 100_000}, {'artist': 'Drake', 'plays': 1_000}, {'artist': 'ODESZA', 'plays': 10_000}, {'artist': 'Brasstracks', 'plays': 100} ])

Slide 29

Slide 29 text

df = pd.DataFrame([ {'artist': 'Bino', 'plays': 100_000}, {'artist': 'Drake', 'plays': 1_000}, {'artist': 'ODESZA', 'plays': 10_000}, {'artist': 'Brasstracks', 'plays': 100} ]) PEP515

Slide 30

Slide 30 text

df = pd.DataFrame([ {'artist': 'Bino', 'plays': 100_000}, {'artist': 'Drake', 'plays': 1_000}, {'artist': 'ODESZA', 'plays': 10_000}, {'artist': 'Brasstracks', 'plays': 100} ]) PEP515

Slide 31

Slide 31 text

from io import StringIO csv = '''\ food,fat,carbs,protein avocado,0.15,0.09,0.02 orange,0.001,0.12,0.009 almond,0.49,0.22,0.21 steak,0.19,0,0.25 peas,0,0.04,0.1 ‘'' pd.read_csv(csv) df = pd.read_csv(StringIO(csv))

Slide 32

Slide 32 text

from io import StringIO csv = '''\ food,fat,carbs,protein avocado,0.15,0.09,0.02 orange,0.001,0.12,0.009 almond,0.49,0.22,0.21 steak,0.19,0,0.25 peas,0,0.04,0.1 ''' pd.read_csv(csv) df = pd.read_csv(StringIO(csv)) # --------------------------------------------------------------------------- # FileNotFoundError Traceback (most recent call last) # in () # ----> 1 pd.read_csv(csv) # # FileNotFoundError: File b'food,fat,carbs,protein\n...' does not exist

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

pd.DataFrame()

Slide 36

Slide 36 text

pd.DataFrame() faker

Slide 37