Scaling your
data infrastructure
C H R I S T I A N B A R R A @ P Y C O N N O V E
Slide 2
Slide 2 text
THE AGENDA
2 3
START
THE DATA SCIENCE
WORKFLOW
SCALING IS NOT JUST
A MATTER OF MACHINE
WHEN THE SIZE OF
YOUR DATA MATTERS
1
Slide 3
Slide 3 text
THE AGENDA
4 5
CONTAINERIZED
DATA SCIENCE
CASSINY: PUT ALL THE
THINGS TOGETHER
END
Slide 4
Slide 4 text
THE
DATA
SCIENCE
WORKFLOW
Slide 5
Slide 5 text
HEXAGON PRESENTATION TEMPLATE
Slide 6
Slide 6 text
HOW YOU
BUILD,
ITERATE
AND SHARE
DEPENDS ON
MANY THINGS
Your
Users
Your
Product
Your
Team
Your
Company
Your
Tech
Stack
Your
Domain
Slide 7
Slide 7 text
SCIKIT-LEARN
DOCKER
DATA SCIENCE TOOLBELT
PANDAS JUPYTER
RAY
Slide 8
Slide 8 text
SCALING
IS NOT JUST A
MATTER OF
MACHINES
Slide 9
Slide 9 text
We all use it.
Slide 10
Slide 10 text
We really care about versioning.
We have Untitled_1.ipynb,
Untitled_2.ipynb and Untitled_3.ipynb.
HOMER SIMPSON
C H I E F D A T A S C I E N T I S T
D A T A B E E R I N C
Slide 11
Slide 11 text
Since JSON is a plain text format, they can be
version-controlled and shared with colleagues.
E X I P Y T H O N N O T E B O O K D O C U M E N T A T I O N
Slide 12
Slide 12 text
THEY GOT
IT RIGHT
Slide 13
Slide 13 text
BUT WE
KEEP
IMPROVING
Slide 14
Slide 14 text
90% OF
JUPITER IS
MADE BY
HYDROGEN
Slide 15
Slide 15 text
THE HARD
THING ABOUT
STORAGE
Slide 16
Slide 16 text
PARQUET
P A R Q U E T + O B J E C T S T O R A G E =
YO U C A N Q U E R Y I T U S I N G S Q L
PA N DA S H A S N AT I V E S U P P O R T
F O R G E T A B O U T C S V
Slide 17
Slide 17 text
WHEN THE
SIZE OF YOUR
DATA MATTERS
Slide 18
Slide 18 text
IT’S TOO
SLOW
DOESN’T FIT
IN YOUR RAM
Slide 19
Slide 19 text
CODE
OPTIMIZATION
APPROACH
SCALING
FROM
DIFFERENT
SIDES
A BIGGER
MACHINE
USE
MULTIPLE
CORES
MORE
MACHINES
FRAMEWORKS:
DASK
RAY
SPARK
PANDAS:
READ BY
CHUNKS
SCIKIT-LEARN:
PARTIAL
FIT
Slide 20
Slide 20 text
chunks & partial_fit
1 M A C H I N E
Slide 21
Slide 21 text
Multiple machines.
n M A C H I N E S
Slide 22
Slide 22 text
I don’t want to use Spark/JVM,
what do you have for me?
H A P P Y P Y T H O N U S E R
TAKEAWAYS
UNIFIED DATA WAREHOUSE
KEEP YOUR CODE RUNNING ON ONE MACHINE
USE DOCKER
TRY RAY
BRING CI/CD TO YOUR DATASCIENCE WORKFLOW
OBJECT STORAGE IS COOL
DISTRIBUTED COMPUTING IS HARD
I DIDN’T HAVE ANOTHER POINT