Analyzing Data With Python - Sarah Guido

Sarah Guido @sarah_guido Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON

¡ Data scientist at Reonomy ¡ University of Michigan graduate ¡ NYC Python
organizer ¡ PyGotham organizer ABOUT ME

¡ Bird’s-eye overview: not comprehensive explanation of these tools! ¡ Take data
from start-to-finish § Preprocessing: Pandas § Analysis: scikit-learn § Analysis: nltk § Data pipeline: MRjob § Visualization: matplotlib ¡ What next? ABOUT THIS TALK

¡ So many tools § Preprocessing, analysis, statistics, machine learning, natural language
processing, network analysis, visualization, scalability ¡ Community support ¡ “Easy” language to learn ¡ Both a scripting and production-ready language WHY PYTHON?

¡ How to find the best tool(s)? ¡ The 90/10 rule ¡ Simple
is better than complex FROM POINT A TO POINT…X?

¡ Available resources § Documentation, tutorials, books, videos ¡ Ease of use (with
a grain of salt) ¡ Community support and continuous development ¡ Widely used WHY I CHOSE THESE TOOLS

¡ The importance of data preprocessing § AKA wrangling, munging, manipulating, and
so on ¡ Preprocessing is also getting to know your data § Missing values? Categorical/continuous? Distribution? PREPROCESSING

¡ Data analysis and modeling ¡ Similar to R and Excel ¡ Easy-to-use
data structures § DataFrame ¡ Data wrangling tools § Merging, pivoting, etc PANDAS

¡ Keep everything in Python ¡ Community support/resources ¡ Use for preprocessing § File
I/0, cleaning, manipulation, etc ¡ Combinable with other modules § NumPy, SciPy, statsmodel, matplotlib PANDAS

¡ File I/O PANDAS

¡ Finding missing values PANDAS

¡ Removing missing values PANDAS

¡ Pivoting PANDAS

¡ Other things § Statistical methods § Merge/join like SQL § Time series § Has
some visualization functionality PANDAS

¡ Application of algorithms that learn from examples ¡ Representation and generalization
¡ Useful in everyday life ¡ Especially useful in data analysis MACHINE LEARNING

¡ Supervised learning § Classification and regression ¡ Unsupervised learning § Clustering and dimensionality
reduction MACHINE LEARNING

¡ Machine learning module ¡ Open-source ¡ Built-in datasets ¡ Good resources for learning
SCIKIT-LEARN

¡ Scikit-learn: your data has to be continuous ¡ Here’s what one
observation/label looks like: SCIKIT-LEARN

¡ Transform categorical values/labels SCIKIT-LEARN

¡ Classification SCIKIT-LEARN

¡ Other things § Very comprehensive of machine learning algorithms § Preprocessing tools
§ Methods for testing the accuracy of your model SCIKIT-LEARN

¡ Concerned with interactions between computers and human languages ¡ Derive meaning
from text ¡ Many NLP algorithms are based on machine learning NATURAL LANGUAGE PROCESSING

¡ Natural Language ToolKit ¡ Access to over 50 corpora § Corpus: body
of text ¡ NLP tools § Stemming, tokenizing, etc ¡ Resources for learning NLTK

¡ Stopword removal NLTK

¡ Stemming NLTK

¡ Other things § Lemmatizing, tokenization, tagging, parse trees § Classification § Chunking § Sentence
structure NLTK

¡ Data that takes too long to process on your machine
§ Not “big data” but larger data ¡ Solution: MapReduce! § Processing large datasets with a parallel, distributed algorithm § Map step § Reduce step PROCESSING LARGE DATA

¡ Map step § Takes series of key/value pairs § Ex. Word counts:
break line into words, return word and count within line ¡ Reduce step § Once for each unique key: iterates through values associated with that key § Ex. Word counts: returns word and sum of all counts PROCESSING LARGE DATA

¡ Write MapReduce jobs in Python ¡ Test code locally without installing
Hadoop ¡ Lots of thorough documentation ¡ A few things to know § Keep everything in one class § MRJob program in a separate file § Output to new file if doing something like word counts MRJOB

¡ Stemmed file ¡ Line 1: (‘miss’, 2), (‘taylor’, 1) ¡ Line 2:
(‘taylor’, 1), (‘first’, 1), (‘wed’, 1) ¡ And so on… MRJOB

Map ¡ Line 1: (‘miss’, 2), (‘taylor’, 1) ¡ Line 2: (‘taylor’,
1), (‘first’, 1), (‘wed’, 1) ¡ Line 3: (‘first’, 1), (‘wed’, 1) ¡ Line 4: (‘father’, 1) ¡ Line 5: (‘father’, 1) Reduce ¡ (‘miss’, 2) ¡ (‘taylor’, 2) ¡ (‘first’, 2) ¡ (‘wed’, 2) ¡ (‘father’, 2) MRJOB

¡ Let’s count all words in the Gutenberg file ¡ Map step
MRJOB

¡ Reduce (and run) step MRJOB

¡ Results § Mapped counts reduced § Key/val pairs MRJOB

¡ Other things § Run on Hadoop clusters § Can write highly complex
jobs § Works with Elasticsearch MRJOB

¡ The “final step” ¡ Conveying your results in a meaningful way
¡ Literally see what’s going on DATA VISUALIZATION

¡ 2D visualization library ¡ Very VERY widely used ¡ Wide variety of
plots ¡ Easy to feed in results from other modules (like Pandas, scikit-learn, NumPy, SciPy, etc) MATPLOTLIB

¡ Remember this? MATPLOTLIB

¡ Bar chart of distribution MATPLOTLIB

¡ Let’s graph our word count frequencies § (Hint: It’s a power
law distribution!) MATPLOTLIB

¡ High frequency of low numbers, low frequency of high numbers
MATPLOTLIB

¡ Other things § Many different kinds of graphs § Customizable § Time series
MATPLOTLIB

¡ Phew! ¡ Which tool to choose depends on your needs ¡ Workflow:
§ Preprocess § Analyze § Visualize WHAT NEXT?

¡ Pandas § http://pandas.pydata.org/ ¡ scikit-learn § http://scikit-learn.org/ ¡ NLTK § http://www.nltk.org/ ¡ MRJob § http://mrjob.readthedocs.org/ ¡ matplotlib § http://matplotlib.org/
RESOURCES

¡ Twitter § @sarah_guido ¡ LinkedIn § https://www.linkedin.com/in/sarahguido ¡ NYC Python § http://www.meetup.com/nycpython/ CONTACT ME!

AND FINALLY…

Questions? THE END!

Analyzing Data With Python - Sarah Guido

Analyzing Data With Python - Sarah Guido

More Decks by PyGotham 2014

Other Decks in Programming

Featured

Transcript