Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analyzing Data With Python - Sarah Guido

Analyzing Data With Python - Sarah Guido

PyGotham 2014

August 16, 2014
Tweet

More Decks by PyGotham 2014

Other Decks in Programming

Transcript

  1. ¡ Bird’s-eye overview: not comprehensive explanation of these tools! ¡ Take data

    from start-to-finish § Preprocessing: Pandas § Analysis: scikit-learn § Analysis: nltk § Data pipeline: MRjob § Visualization: matplotlib ¡ What next? ABOUT THIS TALK
  2. ¡ So many tools § Preprocessing, analysis, statistics, machine learning, natural language

    processing, network analysis, visualization, scalability ¡ Community support ¡ “Easy” language to learn ¡ Both a scripting and production-ready language WHY PYTHON?
  3. ¡ How to find the best tool(s)? ¡ The 90/10 rule ¡ Simple

    is better than complex FROM POINT A TO POINT…X?
  4. ¡ Available resources § Documentation, tutorials, books, videos ¡ Ease of use (with

    a grain of salt) ¡ Community support and continuous development ¡ Widely used WHY I CHOSE THESE TOOLS
  5. ¡ The importance of data preprocessing § AKA wrangling, munging, manipulating, and

    so on ¡ Preprocessing is also getting to know your data § Missing values? Categorical/continuous? Distribution? PREPROCESSING
  6. ¡ Data analysis and modeling ¡ Similar to R and Excel ¡ Easy-to-use

    data structures § DataFrame ¡ Data wrangling tools § Merging, pivoting, etc PANDAS
  7. ¡ Keep everything in Python ¡ Community support/resources ¡ Use for preprocessing § File

    I/0, cleaning, manipulation, etc ¡ Combinable with other modules § NumPy, SciPy, statsmodel, matplotlib PANDAS
  8. ¡ Application of algorithms that learn from examples ¡ Representation and generalization

    ¡ Useful in everyday life ¡ Especially useful in data analysis MACHINE LEARNING
  9. ¡ Concerned with interactions between computers and human languages ¡ Derive meaning

    from text ¡ Many NLP algorithms are based on machine learning NATURAL LANGUAGE PROCESSING
  10. ¡ Natural Language ToolKit ¡ Access to over 50 corpora § Corpus: body

    of text ¡ NLP tools § Stemming, tokenizing, etc ¡ Resources for learning NLTK
  11. ¡ Data that takes too long to process on your machine

    § Not “big data” but larger data ¡ Solution: MapReduce! § Processing large datasets with a parallel, distributed algorithm § Map step § Reduce step PROCESSING LARGE DATA
  12. ¡ Map step § Takes series of key/value pairs § Ex. Word counts:

    break line into words, return word and count within line ¡ Reduce step § Once for each unique key: iterates through values associated with that key § Ex. Word counts: returns word and sum of all counts PROCESSING LARGE DATA
  13. ¡ Write MapReduce jobs in Python ¡ Test code locally without installing

    Hadoop ¡ Lots of thorough documentation ¡ A few things to know § Keep everything in one class § MRJob program in a separate file § Output to new file if doing something like word counts MRJOB
  14. ¡ Stemmed file ¡ Line 1: (‘miss’, 2), (‘taylor’, 1) ¡ Line 2:

    (‘taylor’, 1), (‘first’, 1), (‘wed’, 1) ¡ And so on… MRJOB
  15. Map ¡ Line 1: (‘miss’, 2), (‘taylor’, 1) ¡ Line 2: (‘taylor’,

    1), (‘first’, 1), (‘wed’, 1) ¡ Line 3: (‘first’, 1), (‘wed’, 1) ¡ Line 4: (‘father’, 1) ¡ Line 5: (‘father’, 1) Reduce ¡ (‘miss’, 2) ¡ (‘taylor’, 2) ¡ (‘first’, 2) ¡ (‘wed’, 2) ¡ (‘father’, 2) MRJOB
  16. ¡ The “final step” ¡ Conveying your results in a meaningful way

    ¡ Literally see what’s going on DATA VISUALIZATION
  17. ¡ 2D visualization library ¡ Very VERY widely used ¡ Wide variety of

    plots ¡ Easy to feed in results from other modules (like Pandas, scikit-learn, NumPy, SciPy, etc) MATPLOTLIB
  18. ¡ Phew! ¡ Which tool to choose depends on your needs ¡ Workflow:

    § Preprocess § Analyze § Visualize WHAT NEXT?