Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Analysis Project in Python

Data Analysis Project in Python

Outline: So you have some data, you can do Python. Now how do you go about organizing a data analysis project? What software design approach to use? Which libraries are available in Python? What are the common pitfalls and how to avoid them?

Oxana Sachenkova

September 17, 2013
Tweet

More Decks by Oxana Sachenkova

Other Decks in Programming

Transcript

  1. Data Analysis Deriving a valuable insights from data in… ★

    Finance ★ Marketing and Sales ★ Natural Language Processing ★ Science ◦ Biology ◦ Physics ◦ Chemistry …..
  2. Step by step... Formulate the question Collect data Process data

    Descriptive analysis Formal analysis Visualization Tidy data - relational databases in the light of statistics http://vita.had.co. nz/papers/tidy-data.pdf
  3. Project structure principles ➔ Logically separated functional modules ➔ Meaningful

    names ➔ Easy to reconfigure ➔ Reuse code and data ➔ Structure your results
  4. Agile data analysis ➔ work in small iterations with frequent

    feedback and course correction ➔ verify your data on every “release”; turn bugs into test cases ➔ use version control ➔ collaborate
  5. Step by step... Formulate the question Collect data Process data

    Descriptive analysis Formal analysis Visualization
  6. Structured code Code and Pasta Spagetti Ravioli Lasagna What to

    do? Don’t repeat yourself Keep it simple Refactoring Pair Programming
  7. Structured code Be elegant PEP8 Google Coding Style Be pythonic

    Use the Force Evolution of a Python programmer
  8. I haven’t covered How to actually perform data analysis in

    Python :-) How to write tests for data analysis software? Test driven development for data analysis Code profiling and optimization (vbench) Automatically generated documentation
  9. NumPy/Scipy Have a look at the Example List and John

    Cook's Distributions in Scipy . pandas Working with tabular data like a boss larry Labeled array that plays nice with NumPy. python-statlib Several statistical libraries in one. statsmodels Statistical modeling: Linear models, GLMs. scikits Machine learning. matplotlib Data visualization Python libraries for data analysis
  10. Data Analysis resources Wes McKinney’s tutorials 1 and 2 on

    Kaggle. Hernan Rojas’ tutorial Tutorials on financial data and time series using pandas Python for Data Analysis (book) Interesting IPython Notebooks
  11. More on best practices The Hitchhiker's Guide to Python Python

    worst practices Quick guide to organizing Computational Biology projects Best practices for Scientific Computing
  12. PyLadies Stockholm Let us know what you think Share your

    ideas at the next brainstorming session Suggest the next meetup topic online Tell your friends @PyLadiesSthlm