working on both data analy?cs and data engineering projects. • Experimental Economist by training – Caltech and University of Michigan. – Lots of stats/econometric. – Game theory/mechanism design. • Programmer – Built web-‐based financial markets at Caltech and Michigan. – Building robust analy?cal data ETLs at TrueCar, with small and Big data. Lots and lots of data …
install. – Powerful environment in its own right. – The founda?onal environment for other Python data packages: • Pandas • Matplotlib – A very good tool.
– pip install ipython pyzmq jinja2 tornado • pyzmq takes a bit longer to build on a Mac • Running/launching the iPython notebook – ipython notebook – Note that ipython is a shell, ipython notebook is a browser based interface
is interac?ve. Great for data analysis! • This may not seem like a big deal at first if you haven’t done a lot of data processing work, but it is! • Imagine the alterna?ve: – Edit the program file. – Run the program and look at the output text in a text editor. – Repeat endless ?mes. – And how do you visualize the data? Output to file and click to show on browser? – The iPython Notebook, along with pandas and matplotlib, provide a powerful combina?on of tools to itera?vely examine, process, and visualize data.
raw iPython notebook is not very readable as it contains a lot of HTML formaang code. • Hard to read the code in github. – Though it is easy to convert a iPython notebook to other formats (html, python code) using ‘ipython nbconvert’ • Diffs (‘diff’ or ‘git diff’) are a lot less helpful when comparing iPython notebooks.
it encourages interac?ve coding, it is easy to pollute the name space. • This makes the code hard to debug because you may have over-‐wriFen a variable and had forgoFen about it. – When in doubt, re-‐start the kernel, and run the process through one step at a ?me from the top. – Rename variables ader a transforma?on step. – Break your code into separate cells. – Leverage methods and classes as appropriate.