Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why Python is better for Data Science - SP Big ...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for Ícaro Medeiros Ícaro Medeiros
May 14, 2024
11

Why Python is better for Data Science - SP Big Data Meetup

São Paulo Big Data Meetup 2015

Avatar for Ícaro Medeiros

Ícaro Medeiros

May 14, 2024
Tweet

Transcript

  1. WHY PYTHON IS BETTER FOR DATA SCIENCE ÍCARO MEDEIROS São

    Paulo Big Data Meetup São Paulo - SP, 25/11/2015
  2. WHY PYTHON? ▸ General purpose ▸ Smooth learning curve ▸

    REPL (IPython!) ▸ Programmer productivity ▸ Popular and mature ▸ Glue language (high level API, low level C/Fortran bindings) ▸ Science ecosystem (growing!)
  3. AUTHOR A MULTISTAGE PROCESSING PIPELINE IN PYTHON, DESIGN A HYPOTHESIS

    TEST, PERFORM A REGRESSION ANALYSIS OVER DATA SAMPLES WITH R, DESIGN AND IMPLEMENT AN ALGORITHM FOR SOME DATA-INTENSIVE PRODUCT OR SERVICE IN HADOOP, OR COMMUNICATE THE RESULTS OF OUR ANALYSES Jeff Hammerbacher ONE DAY AT FACEBOOK’S DATA SCIENCE TEAM, A MEMBER COULD… http://berkeleysciencereview.com/scienti fi c-collaborations-uc-berkeley-data-driven-cover/
  4. PYTHON <3 BIG DATA map reduce in python pure python

    HDFS client fast and general engine for large-scale data processing mrjob http://spark.apache.org https://github.com/spotify/snakebite https://pythonhosted.org/mrjob …
  5. OH, BUT SCALA/JAVA IS FASTER. PYTHON IS 2 *FASTER: [WRITING,

    RUNNING] DataFrame operations are optimized and compiled into JVM bytecode https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python- dataframes-and-more.html
  6. SCIENCE IS GETTING MORE AND MORE IMPORTANT FOR PYTHON COMMUNITY

    # module imports imports/numpy 1 sys 2437939 5,85 2 os 2009086 4,82 3 re 1303009 3,12 4 numpy 416981 1,00 5 warnings 371345 0,89 6 subprocess 344934 0,83 7 django 282097 0,68 8 math 281987 0,68 11 matplotlib 146913 0,35 13 pylab 77817 0,19 14 scipy 69092 0,17 22 pandas 18928 0,05 24 theano 5482 0,051 6/25 MOST POPULAR LIBRARIES ARE FOR DATA SCIENCE https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement
  7. SCIENCE IS IMPORTANT FOR PYTHON: MATRIX MULTIPLICATION https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement import numpy

    as np from numpy.linalg import inv, solve # Using dot function: S = np.dot((np.dot(H, beta) - r).T, np.dot(inv(np.dot(np.dot(H, V), H.T)), np.dot(H, beta) - r)) # With the @ operator S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r) S = ( H β − r ) T ( H V H T ) − 1 ( H β − r ) PEP 0465: PROPOSED FEB/14. SINCE PY 3.5 (SEP/15) 2013: 7 INTERNATIONAL CONFERENCES ON NUMERICAL PYTHON AT PYCON 2014, ~20% OF THE TUTORIALS INVOLVED THE USE OF MATRICES