Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data and Machine Learning

Big Data and Machine Learning

Presentation given to the BCS Advanced Computing SIG in February 2017

Malcolm Sherrington

February 16, 2017
Tweet

More Decks by Malcolm Sherrington

Other Decks in Technology

Transcript

  1. Hadoop vs Spark vs Storm !  All : O/S frameworks;

    real-time BI and BD analytics; implemented in JVM based programming languages !  Hadoop : Batch processing; latency in minutes; Map-Reduced jobs used for programming !  Spark : Batch, Graph and ML; latency few seconds; programmed in Scala/Java !  Storm : Only streaming; latency sub-seconds; own Java-API
  2. The usual “suspects” !  Matlab : ( Octave ) ! 

    R !  Python !  JVM : ( Java / Scala / Clojure ) !  SAS / SPSS !  Excel !  Julia
  3. Which programming language? !  V1.0 as released in 1991 ! 

    Does not have native arrays !  The latest version is not 100% compatible with the previous one !  Is 30-50 times slower than ‘C’ code !  Presently is the most popular language of choice among Data Scientists !  Currently is the best choice for studying machine learning.
  4. Machine Learning? !  ML solves problems that cannot be solved

    by numerical means alone. !  Among the different types of ML tasks, a crucial distinction is drawn between supervised and unsupervised learning !  Supervised machine learning: The program is “trained” on a pre-defined set of “training examples”, which then facilitate its ability to reach an accurate conclusion when given new data. !  Unsupervised machine learning: The program is given a bunch of data and must find patterns and relationships within them
  5. Supervised ML has several major subcategories !  Regression ML: Systems

    where the value being predicted falls somewhere on a continuous spectrum. These systems help us with questions of “How much?” or “How many?”. !  Classification ML: Systems where we seek a yes-or-no prediction, such as “Is this tumer cancerous?”, “Does this product meet specified quality standards?”, and so on. !  Bayesian ML: Systems where we have some prior insight and wish to use the data to establish better predictive models.
  6. Classification Concepts !  Cost Function: Used to measure how close

    the predicted values are to their corresponding real values. !  Decision Trees: Algorithms that can used for classification or regression predictive modeling problems (CART). !  Overfitting: Irrelevant attributes can result in overfitting the training example data. !  Underfitting: A model that can neither classify the training data nor generalize to new data.
  7. Random Forests !  Random decision forests are an ensemble learning

    method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time. !  They try to output the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. !  Random decision forests are designed to correct for decision trees' habit of overfitting to their training set.
  8. What do you need for ML in Python !  Anaconda

    (Python) !  Jupyter notebook IDE !  GIT !  Some “well-known” packages –  numpy, scipy –  pandas –  matplotlib –  scikit-learn –  seaborn
  9. Some Useful URLS !  https://github.com/jdwittenauer/ipython-notebooks !  https://github.com/donnemartin/data-science-ipython-notebooks !  https://www.coursera.org/learn/python-machine-learning ! 

    https://www.udemy.com/machinelearning/ !  https://www.datacamp.com/community/tutorials/machine-learning-python !  https://www.amazon.co.uk/Python-Machine-Learning-Sebastian-Raschka/ dp/1783555130 !  https://github.com/rasbt/python-machine-learning-book !  https://speakerdeck.com/rasbt/slides-from-machine-learning-with-scikit- learn-at-scipy-2016
  10. Deep Learning !  Torch/PyTorch is a computational framework with an

    API written in Lua that supports machine-learning algorithms; used by large tech companies such as Facebook and Twitter !  Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation. !  TensorFlow™ is an open source software library for numerical computation using data flow graphs. A flexible architecture allows deployment computation to one or more CPUs or GPUs. !  Caffe is a well-known and widely used machine-vision library that ported Matlab’s implementation of fast convolutional nets to C and C++. !  MxNet is a machine-learning framework with APIs is languages such as R, Python and Julia which has been adopted by Amazon Web Services.
  11. Final Thoughts !  ML approaches are computationally intense. !  General

    purpose and specific hardware is becoming increasingly more important. !  Distributed systems and/ parallelism is necessary to handle non-trivial problems. !  Networked systems based on Hadoop will not be sufficient in the future.