real-time BI and BD analytics; implemented in JVM based programming languages ! Hadoop : Batch processing; latency in minutes; Map-Reduced jobs used for programming ! Spark : Batch, Graph and ML; latency few seconds; programmed in Scala/Java ! Storm : Only streaming; latency sub-seconds; own Java-API
Does not have native arrays ! The latest version is not 100% compatible with the previous one ! Is 30-50 times slower than ‘C’ code ! Presently is the most popular language of choice among Data Scientists ! Currently is the best choice for studying machine learning.
by numerical means alone. ! Among the different types of ML tasks, a crucial distinction is drawn between supervised and unsupervised learning ! Supervised machine learning: The program is “trained” on a pre-defined set of “training examples”, which then facilitate its ability to reach an accurate conclusion when given new data. ! Unsupervised machine learning: The program is given a bunch of data and must find patterns and relationships within them
where the value being predicted falls somewhere on a continuous spectrum. These systems help us with questions of “How much?” or “How many?”. ! Classification ML: Systems where we seek a yes-or-no prediction, such as “Is this tumer cancerous?”, “Does this product meet specified quality standards?”, and so on. ! Bayesian ML: Systems where we have some prior insight and wish to use the data to establish better predictive models.
the predicted values are to their corresponding real values. ! Decision Trees: Algorithms that can used for classification or regression predictive modeling problems (CART). ! Overfitting: Irrelevant attributes can result in overfitting the training example data. ! Underfitting: A model that can neither classify the training data nor generalize to new data.
method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time. ! They try to output the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. ! Random decision forests are designed to correct for decision trees' habit of overfitting to their training set.
API written in Lua that supports machine-learning algorithms; used by large tech companies such as Facebook and Twitter ! Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation. ! TensorFlow™ is an open source software library for numerical computation using data flow graphs. A flexible architecture allows deployment computation to one or more CPUs or GPUs. ! Caffe is a well-known and widely used machine-vision library that ported Matlab’s implementation of fast convolutional nets to C and C++. ! MxNet is a machine-learning framework with APIs is languages such as R, Python and Julia which has been adopted by Amazon Web Services.
purpose and specific hardware is becoming increasingly more important. ! Distributed systems and/ parallelism is necessary to handle non-trivial problems. ! Networked systems based on Hadoop will not be sufficient in the future.