Is Julia the Future for Big Data Analytics

Is Julia the Future for Big Data Analytics?

Features of Julia 1.  Started as an MIT project, become
open source in 2012 The ‘alpha’ version is under development 2.  Uses LLVM to produce speeds similar to C and Java 3.  Has LISP-like macros and ‘generated’ functions 4.  Has genuine shared memory and multi-tasking 5.  Easy connectivity to C and Python; R, Java and C++ are also possible. 6.  Solves the ‘two-language’ problem 7.  Build with parallel/distributed processing in mind

LLVM compilation process.

What systems are using LLVM now? •  As a compiler:
Clang, Swift, GNU, Haskell, Ruby LDC, Clasp, Llgo •  As a “bolt-on” module / extension Python (Numba), Tcl •  As a complete system Julia, Javascript (V8), LuaJIT, Rust , SML, Pure

http://benchmarksgame.alioth.debian.org/ https://github.com/JuliaLang/julia/tree/master/test/perf/shootout

Data Science is confusing!

Data volumes by 2020

Popular Big Data Architectures

Hadoop vs Spark vs Storm All : O/S frameworks; real-1me
BI and BD analy1cs; implemented in JVM based programming languages Hadoop : Batch processing; latency in minutes; Map-Reduced jobs used for programming Spark : Batch, Graph and ML; latency few seconds; programmed in Scala/Java Storm : Only streaming; latency sub-seconds; own Java-API

Julia’s message passing 1.  Julia provides a multiprocessing environment based
on message passing to allow programs to run on multiple processors in shared or distributed memory. 2.  Julia’s implementation of message passing is one-sided: •  the programmer needs to explicitly manage only one processor in a two-processor operation •  these operations typically do not look like message send and message receive but rather resemble higher-level operations like calls to user functions.

Key notions: remote references and remote calls •  A remote
reference is an object that can be used from any processor to refer to an object stored on a particular processor. •  A remote call is a request by one processor to call a certain function on certain arguments on another (possibly the same) processor. •  A remote call returns a remote reference •  Remote calls return immediately: the processor that made the call can then proceeds to its next operation while the remote call happens somewhere else. •  You can wait for a remote call to ﬁnish by calling wait on its remote reference, and you can obtain the full value of the result using fetch

Machine Learning "  ML solves problems that cannot be solved
by numerical means alone. "  Among the different types of ML tasks, a crucial distinction is drawn between supervised and unsupervised learning "  Supervised machine learning: The program is “trained” on a pre-defined set of “training examples”, which then facilitate its ability to reach an accurate conclusion when given new data. "  Unsupervised machine learning: The program is given a bunch of data and must find patterns and relationships within them

Supervised Machine Learning has several major subcategories Regression ML: Systems
where the value being predicted falls somewhere on a con1nuous spectrum. These systems help us with ques1ons of “How much?” or “How many?”. Classiﬁca:on ML: Systems where we seek a yes-or-no predic1on, such as “Is this tumer cancerous?”, “Does this product meet speciﬁed quality standards?”, and so on. Bayesian ML: Systems where we have some prior insight and wish to use the data to establish beSer predic1ve models.

Deep Learning "  Torch/PyTorch is a computational framework with an
API written in Lua that supports machine-learning algorithms; used by large tech companies such as Facebook and Twitter "  Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation. "  TensorFlow™ is an open source software library for numerical computation using data flow graphs. A flexible architecture allows deployment computation to one or more CPUs or GPUs. "  Caffe is a well-known and widely used machine-vision library that ported Matlab’s implementation of fast convolutional nets to C and C++. "  MxNet is a machine-learning framework with APIs is languages such as R, Python and Julia which has been adopted by Amazon Web Services.

Julia Community Groups •  General •  Computing •  Data Science
•  Visualization •  Mathematics •  Scientific Domains hSps://julialang.org/community/

http://yann.lecun.com/exdb/mnist/ MNIST

The MNIST database of handwritten digits has a training set
of 60,000 examples, and a test set of 10,000 examples.

Summary of the results

Final thoughts "  Analysis using ML approaches are computationally intense.
"  General purpose and specific hardware is becoming increasingly more important. "  Distributed systems and/ parallelism is necessary to handle non-trivial problems. "  Networked systems based on Hadoop will not be sufficient in the future. "  Languages such as Julia enable the Data Scientist to process and analyse large datasets within sensible timescales

Is Julia the Future for Big Data Analytics

Is Julia the Future for Big Data Analytics

Malcolm Sherrington

More Decks by Malcolm Sherrington

Other Decks in Technology

Featured

Transcript