Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Is Julia the Future for Big Data Analytics

Is Julia the Future for Big Data Analytics

Presentation given at the Infoconf conference in June 2017

Malcolm Sherrington

June 08, 2017
Tweet

More Decks by Malcolm Sherrington

Other Decks in Technology

Transcript

  1. Features of Julia 1.  Started as an MIT project, become

    open source in 2012 The ‘alpha’ version is under development 2.  Uses LLVM to produce speeds similar to C and Java 3.  Has LISP-like macros and ‘generated’ functions 4.  Has genuine shared memory and multi-tasking 5.  Easy connectivity to C and Python; R, Java and C++ are also possible. 6.  Solves the ‘two-language’ problem 7.  Build with parallel/distributed processing in mind
  2. What systems are using LLVM now? •  As a compiler:

    Clang, Swift, GNU, Haskell, Ruby LDC, Clasp, Llgo •  As a “bolt-on” module / extension Python (Numba), Tcl •  As a complete system Julia, Javascript (V8), LuaJIT, Rust , SML, Pure
  3. Hadoop vs Spark vs Storm All : O/S frameworks; real-1me

    BI and BD analy1cs; implemented in JVM based programming languages Hadoop : Batch processing; latency in minutes; Map-Reduced jobs used for programming Spark : Batch, Graph and ML; latency few seconds; programmed in Scala/Java Storm : Only streaming; latency sub-seconds; own Java-API
  4. Julia’s message passing 1.  Julia provides a multiprocessing environment based

    on message passing to allow programs to run on multiple processors in shared or distributed memory. 2.  Julia’s implementation of message passing is one-sided: •  the programmer needs to explicitly manage only one processor in a two-processor operation •  these operations typically do not look like message send and message receive but rather resemble higher-level operations like calls to user functions.
  5. Key notions: remote references and remote calls •  A remote

    reference is an object that can be used from any processor to refer to an object stored on a particular processor. •  A remote call is a request by one processor to call a certain function on certain arguments on another (possibly the same) processor. •  A remote call returns a remote reference •  Remote calls return immediately: the processor that made the call can then proceeds to its next operation while the remote call happens somewhere else. •  You can wait for a remote call to finish by calling wait on its remote reference, and you can obtain the full value of the result using fetch
  6. Machine Learning "  ML solves problems that cannot be solved

    by numerical means alone. "  Among the different types of ML tasks, a crucial distinction is drawn between supervised and unsupervised learning "  Supervised machine learning: The program is “trained” on a pre-defined set of “training examples”, which then facilitate its ability to reach an accurate conclusion when given new data. "  Unsupervised machine learning: The program is given a bunch of data and must find patterns and relationships within them
  7. Supervised Machine Learning has several major subcategories Regression ML: Systems

    where the value being predicted falls somewhere on a con1nuous spectrum. These systems help us with ques1ons of “How much?” or “How many?”. Classifica:on ML: Systems where we seek a yes-or-no predic1on, such as “Is this tumer cancerous?”, “Does this product meet specified quality standards?”, and so on. Bayesian ML: Systems where we have some prior insight and wish to use the data to establish beSer predic1ve models.
  8. Deep Learning "  Torch/PyTorch is a computational framework with an

    API written in Lua that supports machine-learning algorithms; used by large tech companies such as Facebook and Twitter "  Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation. "  TensorFlow™ is an open source software library for numerical computation using data flow graphs. A flexible architecture allows deployment computation to one or more CPUs or GPUs. "  Caffe is a well-known and widely used machine-vision library that ported Matlab’s implementation of fast convolutional nets to C and C++. "  MxNet is a machine-learning framework with APIs is languages such as R, Python and Julia which has been adopted by Amazon Web Services.
  9. Julia Community Groups •  General •  Computing •  Data Science

    •  Visualization •  Mathematics •  Scientific Domains hSps://julialang.org/community/
  10. The MNIST database of handwritten digits has a training set

    of 60,000 examples, and a test set of 10,000 examples.
  11. Final thoughts "  Analysis using ML approaches are computationally intense.

    "  General purpose and specific hardware is becoming increasingly more important. "  Distributed systems and/ parallelism is necessary to handle non-trivial problems. "  Networked systems based on Hadoop will not be sufficient in the future. "  Languages such as Julia enable the Data Scientist to process and analyse large datasets within sensible timescales