Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark in Action - Overview

Spark in Action - Overview

An overview of spark

More Decks by Dulitha Wijewantha (Chan)

Other Decks in Technology

Transcript

  1. Spark  aims  to  solve  the  main  use-­‐‑cases  of   Ñ

    Iterative  jobs  –  Machine  learning  algorithms   Ñ Iterative  analytics   Hadoop  is  slow  when  it  comes  to  performs  operations   multiple  times  since  each  time  it  will  come  up  as  another   MapReduce  job.   Introduction
  2. Ñ Spark  works  on  Resilient  Distributed  Data  sets  –  an

      abstraction  over  data  objects   Ñ Spark  is  implemented  in  Scala   Ñ Many  distributed  operators  are  available  [count,  collect,   first]   Ñ 10x  faster  than  Hadoop  in  iterative  machine  learning   Ñ Sub-­‐‑second  latency  to  scan  a  39GB  dataset Introduction
  3. Ñ Represented  by  a  Scala  object   Ñ Can  be

     created  by  files  in  file  system  [HDFS],  transforming   an  existing  RDD  etc.   Ñ Cacheable   Ñ Tracks  the  lineage  (how  it  was  built)  –  this  allows  Spark  to   rebuild  a  lost  RDD     Resilient  Distributed  Dataset
  4. Ñ Broadcast  variables    -­‐‑  sent  to  worker  node  once

      Ñ Accumulators  –  only  operation  available  is  add.  Available  at   the  master  node  (driver  program) Shared  Variables