Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dataflows: The abstraction that powers the Big Data technology by RAÚL CASTRO FERNÁNDEZ at Big Data Spain 2014

Dataflows: The abstraction that powers the Big Data technology by RAÚL CASTRO FERNÁNDEZ at Big Data Spain 2014

Dataflows are an omnipresent abstraction across many big data technologies due to its suitability for representing programs in a way that is easy to parallelize. All dataflow models---such as those of Spark or MapReduce---are stateless, which facilitates achieving fault tolerance, a crucial property when running at large-scale. However, this stateless dataflow models have a negative impact on the programming models they expose, which need to adapt to match the stateless nature of the underlying platforms. With the “democratization of data”, different types of users with different skills want answers from their big datasets, but sometimes they lack the skills required to write programs adapted to these specific frameworks: A familiar programming model becomes crucial to open big data value to a broader set of users.

Big Data Spain

November 25, 2014
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. THE ABSTRACTION THAT POWERS THE BIG DATA
    RAÚL CASTRO FERNÁNDEZ
    COMPUTER SCIENCE PHD STUDENT IMPERIAL COLLEGE

    View Slide

  2. Data!ows: The Abstraction
    that Powers Big Data
    Raul  Castro  Fernandez  
    Imperial  College  London  
    [email protected]  
    @raulcfernandez  

    View Slide

  3. “Big  Data  needs  Democra:za:on”  

    View Slide

  4. 3  
    Developers  and  DBAs  are  no  longer  the  only  ones  
    genera:ng,  processing  and  analyzing  data.  
     
    Democratization of Data

    View Slide

  5. 4  
    Decision  makers,  domain  scien:sts,  applica:on  users,  
    journalists,  crowd  workers,  and  everyday  consumers,  sales,  
    marke:ng…  
    Democratization of Data
    Developers  and  DBAs  are  no  longer  the  only  ones  
    genera:ng,  processing  and  analyzing  data.  
     

    View Slide

  6. 5  
    +  Everyone  has  data  

    View Slide

  7. 6  
    +  Everyone  has  data  
    +  Many  have  interes:ng  ques:ons  

    View Slide

  8. 7  
    +  Everyone  has  data  
    +  Many  have  interes:ng  ques:ons  
    -­‐  Not  everyone  knows  how  to  analyze  it  

    View Slide

  9. 8  
    +  Everyone  has  data  
    +  Many  have  interes:ng  ques:ons  
    -­‐  Not  everyone  knows  how  to  analyze  it  

    View Slide

  10. 9  
    Bob  
    Local  Expert  

    View Slide

  11. 10  
    Bob  
    Local  Expert  

    View Slide

  12. 11  
    Bob  
    Local  Expert  
    -­‐  Barrier  of  human  communica:on  
    -­‐  Barrier  of  professional  rela:ons  

    View Slide

  13. 12  
    Bob  
    Local  Expert  
    -­‐  Barrier  of  human  communica:on  
    -­‐  Barrier  of  professional  rela:ons  
    The  limits  of  my  language  mean  the  limits  of  my  
    world.  
    Ludwig  WiWgenstein  “Tractatus  Logico-­‐Philosophicus  1922”  

    View Slide

  14. 13  
    First  step  to  democra:ze  Big  Data:  
    to  offer  a  familiar  programming  interface  

    View Slide

  15. •  Mo>va>on  
    •  SDG:  Stateful  Dataflow  Graphs  
    •  Handling  distributed  state  in  SDGs  
    •  Transla:ng  Java  programs  to  SDGs  
    •  Checkpoint-­‐based  fault  tolerance  for  SDGs  
    •  Experimental  evalua:on  
    14  
    Outline
    ?  
    ?  

    View Slide

  16. Mutable State in a Recommender System
    15  
    Matrix  userItem  =  new  Matrix();  
    Matrix  coOcc  =  new  Matrix();   Item-­‐A   Item-­‐B  
    User-­‐A   4   5  
    User-­‐B   0   5  
    Item-­‐A   Item-­‐B  
    Item-­‐A   1   1  
    Item-­‐B   1   2  
    User-­‐Item  matrix  (UI)  
    Co-­‐Occurrence  matrix  (CO)  

    View Slide

  17. Mutable State in a Recommender System
    16  
    Matrix  userItem  =  new  Matrix();  
    Matrix  coOcc  =  new  Matrix();  
    void  addRa>ng(int  user,  int  item,  int  ra>ng)  {    
           userItem.setElement(user,  item,  ra:ng);  
           updateCoOccurrence(coOcc,  userItem);  
    }  
    Item-­‐A   Item-­‐B  
    User-­‐A   4   5  
    User-­‐B   0   5  
    Item-­‐A   Item-­‐B  
    Item-­‐A   1   1  
    Item-­‐B   1   2  
    User-­‐Item  matrix  (UI)  
    Co-­‐Occurrence  matrix  (CO)  
    Update  
    with  new  
    ra:ngs  

    View Slide

  18. Mutable State in a Recommender System
    17  
    Matrix  userItem  =  new  Matrix();  
    Matrix  coOcc  =  new  Matrix();  
    void  addRa>ng(int  user,  int  item,  int  ra>ng)  {    
           userItem.setElement(user,  item,  ra:ng);  
           updateCoOccurrence(coOcc,  userItem);  
    }  
    Vector  getRec(int  user)  {  
           Vector  userRow  =  userItem.getRow(user);  
           Vector  userRec  =  coOcc.mul:ply(userRow);    
           return  userRec;  
    }  
    Item-­‐A   Item-­‐B  
    User-­‐A   4   5  
    User-­‐B   0   5  
    Item-­‐A   Item-­‐B  
    Item-­‐A   1   1  
    Item-­‐B   1   2  
    User-­‐Item  matrix  (UI)  
    Co-­‐Occurrence  matrix  (CO)  
    Update  
    with  new  
    ra:ngs  
    Mul:ply  for  
    recommenda:on  
    User-­‐B   1   2   x  

    View Slide

  19. 18  
    Challenges When Executing with Big Data
    Big  Data  Problem:  
    Matrices  
    become  large  
    >  Mutable  state  leads  to  concise  algorithms  but  
    complicates  parallelism  and  fault  tolerance  
    Matrix  userItem  =  new  Matrix();  
    Matrix  coOcc  =  new  Matrix();  
    >  Cannot  lose  state  aRer  failure  
    >  Need  to  manage  state  to  support  data-­‐parallelism  

    View Slide

  20. 19  
    Using Current Distributed Data"ow
    Frameworks
    Input  
    data  
    Output  
    data  
    >  No  mutable  state  simplifies  fault  tolerance  
    >  MapReduce:  Map  and  Reduce  tasks  
    >  Storm:  No  support  for  state  
    >  Spark:  Immutable  RDDs  

    View Slide

  21. 20  
    >  Programming  distributed  dataflow  graphs  
    requires  learning  new  programming  models  
    Imperative Big Data Processing

    View Slide

  22. 21  
    Our  Goal:  
    Run  Java  programs  with  mutable  state  but  with    
    performance  and  fault  tolerance  of    
    distributed  dataflow  systems  
    >  Programming  distributed  dataflow  graphs  
    requires  learning  new  programming  models  
    Imperative Big Data Processing

    View Slide

  23. 22  
    >  @Annota>ons  help  with  transla>on  from  Java  to  SDGs  
    >  Mutable  distributed  state  in  dataflow  graphs  
    Stateful Data"ow Graphs: From Imperative
    Programs to Distributed Data"ows
    Program.java  
    SDGs:  Stateful  Dataflow  Graphs  
    >  Checkpoint-­‐based  fault  tolerance  recovers  mutable  state    
    aRer  failure  

    View Slide

  24. •  Mo:va:on  
    •  SDG:  Stateful  Dataflow  Graphs  
    •  Handling  distributed  state  in  SDGs  
    •  Transla:ng  Java  programs  to  SDGs  
    •  Checkpoint-­‐based  fault  tolerance  for  SDGs  
    •  Experimental  evalua:on  
    23  
    Outline
    Program.java  

    View Slide

  25. SDG: Data, State and Computation
    >  SDGs  separate  data  and  state  
    to  allow  data  and  pipeline  parallelism  
    24  
    Task  Elements  (TEs)  
    process  data  
    State  Elements  (SEs)  
    represent  state  
    Dataflows  
    represent    
    data  
    >  Task  Elements  have  local  access  to  State  Elements  

    View Slide

  26. State  Elements  support  two  abstrac:ons  for  
    distributed  mutable  state  
    –  Par>>oned  SEs:  task  elements  always  access  
    state  by  key  
    –  Par>al  SEs:  task  elements  can  access    
    complete  state  
    25  
    Distributed Mutable State

    View Slide

  27. 26  
    Distributed Mutable State: Partitioned SEs
    Dataflow  routed  according  to    
    hash  func:on  
    Item-­‐A   Item-­‐B  
    User-­‐A   4   5  
    User-­‐B   0   5  
    Access  
    by  key  
    State  par::oned  according  
    to  par>>oning  key  
    >  Par>>oned  SEs  split  into  disjoint  par::ons  
    User-­‐Item  matrix  (UI)  
    hash(msg.id)  
    Key  space:  [0-­‐N]  
    [0-­‐k]  
    [(k+1)-­‐N]  

    View Slide

  28. 27  
    Distributed Mutable State: Partial SEs
    Local  access:  
    Data  sent  to  one  
    Global  access:  
    Data  sent  to  all  
    >  Par>al  SE  gives  nodes  local  state  instances  
    >  Par>al  SE  access  by  TEs  can  be  local  or  global  

    View Slide

  29. 28  
    Merging Distributed Mutable State
    Merge  logic  
    >  Requires  applica:on-­‐specific  merge  logic  
    >  Reading  all  par:al  SE  instances  results  in  
    set  of  par>al  values  

    View Slide

  30. 29  
    Merging Distributed Mutable State
    Mul:ple  
    par:al  values  
    Merge  logic  
    >  Requires  applica:on-­‐specific  merge  logic  
    >  Reading  all  par:al  SE  instances  results  in  
    set  of  par>al  values  

    View Slide

  31. 30  
    Merging Distributed Mutable State
    Mul:ple  
    par:al  values  
    Collect  par:al  
    values  
    Merge  logic  
    >  Requires  applica:on-­‐specific  merge  logic  
    >  Reading  all  par:al  SE  instances  results  in  
    set  of  par>al  values  

    View Slide

  32. 31  
    Outline
    >  @Annota>ons  
    •  Mo:va:on  
    •  SDG:  Stateful  Dataflow  Graphs  
    •  Handling  distributed  state  in  SDGs  
    •  Transla>ng  Java  programs  to  SDGs  
    •  Checkpoint-­‐based  fault  tolerance  for  SDGs  
    •  Experimental  evalua:on  
    Program.java  

    View Slide

  33. 32  
    From Imperative Code to Execution
    SEEP  
    Annotated    
    program  
    >  SEEP:  data-­‐parallel  processing  plaborm  
    •  Transla:on  occurs  in  two  stages:  
    –  Sta–  Bytecode  rewriProgram.java  

    View Slide

  34. Program.java  
    33  
    Extract  TEs,  SEs  
    and  accesses  
    Live  variable  
    analysis  
    TE  and  SE  access  
    code  assembly  
    SEEP  runnable  
    SOOT  
    Framework  
    Javassist  
    >  Extract  state  and  state  access  paderns  through  sta:c  code  analysis  
    >  Genera:on  of  runnable  code  using  TE  and  SE  connec:ons  
    Translation Process

    View Slide

  35. Program.java  
    34  
    Extract  TEs,  SEs  
    and  accesses  
    Live  variable  
    analysis  
    TE  and  SE  access  
    code  assembly  
    SEEP  runnable  
    SOOT  
    Framework  
    Javassist  
    >  Extract  state  and  state  access  paderns  through  sta:c  code  analysis  
    >  Genera:on  of  runnable  code  using  TE  and  SE  connec:ons  
    Translation Process
    Annotated    
    Program.java  

    View Slide

  36. 35  
    @Par>>oned  Matrix  userItem  =  new  SeepMatrix();  
    Matrix  coOcc  =  new  Matrix();  
     
    void  addRa:ng(int  user,  int  item,  int  ra:ng)  {    
       userItem.setElement(user,  item,  ra:ng);  
       updateCoOccurrence(coOcc,  userItem);  
    }  
     
    Vector  getRec(int  user)  {  
       Vector  userRow  =  userItem.getRow(user);  
       Vector  userRec  =  coOcc.mul:ply(userRow);    
       return  userRec;  
    }  
     
    Partitioned State Annotation
    >  @Par>>on  field  annota>on  indicates  par<hash(msg.id)  

    View Slide

  37. 36  
    @Par::oned  Matrix  userItem  =  new  SeepMatrix();  
    @Par>al  Matrix  coOcc  =  new  SeepMatrix();  
     
    void  addRa:ng(int  user,  int  item,  int  ra:ng)  {    
       userItem.setElement(user,  item,  ra:ng);  
       updateCoOccurrence(@Global  coOcc,  userItem);  
    }  
    Partial State and Global Annotations
    >  @Global  annotates  variable  to  indicate  
    access  to  all  par:al  instances  
    >  @Par>al  field  annota>on  indicates  par

    View Slide

  38. 37  
    @Par::oned  Matrix  userItem  =  new  SeepMatrix();  
    @Par>al  Matrix  coOcc  =  new  SeepMatrix();  
     
    Vector  getRec(int  user)  {  
       Vector  userRow  =  userItem.getRow(user);  
       @Par>al  Vector  puRec  =  @Global  coOcc.mul:ply(userRow);    
       Vector  userRec  =  merge(puRec);  
       return  userRec;  
    }  
     
    Vector  merge(@Collec>on  Vector[]  v){  
       /*…*/  
    }  
     
    Partial and Collection Annotations
    >  @Collec>on  annota:on  indicates  merge  logic  

    View Slide

  39. 38  
    Outline
    >  Failures  
    •  Mo:va:on  
    •  SDG:  Stateful  Dataflow  Graphs  
    •  Handling  distributed  state  in  SDGs  
    •  Transla:ng  Java  programs  to  SDGs  
    •  Checkpoint-­‐Based  fault  tolerance  for  SDGs  
    •  Experimental  evalua:on  
    Program.java  

    View Slide

  40. 39  
    Challenges of Making SDGs Fault Tolerant
    Physical  deployment  of  SDG  
    >  Node  failures  may    
    lead  to  state  loss  
    >  Task  elements  access  
    local  in-­‐memory  state  

    View Slide

  41. 40  
    Challenges of Making SDGs Fault Tolerant
    RAM   RAM  
    Physical  deployment  of  SDG  
    >  Node  failures  may    
    lead  to  state  loss  
    >  Task  elements  access  
    local  in-­‐memory  state  
    Physical  
    nodes  

    View Slide

  42. 41  
    RAM   RAM  
    Physical  deployment  of  SDG  
    >  Node  failures  may    
    lead  to  state  loss  
    Checkpoin>ng  State  
    •  No  updates  allowed  while  state  
    is  being  checkpointed  
    •  Checkpoin:ng  state  should  not  
    impact  data  processing  path  
    >  Task  elements  access  
    local  in-­‐memory  state  
    Physical  
    nodes  
    Challenges of Making SDGs Fault Tolerant

    View Slide

  43. 42  
    RAM   RAM  
    Physical  deployment  of  SDG  
    •  Backups  large  and  cannot  be  
    stored  in  memory  
    •  Large  writes  to  disk  through  
    network  have  high  cost  
    State  Backup  
    >  Node  failures  may    
    lead  to  state  loss  
    Checkpoin>ng  State  
    •  No  updates  allowed  while  state  
    is  being  checkpointed  
    •  Checkpoin:ng  state  should  not  
    impact  data  processing  path  
    >  Task  elements  access  
    local  in-­‐memory  state  
    Physical  
    nodes  
    Challenges of Making SDGs Fault Tolerant

    View Slide

  44. 43  
    Checkpoint Mechanism for Fault Tolerance
    1.  Freeze  mutable  state  for  checkpoin:ng  
    2.  Dirty  state  supports  updates  concurrently  
    3.  Reconcile  dirty  state  
    Asynchronous,  lock-­‐free  checkpoin>ng    
    Dirty  state  

    View Slide

  45. 44  
    Distributed M to N Checkpoint Backup
    M  to  N  distributed  backup  and  
    parallel  recovery  

    View Slide

  46. 45  
    Distributed M to N Checkpoint Backup
    M  to  N  distributed  backup  and  
    parallel  recovery  

    View Slide

  47. 46  
    M  to  N  distributed  backup  and  
    parallel  recovery  
    Distributed M to N Checkpoint Backup

    View Slide

  48. 47  
    M  to  N  distributed  backup  and  
    parallel  recovery  
    Distributed M to N Checkpoint Backup

    View Slide

  49. 48  
    M  to  N  distributed  backup  and  
    parallel  recovery  
    Distributed M to N Checkpoint Backup

    View Slide

  50. 49  
    M  to  N  distributed  backup  and  
    parallel  recovery  
    Distributed M to N Checkpoint Backup

    View Slide

  51. 50  
    M  to  N  distributed  backup  and  
    parallel  recovery  
    Distributed M to N Checkpoint Backup

    View Slide

  52. 51  
    M  to  N  distributed  backup  and  
    parallel  recovery  
    Distributed M to N Checkpoint Backup

    View Slide

  53. 52  
    M  to  N  distributed  backup  and  
    parallel  recovery  
    Distributed M to N Checkpoint Backup

    View Slide

  54. How  does  mutable  state  impact  performance?  
    How  efficient  are  translated  SDGs?  
    What  is  the  throughput/latency  trade-­‐off?  
    Experimental  set-­‐up:  
    –  Amazon  EC2  (c1  and  m1  xlarge  instances)  
    –  Private  cluster  (4-­‐core  3.4  GHz  Intel  Xeon  servers  with  8  GB  RAM  )  
    –  Sun  Java  7,  Ubuntu  12.04,  Linux  kernel  3.10  
    53  
    Evaluation of SDG Performance

    View Slide

  55. 54  
    0
    5
    10
    15
    20
    1:5 1:2 1:1 2:1 5:1
    100
    1000
    Throughput (1000 requests/s)
    Latency (ms)
    Workload (state read/write ratio)
    Throughput
    Latency
    Combines  batch  and  online  processing  to  serve  fresh  
    results  over  large  mutable  state  
    Processing with Large Mutable State
    >  addRa:ng  and  getRec  func:ons  from  recommender  
    algorithm,  while  changing  read/write  ra:o  

    View Slide

  56. 55  
    0
    10
    20
    30
    40
    50
    60
    25 50 75 100
    Throughput (GB/s)
    Number of nodes
    SDG
    Spark
    Translated  SDG  achieves  performance    
    similar  to  non-­‐mutable  dataflow  
    >  Batch-­‐oriented,  itera:ve  logis:c  regression  
    E#ciency of Translated SDG

    View Slide

  57. 56  
    SDGs  achieve  high  throughput  while  main>ng  low  latency  
    Latency/Throughput Tradeo$
    >  Streaming  word  count  query,  repor:ng  
    counts  over  windows  
    0
    50
    100
    150
    200
    250
    10 100 1000 10000
    Throughput (1000 requests/s)
    Window size (ms)
    SDG
    Naiad-LowLatency

    View Slide

  58. 57  
    SDGs  achieve  high  throughput  while  main>ng  low  latency  
    Latency/Throughput Tradeo$
    >  Streaming  word  count  query,  repor:ng  
    counts  over  windows  
    0
    50
    100
    150
    200
    250
    10 100 1000 10000
    Throughput (1000 requests/s)
    Window size (ms)
    SDG
    Naiad-LowLatency
    0
    50
    100
    150
    200
    250
    10 100 1000 10000
    Throughput (1000 requests/s)
    Window size (ms)
    Naiad-HighThroughput
    SDG
    Streaming Spark

    View Slide

  59. 58  
    SDGs  achieve  high  throughput  while  main>ng  low  latency  
    Latency/Throughput Tradeo$
    >  Streaming  word  count  query,  repor:ng  
    counts  over  windows  
    0
    50
    100
    150
    200
    250
    10 100 1000 10000
    Throughput (1000 requests/s)
    Window size (ms)
    SDG
    Naiad-LowLatency
    0
    50
    100
    150
    200
    250
    10 100 1000 10000
    Throughput (1000 requests/s)
    Window size (ms)
    Naiad-HighThroughput
    SDG
    Streaming Spark
    0
    50
    100
    150
    200
    250
    10 100 1000 10000
    Throughput (1000 requests/s)
    Window size (ms)
    Naiad-HighThroughput
    SDG
    Streaming Spark
    Naiad-LowLatency

    View Slide

  60. Running  Java  programs  with  the  performance  of  
    current  distributed  dataflow  frameworks  
    SDG:  Stateful  Dataflow  Graphs  
    – Abstrac:ons  for  distributed  mutable  state  
    – Annota>ons  to  disambiguate  types  of  
    distributed  state  and  state  access  
    – Checkpoint-­‐based  fault  tolerance  mechanism  
    59  
    Summary

    View Slide

  61. Running  Java  programs  with  the  performance  of  
    current  distributed  dataflow  frameworks  
    SDG:  Stateful  Dataflow  Graphs  
    – Abstrac:ons  for  distributed  mutable  state  
    – Annota>ons  to  disambiguate  types  of  
    distributed  state  and  state  access  
    – Checkpoint-­‐based  fault  tolerance  mechanism  
    60  
    Summary
    Thank  you!  
    Any  Ques>ons?  
    @raulcfernandez  
    [email protected]  
    hEps://github.com/lsds/Seep/  
    hEps://github.com/raulcf/SEEPng/  

    View Slide

  62. BACKUP  SLIDES  
    61  

    View Slide

  63. 62  
    0
    0.5
    1
    1.5
    2
    50 100 150 200
    1
    10
    100
    1000
    Throughput (million requests/s)
    Latency (ms)
    Aggregated memory (GB)
    Throughput
    Latency
    Support  large  state  without  compromising  throughput  
    or  latency  while  staying  fault  tolerant  
    Scalability  on  State  Size  and  Throughput  
    >  Increase  state  size  in  a  mutated  KV  store  

    View Slide

  64. 63  
    Itera:on  in  SDG  
    >  Local  itera>on  supported  by  one  node  
    >  Itera>on  across  TEs  requires  cycle  in  the  dataflow  

    View Slide

  65. •  Par::on  
    •  Par:al  
    •  Global  
    •  Par:al  
    •  Collec:on  
    •  Data  annota:ons  
    – Batch  
    – Stream  
    64  
    Types  of  Annota:ons  

    View Slide

  66. Overhead  of  SDG  Fault  Tolerance  
    65  
    1
    10
    100
    1000
    10000
    No FT 1 2 3 4 5
    Latency (ms)
    State size (GB)
    1
    10
    100
    1000
    2 4 6 8 10 No FT
    Latency (ms)
    Checkpoint frequency (s)
    Fault  Tolerance  mechanism  
    impact  on  performance  and  
    latency  is  small.  
    State  size  and  checkpoin>ng  
    Frequency  do  not  affect  the  
    performance  

    View Slide

  67. 66  
    0
    2
    4
    6
    8
    10
    10 100 1000 2000
    0
    20
    40
    60
    80
    100
    Throughput (10,000 requests/s)
    Latency (ms)
    Aggregated memory (MB)
    SDG
    Naiad-NoDisk
    Naiad-Disk
    SDG (latency)
    Naiad-NoDisk (latency)
    Fault  Tolerance  Overhead  

    View Slide

  68. 0
    5
    10
    15
    20
    25
    30
    35
    40
    1 2 4
    Recovery time (s)
    State size (GB)
    1-to-1 recovery
    2-to-1 recovery
    1-to-2 recovery
    2-to-2 recovery
    67  
    Recovery  Times  

    View Slide

  69. 68  
    0
    5
    10
    15
    20
    25
    30
    0 10 20 30 40 50 60
    0
    1
    2
    3
    4
    5
    Throughput (1000 request/s)
    Number of nodes
    Time (s)
    Throughput
    Nodes
    Stragglers  

    View Slide

  70. 69  
    0
    50
    100
    150
    200
    250
    1 2 3 4
    0.001
    0.01
    0.1
    1
    10
    Throughput (1000 requests/s)
    Latency (s)
    State size (GB)
    T'put (Sync)
    Latency (Sync)
    T'put (Async)
    Fault  Tolerance  Sync.  Vs.  Async.  

    View Slide

  71. System   Large  State   Mutable  State   Low  Latency   Itera>on  
    MapReduce   n/a   n/a   No   No  
    Spark   n/a   n/a   No   Yes  
    Storm   n/a   n/a   Yes   No  
    Naiad   No   Yes   Yes   Yes  
    SDG   Yes   Yes   Yes   Yes  
    70  
    Comparison  to  State-­‐of-­‐the-­‐Art  
    SDGs  are  first  stateful  fault  tolerant  model;  enabling  
    execu:on  of  impera:ve  code  with  explicit  state  

    View Slide

  72. 71  
    Characteris:cs  of  SDGs  
    >  Run>me  Data  Parallelism  
    (elas>city)  
    >  Support  for  Cyclic  Graphs  
    >  Low  Latency  
    Adapta:on  to  varying  workloads    
    and  mechanism  against  stragglers  
    Efficiently  represent  itera:ve    
    algorithms  
    Pipelining  tasks  decreases    
    latency  

    View Slide

  73. 72  
    Bob  
    Local  Expert  
    Hi,  I  have  a  query  to  run  on  “Big  Data”  
    Ok,  cool,  tell  me  about  it  
    I  want  to  know  sales  per  employee  on  Saturdays  
    …  well  …  ok,  come  in  3  days  
    Well,  this  is  actually  preWy  urgent…  
    …  2  days,  I’m  preWy  busy  
    2  Days  Ayer  
    Hi!  You  have  the  results?  
    Yes,  here  you  have  your  sales  last  Saturday  
    My  sales?  I  meant  all  employee  sales,  and  not  only  last  Saturday  
    ups,  sorry  for  that,  give  me  2  days…  

    View Slide

  74. 17TH ~ 18th NOV 2014
    MADRID (SPAIN)

    View Slide