Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallel worlds of CRuby's GC

nari
October 02, 2011
75

Parallel worlds of CRuby's GC

In rubyconf2011.

nari

October 02, 2011
Tweet

Transcript

  1. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Parallel

    worlds of CRuby's GC nari/Narihiro Nakamura/ @nari_en Network Applied Communication Laboratory Ltd.
  2. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Ice-cream

    factory I worked in an assembly line ✓ For example, I made many cardboard boxes. I was a professional cardboard box maker :) ✓ ✓ 8/207
  3. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Ice-cream

    factory I made 150 boxes per hour (ZOMG) ✓ 9/207
  4. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Working

    with Java I worked in a big company. ✓ This work was similar to assembly line work.. I made a part of a product. I didn't understand whole product. ✓ ✓ 13/207
  5. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 My

    current work Currently, I work at NaCl. ✓ matz and shyouhei and takaokouji are my co-workers. ✓ shugo is my boss. They are CRuby committers. ✓ ✓ 17/207
  6. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 When

    I started Ruby programming I felt free. ✓ This work wasn't similar to assembly line work. I could make the whole product. ✓ ✓ 18/207
  7. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Garbage

    Collection for me GC technology is very interesting for me. ✓ GC is a garbage collecting machine. ✓ I've been creating it since then. It's very fun!! ✓ 21/207
  8. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 I'm

    a CRuby Committer I work on GC. ✓ 24/207
  9. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 My

    RDD history LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 ✓ 30/207
  10. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 My

    RDD history LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 ✓ 31/207
  11. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 LonglifeGC

    It treats long-life objects as a special case. similar to Generational GC. ✓ ✓ LonglifeGC was rejected in CRuby 1.9.2 by some reason. :'( ✓ ✓ 32/207
  12. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Kiji

    Kiji is an optimized version of REE by Twitter developers. ✓ The twitter team substantially extended LonglifeGC. It's cool!! ✓ ✓ 34/207
  13. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 My

    RDD history LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 ✓ 36/207
  14. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 LazySweepGC

    Traditional M&S GC executes mark and sweep atomically. Ruby application stops during GC (stop-the-world). ✓ ✓ In Lazy sweeping, sweeping is lazy. ✓ 37/207
  15. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 LazySweepGC

    Each invocation of the object allocation sweeps Ruby's heap until it finds an appropriate free object. ✓ ✓ 38/207
  16. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Improvements

    This improves the response time of GC ✓ I.e. the worst case time of GC decreases. ✓ 39/207
  17. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 LazySweepGC

    You can use LazySweepGC since Ruby 1.9.3 ✓ 40/207
  18. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 My

    RDD history LazySweepGC - RubyKaigi2008 ✓ LonglifeGC - 2009 ✓ LazySweepGC - 2010 ✓ ParallelMarkingGC - 2011 ✓ 41/207
  19. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Today's

    topics Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? ✓ 43/207
  20. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Today's

    topics Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? ✓ 44/207
  21. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Current

    CRuby's GC GC operates on only 1 core. ✓ In multi-core environment, other cores don't help GC. ✓ 47/207
  22. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 What

    is GC? GC collects all dead objects. ✓ 51/207
  23. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 What

    is a dead object? A dead object is an object that is never referenced by the program. ✓ In GC terms, we say a that dead object is unreachable from Roots. ✓ 52/207
  24. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 What

    is Roots? Roots is a set of pointers that directly reference objects in the program. e.g. Ruby's local variables, etc.. ✓ ✓ 53/207
  25. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Please

    remember that GC collects objects that are unreachable from Roots. ✓ 55/207
  26. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 CRuby's

    GC algorithm summary CRuby adopts the Mark & Sweep algorithm ✓ Collector works in separate Mark and Sweep phases. ✓ 57/207
  27. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 In

    the Mark phase collector marks live objects that are reachable from Roots. ✓ 58/207
  28. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 In

    the Sweep phase collector sweeps "dead" objects "dead" means unmarked ✓ "dead" means unreachable from Roots ✓ ✓ 62/207
  29. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Characteristics

    The stop-the-world algorithm ✓ Single thread execution ✓ 65/207
  30. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Recently,

    PC has multi-core processors. But, GC executes on a single thread. ✓ Other cores don't work during GC. ✓ What a waste!! ✓ 66/207
  31. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 What

    is Parallel Marking? Collector run several marking processes in parallel by using native threads. ✓ ✓ We will be happy on multi-core machine. ✓ 70/207
  32. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Why

    not perform sweeping in parallel The sweeping is much faster than the marking. You can see ko1's research ✓ <URL:http://www.atdot.net/~ko1/ diary/201011.html#d4> ✓ ✓ 73/207
  33. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Why

    not perform sweeping in parallel So, Mark phase improvement = GC improvement ✓ And, we already have the lazy sweeping. ✓ 74/207
  34. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Today's

    topics Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? ✓ 75/207
  35. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 We

    should consider two problems Workload balancing ✓ Wait-free algorithm ✓ 77/207
  36. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 This

    means.. Tasks are distributed to multiple threads. ✓ The task of marking the entire heap is divided into several tasks, each marking a single branch. ✓ 84/207
  37. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 We

    should consider two problems Workload balancing ✓ Wait-free algorithm ✓ 97/207
  38. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 What

    does "wait-free" mean? A wait-free program does non- blocking execution. ✓ It guarantees per-thread progress. ✓ 99/207
  39. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Amdahl's

    law is used to find the maximum expected improvement to an overall system when only part of the system is improved. [cited from `Amdahl's law - Wikipedia'] 102/207
  40. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Amdahl's

    law is used in parallel computing If parallel portion of the system is X% ✓ And number of processors is Y, ✓ How much speedup can we expect? ✓ 103/207
  41. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 The

    conclusion so far We should consider how we can efficiently balance workloads. So, we use Task Stealing. ✓ ✓ We should eliminate non-parallel parts by using wait-free algorithm. ✓ ✓ 109/207
  42. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Today's

    topics Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve ✓ 110/207
  43. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Task

    Stealing In Task Stealing, threads steal tasks from each other ✓ Task Stealing is achieved with Arora's Deque ✓ 112/207
  44. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Arora's

    Deque Deque stands for the Double- Ended Queue. ✓ In Arora's Deque, the deque contains tasks as elements. ✓ It's a wait-free data structure. ✓ 113/207
  45. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 In

    what ways could shift() cause contention problems? e.g... Multi-thread (workers) may call shift() of same deque at the same time. ✓ 122/207
  46. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 In

    what ways could shift() cause contention problems? e.g... shift() and pop() could be called at the same time when deque has only one element. ✓ ✓ 123/207
  47. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Serialization

    shift() is serialized by using CAS. CAS = Compare And Swap ✓ ✓ And, this serialization doesn't use a lock. It's wait-free!! ✓ ✓ 125/207
  48. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Summary

    for Arora's Deque A simple data structure for Task Stealing. ✓ Each worker has a single deque. ✓ Stealing (shift operation) is wait- free! ✓ 128/207
  49. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Summary

    Marker uses Arora's Deque as a marking stack. ✓ A "task" means an object. The granularity of the task is very fine. ✓ ✓ This is a naive implementation. ✓ 140/207
  50. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Why

    slow? pop(),push(),shift() are called frequently. Because deque has fine-grained tasks. ✓ ✓ Their overhead is too big. ✓ 147/207
  51. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Good

    point & Bad point Number of calls to Deque's operations was reduced. Marking speed of the worker is improved. ✓ ✓ However, Coarse-grained tasks decrease parallelism. ✓ 155/207
  52. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Treatment

    for large Array objects and Hash objects Each marker has a special deque to manage them. ✓ A marker divides them into fixed size tasks. e.g. 0-9 elements of Array, 10-19 elements of Array... ✓ ✓ 162/207
  53. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Treatment

    for Large Array and Hash By doing this, other workers can steal divided tasks. This improves parallelism. ✓ ✓ 163/207
  54. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Summary

    The naive implementation was slow. Grain of the task was too fine. ✓ ✓ A "task" means a branch in Roots Grain of the task is coarse. ✓ ✓ It's faster!! ✓ 164/207
  55. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Today's

    topics Why do we need Parallel Marking? ✓ What to consider? ✓ How to implement? ✓ How much did performance improve? ✓ 165/207
  56. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 These

    are my machine specs My machine has only 2 cores ✓ Memory: 8GB ✓ OS: Linux ✓ 167/207
  57. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 First

    benchmark program is make benchmark This is the benchmark which used in CRuby development ✓ ✓ 169/207
  58. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Why

    does this seem so slow? I think it's affected by Parallel Marking's preparation. e.g. creating marking threads, allocation of deques. ✓ ✓ 171/207
  59. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Why

    does this seem so slow? In most of the benchmarks, the mark target objects are few. In this case, Parallel Marking cost is expensive. ✓ ✓ 172/207
  60. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Next

    benchmark program is make rdoc make rdoc generates the Ruby documentation. ✓ This benchmark measures execution time and the GC execution time of make rdoc. ✓ ✓ 173/207
  61. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 make

    rdoc It takes about 80 seconds on my machine. ✓ In fact, 30% of that time is spent on GC!! ✓ How much did performance improve? ✓ 174/207
  62. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 In

    many core environment I expect we get a large improvement. e.g. 8 core, 16 core... ✓ ✓ But, my machine has just 2 cores. I can't see it :( ✓ ✓ 178/207
  63. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Best

    case for Parallel GC If the objects are many. In this case, mark targets is also many. ✓ ✓ If the objects are long-lived. Server-side application? ✓ ✓ 179/207
  64. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Demonstration

    I want to show the performance improvement with Parallel GC. ✓ This demonstration is video game style. ✓ 181/207
  65. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Other

    characteristics of SUPER NARIO GC GC is running in fixed intervals. ✓ A lot of objects are generated to increase GC's burden. Burden = Game Level ✓ ✓ 187/207
  66. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Try

    to compare Original GC and Parallel GC Original GC pause time is long. This game will be difficult. ✓ ✓ Parallel GC pause time is short. This game will be easy. ✓ ✓ 188/207
  67. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Windows

    OS is not supported Mark Worker uses pthread as native thread. ✓ And, uses some gcc built-in functions. ✓ But, I'll support for Windows eventually. ✓ 198/207
  68. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Increased

    memory usage. Size of 1 Deque is roughly 32KB. ✓ But generally multi-core machine have plenty of memory. So, I think it's OK :P ✓ ✓ 199/207
  69. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Conclusion

    I implemented Parallel Marking GC ✓ GC was improved! I'll report to ruby-core soon. ✓ ✓ 201/207
  70. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Conclusion

    But, Parallel Marking has some problems. I'll fix these. ✓ ✓ 202/207
  71. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 source

    code Parallel Marking GC <URL:https://github.com/authorNari/ ruby/tree/pmark_div_root2> ✓ ✓ SUPER NARIO GC <URL:https://github.com/authorNari/ nario/> ✓ ✓ 203/207
  72. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Acknowledgments

    Following people helped me make this presentation!! Tor-san!! ✓ matz, shugo, yhara, sada, takaokouji, other co-workers!! ✓ ✓ 204/207
  73. Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3 Sorry

    It's too difficult for me to understand/answer the question. ✓ Could be send the question on twitter(@nari_en)? ✓ 207/207