Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Taming memory: performance-tuning a (Crystal) application [SoundCloud HQ edition]

Taming memory: performance-tuning a (Crystal) application [SoundCloud HQ edition]

When developing a game, you need to pay attention to performance. After all, a game needs to run fast, and have a predictable frame rate, and stuttering will throw people off.

I’ve had performance issues even in Crystal, a fast, compiled, statically-typed language with a syntax inspired by Ruby. As it turns out, the way a program handles memory can have a huge impact on performance. Luckily, Crystal gives a great deal of control over how this can be done. It’s also possible to use familiar tools with Crystal to debug issues and identify bottlenecks.

In this talk, I’ll share what I’ve learnt about memory and performance tuning, and give an introduction to several powerful tools for identifying performance issues.

5fd304a1701b060ad83381e4f4722e95?s=128

Denis Defreyne

November 24, 2015
Tweet

Transcript

  1. Taming memory: Performance-tuning a (Crystal) application Denis Defreyne / SoundCloud

    HQ / November 24, 2015 1
  2. The contents of this talk aren’t particularly revolutionary. 2 DISCLAIMER

  3. 3

  4. 4

  5. C 5

  6. Gosu (Ruby) 6 ✂

  7. LÖVE (Lua) 7 ✂

  8. 8

  9. 9

  10. Rust 10

  11. Crystal 11 ✂

  12. I don’t know much about game development. 12 DISCLAIMER

  13. 13

  14. 14

  15. 15

  16. 16

  17. 17

  18. 17

  19. memory, the game memory, the computer thingie 18

  20. Allocating objects 19

  21. donkey = Donkey.new(3, "grey") 20

  22. donkey = Donkey.allocate donkey.initialize(3, "grey") 21

  23. donkey = malloc(6).cast(Donkey) donkey.initialize(3, "grey") 22

  24. What is memory? 23

  25. 24

  26. 25 0 1 2 3 4 5 6 7 8

    9 A B C D E F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 23 …
  27. 26 0 1 2 3 4 5 6 7 E

    F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 23 …
  28. 27 0 1 2 3 4 5 6 7 3

    G R E Y Ø E F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 23 …
  29. Freeing memory 28

  30. donkey = malloc(6).cast(Donkey) donkey.initialize(3, "grey") free(donkey) 29

  31. Garbage collection 30 ✨

  32. Calling functions 31

  33. 32

  34. 10 LET S = 0 15 MAT INPUT V 20

    LET N = NUM 30 IF N = 0 THEN 99 40 FOR I = 1 TO N 45 LET S = S + V(I) 50 NEXT I 60 PRINT S/N 70 GO TO 10 99 END 33
  35. 10 LET S = 0 15 MAT INPUT V 20

    LET N = NUM 30 IF N = 0 THEN 99 40 FOR I = 1 TO N 45 LET S = S + V(I) 50 NEXT I 60 PRINT S/N 70 GO TO 10 99 END 34
  36. Structured programming [makes] extensive use of subroutines […] 35

  37. return 36

  38. 9074: eb 4a jmp 90c0 <_mysql_init_character_set+0x124> 9076: 8b 43 f8

    mov -0x8(%rbx),%eax 9079: 83 f8 01 cmp $0x1,%eax 907c: 74 04 je 9082 <_mysql_init_character_set+0xe6> 907e: 85 c0 test %eax,%eax 9080: 75 06 jne 9088 <_mysql_init_character_set+0xec> 9082: 4c 8b 7b f0 mov -0x10(%rbx),%r15 9086: eb 38 jmp 90c0 <_mysql_init_character_set+0x124> 9088: 48 8b 4b f0 mov -0x10(%rbx),%rcx 908c: 48 8d 35 15 a3 03 00 lea 0x3a315(%rip),%rsi # 433a8 <_zcfree+0x1fd6> 9093: bf 51 04 00 00 mov $0x451,%edi 9098: 31 d2 xor %edx,%edx 909a: 31 c0 xor %eax,%eax 909c: e8 72 76 02 00 callq 30713 <_my_printf_error> 90a1: 48 8d 35 56 a3 03 00 lea 0x3a356(%rip),%rsi # 433fe <_zcfree+0x202c> 90a8: 4c 8d 3d 61 b9 03 00 lea 0x3b961(%rip),%r15 # 44a10 <_zcfree+0x363e> 90af: bf 51 04 00 00 mov $0x451,%edi 90b4: 31 d2 xor %edx,%edx 90b6: 31 c0 xor %eax,%eax 90b8: 4c 89 f9 mov %r15,%rcx 37
  39. 0000000000030713 <_my_printf_error>: 30713: 55 push %rbp 30714: 48 89 e5

    mov %rsp,%rbp 30717: 41 57 push %r15 30719: 41 56 push %r14 3071b: 41 54 push %r12 3071d: 53 push %rbx 3071e: 48 81 ec d0 02 00 00 sub $0x2d0,%rsp … 30932: 4c 3b 75 e8 cmp -0x18(%rbp),%r14 30936: 75 0c jne 30944 <_my_printf_warning+0xda> 30938: 48 81 c4 d0 02 00 00 add $0x2d0,%rsp 3093f: 5b pop %rbx 30940: 41 5e pop %r14 30942: 5d pop %rbp 30943: c3 retq 38
  40. 39 main

  41. 39 my_printf_error main call

  42. 39 my_printf_error main STACK call (4 byte elements)

  43. 39 return address my_printf_error main STACK call (4 byte elements)

  44. 39 return address my_printf_error main STACK call my_vsnprintf_ex call (4

    byte elements)
  45. 39 return address my_printf_error main STACK call my_vsnprintf_ex return address

    call (4 byte elements)
  46. 39 return address my_printf_error main STACK call my_vsnprintf_ex call ret

    (4 byte elements)
  47. 39 my_printf_error main STACK call ret my_vsnprintf_ex call ret (4

    byte elements)
  48. Passing arguments 40

  49. 41

  50. 41 …

  51. 41 param 1 …

  52. 41 param 1 param 2 …

  53. 41 param 1 param 2 return address …

  54. 41 param 1 param 2 …

  55. 41 …

  56. 42

  57. 42 …

  58. 42 param param return address …

  59. 42 param param return address … param return address

  60. 42 param param return address … param return address return

    address
  61. 42 param param return address … param return address

  62. 42 param param return address …

  63. 42 …

  64. Storing local variables 43

  65. 44

  66. 44 …

  67. 44 param param return address …

  68. 44 param param return address … local variable local variable

  69. 44 param param return address …

  70. 44 …

  71. 45

  72. Collecting garbage 46

  73. 47 param 1 param 2 return address local variable 1

    local variable 2 local variable 3 local variable local variable
  74. 47 param 1 param 2 return address local variable 1

    local variable 2 local variable 3 local variable local variable
  75. 47 ? ? local variable local variable

  76. 48 ? ? local variable local variable

  77. 49 ? ? Mark phase local variable local variable

  78. 50 Sweep phase local variable local variable

  79. Mark and sweep: the simplest GC technique 51

  80. 52

  81. 79% of time spent in the garbage collector‽ 53

  82. ‽ interrobang — U+203D 54 ( )

  83. Stop this GC madness! 55

  84. Avoiding garbage collection:
 three techniques 56

  85. Avoiding garbage collection
 through explicit reuse 57

  86. 1_000_000.times do point = Point.new(random(width), random(height)) draw(point) end 58

  87. point = Point.new 1_000_000.times do point.x = random(width) point.y =

    random(height) draw(point) end 59
  88. - only works for a single instance - mutating state

    can lead to bugs 60
  89. Avoiding garbage collection
 through memory pooling 61

  90. 62

  91. pool = Pool(Entity).new(1000)
 62

  92. pool = Pool(Entity).new(1000)
 entity = pool.acquire
 62

  93. pool = Pool(Entity).new(1000)
 entity = pool.acquire
 pool.release(entity) 62

  94. + can reuse multiple objects - memory management is more

    manual 63
  95. Avoiding garbage collection
 through stack allocation 64

  96. 65

  97. 65 …

  98. 65 param param return address …

  99. 65 param param return address … local variable local variable

  100. 65 param param return address …

  101. 65 …

  102. class Point getter :x, :y def initialize(@x, @y) end end

    66
  103. struct Point getter :x, :y def initialize(@x, @y) end end

    67
  104. 1_000_000.times do point = Point.new(random(width), random(height)) draw(point) end 68

  105. + no explicit memory management - only usable for local

    variables 69
  106. Don’t destroy the CPU cache! 70

  107. 71

  108. 71 HD

  109. 71 HD RAM

  110. 71 HD RAM Cache

  111. 71 HD RAM Cache ~ 1 000 000 ns

  112. 71 HD RAM Cache ~ 1 000 000 ns ~

    500 ns
  113. 71 HD RAM Cache ~ 1 000 000 ns ~

    1-10 ns ~ 500 ns
  114. 72

  115. 73

  116. If data is available in the cache, we have a

    cache hit. If it’s not available in the cache, we have a cache miss. 74
  117. To avoid cache misses, keep similar data together in memory.

    75
  118. 76 position velocity rotation armor shield

  119. 77 position velocity (for movement) rotation armor shield

  120. 77 position velocity 14 used bytes / 32 total bytes

    = 44% efficiency (for movement) rotation armor shield
  121. 78

  122. 79

  123. Store similar data in contiguous arrays. 80

  124. positions = Pool(Position).new(1000) velocities = Pool(Velocity).new(1000) 81

  125. Demo 82

  126. Demo: Stack allocation 83

  127. 84 r

  128. 84 r

  129. #total (2 × r) 2 #inside π × r 2

    85 ≈
  130. #total (2 × r) 2 #inside π × r 2

    85 ≈
  131. #total 4 #inside π 86 ≈

  132. 4 × #inside #total 87 π ≈

  133. (use the source, luke) 88

  134. Demo: Cache grinding 89

  135. (use the source, luke) 90

  136. 91

  137. 92 slack @denis / mail denis@soundcloud.com Denis Defreyne Ask me

    about anything but potatoes.
  138. Extra slides 93

  139. 94 lldb, gdb debugger Instruments (Mac OS X) performance analyser

    and visualiser dtrace dynamic tracing framework