Taming memory: performance-tuning a (Crystal) application [RUG::B edition]

Taming memory: performance-tuning a (Crystal) application [RUG::B edition]

When developing a game, you need to pay attention to performance. After all, a game needs to run fast, and have a predictable frame rate, and stuttering will throw people off.

I’ve had performance issues even in Crystal, a fast, compiled, statically-typed language with a syntax inspired by Ruby. As it turns out, the way a program handles memory can have a huge impact on performance. Luckily, Crystal gives a great deal of control over how this can be done. It’s also possible to use familiar tools with Crystal to debug issues and identify bottlenecks.

In this talk, I’ll share what I’ve learnt about memory and performance tuning, and give an introduction to several powerful tools for identifying performance issues.

Be732ee41fd3038aa98a0a7e7b7be081?s=128

Denis Defreyne

December 03, 2015
Tweet

Transcript

  1. Taming memory: Performance-tuning a (Crystal) application Denis Defreyne / RUG::B

    / December 3, 2015 1
  2. 2

  3. The contents of this talk aren’t particularly revolutionary. 3 DISCLAIMER

  4. 4

  5. 5

  6. Crystal 6

  7. I don’t know much about game development. 7 DISCLAIMER

  8. 8

  9. 9

  10. 10

  11. 11

  12. 12

  13. 13

  14. memory, the game memory, the computer thingie 14

  15. Allocating objects 15

  16. donkey = Donkey.new(3, "grey") 16

  17. donkey = Donkey.allocate donkey.initialize(3, "grey") 17

  18. donkey = malloc(6).cast(Donkey) donkey.initialize(3, "grey") 18

  19. What is memory? 19

  20. 20

  21. 21 0 1 2 3 4 5 6 7 8

    9 A B C D E F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 23 …
  22. 22 0 1 2 3 4 5 6 7 E

    F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 23 …
  23. 23 0 1 2 3 4 5 6 7 3

    G R E Y Ø E F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 23 …
  24. Freeing memory 24

  25. donkey = malloc(6).cast(Donkey) donkey.initialize(3, "grey") free(donkey) 25

  26. Garbage collection 26 ✨

  27. Calling functions 27

  28. 28

  29. 10 LET S = 0 15 MAT INPUT V 20

    LET N = NUM 30 IF N = 0 THEN 99 40 FOR I = 1 TO N 45 LET S = S + V(I) 50 NEXT I 60 PRINT S/N 70 GO TO 10 99 END 29
  30. 10 LET S = 0 15 MAT INPUT V 20

    LET N = NUM 30 IF N = 0 THEN 99 40 FOR I = 1 TO N 45 LET S = S + V(I) 50 NEXT I 60 PRINT S/N 70 GO TO 10 99 END 30
  31. Structured programming [makes] extensive use of subroutines […] 31

  32. return 32

  33. 9074: eb 4a jmp 90c0 <_mysql_init_character_set+0x124> 9076: 8b 43 f8

    mov -0x8(%rbx),%eax 9079: 83 f8 01 cmp $0x1,%eax 907c: 74 04 je 9082 <_mysql_init_character_set+0xe6> 907e: 85 c0 test %eax,%eax 9080: 75 06 jne 9088 <_mysql_init_character_set+0xec> 9082: 4c 8b 7b f0 mov -0x10(%rbx),%r15 9086: eb 38 jmp 90c0 <_mysql_init_character_set+0x124> 9088: 48 8b 4b f0 mov -0x10(%rbx),%rcx 908c: 48 8d 35 15 a3 03 00 lea 0x3a315(%rip),%rsi # 433a8 <_zcfree+0x1fd6> 9093: bf 51 04 00 00 mov $0x451,%edi 9098: 31 d2 xor %edx,%edx 909a: 31 c0 xor %eax,%eax 909c: e8 72 76 02 00 callq 30713 <_my_printf_error> 90a1: 48 8d 35 56 a3 03 00 lea 0x3a356(%rip),%rsi # 433fe <_zcfree+0x202c> 90a8: 4c 8d 3d 61 b9 03 00 lea 0x3b961(%rip),%r15 # 44a10 <_zcfree+0x363e> 90af: bf 51 04 00 00 mov $0x451,%edi 90b4: 31 d2 xor %edx,%edx 90b6: 31 c0 xor %eax,%eax 90b8: 4c 89 f9 mov %r15,%rcx 33
  34. 0000000000030713 <_my_printf_error>: 30713: 55 push %rbp 30714: 48 89 e5

    mov %rsp,%rbp 30717: 41 57 push %r15 30719: 41 56 push %r14 3071b: 41 54 push %r12 3071d: 53 push %rbx 3071e: 48 81 ec d0 02 00 00 sub $0x2d0,%rsp … 30932: 4c 3b 75 e8 cmp -0x18(%rbp),%r14 30936: 75 0c jne 30944 <_my_printf_warning+0xda> 30938: 48 81 c4 d0 02 00 00 add $0x2d0,%rsp 3093f: 5b pop %rbx 30940: 41 5e pop %r14 30942: 5d pop %rbp 30943: c3 retq 34
  35. 35 return address my_printf_error main STACK call ret my_vsnprintf_ex return

    address call ret (4 byte elements)
  36. Passing arguments 36

  37. 37 param 1 param 2 return address …

  38. 38 param param return address … param return address return

    address
  39. Storing local variables 39

  40. 40 param param return address … local variable local variable

  41. 41

  42. Collecting garbage 42

  43. 43 param 1 param 2 return address local variable 1

    local variable 2 local variable 3 ? ? local variable local variable
  44. 44 ? ? local variable local variable

  45. 45 ? ? Mark phase local variable local variable

  46. 46 Sweep phase local variable local variable

  47. Mark and sweep: the simplest GC technique 47

  48. 48

  49. 79% of time spent in the garbage collector‽ 49

  50. ‽ interrobang — U+203D 50 ( )

  51. Stop this GC madness! 51

  52. Avoiding garbage collection:
 three techniques 52

  53. Avoiding garbage collection
 through explicit reuse 53

  54. 1_000_000.times do point = Point.new(random(width), random(height)) draw(point) end 54

  55. point = Point.new 1_000_000.times do point.x = random(width) point.y =

    random(height) draw(point) end 55
  56. - only works for a single instance - mutating state

    can lead to bugs 56
  57. Avoiding garbage collection
 through memory pooling 57

  58. pool = Pool(Entity).new(1000)
 entity = pool.acquire
 pool.release(entity) 58

  59. + can reuse multiple objects - memory management is more

    manual 59
  60. Avoiding garbage collection
 through stack allocation 60

  61. 61 param param return address … local variable local variable

  62. class Point getter :x, :y def initialize(@x, @y) end end

    62
  63. struct Point getter :x, :y def initialize(@x, @y) end end

    63
  64. 1_000_000.times do point = Point.new(random(width), random(height)) draw(point) end 64

  65. + no explicit memory management - only usable for local

    variables 65
  66. Don’t destroy the CPU cache! 66

  67. 67 HD RAM Cache ~ 1 000 000 ns ~

    1-10 ns ~ 500 ns
  68. 68

  69. 69

  70. For maximum speed, keep similar data together in memory. 70

  71. 71 position velocity rotation health AI

  72. 72 position velocity 12 used bytes / 32 total bytes

    = 37% efficiency (for movement) rotation health AI
  73. 73

  74. 74

  75. Store similar data in contiguous arrays. 75

  76. positions = Pool(Position).new(1000) velocities = Pool(Velocity).new(1000) 76

  77. Demo 77

  78. Demo: Stack allocation 78

  79. 79 (shamelessly copied from Tobi)

  80. 80 r

  81. #total (2 × r) 2 #inside π × r 2

    81 ≈
  82. #total 4 #inside π 82 ≈

  83. 4 × #inside #total 83 π ≈

  84. (use the source, luke) 84

  85. Demo: Cache grinding 85

  86. (use the source, luke) 86

  87. 87

  88. 88 @ddfreyne denis@stoneship.org Denis Defreyne I’ll question your answers now.