Slide 1

Slide 1 text

Dive into CRuby naruse

Slide 2

Slide 2 text

Who is naruse? • twitter: @nalsh • GitHub: @nurse • Ruby committer (M17N, release engineering, etc.) • nkf maintainer • working at Treasure Data, Inc. 2

Slide 3

Slide 3 text

झࢫ • ϢʔεέʔεͷॏཁੑΛ ஌Β͠ΊɺօʹऩूΛ ख఻ͬͯ΋Β͏ • RubyΑΓ௿͍ϨΠϠͷ ཧղΛਂΊΔ Goal • Understand how use case is important, and make people gather real examples • Understand layers lower than Ruby 3

Slide 4

Slide 4 text

Through RubyKaigi 2016, You may want to • write a new application • improve JRuby or mruby or something • improve CRuby 4

Slide 5

Slide 5 text

But... What should I do... 5

Slide 6

Slide 6 text

Motivations • New features • New methods or extend exist one • New platforms • Optimizations • speed • memory • Instruments to help trouble shooting • to ease debugging 6

Slide 7

Slide 7 text

New features • New features • New methods or extend exist one • String#scrub • Regexp#match? • etc. • New platforms • clang • Microsoft Visual C++ 2015 7

Slide 8

Slide 8 text

Why new feature? • Existing methods doesn’t fit my use case • This “use case” must be a real example. • It shall not be an artificial one. • Current method is too complicated, and it must be more simple for my use case 8

Slide 9

Slide 9 text

–Jonathan Swift, Travels into Several Remote Nations of the World (1726) “Necessity is the mother of invention.” but... 9

Slide 10

Slide 10 text

What is the truly necessary one? http://dic.nicovideo.jp/a/ސ٬͕ຊ౰ʹඞཁͩͬͨ΋ͷ 10

Slide 11

Slide 11 text

[Feature #6752] Replacing ill- formed subsequence • Repair invalid bytes in the given string. • Above snippet didn’t work as expected if `str` is already UTF-8. (it simply returned the original string) • Anyway what kind of strings are used? str = "\xe3\x81\x82\x80\x80" str.encode("utf-8", invalid: :replace) 11

Slide 12

Slide 12 text

Case: Web Forms • If strings come from Web Forms are broken, the web application should simply raise error: 400 Bad Request. • Clients should retry with correct request. 12

Slide 13

Slide 13 text

Case: Files • If strings come from files are broken, it’s usually because it doesn't specify correct encoding. • Retry reading with correct encoding. 13

Slide 14

Slide 14 text

Case: Web Crawler • They can’t retry. • They need to handle broken strings easily and secure. • There’s real pages, for example twitter... • mechanism: • wrongly get substring as bytes (though it must do as chars) • insert such invalid bytes into web page template 14

Slide 15

Slide 15 text

How do I design the interface • transcode base • Use Ruby’s encoding conversion engine. • Some people are used to use this way with iconv. • regexp base • Use regular expression engine (Ruby usually uses regexp engine on handling characters) • Better performance and meaning. • naming issue str.encode("utf-8", invalid: :replace) 15

Slide 16

Slide 16 text

Added both APIs • String#encode • tables under enc/trans • CSI (Code Set Independent; they doesn’t convert into Unicode if it doesn’t need) • but in 21st century, people rarely convert encodings… • String#scrub • regexp base 16

Slide 17

Slide 17 text

CRuby files • Ruby class name/module name • CRuby component name (+suffix) • utility functions • etc. 17

Slide 18

Slide 18 text

Ruby class/module 18

Slide 19

Slide 19 text

documents 19

Slide 20

Slide 20 text

encodings 20

Slide 21

Slide 21 text

regular expressions Onigmo (Oniguruma fork) 21

Slide 22

Slide 22 text

Ruby VM (YARV) 22

Slide 23

Slide 23 text

GC 23

Slide 24

Slide 24 text

build a working field of nobu 24

Slide 25

Slide 25 text

platforms NOTE: cygwin has no maintainer 25

Slide 26

Slide 26 text

tests & benchmarks 26

Slide 27

Slide 27 text

Impl: String#scrub • This is a method of `String` class.
 → string.c 27

Slide 28

Slide 28 text

the end of string.c void Init_String(void) { rb_cString = rb_define_class("String", rb_cObject); rb_include_module(rb_cString, rb_mComparable); … rb_define_method(rb_cString, "empty?", rb_str_empty, 0); rb_define_method(rb_cString, "=~", rb_str_match, 1); rb_define_method(rb_cString, "getbyte", rb_str_getbyte, 1); rb_define_method(rb_cString, "setbyte", rb_str_setbyte, 2); … rb_define_method(rb_cString, "scrub", str_scrub, -1); rb_define_method(rb_cString, "scrub!", str_scrub_bang, -1); … } define method in String class named “scrub” whose implementation is `str_scrub` function, and the number of its arguments is -1 (variable arguments). 28

Slide 29

Slide 29 text

/* * call-seq: * str.scrub -> new_str * str.scrub(repl) -> new_str * str.scrub{|bytes|} -> new_str * * If the string is invalid byte sequence then replace invalid bytes * with given replacement character, else returns self. * If block is given, replace invalid bytes with returned value of the block. * * "abc\u3042\x81".scrub #=> "abc\u3042\uFFFD" * "abc\u3042\x81".scrub("*") #=> "abc\u3042*" * "abc\u3042\xE3\x80".scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc \u3042" */ static VALUE str_scrub(int argc, VALUE *argv, VALUE str) { VALUE repl = argc ? (rb_check_arity(argc, 0, 1), argv[0]) : Qnil; VALUE new = rb_str_scrub(str, repl); return NIL_P(new) ? rb_str_dup(str): new; } 29

Slide 30

Slide 30 text

static VALUE str_scrub(int argc, VALUE *argv, VALUE str) { VALUE repl = argc ? (rb_check_arity(argc, 0, 1), argv[0]) : Qnil; VALUE s = rb_str_scrub(str, repl); return NIL_P(s) ? rb_str_dup(str): s; } • VALUE: a pointer type for Ruby object. • remind that Ruby’s data is always an object.
 (there’s no primitive type) • argv: C array of VALUEs • argc: the number of VALUEs in argv • arity: the number of arguments • rb_check_arity(argc,min,max): raise exception if arity doesn’t match. • Qnil: `nil` in CRuby source code • NIL_P(x): `x.nil?`; Like Lisp CRuby source uses `_p` instead of `?` 30

Slide 31

Slide 31 text

/** * @param str the string to be scrubbed * @param repl the replacement character * @return If given string is invalid, returns a new string. Otherwise, returns Qnil. */ VALUE rb_str_scrub(VALUE str, VALUE repl) { return rb_enc_str_scrub(STR_ENC_GET(str), str, repl); } • STR_ENC_GET(): returns `rb_encoding *` which is C structure to represent Ruby’s Encoding. • There’s some similar relations between C structures and Ruby class like Hash and st_table, Thread class and rb_thread_t and so on. 31

Slide 32

Slide 32 text

VALUE rb_enc_str_scrub(rb_encoding *enc, VALUE str, VALUE repl) { VALUE buf = rb_str_buf_new(RSTRING_LEN(str)); const char *rep = /* replacement char */; long replen = strlen(rep); const char *p = RSTRING_PTR(str); const char *e = RSTRING_END(str); const char *p1 = p; while (p < e) { int ret = rb_enc_precise_mbclen(p, e, enc); if (MBCLEN_NEEDMORE_P(ret)) { break; } else if (MBCLEN_CHARFOUND_P(ret)) { p += MBCLEN_CHARFOUND_LEN(ret); } else if (MBCLEN_INVALID_P(ret)) { ... 32

Slide 33

Slide 33 text

[Feature #7361] Pathname#touch • touch(1) • change file access and modification times • NOTE: This ticket is rejected. new_file = Pathname.new('location/for/new/file') FileUtils.touch(new_file) 33

Slide 34

Slide 34 text

NAME touch - change file access and modification times SYNOPSIS touch [-acm] [-r ref_file|-t time|-d date_time] file… OPTIONS -a: Change the access time of file. -c: Do not create a specified file if it does not exist. -m: Change the modification time of file. http://pubs.opengroup.org/onlinepubs/9699919799/utilities/touch.html 34

Slide 35

Slide 35 text

But… What is this method? • “Behave like touch(1)” • This can’t be a correct answer. • This has two different features: • “change file access and modification times” • “create empty file” • It describes the spec, but doesn’t describe what users truly want to do. 35

Slide 36

Slide 36 text

Conclusion of suggesting a new feature • Collect use cases: • take a memo if you find a case which should be written shorter. (only this can be a good feature request) • Time class and Integer bit manipulation seems to be improved… • Design API for the group of use cases • and find a good name ;-) • Read ʮAPIσβΠϯέʔεελσΟ ――Rubyͷ࣮ྫ͔ ΒֶͿɻ໰୊ʹଈͨ͠σβΠϯͱීวͷߟ͑ํʯ • http://gihyo.jp/book/2016/978-4-7741-7802-8 36

Slide 37

Slide 37 text

New platforms • On a brand new environment, Ruby sometimes doesn’t work. For example clang (a C language family frontend for LLVM), and Microsoft Visual C++ 2015 (VC14). • clang (including llvm-gcc): • It usually behaves like GCC, but different in some points. • Visual C++ 2015 • It introduced the Universal C Runtime, which breaks many compatibility. 37

Slide 38

Slide 38 text

clang (including llvm-gcc) • It optimizes code more aggressive than gcc at some point. It breaks Ruby’s GC (because Ruby uses conservative mark & sweep GC). • The most tough one was r34278, in which llvm optimized out the memory allocation for continuation. • You may think no one uses continuation, but actually uses through enumerator. 38

Slide 39

Slide 39 text

Visual C++ 2015 • It introduced the Universal C Runtime, which breaks many compatibility. • The most tough part is around _pioinfo: • _pioinfo is an internal array of IO structure. It stores underlying OS file HANDLE, file position (cursor), file attributes, text mode, and so on. Ruby handles the structure • to associate socket and fd: CRuby creates fd with dummy file handle and set socket • to implement overlapped I/O for Windows 2000/XP • to emulate fcntl(2) • But VC2015 make it private. • NOTE: overlapped I/O is a Microsoft’s term similar to nonblocking I/O. 39

Slide 40

Slide 40 text

static void set_pioinfo_extra(void) { # define FUNCTION_RET 0xc3 /* ret */ /* get __pioinfo addr with _isatty */ char *p = (char*)get_proc_address("ucrtbase.dll", "_isatty", NULL); char *pend = p; /* _osfile(fh) & FDEV */ int32_t rel; char *rip; /* add rsp, _ */ # define FUNCTION_BEFORE_RET_MARK "\x48\x83\xc4" # define FUNCTION_SKIP_BYTES 1 /* lea rdx,[__pioinfo's addr in RIP-relative 32bit addr] */ # define PIOINFO_MARK "\x48\x8d\x15" if (p) { for (pend += 10; pend < p + 300; pend++) { // find end of function if (memcmp(pend, FUNCTION_BEFORE_RET_MARK, sizeof(FUNCTION_BEFORE_RET_MARK) - 1) == 0 && *(pend + (sizeof(FUNCTION_BEFORE_RET_MARK) - 1) + FUNCTION_SKIP_BYTES) & FUNCTION_RET == FUNCTION_RET) { // search backwards from end of function for (pend -= (sizeof(PIOINFO_MARK) - 1); pend > p; pend--) { if (memcmp(pend, PIOINFO_MARK, sizeof(PIOINFO_MARK) - 1) == 0) { p = pend; goto found; } } break; } } } fprintf(stderr, "unexpected " UCRTBASE "\n"); _exit(1); found: p += sizeof(PIOINFO_MARK) - 1; rel = *(int32_t*)(p); rip = p + sizeof(int32_t); Even if _pioinfo is private, it exists on the memory, and we can get it. The problem is how to reach it. I use the fact _isatty() refers _pioinfo. I can read the function body of _isatty through the function pointer of it, find the instruction whose argument is _pioinfo, and I get the address of _pioinfo. 40

Slide 41

Slide 41 text

after story • Receive a patch to fix the code to support Visual C++ 2015’s DEBUG build. • I felt a deep emotion by the fact there’s a person who use such dirty code, and moreover read the code and contribute a fix. What a great thing OSS is! 41

Slide 42

Slide 42 text

Performance improvements • Speeding up • Optimize Ruby Applications • Optimize CRuby • some parts (hash, string, etc.) • Ruby VM • Profiling for speeding up • stackprof • perf(1) 42

Slide 43

Slide 43 text

Theory of improving performance 1.Take a benchmark • Take a profile of the target application (better) • for example stackprof, derailed_benchmarks. 2.Find a bottle neck (hotspot). 3.Optimize the bottle neck. 4.win! 43

Slide 44

Slide 44 text

benchmark 44

Slide 45

Slide 45 text

% ls benchmark bm_app_answer.rb bm_vm1_gc_short_lived.rb bm_app_aobench.rb bm_vm1_gc_short_with_complex_long.rb bm_app_erb.rb bm_vm1_gc_short_with_long.rb bm_app_factorial.rb bm_vm1_gc_short_with_symbol.rb bm_app_fib.rb bm_vm1_gc_wb_ary.rb bm_app_lc_fizzbuzz.rb bm_vm1_gc_wb_ary_promoted.rb bm_app_mandelbrot.rb bm_vm1_gc_wb_obj.rb bm_app_pentomino.rb bm_vm1_gc_wb_obj_promoted.rb bm_app_raise.rb bm_vm1_ivar.rb bm_app_strconcat.rb bm_vm1_ivar_set.rb bm_app_tak.rb bm_vm1_length.rb bm_app_tarai.rb bm_vm1_lvar_init.rb bm_app_uri.rb bm_vm1_lvar_set.rb bm_array_shift.rb bm_vm1_neq.rb bm_hash_aref_dsym.rb bm_vm1_not.rb bm_hash_aref_dsym_long.rb bm_vm1_rescue.rb bm_hash_aref_fix.rb bm_vm1_simplereturn.rb bm_hash_aref_flo.rb bm_vm1_swap.rb bm_hash_aref_miss.rb bm_vm1_yield.rb bm_hash_aref_str.rb bm_vm2_array.rb bm_hash_aref_sym.rb bm_vm2_bigarray.rb bm_hash_aref_sym_long.rb bm_vm2_bighash.rb bm_hash_flatten.rb bm_vm2_case.rb bm_hash_ident_flo.rb bm_vm2_case_lit.rb bm_hash_ident_num.rb bm_vm2_defined_method.rb bm_hash_ident_obj.rb bm_vm2_dstr.rb bm_hash_ident_str.rb bm_vm2_eval.rb bm_hash_ident_sym.rb bm_vm2_method.rb bm_hash_keys.rb bm_vm2_method_missing.rb bm_hash_shift.rb bm_vm2_method_with_block.rb bm_hash_shift_u16.rb bm_vm2_mutex.rb … … 45

Slide 46

Slide 46 text

Take a profile with New Relic 46

Slide 47

Slide 47 text

Result 47

Slide 48

Slide 48 text

Result… 48

Slide 49

Slide 49 text

bundler rack railties actionpack activesupport actionpack activesupport activerecord 49

Slide 50

Slide 50 text

String#blank? includes IO wait 50

Slide 51

Slide 51 text

No bottle neck… I wish you try with your applications, and confirm there’s no bottle neck. Anyway I talk about String#blank? 51

Slide 52

Slide 52 text

class String BLANK_RE = /\A[[:space:]]*\z/ # A string is blank if it's empty or contains whitespaces only: # # ''.blank? # => true # ' '.blank? # => true # "\t\n\r".blank? # => true # ' blah '.blank? # => false # # Unicode whitespace is supported: # # "\u00a0".blank? # => true # # @return [true, false] def blank? # The regexp that matches blank strings is expensive. For the case of empty # strings we can speed up this method (~3.5x) with an empty? call. The # penalty for the rest of strings is marginal. empty? || BLANK_RE.match?(self) end end String#blank? Check Unicode aware string emptiness 52

Slide 53

Slide 53 text

Regexp#match? • It was used Regexp#===. • Usually Regular Expression match stores $& (MatchData.last_match); it is slow because of creating MatchData object. • → Regexp#match? don’t update $&
 (Ruby 2.4 feature) • Note: Perl has similar optimization 53

Slide 54

Slide 54 text

SSE 4.2 STTNI • String and Text New Instructions • Handle 16 bytes at once. • But all the 16 bytes must be available. If not, it will cause SEGV. • ARM SVE (Scalable Vector Extension), an instruction set extension for HPC, which will be used by Post-K computer. VALUE rb_str_blank(VALUE str) { const unsigned char *p = (const unsigned char *)RSTRING_PTR(str); const unsigned char *e = (const unsigned char *)RSTRING_END(str); intptr_t pe = (intptr_t)e; const __m128i mask = _mm_set_epi8(0,0,0,0,0,0,0,0,0,0,9,10,11,12,13,32); const int masksize = 6; const int mode = _SIDD_CMP_EQUAL_ANY|_SIDD_UBYTE_OPS|_SIDD_MASKED_NEGATIVE_POLARITY; if (RSTRING_LEN(str) == 0) return Qtrue; /* set the edge of a page before the end of string */ if (pe & 0xfff > 0xff1) { pe &= ~0xfff; pe |= 0xff1; } for (; (intptr_t)p < pe; p += sizeof(__m128i)) { int idx, len; ptrdiff_t sz; __m128i m; retry: sz = e - p; len = (int)((sz&INT_MAX) | (sz >> 27)); m = _mm_loadu_si128((__m128i const *)p); /* CF: 1 if there's non spaces * ZF: 1 if reached the end */ if (_mm_cmpestra(mask, masksize, m, len, mode)) { /* CF=0 ZF=0 */ continue; } #if 0 /* GCC 6 wrongly generates cmpestri and cmpestri... */ if (_mm_cmpestrc(mask, masksize, m, len, mode)) { /* CF=0 ZF=1 */ return Qtrue; } idx = _mm_cmpestri(mask, masksize, m, len, mode); #else idx = _mm_cmpestri(mask, masksize, m, len, mode); if (sz < idx) return Qtrue; #endif p += idx; if (!str_blank0(&p, e, FALSE)) return Qfalse; goto retry; } if (!str_blank0(&p, e, TRUE)) return Qfalse; return Qtrue; } 54

Slide 55

Slide 55 text

Improve RubyVM • Anyway if there’s no bottle neck in Ruby layer, it needs to dive into C layer • Get a profile with C level profiler • OProfile • perf(1) 55

Slide 56

Slide 56 text

# Samples: 32K of event 'cycles:u' # Event count (approx.): 19800054829 # # Overhead Command Shared Object Symbol # ........ .............. .................. .................................................... # 66.81% miniruby miniruby [.] vm_exec_core 7.03% miniruby miniruby [.] vm_search_method.isra.97 3.62% miniruby miniruby [.] vm_call_iseq_setup_normal_0start_0params_0locals 2.95% miniruby miniruby [.] vm_call_cfunc 2.18% miniruby miniruby [.] rb_ary_entry 0.95% miniruby miniruby [.] invoke_iseq_block_from_c 0.92% miniruby miniruby [.] vm_call_iseq_setup_normal_0start_1params_1locals 0.91% miniruby miniruby [.] rb_ary_push 0.85% miniruby miniruby [.] vm_call_opt_send 0.82% miniruby miniruby [.] ary_ensure_room_for_push 0.73% miniruby miniruby [.] rb_ary_modify 0.71% miniruby miniruby [.] vm_call_iseq_setup_normal_0start_0params_2locals 0.66% miniruby miniruby [.] vm_call_iseq_setup 0.58% miniruby miniruby [.] rb_ary_rotate_bang 0.56% miniruby miniruby [.] vm_call_iseq_setup_normal_0start_0params_1locals 0.52% miniruby miniruby [.] opt_eq_func 0.41% miniruby miniruby [.] method_entry_get 0.38% miniruby miniruby [.] vm_callee_setup_block_arg 0.37% miniruby miniruby [.] rb_ary_transpose 0.34% miniruby miniruby [.] invoke_block_from_c_splattable.constprop.136 0.34% miniruby miniruby [.] vm_exec 0.33% miniruby miniruby [.] rb_yield perf report of optcarrot 56

Slide 57

Slide 57 text

perf report of fluentd You can confirm you application 57

Slide 58

Slide 58 text

vm_exec_core • The core function of RubyVM (YARV) • bytecode interpreter’s method dispatch routine • very large (30KB~40KB) switch-case • Strict to say, “switch-case” is optimized to jump instructions (Direct Threaded Code) 58

Slide 59

Slide 59 text

vm_exec.c 59

Slide 60

Slide 60 text

static VALUE vm_exec_core(rb_thread_t *th, VALUE initial) { register rb_control_frame_t *reg_cfp; const VALUE *reg_pc; #include "vmtc.inc" if (UNLIKELY(th == 0)) { return (VALUE)insns_address_table; } reg_cfp = th->cfp; reg_pc = reg_cfp->pc; first: INSN_DISPATCH(); /*****************/ #include "vm.inc" /*****************/ END_INSNS_DISPATCH(); /* unreachable */ rb_bug("vm_eval: unreachable"); goto first; } vm_exec_core (some macros are removed) keeps method/block local information program counter, What bytecode it executes now 60

Slide 61

Slide 61 text

How bytecode interpreter behave v1 PUSH = 1; ADD = 2; MINUS = 3 pc = 0 # program counter stack = [] iseq = [PUSH, 1, PUSH, 2, ADD] # instruction sequence while (true) { switch (iseq[pc++]) { case PUSH: stack.push(iseq[pc++]); break; case ADD: stack.push(stack.pop + stack.pop); break; case MINUS: stack.push(stack.pop - stack.pop); break; /* ... */ } } 61

Slide 62

Slide 62 text

How bytecode interpreter behave v1 PUSH = 1; ADD = 2; MINUS = 3 pc = 0 # program counter stack = [] iseq = [PUSH, 1, PUSH, 2, ADD] # instruction sequence while (true) { switch (iseq[pc++]) { case PUSH: stack.push(iseq[pc++]); break; case ADD: stack.push(stack.pop + stack.pop); break; case MINUS: stack.push(stack.pop - stack.pop); break; /* ... */ } } instruction body 62

Slide 63

Slide 63 text

How bytecode interpreter behave v1 PUSH = 1; ADD = 2; MINUS = 3 pc = 0 # program counter stack = [] iseq = [PUSH, 1, PUSH, 2, ADD] # instruction sequence while (true) { switch (iseq[pc++]) { case PUSH: stack.push(iseq[pc++]); break; case ADD: stack.push(stack.pop + stack.pop); break; case MINUS: stack.push(stack.pop - stack.pop); break; /* ... */ } } 1. loop overhead 2. fetch an instruction 3. jump to the instruction body 63

Slide 64

Slide 64 text

How bytecode interpreter behave v1 PUSH = 1; ADD = 2; MINUS = 3 pc = 0 # program counter stack = [] iseq = [PUSH, 1, PUSH, 2, ADD] # instruction sequence while (true) { switch (iseq[pc++]) { # indirect branch case PUSH: stack.push(iseq[pc++]); break; case ADD: stack.push(stack.pop + stack.pop); break; case MINUS: stack.push(stack.pop - stack.pop); break; /* ... */ } } branch misprediction 64

Slide 65

Slide 65 text

How bytecode interpreter behave v2: threading code pc = 0 # program counter stack = [] iseq = [&&label_PUSH, 1, &&label_PUSH, 2, &&label_ADD] goto iseq[pc++] label_PUSH: stack.push(iseq[pc++]); goto iseq[pc++] # indirect branch label_ADD: stack.push(stack.pop + stack.pop); goto iseq[pc++] # indirect branch label_MINUS: stack.push(stack.pop - stack.pop); goto iseq[pc++] # indirect branch /* ... */ 65

Slide 66

Slide 66 text

How bytecode interpreter behave v2: threading code pc = 0 # program counter stack = [] iseq = [&&label_PUSH, 1, &&label_PUSH, 2, &&label_ADD] goto iseq[pc++] label_PUSH: stack.push(iseq[pc++]); goto iseq[pc++] # indirect branch label_ADD: stack.push(stack.pop + stack.pop); goto iseq[pc++] # indirect branch label_MINUS: stack.push(stack.pop - stack.pop); goto iseq[pc++] # indirect branch /* ... */• No loop overhead • branch prediction friendly 66

Slide 67

Slide 67 text

But some people learned that branch prediction for indirect branches are hard work… Let’s get a profile 67

Slide 68

Slide 68 text

perf stat result of optcarrot on SandyBridge $ perf stat ./miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/ optcarrot --benchmark optcarrot/examples/Lan_Master.nes ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux] fps: 13.027601466360375 checksum: 59662 Performance counter stats for './miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/ shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes': 14796.375448 task-clock (msec) # 0.998 CPUs utilized 185 context-switches # 0.013 K/sec 0 cpu-migrations # 0.000 K/sec 14,404 page-faults # 0.973 K/sec 33,417,895,565 cycles # 2.259 GHz [83.29%] 11,061,511,903 stalled-cycles-frontend # 33.10% frontend cycles idle [83.34%] 7,225,689,220 stalled-cycles-backend # 21.62% backend cycles idle [66.70%] 51,166,479,085 instructions # 1.53 insns per cycle # 0.22 stalled cycles per insn [83.38%] 9,698,878,651 branches # 655.490 M/sec [83.35%] 507,996,700 branch-misses # 5.24% of all branches [83.34%] 14.820478694 seconds time elapsed 68

Slide 69

Slide 69 text

perf stat result of optcarrot on Haswell % perf stat ./miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/ optcarrot --benchmark optcarrot/examples/Lan_Master.nes ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux] fps: 27.269785989915373 checksum: 59662 Performance counter stats for './miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/ shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes': 7214.329827 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 14849 page-faults:u # 0.002 M/sec 19775865899 cycles:u # 2.741 GHz 50975216372 instructions:u # 2.58 insn per cycle 9662536352 branches:u # 1339.353 M/sec 67884021 branch-misses:u # 0.70% of all branches 7.218712639 seconds time elapsed 69

Slide 70

Slide 70 text

Free lunch by Moore’s law is still alive $ perf stat ./miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/ optcarrot --benchmark optcarrot/examples/Lan_Master.nes ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux] Sandy Bridge: fps: 13.027601466360375 14796.375448 task-clock (msec) # 0.998 CPUs utilized 185 context-switches # 0.013 K/sec 14,404 page-faults # 0.973 K/sec 33,417,895,565 cycles # 2.259 GHz [83.29%] 51,166,479,085 instructions # 1.53 insns per cycle 9,698,878,651 branches # 655.490 M/sec [83.35%] 507,996,700 branch-misses # 5.24% of all branches [83.34%] 14.820478694 seconds time elapsed Haswell: fps: 27.269785989915373 7214.329827 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 14849 page-faults:u # 0.002 M/sec 19775865899 cycles:u # 2.741 GHz 50975216372 instructions:u # 2.58 insn per cycle 9662536352 branches:u # 1339.353 M/sec 67884021 branch-misses:u # 0.70% of all branches 7.218712639 seconds time elapsed 70

Slide 71

Slide 71 text

/* -*-c-*- ******************************************************/ /****************************************************************/ /****************************************************************/ /** This file is VM main loop. ---- This file is auto generated by insns2vm.rb DO NOT TOUCH! If you want to fix something, you must edit 'insns.def' */ INSN_ENTRY(nop){ { DEBUG_ENTER_INSN("nop"); ADD_PC(1+0); PREFETCH(GET_PC()); #define CURRENT_INSN_nop 1 #define INSN_IS_SC() 0 #define INSN_LABEL(lab) LABEL_nop_##lab #define LABEL_IS_SC(lab) LABEL_##lab##_##t COLLECT_USAGE_INSN(BIN(nop)); { #line 40 "insns.def" /* none */ #line 32 "vm.inc" #undef CURRENT_INSN_nop #undef INSN_IS_SC #undef INSN_LABEL #undef LABEL_IS_SC END_INSN(nop);}}} INSN_ENTRY(getlocal){ { VALUE val; rb_num_t level = (rb_num_t)GET_OPERAND(2); lindex_t idx = (lindex_t)GET_OPERAND(1); DEBUG_ENTER_INSN("getlocal"); ADD_PC(1+2); PREFETCH(GET_PC()); #define CURRENT_INSN_getlocal 1 #define INSN_IS_SC() 0 #define INSN_LABEL(lab) LABEL_getlocal_##lab #define LABEL_IS_SC(lab) LABEL_##lab##_##t COLLECT_USAGE_INSN(BIN(getlocal)); COLLECT_USAGE_OPERAND(BIN(getlocal), 0, idx); COLLECT_USAGE_OPERAND(BIN(getlocal), 1, level); { #line 60 "insns.def" int i, lev = (int)level; const VALUE *ep = GET_EP(); for (i = 0; i < lev; i++) { ep = GET_PREV_EP(ep); } val = *(ep - idx); #line 65 "vm.inc" CHECK_VM_STACK_OVERFLOW_FOR_INSN(REG_CFP, 1); PUSH(val); #undef CURRENT_INSN_getlocal #undef INSN_IS_SC #undef INSN_LABEL #undef LABEL_IS_SC END_INSN(getlocal);}}} 71

Slide 72

Slide 72 text

/* -*-c-*- ******************************************************/ /****************************************************************/ /****************************************************************/ /** This file is VM main loop. ---- This file is auto generated by insns2vm.rb DO NOT TOUCH! If you want to fix something, you must edit 'insns.def' */ INSN_ENTRY(nop){ { DEBUG_ENTER_INSN("nop"); ADD_PC(1+0); PREFETCH(GET_PC()); #define CURRENT_INSN_nop 1 #define INSN_IS_SC() 0 #define INSN_LABEL(lab) LABEL_nop_##lab #define LABEL_IS_SC(lab) LABEL_##lab##_##t COLLECT_USAGE_INSN(BIN(nop)); { #line 40 "insns.def" /* none */ #line 32 "vm.inc" #undef CURRENT_INSN_nop #undef INSN_IS_SC #undef INSN_LABEL #undef LABEL_IS_SC END_INSN(nop);}}} /* r14: pc (program counter) */ /* r15: cfp (call frame pointer) */ /* NOTE: cop */ INSN_ENTRY(nop){ { ADD_PC(1+0); /* $rdx = $r14 + 8byte */ 209bb0: lea 0x8(%r14),%rdx /* $rax = $r14 */ 209bb4: mov %r14,%rax /* *$r15 = $rdx */ 209bb7: mov %rdx,(%r15) /* $r14 = $rdx */ 209bba: mov %rdx,%r14 END_INSN(nop);}}} /* goto *($rax + 8byte) */ 209bbd: jmpq *0x8(%rax) typedef struct rb_control_frame_struct { const VALUE *pc; /* cfp[0] */ VALUE *sp; /* cfp[1] */ const rb_iseq_t *iseq; /* cfp[2] */ VALUE self; /* cfp[3] */ const VALUE *ep; /* cfp[4] */ const void *block_code; /* cfp[5] */ } rb_control_frame_t; 72

Slide 73

Slide 73 text

opt_plus INSN_ENTRY(opt_plus){ { VALUE val; CALL_CACHE cc = (CALL_CACHE)GET_OPERAND(2); CALL_INFO ci = (CALL_INFO)GET_OPERAND(1); VALUE recv = TOPN(1); VALUE obj = TOPN(0); DEBUG_ENTER_INSN("opt_plus"); ADD_PC(1+2); PREFETCH(GET_PC()); POPN(2); #define CURRENT_INSN_opt_plus 1 #define INSN_IS_SC() 0 #define INSN_LABEL(lab) LABEL_opt_plus_##lab #define LABEL_IS_SC(lab) LABEL_##lab##_##t COLLECT_USAGE_INSN(BIN(opt_plus)); COLLECT_USAGE_OPERAND(BIN(opt_plus), 0, ci); COLLECT_USAGE_OPERAND(BIN(opt_plus), 1, cc); { #line 1377 "insns.def" if (FIXNUM_2_P(recv, obj) && BASIC_OP_UNREDEFINED_P(BOP_PLUS,INTEGER_REDEFINED_OP_FLAG)) { /* fixnum + fixnum */ #ifndef LONG_LONG_VALUE VALUE msb = (VALUE)1 << ((sizeof(VALUE) * CHAR_BIT) - 1); val = recv - 1 + obj; if ((~(recv ^ obj) & (recv ^ val)) & msb) { val = rb_int2big((SIGNED_VALUE)((val>>1) | (recv & msb))); } #else val = LONG2NUM(FIX2LONG(recv) + FIX2LONG(obj)); #endif } else if (FLONUM_2_P(recv, obj) && BASIC_OP_UNREDEFINED_P(BOP_PLUS, FLOAT_REDEFINED_OP_FLAG)) { val = DBL2NUM(RFLOAT_VALUE(recv) + RFLOAT_VALUE(obj)); } else if (!SPECIAL_CONST_P(recv) && !SPECIAL_CONST_P(obj)) { if (RBASIC_CLASS(recv) == rb_cFloat && RBASIC_CLASS(obj) == rb_cFloat && BASIC_OP_UNREDEFINED_P(BOP_PLUS, FLOAT_REDEFINED_OP_FLAG)) { val = DBL2NUM(RFLOAT_VALUE(recv) + RFLOAT_VALUE(obj)); } else if (RBASIC_CLASS(recv) == rb_cString && RBASIC_CLASS(obj) == rb_cString && BASIC_OP_UNREDEFINED_P(BOP_PLUS, STRING_REDEFINED_OP_FLAG)) { val = rb_str_plus(recv, obj); } else if (RBASIC_CLASS(recv) == rb_cArray && BASIC_OP_UNREDEFINED_P(BOP_PLUS, ARRAY_REDEFINED_OP_FLAG)) { val = rb_ary_plus(recv, obj); } else { goto INSN_LABEL(normal_dispatch); } } else { INSN_LABEL(normal_dispatch): PUSH(recv); PUSH(obj); CALL_SIMPLE_METHOD(recv); } #line 2073 "vm.inc" CHECK_VM_STACK_OVERFLOW_FOR_INSN(REG_CFP, 1); PUSH(val); #undef CURRENT_INSN_opt_plus #undef INSN_IS_SC #undef INSN_LABEL #undef LABEL_IS_SC END_INSN(opt_plus);}}} 73

Slide 74

Slide 74 text

opt_plus • Optimized case is a home study • It’s a bit pattern magic DEFINE_INSN opt_plus (CALL_INFO ci, CALL_CACHE cc) (VALUE recv, VALUE obj) (VALUE val) { if (FIXNUM_2_P(recv, obj) && BASIC_OP_UNREDEFINED_P(BOP_PLUS,INTEGER_REDEFINED_OP_FLAG)) { /* fixnum + fixnum */ #ifndef LONG_LONG_VALUE VALUE msb = (VALUE)1 << ((sizeof(VALUE) * CHAR_BIT) - 1); val = recv - 1 + obj; if ((~(recv ^ obj) & (recv ^ val)) & msb) { val = rb_int2big((SIGNED_VALUE)((val>>1) | (recv & msb))); } #else val = LONG2NUM(FIX2LONG(recv) + FIX2LONG(obj)); #endif } 74

Slide 75

Slide 75 text

opt_plus | Inst | Resource Delay In Cycles | | | Num | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | FE | | ------------------------------------------------------------------------- | 0 | | | | | | | | | | | mov rax, qword ptr [r15+0x8] | 1 | | | | | | | | | | | mov r12, qword ptr [r14+0x10] | 2 | | | | | | | | | | | lea rbx, ptr [r14+0x18] | 3 | | | 1 | | | | | | | | mov r13, qword ptr [r14+0x8] | 4 | | | | | | | | | 1 | | mov rbp, r15 | 5 | | | | | | | | | | | mov r14, rbx | 6 | | | | | | | | | | | mov rdi, qword ptr [rax-0x10] | 7 | | | | | | | | | | | mov rsi, qword ptr [rax-0x8] | 8 | | | | | | | | | 2 | | lea rdx, ptr [rax-0x10] | 9 | | | | | | | | | 3 | | mov qword ptr [r15], rbx | 10 | | | | | | | | 1 | 2 | | mov qword ptr [r15+0x8], rdx | 11 | | | | | | | | | | | mov rdx, rdi | 12 | | | | | | | | | 3 | | mov qword ptr [rsp+0x10], rsi | 13 | | | | | | | | | | | and rdx, rsi | 14 | | | | | | | | | | | and edx, 0x1 | 15 | | | | | | | | | | | jz 0x12f8 | 16 | | | | 1 | | | | | 3 | CP | mov rdx, qword ptr [rip+0x2fe3a2] | 17 | | | | | | | | | | CP | test byte ptr [rdx+0x668], 0x1 | 18 | | | | | | | | | | CP | jnz 0x12e4 | 19 | | | | | | | | | 5 | | lea rax, ptr [rdi+rsi*1-0x1] | 20 | | | | | | | | | | | mov rdx, rdi | 21 | | | | | | | | | | | xor rsi, rdi | 22 | | | | | | | | | | | not rsi | 23 | | | | | | | | | | | xor rdx, rax | 24 | | | | | | | | | | | test rsi, rdx | 25 | | | | | | | | | | | js 0x438d Resource Conflict on Critical Paths: ----------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | ----------------------------------------------------------------- | Cycles | 0 0 | 0 | 0 0 | 1 0 | 0 | 0 | 0 | 0 | ----------------------------------------------------------------- List Of Delays On Critical Paths ------------------------------- 12 --> 16 1 Cycles Delay On PORT3_AGU Intel(R) Architecture Code Analyzer Version - 2.1 Analyzed File - ./miniruby Binary Format - 64Bit Architecture - HSW Analysis Type - Latency Latency Analysis Report --------------------------- Latency: 16 Cycles N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis The Resource delay is counted since all the sources of the instructions are ready and until the needed resource becomes available https://software.intel.com/en-us/articles/intel-architecture-code-analyzer 75

Slide 76

Slide 76 text

opt_plus | Inst | Resource Delay In Cycles | | | Num | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | FE | | ------------------------------------------------------------------------- | 0 | | | | | | | | | | | mov rax, qword ptr [r15+0x8] | 1 | | | | | | | | | | | mov r12, qword ptr [r14+0x10] | 2 | | | | | | | | | | | lea rbx, ptr [r14+0x18] | 3 | | | 1 | | | | | | | | mov r13, qword ptr [r14+0x8] | 4 | | | | | | | | | 1 | | mov rbp, r15 | 5 | | | | | | | | | | | mov r14, rbx | 6 | | | | | | | | | | | mov rdi, qword ptr [rax-0x10] | 7 | | | | | | | | | | | mov rsi, qword ptr [rax-0x8] | 8 | | | | | | | | | 2 | | lea rdx, ptr [rax-0x10] | 9 | | | | | | | | | 3 | | mov qword ptr [r15], rbx | 10 | | | | | | | | 1 | 2 | | mov qword ptr [r15+0x8], rdx | 11 | | | | | | | | | | | mov rdx, rdi | 12 | | | | | | | | | 3 | | mov qword ptr [rsp+0x10], rsi | 13 | | | | | | | | | | | and rdx, rsi | 14 | | | | | | | | | | | and edx, 0x1 | 15 | | | | | | | | | | | jz 0x12f8 | 16 | | | | 1 | | | | | 3 | CP | mov rdx, qword ptr [rip+0x2fe3a2] | 17 | | | | | | | | | | CP | test byte ptr [rdx+0x668], 0x1 | 18 | | | | | | | | | | CP | jnz 0x12e4 | 19 | | | | | | | | | 5 | | lea rax, ptr [rdi+rsi*1-0x1] | 20 | | | | | | | | | | | mov rdx, rdi | 21 | | | | | | | | | | | xor rsi, rdi | 22 | | | | | | | | | | | not rsi | 23 | | | | | | | | | | | xor rdx, rax | 24 | | | | | | | | | | | test rsi, rdx | 25 | | | | | | | | | | | js 0x438d Resource Conflict on Critical Paths: ----------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | ----------------------------------------------------------------- | Cycles | 0 0 | 0 | 0 0 | 1 0 | 0 | 0 | 0 | 0 | ----------------------------------------------------------------- List Of Delays On Critical Paths ------------------------------- 12 --> 16 1 Cycles Delay On PORT3_AGU It happens delay because there’s too many instructions. • Increment PC • Set PC to call frame • Get value from stack To avoid that • JIT • compilation overhead; it’s trade-off with its hotness. (Just I said there’s no hotspot…) • Register machine • [Feature #12589] • tough work https://software.intel.com/en-us/articles/intel-architecture-code-analyzer 76

Slide 77

Slide 77 text

Debugging / Profiling / Tracing • Program works but something wrong: • unexpected behavior: normal debugging • slow or too much memory • the case you can kill it with SIGTERM • Get stuck • user level: stuck but can kill with SIGKILL • kernel level: cannot kill even with SIGKILL • Crash • Ruby level crash • SEGV (Segmentation Fault) 77

Slide 78

Slide 78 text

Slow or too much memory • perf-top • can check what function is running. • ps(1) • procfs (Linux) • ObjectSpace / objspace • count_objects • GC.stat • frsyuki/sigdump: show thread backtrace and memory usage. This needs to be installed in Gemfile before an incident. 78

Slide 79

Slide 79 text

Get stuck • tool • perf top • pid2line • Reason • infinite loop (?) in kernel • regular expression match’s backtrack explosion 79

Slide 80

Slide 80 text

strace with CRuby futex(0x555e283e4464, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4460, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT , 1}) = 1 futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e4498, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e449c, FUTEX_WAIT_PRIVATE, 91, NULL) = -1 EAGAIN (Resource temporarily unavailable) write(6, "!", 1) = 1 futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x555e283e4464, FUTEX_WAIT_PRIVATE, 95, NULL) = 0 futex(0x555e283e449c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4498, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT , 1}) = 1 futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e4464, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4460, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT , 1}) = 1 futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e4498, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e449c, FUTEX_WAIT_PRIVATE, 95, NULL) = -1 EAGAIN (Resource temporarily unavailable) write(6, "!", 1) = 1 futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x555e283e4464, FUTEX_WAIT_PRIVATE, 99, NULL) = 0 futex(0x555e283e449c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4498, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT , 1}) = 1 futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e4464, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4460, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT , 1}) = 1 futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e4498, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e449c, FUTEX_WAIT_PRIVATE, 99, NULL) = -1 EAGAIN (Resource temporarily unavailable) It sometimes shows like below: 80

Slide 81

Slide 81 text

strace with CRuby futex(0x555e283e4464, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4460, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT , 1}) = 1 futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e4498, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e449c, FUTEX_WAIT_PRIVATE, 91, NULL) = -1 EAGAIN (Resource temporarily unavailable) write(6, "!", 1) = 1 futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x555e283e4464, FUTEX_WAIT_PRIVATE, 95, NULL) = 0 futex(0x555e283e449c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4498, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT , 1}) = 1 futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e4464, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4460, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT , 1}) = 1 futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e4498, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e449c, FUTEX_WAIT_PRIVATE, 95, NULL) = -1 EAGAIN (Resource temporarily unavailable) write(6, "!", 1) = 1 futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x555e283e4464, FUTEX_WAIT_PRIVATE, 99, NULL) = 0 futex(0x555e283e449c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4498, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT , 1}) = 1 futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e4464, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4460, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT , 1}) = 1 futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e4498, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x555e283e449c, FUTEX_WAIT_PRIVATE, 99, NULL) = -1 EAGAIN (Resource temporarily unavailable) It sometimes shows like below: This is by timer thread: There’s another reason. It needs to a way to get information
 from a living process. 81

Slide 82

Slide 82 text

procfs(5) • You can get many information (30) kstkeip %lu The current EIP (instruction pointer). PROC - PROCESS INFORMATION PSEUDO-FILESYSTEM It is running at 140308168160605 = 0x7f9c0a85b55d 82

Slide 83

Slide 83 text

procfs • Read /proc//maps and find which library EIP is in. • Convert the address into the library local address considering offsets. • 0x7f9c0a85b55d - 0x7f9c0a84b000 = 0x1055d • See symbol table of the library and find it!!! • (You can also use addr2line) 83

Slide 84

Slide 84 text

pid2line.rb • It does such investigation. • https://gist.github.com/nurse/0619b6af90df140508c2 • Note that recent Ubuntu requires privilege even if the target process is mine. • It is a good tool because it can get information even if you cannot attach the process into gdb. (= get stuck in kernel) • It doesn’t require install instruction. easy to use. • Can run even on RHEL5/CentOS5 84

Slide 85

Slide 85 text

An example # /opt/td-agent/embedded/bin/ruby pid2line.rb 11010 LWPID: 11010 EIP: 7faa4e58a462 -> 142462 Elf vaddr: 142462 -> 142462 match_at /var/cache/omnibus/src/ruby/ruby-2.1.10/regexec.c:2627 LWPID: 11011 EIP: 7faa4d81969d -> ec69d Elf vaddr: ec69d -> ec69d __poll_nocancel :? LWPID: 11014 EIP: 7faa4e237a82 -> ba82 Elf vaddr: ba82 -> ba82 __pthread_cond_timedwait :? LWPID: 11015 EIP: 7faa4e237a82 -> ba82 Elf vaddr: ba82 -> ba82 __pthread_cond_timedwait :? LWPID: 11021 EIP: 7faa4d8242c3 -> f72c3 Elf vaddr: f72c3 -> f72c3 __epoll_wait_nocancel :? LWPID: 11022 EIP: 7faa4d8242c3 -> f72c3 Elf vaddr: f72c3 -> f72c3 __epoll_wait_nocancel :? LWPID: 11023 EIP: 7faa4d81975f -> ec75f Elf vaddr: ec75f -> ec75f __GI_ppoll :? the core of regular expression (Oniguruma/Onigmo) It seems regexp’s backtrack explosion. 85

Slide 86

Slide 86 text

SEGV Segmentation fault • Useful information to identify the issue. • Ruby level or C level backtrace • loaded libraries • core file 86

Slide 87

Slide 87 text

Ruby level backtrace • You can read it. • But in SEGV case it often far from the cause. 87

Slide 88

Slide 88 text

C level backtrace • It shows both public/static function names. • It’s actually difficult to get static function’s name. Ruby fetches it from DWARF by itself. • (it needs to compile `gcc -g`) • Tracing back beyond signal trampoline on OS X was also difficult. • Just before _sigtramp maybe buggy. 88

Slide 89

Slide 89 text

core file • the file named ruby.core (or managed by coredumpctl) • Generated on SEGV or when manually dumped • The memory dump of Ruby at that time • You can easily debug C layer with gdb • You can debug Ruby layer with gdb… 89

Slide 90

Slide 90 text

Inside Ruby Process (Linux) RubyVM Thread Thread OS pthread pthread lwp lwp 1:1 thread 90

Slide 91

Slide 91 text

as C structures ruby_current_vm pthread lwp living_threads Ruby
 Thread Ruby
 Thread Ruby
 Thread Ruby call frame glibc Linux Kernel call frame call frame … Ruby VM and threads are represented as shown. If you can track this by gdb, you can get Ruby level information… call frame: a structure which saves that Ruby executes in the function, local variables, previous call frame, and so on 91

Slide 92

Slide 92 text

.gdbinit a GDB initialization script, which have many utility functions to debug CRuby 92

Slide 93

Slide 93 text

93

Slide 94

Slide 94 text

.gdbinit for CRuby • It is hard because it needs to re- implement Ruby’s C functions with gdb’s tiny scripts… • Slow; rb_ps is enough fast but rb_count_objects, which counts Ruby objects by categories, takes some seconds even if nearly empty process. (it should be written in Python…) • It needs to follow Ruby’s changes. • It can read core file. It means you can get Ruby level backtrace of each Ruby threads even if the process is crashed. (note that you can get core file manually from running process with stopping it short time) 94

Slide 95

Slide 95 text

·ͱΊ • Ϣʔεέʔεʢݱ࣮ͷྫʣ Λऩू͠Α͏ • ݱ࣮ͷΞϓϦέʔγϣϯ ͷϘτϧωοΫΛ஌Ζ͏ Conclusion • Gather use cases (real snippets) • Profile/Benchmark real applications 95