Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dive into CRuby

NARUSE, Yui
September 10, 2016

Dive into CRuby

RubyKaigi 2016

NARUSE, Yui

September 10, 2016
Tweet

More Decks by NARUSE, Yui

Other Decks in Programming

Transcript

  1. Dive into CRuby
    naruse

    View full-size slide

  2. Who is naruse?
    • twitter: @nalsh
    • GitHub: @nurse
    • Ruby committer (M17N, release engineering, etc.)
    • nkf maintainer
    • working at Treasure Data, Inc.
    2

    View full-size slide

  3. झࢫ
    • ϢʔεέʔεͷॏཁੑΛ
    ஌Β͠ΊɺօʹऩूΛ
    ख఻ͬͯ΋Β͏
    • RubyΑΓ௿͍ϨΠϠͷ
    ཧղΛਂΊΔ
    Goal
    • Understand how use
    case is important, and
    make people gather real
    examples
    • Understand layers lower
    than Ruby
    3

    View full-size slide

  4. Through RubyKaigi 2016,
    You may want to
    • write a new application
    • improve JRuby or mruby or something
    • improve CRuby
    4

    View full-size slide

  5. But...
    What should I do...
    5

    View full-size slide

  6. Motivations
    • New features
    • New methods or extend exist one
    • New platforms
    • Optimizations
    • speed
    • memory
    • Instruments to help trouble shooting
    • to ease debugging
    6

    View full-size slide

  7. New features
    • New features
    • New methods or extend exist one
    • String#scrub
    • Regexp#match?
    • etc.
    • New platforms
    • clang
    • Microsoft Visual C++ 2015
    7

    View full-size slide

  8. Why new feature?
    • Existing methods doesn’t fit my use case
    • This “use case” must be a real example.
    • It shall not be an artificial one.
    • Current method is too complicated, and it must be
    more simple for my use case
    8

    View full-size slide

  9. –Jonathan Swift, Travels into Several Remote Nations of the World (1726)
    “Necessity is the mother of invention.”
    but...
    9

    View full-size slide

  10. What is the truly necessary one?
    http://dic.nicovideo.jp/a/ސ٬͕ຊ౰ʹඞཁͩͬͨ΋ͷ
    10

    View full-size slide

  11. [Feature #6752] Replacing ill-
    formed subsequence
    • Repair invalid bytes in the given string.
    • Above snippet didn’t work as expected if `str` is already
    UTF-8. (it simply returned the original string)
    • Anyway what kind of strings are used?
    str = "\xe3\x81\x82\x80\x80"
    str.encode("utf-8", invalid: :replace)
    11

    View full-size slide

  12. Case: Web Forms
    • If strings come from Web Forms are broken, the web
    application should simply raise error: 400 Bad
    Request.
    • Clients should retry with correct request.
    12

    View full-size slide

  13. Case: Files
    • If strings come from files are broken, it’s usually
    because it doesn't specify correct encoding.
    • Retry reading with correct encoding.
    13

    View full-size slide

  14. Case: Web Crawler
    • They can’t retry.
    • They need to handle broken strings easily and
    secure.
    • There’s real pages, for example twitter...
    • mechanism:
    • wrongly get substring as bytes (though it must do as
    chars)
    • insert such invalid bytes into web page template
    14

    View full-size slide

  15. How do I design the interface
    • transcode base
    • Use Ruby’s encoding conversion engine.
    • Some people are used to use this way with iconv.
    • regexp base
    • Use regular expression engine (Ruby usually uses
    regexp engine on handling characters)
    • Better performance and meaning.
    • naming issue
    str.encode("utf-8", invalid: :replace)
    15

    View full-size slide

  16. Added both APIs
    • String#encode
    • tables under enc/trans
    • CSI (Code Set Independent; they doesn’t convert
    into Unicode if it doesn’t need)
    • but in 21st century, people rarely convert
    encodings…
    • String#scrub
    • regexp base
    16

    View full-size slide

  17. CRuby files
    • Ruby class name/module name
    • CRuby component name (+suffix)
    • utility functions
    • etc.
    17

    View full-size slide

  18. Ruby class/module
    18

    View full-size slide

  19. regular expressions
    Onigmo (Oniguruma fork)
    21

    View full-size slide

  20. Ruby VM (YARV)
    22

    View full-size slide

  21. build
    a working field of nobu
    24

    View full-size slide

  22. platforms
    NOTE: cygwin has no maintainer
    25

    View full-size slide

  23. tests & benchmarks
    26

    View full-size slide

  24. Impl: String#scrub
    • This is a method of `String` class.

    → string.c
    27

    View full-size slide

  25. the end of string.c
    void
    Init_String(void)
    {
    rb_cString = rb_define_class("String", rb_cObject);
    rb_include_module(rb_cString, rb_mComparable);

    rb_define_method(rb_cString, "empty?", rb_str_empty, 0);
    rb_define_method(rb_cString, "=~", rb_str_match, 1);
    rb_define_method(rb_cString, "getbyte", rb_str_getbyte, 1);
    rb_define_method(rb_cString, "setbyte", rb_str_setbyte, 2);

    rb_define_method(rb_cString, "scrub", str_scrub, -1);
    rb_define_method(rb_cString, "scrub!", str_scrub_bang, -1);

    }
    define method in String class named “scrub”
    whose implementation is `str_scrub` function,
    and the number of its arguments is -1 (variable arguments).
    28

    View full-size slide

  26. /*
    * call-seq:
    * str.scrub -> new_str
    * str.scrub(repl) -> new_str
    * str.scrub{|bytes|} -> new_str
    *
    * If the string is invalid byte sequence then replace invalid bytes
    * with given replacement character, else returns self.
    * If block is given, replace invalid bytes with returned value of the block.
    *
    * "abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"
    * "abc\u3042\x81".scrub("*") #=> "abc\u3042*"
    * "abc\u3042\xE3\x80".scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc
    \u3042"
    */
    static VALUE
    str_scrub(int argc, VALUE *argv, VALUE str)
    {
    VALUE repl = argc ? (rb_check_arity(argc, 0, 1), argv[0]) : Qnil;
    VALUE new = rb_str_scrub(str, repl);
    return NIL_P(new) ? rb_str_dup(str): new;
    }
    29

    View full-size slide

  27. static VALUE
    str_scrub(int argc, VALUE *argv, VALUE str)
    {
    VALUE repl = argc ? (rb_check_arity(argc, 0, 1), argv[0]) : Qnil;
    VALUE s = rb_str_scrub(str, repl);
    return NIL_P(s) ? rb_str_dup(str): s;
    }
    • VALUE: a pointer type for Ruby object.
    • remind that Ruby’s data is always an object.

    (there’s no primitive type)
    • argv: C array of VALUEs
    • argc: the number of VALUEs in argv
    • arity: the number of arguments
    • rb_check_arity(argc,min,max): raise exception if arity doesn’t
    match.
    • Qnil: `nil` in CRuby source code
    • NIL_P(x): `x.nil?`; Like Lisp CRuby source uses `_p` instead of `?`
    30

    View full-size slide

  28. /**
    * @param str the string to be scrubbed
    * @param repl the replacement character
    * @return If given string is invalid, returns a new string. Otherwise, returns
    Qnil.
    */
    VALUE
    rb_str_scrub(VALUE str, VALUE repl)
    {
    return rb_enc_str_scrub(STR_ENC_GET(str), str, repl);
    }
    • STR_ENC_GET(): returns `rb_encoding *` which is C
    structure to represent Ruby’s Encoding.
    • There’s some similar relations between C structures
    and Ruby class like Hash and st_table, Thread class
    and rb_thread_t and so on.
    31

    View full-size slide

  29. VALUE
    rb_enc_str_scrub(rb_encoding *enc, VALUE str, VALUE repl)
    {
    VALUE buf = rb_str_buf_new(RSTRING_LEN(str));
    const char *rep = /* replacement char */;
    long replen = strlen(rep);
    const char *p = RSTRING_PTR(str);
    const char *e = RSTRING_END(str);
    const char *p1 = p;
    while (p < e) {
    int ret = rb_enc_precise_mbclen(p, e, enc);
    if (MBCLEN_NEEDMORE_P(ret)) {
    break;
    }
    else if (MBCLEN_CHARFOUND_P(ret)) {
    p += MBCLEN_CHARFOUND_LEN(ret);
    }
    else if (MBCLEN_INVALID_P(ret)) {
    ...
    32

    View full-size slide

  30. [Feature #7361]
    Pathname#touch
    • touch(1)
    • change file access and modification times
    • NOTE: This ticket is rejected.
    new_file = Pathname.new('location/for/new/file')
    FileUtils.touch(new_file)
    33

    View full-size slide

  31. NAME
    touch - change file access and modification times
    SYNOPSIS
    touch [-acm] [-r ref_file|-t time|-d date_time] file…
    OPTIONS
    -a: Change the access time of file.
    -c: Do not create a specified file if it does not exist.
    -m: Change the modification time of file.
    http://pubs.opengroup.org/onlinepubs/9699919799/utilities/touch.html
    34

    View full-size slide

  32. But…
    What is this method?
    • “Behave like touch(1)”
    • This can’t be a correct answer.
    • This has two different features:
    • “change file access and modification times”
    • “create empty file”
    • It describes the spec, but doesn’t describe what users
    truly want to do.
    35

    View full-size slide

  33. Conclusion of suggesting a
    new feature
    • Collect use cases:
    • take a memo if you find a case which should be written
    shorter. (only this can be a good feature request)
    • Time class and Integer bit manipulation seems to be
    improved…
    • Design API for the group of use cases
    • and find a good name ;-)
    • Read ʮAPIσβΠϯέʔεελσΟ ――Rubyͷ࣮ྫ͔
    ΒֶͿɻ໰୊ʹଈͨ͠σβΠϯͱීวͷߟ͑ํʯ
    • http://gihyo.jp/book/2016/978-4-7741-7802-8
    36

    View full-size slide

  34. New platforms
    • On a brand new environment, Ruby sometimes doesn’t
    work. For example clang (a C language family frontend
    for LLVM), and Microsoft Visual C++ 2015 (VC14).
    • clang (including llvm-gcc):
    • It usually behaves like GCC, but different in some
    points.
    • Visual C++ 2015
    • It introduced the Universal C Runtime, which breaks
    many compatibility.
    37

    View full-size slide

  35. clang (including llvm-gcc)
    • It optimizes code more aggressive than gcc at some
    point. It breaks Ruby’s GC (because Ruby uses
    conservative mark & sweep GC).
    • The most tough one was r34278, in which llvm
    optimized out the memory allocation for continuation.
    • You may think no one uses continuation, but actually
    uses through enumerator.
    38

    View full-size slide

  36. Visual C++ 2015
    • It introduced the Universal C Runtime, which breaks many
    compatibility.
    • The most tough part is around _pioinfo:
    • _pioinfo is an internal array of IO structure. It stores
    underlying OS file HANDLE, file position (cursor), file
    attributes, text mode, and so on. Ruby handles the structure
    • to associate socket and fd: CRuby creates fd with dummy
    file handle and set socket
    • to implement overlapped I/O for Windows 2000/XP
    • to emulate fcntl(2)
    • But VC2015 make it private.
    • NOTE: overlapped I/O is a Microsoft’s term similar to
    nonblocking I/O.
    39

    View full-size slide

  37. static void
    set_pioinfo_extra(void)
    {
    # define FUNCTION_RET 0xc3 /* ret */
    /* get __pioinfo addr with _isatty */
    char *p = (char*)get_proc_address("ucrtbase.dll", "_isatty", NULL);
    char *pend = p;
    /* _osfile(fh) & FDEV */
    int32_t rel;
    char *rip;
    /* add rsp, _ */
    # define FUNCTION_BEFORE_RET_MARK "\x48\x83\xc4"
    # define FUNCTION_SKIP_BYTES 1
    /* lea rdx,[__pioinfo's addr in RIP-relative 32bit addr] */
    # define PIOINFO_MARK "\x48\x8d\x15"
    if (p) {
    for (pend += 10; pend < p + 300; pend++) {
    // find end of function
    if (memcmp(pend, FUNCTION_BEFORE_RET_MARK, sizeof(FUNCTION_BEFORE_RET_MARK) - 1) == 0 &&
    *(pend + (sizeof(FUNCTION_BEFORE_RET_MARK) - 1) + FUNCTION_SKIP_BYTES) & FUNCTION_RET == FUNCTION_RET) {
    // search backwards from end of function
    for (pend -= (sizeof(PIOINFO_MARK) - 1); pend > p; pend--) {
    if (memcmp(pend, PIOINFO_MARK, sizeof(PIOINFO_MARK) - 1) == 0) {
    p = pend;
    goto found;
    }
    }
    break;
    }
    }
    }
    fprintf(stderr, "unexpected " UCRTBASE "\n");
    _exit(1);
    found:
    p += sizeof(PIOINFO_MARK) - 1;
    rel = *(int32_t*)(p);
    rip = p + sizeof(int32_t);
    Even if _pioinfo is private, it exists on the
    memory, and we can get it. The problem is
    how to reach it. I use the fact _isatty()
    refers _pioinfo. I can read the function
    body of _isatty through the function
    pointer of it, find the instruction whose
    argument is _pioinfo, and I get the address
    of _pioinfo.
    40

    View full-size slide

  38. after story
    • Receive a patch to fix the code to support Visual C++
    2015’s DEBUG build.
    • I felt a deep emotion by the fact there’s a person who
    use such dirty code, and moreover read the code and
    contribute a fix. What a great thing OSS is!
    41

    View full-size slide

  39. Performance improvements
    • Speeding up
    • Optimize Ruby Applications
    • Optimize CRuby
    • some parts (hash, string, etc.)
    • Ruby VM
    • Profiling for speeding up
    • stackprof
    • perf(1)
    42

    View full-size slide

  40. Theory of improving performance
    1.Take a benchmark
    • Take a profile of the target
    application (better)
    • for example stackprof,
    derailed_benchmarks.
    2.Find a bottle neck (hotspot).
    3.Optimize the bottle neck.
    4.win!
    43

    View full-size slide

  41. % ls benchmark
    bm_app_answer.rb bm_vm1_gc_short_lived.rb
    bm_app_aobench.rb bm_vm1_gc_short_with_complex_long.rb
    bm_app_erb.rb bm_vm1_gc_short_with_long.rb
    bm_app_factorial.rb bm_vm1_gc_short_with_symbol.rb
    bm_app_fib.rb bm_vm1_gc_wb_ary.rb
    bm_app_lc_fizzbuzz.rb bm_vm1_gc_wb_ary_promoted.rb
    bm_app_mandelbrot.rb bm_vm1_gc_wb_obj.rb
    bm_app_pentomino.rb bm_vm1_gc_wb_obj_promoted.rb
    bm_app_raise.rb bm_vm1_ivar.rb
    bm_app_strconcat.rb bm_vm1_ivar_set.rb
    bm_app_tak.rb bm_vm1_length.rb
    bm_app_tarai.rb bm_vm1_lvar_init.rb
    bm_app_uri.rb bm_vm1_lvar_set.rb
    bm_array_shift.rb bm_vm1_neq.rb
    bm_hash_aref_dsym.rb bm_vm1_not.rb
    bm_hash_aref_dsym_long.rb bm_vm1_rescue.rb
    bm_hash_aref_fix.rb bm_vm1_simplereturn.rb
    bm_hash_aref_flo.rb bm_vm1_swap.rb
    bm_hash_aref_miss.rb bm_vm1_yield.rb
    bm_hash_aref_str.rb bm_vm2_array.rb
    bm_hash_aref_sym.rb bm_vm2_bigarray.rb
    bm_hash_aref_sym_long.rb bm_vm2_bighash.rb
    bm_hash_flatten.rb bm_vm2_case.rb
    bm_hash_ident_flo.rb bm_vm2_case_lit.rb
    bm_hash_ident_num.rb bm_vm2_defined_method.rb
    bm_hash_ident_obj.rb bm_vm2_dstr.rb
    bm_hash_ident_str.rb bm_vm2_eval.rb
    bm_hash_ident_sym.rb bm_vm2_method.rb
    bm_hash_keys.rb bm_vm2_method_missing.rb
    bm_hash_shift.rb bm_vm2_method_with_block.rb
    bm_hash_shift_u16.rb bm_vm2_mutex.rb
    … …
    45

    View full-size slide

  42. Take a profile with New Relic
    46

    View full-size slide

  43. bundler
    rack
    railties
    actionpack
    activesupport
    actionpack
    activesupport
    activerecord
    49

    View full-size slide

  44. String#blank?
    includes IO wait
    50

    View full-size slide

  45. No bottle neck…
    I wish you try with your applications,
    and confirm there’s no bottle neck.
    Anyway I talk about String#blank?
    51

    View full-size slide

  46. class String
    BLANK_RE = /\A[[:space:]]*\z/
    # A string is blank if it's empty or contains whitespaces only:
    #
    # ''.blank? # => true
    # ' '.blank? # => true
    # "\t\n\r".blank? # => true
    # ' blah '.blank? # => false
    #
    # Unicode whitespace is supported:
    #
    # "\u00a0".blank? # => true
    #
    # @return [true, false]
    def blank?
    # The regexp that matches blank strings is expensive. For the case of empty
    # strings we can speed up this method (~3.5x) with an empty? call. The
    # penalty for the rest of strings is marginal.
    empty? || BLANK_RE.match?(self)
    end
    end
    String#blank?
    Check Unicode aware string emptiness
    52

    View full-size slide

  47. Regexp#match?
    • It was used Regexp#===.
    • Usually Regular Expression match stores $&
    (MatchData.last_match); it is slow because of creating
    MatchData object.
    • → Regexp#match? don’t update $&

    (Ruby 2.4 feature)
    • Note: Perl has similar optimization
    53

    View full-size slide

  48. SSE 4.2 STTNI
    • String and Text New
    Instructions
    • Handle 16 bytes at once.
    • But all the 16 bytes
    must be available. If
    not, it will cause SEGV.
    • ARM SVE (Scalable
    Vector Extension), an
    instruction set
    extension for HPC,
    which will be used by
    Post-K computer.
    VALUE
    rb_str_blank(VALUE str)
    {
    const unsigned char *p = (const unsigned char *)RSTRING_PTR(str);
    const unsigned char *e = (const unsigned char *)RSTRING_END(str);
    intptr_t pe = (intptr_t)e;
    const __m128i mask = _mm_set_epi8(0,0,0,0,0,0,0,0,0,0,9,10,11,12,13,32);
    const int masksize = 6;
    const int mode = _SIDD_CMP_EQUAL_ANY|_SIDD_UBYTE_OPS|_SIDD_MASKED_NEGATIVE_POLARITY;
    if (RSTRING_LEN(str) == 0) return Qtrue;
    /* set the edge of a page before the end of string */
    if (pe & 0xfff > 0xff1) {
    pe &= ~0xfff;
    pe |= 0xff1;
    }
    for (; (intptr_t)p < pe; p += sizeof(__m128i)) {
    int idx, len;
    ptrdiff_t sz;
    __m128i m;
    retry:
    sz = e - p;
    len = (int)((sz&INT_MAX) | (sz >> 27));
    m = _mm_loadu_si128((__m128i const *)p);
    /* CF: 1 if there's non spaces
    * ZF: 1 if reached the end
    */
    if (_mm_cmpestra(mask, masksize, m, len, mode)) { /* CF=0 ZF=0 */
    continue;
    }
    #if 0 /* GCC 6 wrongly generates cmpestri and cmpestri... */
    if (_mm_cmpestrc(mask, masksize, m, len, mode)) { /* CF=0 ZF=1 */
    return Qtrue;
    }
    idx = _mm_cmpestri(mask, masksize, m, len, mode);
    #else
    idx = _mm_cmpestri(mask, masksize, m, len, mode);
    if (sz < idx) return Qtrue;
    #endif
    p += idx;
    if (!str_blank0(&p, e, FALSE)) return Qfalse;
    goto retry;
    }
    if (!str_blank0(&p, e, TRUE)) return Qfalse;
    return Qtrue;
    } 54

    View full-size slide

  49. Improve RubyVM
    • Anyway if there’s no bottle neck in Ruby layer, it needs
    to dive into C layer
    • Get a profile with C level profiler
    • OProfile
    • perf(1)
    55

    View full-size slide

  50. # Samples: 32K of event 'cycles:u'
    # Event count (approx.): 19800054829
    #
    # Overhead Command Shared Object Symbol
    # ........ .............. .................. ....................................................
    #
    66.81% miniruby miniruby [.] vm_exec_core
    7.03% miniruby miniruby [.] vm_search_method.isra.97
    3.62% miniruby miniruby [.] vm_call_iseq_setup_normal_0start_0params_0locals
    2.95% miniruby miniruby [.] vm_call_cfunc
    2.18% miniruby miniruby [.] rb_ary_entry
    0.95% miniruby miniruby [.] invoke_iseq_block_from_c
    0.92% miniruby miniruby [.] vm_call_iseq_setup_normal_0start_1params_1locals
    0.91% miniruby miniruby [.] rb_ary_push
    0.85% miniruby miniruby [.] vm_call_opt_send
    0.82% miniruby miniruby [.] ary_ensure_room_for_push
    0.73% miniruby miniruby [.] rb_ary_modify
    0.71% miniruby miniruby [.] vm_call_iseq_setup_normal_0start_0params_2locals
    0.66% miniruby miniruby [.] vm_call_iseq_setup
    0.58% miniruby miniruby [.] rb_ary_rotate_bang
    0.56% miniruby miniruby [.] vm_call_iseq_setup_normal_0start_0params_1locals
    0.52% miniruby miniruby [.] opt_eq_func
    0.41% miniruby miniruby [.] method_entry_get
    0.38% miniruby miniruby [.] vm_callee_setup_block_arg
    0.37% miniruby miniruby [.] rb_ary_transpose
    0.34% miniruby miniruby [.] invoke_block_from_c_splattable.constprop.136
    0.34% miniruby miniruby [.] vm_exec
    0.33% miniruby miniruby [.] rb_yield
    perf report of optcarrot
    56

    View full-size slide

  51. perf report of fluentd
    You can confirm you application
    57

    View full-size slide

  52. vm_exec_core
    • The core function of RubyVM (YARV)
    • bytecode interpreter’s method dispatch routine
    • very large (30KB~40KB) switch-case
    • Strict to say, “switch-case” is optimized to jump
    instructions (Direct Threaded Code)
    58

    View full-size slide

  53. static VALUE vm_exec_core(rb_thread_t *th, VALUE initial)
    {
    register rb_control_frame_t *reg_cfp;
    const VALUE *reg_pc;
    #include "vmtc.inc"
    if (UNLIKELY(th == 0)) {
    return (VALUE)insns_address_table;
    }
    reg_cfp = th->cfp;
    reg_pc = reg_cfp->pc;
    first:
    INSN_DISPATCH();
    /*****************/
    #include "vm.inc"
    /*****************/
    END_INSNS_DISPATCH();
    /* unreachable */
    rb_bug("vm_eval: unreachable");
    goto first;
    }
    vm_exec_core (some macros are removed)
    keeps method/block
    local information
    program counter,
    What bytecode it executes now
    60

    View full-size slide

  54. How bytecode interpreter
    behave v1
    PUSH = 1; ADD = 2; MINUS = 3
    pc = 0 # program counter
    stack = []
    iseq = [PUSH, 1, PUSH, 2, ADD] # instruction sequence
    while (true) {
    switch (iseq[pc++]) {
    case PUSH:
    stack.push(iseq[pc++]);
    break;
    case ADD:
    stack.push(stack.pop + stack.pop);
    break;
    case MINUS:
    stack.push(stack.pop - stack.pop);
    break;
    /* ... */
    }
    }
    61

    View full-size slide

  55. How bytecode interpreter
    behave v1
    PUSH = 1; ADD = 2; MINUS = 3
    pc = 0 # program counter
    stack = []
    iseq = [PUSH, 1, PUSH, 2, ADD] # instruction sequence
    while (true) {
    switch (iseq[pc++]) {
    case PUSH:
    stack.push(iseq[pc++]);
    break;
    case ADD:
    stack.push(stack.pop + stack.pop);
    break;
    case MINUS:
    stack.push(stack.pop - stack.pop);
    break;
    /* ... */
    }
    }
    instruction body
    62

    View full-size slide

  56. How bytecode interpreter
    behave v1
    PUSH = 1; ADD = 2; MINUS = 3
    pc = 0 # program counter
    stack = []
    iseq = [PUSH, 1, PUSH, 2, ADD] # instruction sequence
    while (true) {
    switch (iseq[pc++]) {
    case PUSH:
    stack.push(iseq[pc++]);
    break;
    case ADD:
    stack.push(stack.pop + stack.pop);
    break;
    case MINUS:
    stack.push(stack.pop - stack.pop);
    break;
    /* ... */
    }
    } 1. loop overhead
    2. fetch an instruction
    3. jump to the instruction body
    63

    View full-size slide

  57. How bytecode interpreter
    behave v1
    PUSH = 1; ADD = 2; MINUS = 3
    pc = 0 # program counter
    stack = []
    iseq = [PUSH, 1, PUSH, 2, ADD] # instruction sequence
    while (true) {
    switch (iseq[pc++]) { # indirect branch
    case PUSH:
    stack.push(iseq[pc++]);
    break;
    case ADD:
    stack.push(stack.pop + stack.pop);
    break;
    case MINUS:
    stack.push(stack.pop - stack.pop);
    break;
    /* ... */
    }
    }
    branch misprediction
    64

    View full-size slide

  58. How bytecode interpreter
    behave v2: threading code
    pc = 0 # program counter
    stack = []
    iseq = [&&label_PUSH, 1, &&label_PUSH, 2, &&label_ADD]
    goto iseq[pc++]
    label_PUSH:
    stack.push(iseq[pc++]);
    goto iseq[pc++] # indirect branch
    label_ADD:
    stack.push(stack.pop + stack.pop);
    goto iseq[pc++] # indirect branch
    label_MINUS:
    stack.push(stack.pop - stack.pop);
    goto iseq[pc++] # indirect branch
    /* ... */
    65

    View full-size slide

  59. How bytecode interpreter
    behave v2: threading code
    pc = 0 # program counter
    stack = []
    iseq = [&&label_PUSH, 1, &&label_PUSH, 2, &&label_ADD]
    goto iseq[pc++]
    label_PUSH:
    stack.push(iseq[pc++]);
    goto iseq[pc++] # indirect branch
    label_ADD:
    stack.push(stack.pop + stack.pop);
    goto iseq[pc++] # indirect branch
    label_MINUS:
    stack.push(stack.pop - stack.pop);
    goto iseq[pc++] # indirect branch
    /* ... */• No loop overhead
    • branch prediction friendly
    66

    View full-size slide

  60. But some people learned that branch prediction for
    indirect branches are hard work…
    Let’s get a profile
    67

    View full-size slide

  61. perf stat result of optcarrot on SandyBridge
    $ perf stat ./miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/
    optcarrot --benchmark optcarrot/examples/Lan_Master.nes
    ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux]
    fps: 13.027601466360375
    checksum: 59662
    Performance counter stats for './miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/
    shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes':
    14796.375448 task-clock (msec) # 0.998 CPUs utilized
    185 context-switches # 0.013 K/sec
    0 cpu-migrations # 0.000 K/sec
    14,404 page-faults # 0.973 K/sec
    33,417,895,565 cycles # 2.259 GHz [83.29%]
    11,061,511,903 stalled-cycles-frontend # 33.10% frontend cycles idle [83.34%]
    7,225,689,220 stalled-cycles-backend # 21.62% backend cycles idle [66.70%]
    51,166,479,085 instructions # 1.53 insns per cycle
    # 0.22 stalled cycles per insn [83.38%]
    9,698,878,651 branches # 655.490 M/sec [83.35%]
    507,996,700 branch-misses # 5.24% of all branches [83.34%]
    14.820478694 seconds time elapsed
    68

    View full-size slide

  62. perf stat result of optcarrot on Haswell
    % perf stat ./miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/
    optcarrot --benchmark optcarrot/examples/Lan_Master.nes
    ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux]
    fps: 27.269785989915373
    checksum: 59662
    Performance counter stats for './miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/
    shim optcarrot/bin/optcarrot --benchmark optcarrot/examples/Lan_Master.nes':
    7214.329827 task-clock:u (msec) # 0.999 CPUs utilized
    0 context-switches:u # 0.000 K/sec
    0 cpu-migrations:u # 0.000 K/sec
    14849 page-faults:u # 0.002 M/sec
    19775865899 cycles:u # 2.741 GHz
    50975216372 instructions:u # 2.58 insn per cycle
    9662536352 branches:u # 1339.353 M/sec
    67884021 branch-misses:u # 0.70% of all branches
    7.218712639 seconds time elapsed
    69

    View full-size slide

  63. Free lunch by Moore’s law is still alive
    $ perf stat ./miniruby-gcc611 -v -Ioptcarrot/lib -r./optcarrot/tools/shim optcarrot/bin/
    optcarrot --benchmark optcarrot/examples/Lan_Master.nes
    ruby 2.4.0dev (2016-08-26 trunk 56013) [x86_64-linux]
    Sandy Bridge:
    fps: 13.027601466360375
    14796.375448 task-clock (msec) # 0.998 CPUs utilized
    185 context-switches # 0.013 K/sec
    14,404 page-faults # 0.973 K/sec
    33,417,895,565 cycles # 2.259 GHz [83.29%]
    51,166,479,085 instructions # 1.53 insns per cycle
    9,698,878,651 branches # 655.490 M/sec [83.35%]
    507,996,700 branch-misses # 5.24% of all branches [83.34%]
    14.820478694 seconds time elapsed
    Haswell:
    fps: 27.269785989915373
    7214.329827 task-clock:u (msec) # 0.999 CPUs utilized
    0 context-switches:u # 0.000 K/sec
    14849 page-faults:u # 0.002 M/sec
    19775865899 cycles:u # 2.741 GHz
    50975216372 instructions:u # 2.58 insn per cycle
    9662536352 branches:u # 1339.353 M/sec
    67884021 branch-misses:u # 0.70% of all branches
    7.218712639 seconds time elapsed
    70

    View full-size slide

  64. /* -*-c-*- ******************************************************/
    /****************************************************************/
    /****************************************************************/
    /**
    This file is VM main loop.
    ----
    This file is auto generated by insns2vm.rb
    DO NOT TOUCH!
    If you want to fix something, you must edit 'insns.def'
    */
    INSN_ENTRY(nop){
    {
    DEBUG_ENTER_INSN("nop");
    ADD_PC(1+0);
    PREFETCH(GET_PC());
    #define CURRENT_INSN_nop 1
    #define INSN_IS_SC() 0
    #define INSN_LABEL(lab) LABEL_nop_##lab
    #define LABEL_IS_SC(lab) LABEL_##lab##_##t
    COLLECT_USAGE_INSN(BIN(nop));
    {
    #line 40 "insns.def"
    /* none */
    #line 32 "vm.inc"
    #undef CURRENT_INSN_nop
    #undef INSN_IS_SC
    #undef INSN_LABEL
    #undef LABEL_IS_SC
    END_INSN(nop);}}}
    INSN_ENTRY(getlocal){
    {
    VALUE val;
    rb_num_t level = (rb_num_t)GET_OPERAND(2);
    lindex_t idx = (lindex_t)GET_OPERAND(1);
    DEBUG_ENTER_INSN("getlocal");
    ADD_PC(1+2);
    PREFETCH(GET_PC());
    #define CURRENT_INSN_getlocal 1
    #define INSN_IS_SC() 0
    #define INSN_LABEL(lab) LABEL_getlocal_##lab
    #define LABEL_IS_SC(lab) LABEL_##lab##_##t
    COLLECT_USAGE_INSN(BIN(getlocal));
    COLLECT_USAGE_OPERAND(BIN(getlocal), 0, idx);
    COLLECT_USAGE_OPERAND(BIN(getlocal), 1, level);
    {
    #line 60 "insns.def"
    int i, lev = (int)level;
    const VALUE *ep = GET_EP();
    for (i = 0; i < lev; i++) {
    ep = GET_PREV_EP(ep);
    }
    val = *(ep - idx);
    #line 65 "vm.inc"
    CHECK_VM_STACK_OVERFLOW_FOR_INSN(REG_CFP, 1);
    PUSH(val);
    #undef CURRENT_INSN_getlocal
    #undef INSN_IS_SC
    #undef INSN_LABEL
    #undef LABEL_IS_SC
    END_INSN(getlocal);}}}
    71

    View full-size slide

  65. /* -*-c-*- ******************************************************/
    /****************************************************************/
    /****************************************************************/
    /**
    This file is VM main loop.
    ----
    This file is auto generated by insns2vm.rb
    DO NOT TOUCH!
    If you want to fix something, you must edit 'insns.def'
    */
    INSN_ENTRY(nop){
    {
    DEBUG_ENTER_INSN("nop");
    ADD_PC(1+0);
    PREFETCH(GET_PC());
    #define CURRENT_INSN_nop 1
    #define INSN_IS_SC() 0
    #define INSN_LABEL(lab) LABEL_nop_##lab
    #define LABEL_IS_SC(lab) LABEL_##lab##_##t
    COLLECT_USAGE_INSN(BIN(nop));
    {
    #line 40 "insns.def"
    /* none */
    #line 32 "vm.inc"
    #undef CURRENT_INSN_nop
    #undef INSN_IS_SC
    #undef INSN_LABEL
    #undef LABEL_IS_SC
    END_INSN(nop);}}}
    /* r14: pc (program counter) */
    /* r15: cfp (call frame pointer) */
    /* NOTE: cop */
    INSN_ENTRY(nop){
    {
    ADD_PC(1+0);
    /* $rdx = $r14 + 8byte */
    209bb0: lea 0x8(%r14),%rdx
    /* $rax = $r14 */
    209bb4: mov %r14,%rax
    /* *$r15 = $rdx */
    209bb7: mov %rdx,(%r15)
    /* $r14 = $rdx */
    209bba: mov %rdx,%r14
    END_INSN(nop);}}}
    /* goto *($rax + 8byte) */
    209bbd: jmpq *0x8(%rax)
    typedef struct rb_control_frame_struct {
    const VALUE *pc; /* cfp[0] */
    VALUE *sp; /* cfp[1] */
    const rb_iseq_t *iseq; /* cfp[2] */
    VALUE self; /* cfp[3] */
    const VALUE *ep; /* cfp[4] */
    const void *block_code; /* cfp[5] */
    } rb_control_frame_t;
    72

    View full-size slide

  66. opt_plus
    INSN_ENTRY(opt_plus){
    {
    VALUE val;
    CALL_CACHE cc = (CALL_CACHE)GET_OPERAND(2);
    CALL_INFO ci = (CALL_INFO)GET_OPERAND(1);
    VALUE recv = TOPN(1);
    VALUE obj = TOPN(0);
    DEBUG_ENTER_INSN("opt_plus");
    ADD_PC(1+2);
    PREFETCH(GET_PC());
    POPN(2);
    #define CURRENT_INSN_opt_plus 1
    #define INSN_IS_SC() 0
    #define INSN_LABEL(lab) LABEL_opt_plus_##lab
    #define LABEL_IS_SC(lab) LABEL_##lab##_##t
    COLLECT_USAGE_INSN(BIN(opt_plus));
    COLLECT_USAGE_OPERAND(BIN(opt_plus), 0, ci);
    COLLECT_USAGE_OPERAND(BIN(opt_plus), 1, cc);
    {
    #line 1377 "insns.def"
    if (FIXNUM_2_P(recv, obj) &&
    BASIC_OP_UNREDEFINED_P(BOP_PLUS,INTEGER_REDEFINED_OP_FLAG)) {
    /* fixnum + fixnum */
    #ifndef LONG_LONG_VALUE
    VALUE msb = (VALUE)1 << ((sizeof(VALUE) * CHAR_BIT) - 1);
    val = recv - 1 + obj;
    if ((~(recv ^ obj) & (recv ^ val)) & msb) {
    val = rb_int2big((SIGNED_VALUE)((val>>1) | (recv & msb)));
    }
    #else
    val = LONG2NUM(FIX2LONG(recv) + FIX2LONG(obj));
    #endif
    }
    else if (FLONUM_2_P(recv, obj) &&
    BASIC_OP_UNREDEFINED_P(BOP_PLUS, FLOAT_REDEFINED_OP_FLAG)) {
    val = DBL2NUM(RFLOAT_VALUE(recv) + RFLOAT_VALUE(obj));
    }
    else if (!SPECIAL_CONST_P(recv) && !SPECIAL_CONST_P(obj)) {
    if (RBASIC_CLASS(recv) == rb_cFloat && RBASIC_CLASS(obj) == rb_cFloat &&
    BASIC_OP_UNREDEFINED_P(BOP_PLUS, FLOAT_REDEFINED_OP_FLAG)) {
    val = DBL2NUM(RFLOAT_VALUE(recv) + RFLOAT_VALUE(obj));
    }
    else if (RBASIC_CLASS(recv) == rb_cString && RBASIC_CLASS(obj) == rb_cString &&
    BASIC_OP_UNREDEFINED_P(BOP_PLUS, STRING_REDEFINED_OP_FLAG)) {
    val = rb_str_plus(recv, obj);
    }
    else if (RBASIC_CLASS(recv) == rb_cArray &&
    BASIC_OP_UNREDEFINED_P(BOP_PLUS, ARRAY_REDEFINED_OP_FLAG)) {
    val = rb_ary_plus(recv, obj);
    }
    else {
    goto INSN_LABEL(normal_dispatch);
    }
    }
    else {
    INSN_LABEL(normal_dispatch):
    PUSH(recv);
    PUSH(obj);
    CALL_SIMPLE_METHOD(recv);
    }
    #line 2073 "vm.inc"
    CHECK_VM_STACK_OVERFLOW_FOR_INSN(REG_CFP, 1);
    PUSH(val);
    #undef CURRENT_INSN_opt_plus
    #undef INSN_IS_SC
    #undef INSN_LABEL
    #undef LABEL_IS_SC
    END_INSN(opt_plus);}}}
    73

    View full-size slide

  67. opt_plus
    • Optimized case is a home study
    • It’s a bit pattern magic
    DEFINE_INSN
    opt_plus
    (CALL_INFO ci, CALL_CACHE cc)
    (VALUE recv, VALUE obj)
    (VALUE val)
    {
    if (FIXNUM_2_P(recv, obj) &&
    BASIC_OP_UNREDEFINED_P(BOP_PLUS,INTEGER_REDEFINED_OP_FLAG)) {
    /* fixnum + fixnum */
    #ifndef LONG_LONG_VALUE
    VALUE msb = (VALUE)1 << ((sizeof(VALUE) * CHAR_BIT) - 1);
    val = recv - 1 + obj;
    if ((~(recv ^ obj) & (recv ^ val)) & msb) {
    val = rb_int2big((SIGNED_VALUE)((val>>1) | (recv & msb)));
    }
    #else
    val = LONG2NUM(FIX2LONG(recv) + FIX2LONG(obj));
    #endif
    }
    74

    View full-size slide

  68. opt_plus
    | Inst | Resource Delay In Cycles | |
    | Num | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | FE | |
    -------------------------------------------------------------------------
    | 0 | | | | | | | | | | | mov rax, qword ptr [r15+0x8]
    | 1 | | | | | | | | | | | mov r12, qword ptr [r14+0x10]
    | 2 | | | | | | | | | | | lea rbx, ptr [r14+0x18]
    | 3 | | | 1 | | | | | | | | mov r13, qword ptr [r14+0x8]
    | 4 | | | | | | | | | 1 | | mov rbp, r15
    | 5 | | | | | | | | | | | mov r14, rbx
    | 6 | | | | | | | | | | | mov rdi, qword ptr [rax-0x10]
    | 7 | | | | | | | | | | | mov rsi, qword ptr [rax-0x8]
    | 8 | | | | | | | | | 2 | | lea rdx, ptr [rax-0x10]
    | 9 | | | | | | | | | 3 | | mov qword ptr [r15], rbx
    | 10 | | | | | | | | 1 | 2 | | mov qword ptr [r15+0x8], rdx
    | 11 | | | | | | | | | | | mov rdx, rdi
    | 12 | | | | | | | | | 3 | | mov qword ptr [rsp+0x10], rsi
    | 13 | | | | | | | | | | | and rdx, rsi
    | 14 | | | | | | | | | | | and edx, 0x1
    | 15 | | | | | | | | | | | jz 0x12f8
    | 16 | | | | 1 | | | | | 3 | CP | mov rdx, qword ptr [rip+0x2fe3a2]
    | 17 | | | | | | | | | | CP | test byte ptr [rdx+0x668], 0x1
    | 18 | | | | | | | | | | CP | jnz 0x12e4
    | 19 | | | | | | | | | 5 | | lea rax, ptr [rdi+rsi*1-0x1]
    | 20 | | | | | | | | | | | mov rdx, rdi
    | 21 | | | | | | | | | | | xor rsi, rdi
    | 22 | | | | | | | | | | | not rsi
    | 23 | | | | | | | | | | | xor rdx, rax
    | 24 | | | | | | | | | | | test rsi, rdx
    | 25 | | | | | | | | | | | js 0x438d
    Resource Conflict on Critical Paths:
    -----------------------------------------------------------------
    | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
    -----------------------------------------------------------------
    | Cycles | 0 0 | 0 | 0 0 | 1 0 | 0 | 0 | 0 | 0 |
    -----------------------------------------------------------------
    List Of Delays On Critical Paths
    -------------------------------
    12 --> 16 1 Cycles Delay On PORT3_AGU
    Intel(R) Architecture Code Analyzer
    Version - 2.1
    Analyzed File - ./miniruby
    Binary Format - 64Bit
    Architecture - HSW
    Analysis Type - Latency
    Latency Analysis Report
    ---------------------------
    Latency: 16 Cycles
    N - port number or number of cycles
    resource conflict caused delay, DV -
    Divider pipe (on port 0)
    D - Data fetch pipe (on ports 2 and 3), CP
    - on a critical path
    F - Macro Fusion with the previous
    instruction occurred
    * - instruction micro-ops not bound to a
    port
    ^ - Micro Fusion happened
    # - ESP Tracking sync uop was issued
    @ - Intel(R) AVX to Intel(R) SSE code
    switch, dozens of cycles penalty is
    expected
    ! - instruction not supported, was not
    accounted in Analysis
    The Resource delay is counted since all
    the sources of the instructions are ready
    and until the needed resource becomes
    available
    https://software.intel.com/en-us/articles/intel-architecture-code-analyzer
    75

    View full-size slide

  69. opt_plus
    | Inst | Resource Delay In Cycles | |
    | Num | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | FE | |
    -------------------------------------------------------------------------
    | 0 | | | | | | | | | | | mov rax, qword ptr [r15+0x8]
    | 1 | | | | | | | | | | | mov r12, qword ptr [r14+0x10]
    | 2 | | | | | | | | | | | lea rbx, ptr [r14+0x18]
    | 3 | | | 1 | | | | | | | | mov r13, qword ptr [r14+0x8]
    | 4 | | | | | | | | | 1 | | mov rbp, r15
    | 5 | | | | | | | | | | | mov r14, rbx
    | 6 | | | | | | | | | | | mov rdi, qword ptr [rax-0x10]
    | 7 | | | | | | | | | | | mov rsi, qword ptr [rax-0x8]
    | 8 | | | | | | | | | 2 | | lea rdx, ptr [rax-0x10]
    | 9 | | | | | | | | | 3 | | mov qword ptr [r15], rbx
    | 10 | | | | | | | | 1 | 2 | | mov qword ptr [r15+0x8], rdx
    | 11 | | | | | | | | | | | mov rdx, rdi
    | 12 | | | | | | | | | 3 | | mov qword ptr [rsp+0x10], rsi
    | 13 | | | | | | | | | | | and rdx, rsi
    | 14 | | | | | | | | | | | and edx, 0x1
    | 15 | | | | | | | | | | | jz 0x12f8
    | 16 | | | | 1 | | | | | 3 | CP | mov rdx, qword ptr [rip+0x2fe3a2]
    | 17 | | | | | | | | | | CP | test byte ptr [rdx+0x668], 0x1
    | 18 | | | | | | | | | | CP | jnz 0x12e4
    | 19 | | | | | | | | | 5 | | lea rax, ptr [rdi+rsi*1-0x1]
    | 20 | | | | | | | | | | | mov rdx, rdi
    | 21 | | | | | | | | | | | xor rsi, rdi
    | 22 | | | | | | | | | | | not rsi
    | 23 | | | | | | | | | | | xor rdx, rax
    | 24 | | | | | | | | | | | test rsi, rdx
    | 25 | | | | | | | | | | | js 0x438d
    Resource Conflict on Critical Paths:
    -----------------------------------------------------------------
    | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
    -----------------------------------------------------------------
    | Cycles | 0 0 | 0 | 0 0 | 1 0 | 0 | 0 | 0 | 0 |
    -----------------------------------------------------------------
    List Of Delays On Critical Paths
    -------------------------------
    12 --> 16 1 Cycles Delay On PORT3_AGU
    It happens delay because
    there’s too many
    instructions.
    • Increment PC
    • Set PC to call frame
    • Get value from stack
    To avoid that
    • JIT
    • compilation overhead;
    it’s trade-off with its
    hotness. (Just I said
    there’s no hotspot…)
    • Register machine
    • [Feature #12589]
    • tough work
    https://software.intel.com/en-us/articles/intel-architecture-code-analyzer
    76

    View full-size slide

  70. Debugging / Profiling / Tracing
    • Program works but something wrong:
    • unexpected behavior: normal debugging
    • slow or too much memory
    • the case you can kill it with SIGTERM
    • Get stuck
    • user level: stuck but can kill with SIGKILL
    • kernel level: cannot kill even with SIGKILL
    • Crash
    • Ruby level crash
    • SEGV (Segmentation Fault)
    77

    View full-size slide

  71. Slow or too much memory
    • perf-top
    • can check what function is running.
    • ps(1)
    • procfs (Linux)
    • ObjectSpace / objspace
    • count_objects
    • GC.stat
    • frsyuki/sigdump: show thread backtrace and memory
    usage. This needs to be installed in Gemfile before an
    incident.
    78

    View full-size slide

  72. Get stuck
    • tool
    • perf top
    • pid2line
    • Reason
    • infinite loop (?) in kernel
    • regular expression match’s backtrack explosion
    79

    View full-size slide

  73. strace with CRuby
    futex(0x555e283e4464, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4460, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT
    , 1}) = 1
    futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e4498, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e449c, FUTEX_WAIT_PRIVATE, 91, NULL) = -1 EAGAIN (Resource temporarily unavailable)
    write(6, "!", 1) = 1
    futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 0
    futex(0x555e283e4464, FUTEX_WAIT_PRIVATE, 95, NULL) = 0
    futex(0x555e283e449c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4498, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT
    , 1}) = 1
    futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e4464, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4460, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT
    , 1}) = 1
    futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e4498, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e449c, FUTEX_WAIT_PRIVATE, 95, NULL) = -1 EAGAIN (Resource temporarily unavailable)
    write(6, "!", 1) = 1
    futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 0
    futex(0x555e283e4464, FUTEX_WAIT_PRIVATE, 99, NULL) = 0
    futex(0x555e283e449c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4498, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT
    , 1}) = 1
    futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e4464, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4460, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT
    , 1}) = 1
    futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e4498, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e449c, FUTEX_WAIT_PRIVATE, 99, NULL) = -1 EAGAIN (Resource temporarily unavailable)
    It sometimes shows like below:
    80

    View full-size slide

  74. strace with CRuby
    futex(0x555e283e4464, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4460, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT
    , 1}) = 1
    futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e4498, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e449c, FUTEX_WAIT_PRIVATE, 91, NULL) = -1 EAGAIN (Resource temporarily unavailable)
    write(6, "!", 1) = 1
    futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 0
    futex(0x555e283e4464, FUTEX_WAIT_PRIVATE, 95, NULL) = 0
    futex(0x555e283e449c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4498, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT
    , 1}) = 1
    futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e4464, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4460, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT
    , 1}) = 1
    futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e4498, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e449c, FUTEX_WAIT_PRIVATE, 95, NULL) = -1 EAGAIN (Resource temporarily unavailable)
    write(6, "!", 1) = 1
    futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 0
    futex(0x555e283e4464, FUTEX_WAIT_PRIVATE, 99, NULL) = 0
    futex(0x555e283e449c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4498, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT
    , 1}) = 1
    futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e4464, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x555e283e4460, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT
    , 1}) = 1
    futex(0x555e283e4430, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e4498, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x555e283e449c, FUTEX_WAIT_PRIVATE, 99, NULL) = -1 EAGAIN (Resource temporarily unavailable)
    It sometimes shows like below:
    This is by timer thread:
    There’s another reason.
    It needs to a way to get information

    from a living process.
    81

    View full-size slide

  75. procfs(5)
    • You can get many information
    (30) kstkeip %lu
    The current EIP (instruction pointer).
    PROC - PROCESS INFORMATION PSEUDO-FILESYSTEM
    It is running at 140308168160605 = 0x7f9c0a85b55d
    82

    View full-size slide

  76. procfs
    • Read /proc//maps and find which library EIP is in.
    • Convert the address into the library local address considering
    offsets.
    • 0x7f9c0a85b55d - 0x7f9c0a84b000 = 0x1055d
    • See symbol table of the library and find it!!!
    • (You can also use addr2line)
    83

    View full-size slide

  77. pid2line.rb
    • It does such investigation.
    • https://gist.github.com/nurse/0619b6af90df140508c2
    • Note that recent Ubuntu requires privilege even if the target
    process is mine.
    • It is a good tool because it can get information even if you cannot
    attach the process into gdb. (= get stuck in kernel)
    • It doesn’t require install instruction. easy to use.
    • Can run even on RHEL5/CentOS5
    84

    View full-size slide

  78. An example
    # /opt/td-agent/embedded/bin/ruby pid2line.rb 11010
    LWPID: 11010 EIP: 7faa4e58a462 -> 142462
    Elf vaddr: 142462 -> 142462
    match_at
    /var/cache/omnibus/src/ruby/ruby-2.1.10/regexec.c:2627
    LWPID: 11011 EIP: 7faa4d81969d -> ec69d
    Elf vaddr: ec69d -> ec69d
    __poll_nocancel
    :?
    LWPID: 11014 EIP: 7faa4e237a82 -> ba82
    Elf vaddr: ba82 -> ba82
    __pthread_cond_timedwait
    :?
    LWPID: 11015 EIP: 7faa4e237a82 -> ba82
    Elf vaddr: ba82 -> ba82
    __pthread_cond_timedwait
    :?
    LWPID: 11021 EIP: 7faa4d8242c3 -> f72c3
    Elf vaddr: f72c3 -> f72c3
    __epoll_wait_nocancel
    :?
    LWPID: 11022 EIP: 7faa4d8242c3 -> f72c3
    Elf vaddr: f72c3 -> f72c3
    __epoll_wait_nocancel
    :?
    LWPID: 11023 EIP: 7faa4d81975f -> ec75f
    Elf vaddr: ec75f -> ec75f
    __GI_ppoll
    :?
    the core of regular expression
    (Oniguruma/Onigmo)
    It seems regexp’s backtrack explosion.
    85

    View full-size slide

  79. SEGV
    Segmentation fault
    • Useful information to
    identify the issue.
    • Ruby level or C level
    backtrace
    • loaded libraries
    • core file
    86

    View full-size slide

  80. Ruby level
    backtrace
    • You can read it.
    • But in SEGV case it often
    far from the cause.
    87

    View full-size slide

  81. C level backtrace
    • It shows both public/static
    function names.
    • It’s actually difficult to get
    static function’s name.
    Ruby fetches it from
    DWARF by itself.
    • (it needs to compile `gcc
    -g`)
    • Tracing back beyond signal
    trampoline on OS X was
    also difficult.
    • Just before _sigtramp
    maybe buggy.
    88

    View full-size slide

  82. core file
    • the file named ruby.core (or managed by coredumpctl)
    • Generated on SEGV or when manually dumped
    • The memory dump of Ruby at that time
    • You can easily debug C layer with gdb
    • You can debug Ruby layer with gdb…
    89

    View full-size slide

  83. Inside Ruby Process (Linux)
    RubyVM
    Thread Thread
    OS
    pthread pthread
    lwp lwp
    1:1 thread
    90

    View full-size slide

  84. as C structures
    ruby_current_vm
    pthread
    lwp
    living_threads
    Ruby

    Thread
    Ruby

    Thread
    Ruby

    Thread
    Ruby call
    frame
    glibc
    Linux Kernel
    call frame call frame

    Ruby VM and threads are
    represented as shown.
    If you can track this by gdb,
    you can get Ruby level
    information…
    call frame: a structure
    which saves that Ruby executes in
    the function, local variables,
    previous call frame, and so on
    91

    View full-size slide

  85. .gdbinit
    a GDB initialization script, which have many utility functions to debug
    CRuby
    92

    View full-size slide

  86. .gdbinit for CRuby
    • It is hard because it needs to re-
    implement Ruby’s C functions with
    gdb’s tiny scripts…
    • Slow; rb_ps is enough fast but
    rb_count_objects, which counts
    Ruby objects by categories, takes
    some seconds even if nearly
    empty process. (it should be
    written in Python…)
    • It needs to follow Ruby’s changes.
    • It can read core file. It means you
    can get Ruby level backtrace of
    each Ruby threads even if the
    process is crashed. (note that you
    can get core file manually from
    running process with stopping it
    short time)
    94

    View full-size slide

  87. ·ͱΊ
    • Ϣʔεέʔεʢݱ࣮ͷྫʣ
    Λऩू͠Α͏
    • ݱ࣮ͷΞϓϦέʔγϣϯ
    ͷϘτϧωοΫΛ஌Ζ͏
    Conclusion
    • Gather use cases
    (real snippets)
    • Profile/Benchmark
    real applications
    95

    View full-size slide