Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What Lies Beneath

What Lies Beneath

What really happens when your Java program runs? After the transformation from Java source through bytecode and machine code to microcode, and the various optimizations that take place along the way, the instructions that are actually executed may be very different from what you imagined when you wrote the program. This session shows you tools and techniques for tracing that path. You’ll see what a simple program actually looks like when it really hits the hardware!

Level:Intermediate

mauricen

May 28, 2019
Tweet

More Decks by mauricen

Other Decks in Technology

Transcript

  1. jPrime, Sofia, May 2019
    Dmitry Vyazelenko @DVyazelenko
    Maurice Naftalin @mauricenaftalin
    What Lies Beneath
    What Lies Beneath

    View full-size slide

  2. Maurice Naftalin
    Java 5 Java 8
    2013 2014 20152017

    View full-size slide

  3. Dmitry Vyazelenko
    • Disorganiser-in-chief of the unconferences
    JCrete and JAlba

    View full-size slide

  4. @JAlbaUnconf − http://jalba.scot

    View full-size slide

  5. What Lies Beneath
    • What, another ”Hello, World” talk?
    • Yes! We have a tiny class Computer. We’ll see
    what happens when we type

    • It’s amazingly complicated
    — and interesting!
    java Computer.java

    View full-size slide

  6. Everybody Knows…
    javac
    template
    interpreter
    Bytecode
    Java source
    Machine code

    View full-size slide

  7. Java to Bytecode
    private int add(int value) {
    return value + 254;
    }

    public int compute(int value) {
    return add(value / 0xdeadbeef);
    }

    public static void main(String[] args) {
    System.out.println(new Computer().compute(0xcafebabe));
    }
    Result: 0xff
    1
    value / 0xdeadbeef

    View full-size slide

  8. private int add(int value) {
    return value + 254;
    }

    public int compute(int value) {
    return add(value / 0xdeadbeef);
    }

    public static void main(String[] args) {
    System.out.println(new Computer().compute(0xcafebabe));
    }
    Java to Bytecode
    javac
    Bytecode
    javap -verbose
    Constant pool:
    #1 = ...
    #2 = Integer -559038737 // 0xdeadbeef
    #3 = Methodref #5.#30 // wlb/Computer.add:(I)I
    {
    public int compute(int);
    Code:
    0: aload_0
    1: iload_1
    2: ldc #2 // int -559038737
    4: idiv
    5: invokespecial #3 // Method add:(I)I
    8: ireturn
    LocalVariableTable:
    Start Length Slot Name Signature
    0 9 0 this Lwlb/Computer;
    0 9 1 value I
    }

    View full-size slide

  9. Constant pool:
    #1 = ...
    #2 = Integer -559038737 // 0xdeadbeef
    #3 = Methodref #5.#30 // wlb/Computer.add:(I)I
    {
    public int compute(int);
    Code:
    0: aload_0
    1: iload_1
    2: ldc #2 // int -559038737
    4: idiv
    5: invokespecial #3 // Method add:(I)I
    8: ireturn
    LocalVariableTable:
    Start Length Slot Name Signature
    0 9 0 this Lwlb/Computer;
    0 9 1 value I
    }
    Java to Bytecode
    public int compute(int value) {
    return add(value / 0xdeadbeef);
    }
    Constant pool:
    LocalVariableTable:
    Code:

    View full-size slide

  10. Constant pool:
    #1 = ...
    #2 = Integer -559038737 // 0xdeadbeef
    #3 = Methodref #5.#30 // wlb/Computer.add:(I)I
    {
    public int compute(int);
    Code:
    0: aload_0
    1: iload_1
    2: ldc #2 // int -559038737
    4: idiv
    5: invokespecial #3 // Method add:(I)I
    8: ireturn
    LocalVariableTable:
    Start Length Slot Name Signature
    0 9 0 this Lwlb/Computer;
    0 9 1 value I
    }
    Executing Bytecode
    public int compute(int value) {
    return add(value / 0xdeadbeef);
    }
    Operand
    Stack
    this

    View full-size slide

  11. Constant pool:
    #1 = ...
    #2 = Integer -559038737 // 0xdeadbeef
    #3 = Methodref #5.#30 // wlb/Computer.add:(I)I
    {
    public int compute(int);
    Code:
    0: aload_0
    1: iload_1
    2: ldc #2 // int -559038737
    4: idiv
    5: invokespecial #3 // Method add:(I)I
    8: ireturn
    LocalVariableTable:
    Start Length Slot Name Signature
    0 9 0 this Lwlb/Computer;
    0 9 1 value I
    }
    Executing Bytecode
    public int compute(int value) {
    return add(value / 0xdeadbeef);
    }
    Operand
    Stack
    this
    0xcafebabe

    View full-size slide

  12. Constant pool:
    #1 = ...
    #2 = Integer -559038737 // 0xdeadbeef
    #3 = Methodref #5.#30 // wlb/Computer.add:(I)I
    {
    public int compute(int);
    Code:
    0: aload_0
    1: iload_1
    2: ldc #2 // int -559038737
    4: idiv
    5: invokespecial #3 // Method add:(I)I
    8: ireturn
    LocalVariableTable:
    Start Length Slot Name Signature
    0 9 0 this Lwlb/Computer;
    0 9 1 value I
    }
    Executing Bytecode
    public int compute(int value) {
    return add(value / 0xdeadbeef);
    }
    Operand
    Stack
    0xdeadbeef
    this
    0xcafebabe

    View full-size slide

  13. Constant pool:
    #1 = ...
    #2 = Integer -559038737 // 0xdeadbeef
    #3 = Methodref #5.#30 // wlb/Computer.add:(I)I
    {
    public int compute(int);
    Code:
    0: aload_0
    1: iload_1
    2: ldc #2 // int -559038737
    4: idiv
    5: invokespecial #3 // Method add:(I)I
    8: ireturn
    LocalVariableTable:
    Start Length Slot Name Signature
    0 9 0 this Lwlb/Computer;
    0 9 1 value I
    }
    Executing Bytecode
    public int compute(int value) {
    return add(value / 0xdeadbeef);
    }
    Operand
    Stack
    this
    0xcafebabe
    0xdeadbeef
    1

    View full-size slide

  14. Constant pool:
    #1 = ...
    #2 = Integer -559038737 // 0xdeadbeef
    #3 = Methodref #5.#30 // wlb/Computer.add:(I)I
    {
    public int compute(int);
    Code:
    0: aload_0
    1: iload_1
    2: ldc #2 // int -559038737
    4: idiv
    5: invokespecial #3 // Method add:(I)I
    8: ireturn
    LocalVariableTable:
    Start Length Slot Name Signature
    0 9 0 this Lwlb/Computer;
    0 9 1 value I
    }
    Executing Bytecode
    public int compute(int value) {
    return add(value / 0xdeadbeef);
    }
    Operand
    Stack
    this
    1
    255

    View full-size slide

  15. Java to Bytecode
    private int add250(int value) {
    return value + 250; // FF: 250 + 5
    }

    public int caffeinate(int value) {
    return 0xCAFEBABE * add3(value);
    }

    public static void main(String[] args) {
    WhatLiesBeneath wlb = new WhatLiesBeneath();
    System.out.println(new Adder().doubleAdd3(5));
    }
    public int compute(int);
    Code:
    0: aload_0
    1: iload_1
    2: ldc #2 // int 0xdeadbeef
    4: idiv
    5: invokespecial #3 // Method add:(I)I
    8: ireturn
    javac
    idiv

    View full-size slide

  16. public int compute(int);
    Code:
    0: aload_0
    1: iload_1
    2: ldc #2 // int 0xdeadbeef
    4: idiv
    5: invokespecial #3 // Method add:(I)I
    8: ireturn
    idiv 108 idiv [0x00007f8f8f72ecc0, 0x00007f8f8f72ed00] 64 bytes
    0x00007f8f8f72ecc0: mov (%rsp),%eax
    0x00007f8f8f72ecc3: add $0x8,%rsp
    0x00007f8f8f72ecc7: mov %eax,%ecx
    0x00007f8f8f72ecc9: mov (%rsp),%eax
    0x00007f8f8f72eccc: add $0x8,%rsp
    0x00007f8f8f72ecd0: cmp $0x80000000,%eax
    0x00007f8f8f72ecd6: jne 0x00007f8f8f72ece7
    0x00007f8f8f72ecdc: xor %edx,%edx
    0x00007f8f8f72ecde: cmp $0xffffffff,%ecx
    0x00007f8f8f72ece1: je 0x00007f8f8f72ecea
    0x00007f8f8f72ece7: cltd
    0x00007f8f8f72ece8: idiv %ecx
    0x00007f8f8f72ecea: movzbl 0x1(%r13),%ebx
    0x00007f8f8f72ecef: inc %r13
    0x00007f8f8f72ecf2: movabs $0x7f8fadbe4d80,%r10
    0x00007f8f8f72ecfc: jmpq *(%r10,%rbx,8)
    java … -XX:+PrintInterpreter
    dividend ➞ eax
    divisor ➞ ecx
    template
    code

    View full-size slide

  17. public int compute(int);
    Code:
    0: aload_0
    1: iload_1
    2: ldc #2 // int 0xdeadbeef
    4: idiv
    5: invokespecial #3 // Method add:(I)I
    8: ireturn
    idiv 108 idiv [0x00007f8f8f72ecc0, 0x00007f8f8f72ed00] 64 bytes
    0x00007f8f8f72ecc0: mov (%rsp),%eax
    0x00007f8f8f72ecc3: add $0x8,%rsp
    0x00007f8f8f72ecc7: mov %eax,%ecx
    0x00007f8f8f72ecc9: mov (%rsp),%eax
    0x00007f8f8f72eccc: add $0x8,%rsp
    0x00007f8f8f72ecd0: cmp $0x80000000,%eax
    0x00007f8f8f72ecd6: jne 0x00007f8f8f72ece7
    0x00007f8f8f72ecdc: xor %edx,%edx
    0x00007f8f8f72ecde: cmp $0xffffffff,%ecx
    0x00007f8f8f72ece1: je 0x00007f8f8f72ecea
    0x00007f8f8f72ece7: cltd
    0x00007f8f8f72ece8: idiv %ecx
    0x00007f8f8f72ecea: movzbl 0x1(%r13),%ebx
    0x00007f8f8f72ecef: inc %r13
    0x00007f8f8f72ecf2: movabs $0x7f8fadbe4d80,%r10
    0x00007f8f8f72ecfc: jmpq *(%r10,%rbx,8)
    java … -XX:+PrintInterpreter
    dividend ==
    MIN_VALUE
    edx = 0
    N
    Y
    divisor == -1
    divide eax by ecx
    N
    Y

    View full-size slide

  18. Improving Our Program…?
    private int add(int value) {
    return value + 254;
    }

    public int compute(int value) {
    return add(value / 0xdeadbeef);
    }

    public static void main(String[] args) {
    System.out.println(new Computer().compute(0xcafebabe));
    }

    View full-size slide

  19. private int add(int value) {
    return value + 254;
    }

    public int compute(int value) {
    return (value / 0xdeadbeef) + 254;
    }

    public static void main(String[] args) {
    System.out.println(new Computer().compute(0xcafebabe));
    }
    Improving Our Program…?

    View full-size slide

  20. Let’s Benchmark!
    @Fork(3)
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @State(Scope.Benchmark)
    public class ComputerBenchmark {
    private Computer computer = new Computer();
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    @Benchmark
    public int compute() {
    return computer.compute(0xcafebabe);
    }
    }

    View full-size slide

  21. Manual Optimisation
    Benchmark Mode Cnt Score Error Units
    ComputerBenchmark.compute (-Xint) avgt 5 178.124 ± 8.136 ns/op
    ComputerBenchmark.compute (-Xint, improved) avgt 5 139.749 ± 9.503 ns/op

    View full-size slide

  22. Everybody Knows…
    javac
    template
    interpreter
    Bytecode
    Java source
    Unoptimised code
    JIT
    compiler
    Optimised code
    profiler

    View full-size slide

  23. Without Inlining (-XX:-Inline)
    ComputerBenchmark::compute
    ...
    mov $0xcafebabe,%edx
    callq 0x00007f0568fb3c60 ;*invokevirtual Computer.compute
    ...
    Computer::compute
    ...
    imul $0x7aeca299,%r10,%r10
    sar $0x3c,%r10
    mov %r10d,%r10d
    sub %r10d,%edx ;*idiv
    callq 0x00007f0568fb40e0 ;*invokevirtual Computer.add
    ...
    Computer::add
    ...
    add $0xfe,%edx
    ...

    View full-size slide

  24. With Inlining
    ComputerBenchmark::compute
    ...
    mov $0xff,%eax
    ...

    View full-size slide

  25. JIT Optimisation
    Benchmark Mode Cnt Score Error Units
    ComputerBenchmark.compute (-Xint) avgt 5 178.124 ± 8.136 ns/op
    ComputerBenchmark.compute (-Xint, modified) avgt 5 139.749 ± 9.503 ns/op
    ComputerBenchmark.compute (-XX:-Inline) avgt 5 6.697 ± 2.253 ns/op
    ComputerBenchmark.compute avgt 5 3.571 ± 0.106 ns/op

    View full-size slide

  26. JIT Optimisations

    View full-size slide

  27. Everybody Knows…
    javac
    template
    interpreter
    Bytecode
    Java source
    Unoptimised code
    JIT
    compiler
    Optimised code

    View full-size slide

  28. And Now the Program Executes…
    And Now the Program Executes…

    View full-size slide

  29. And Now the Program Executes…

    View full-size slide

  30. Computers Used to be so Simple!
    Input Device
    Central
    Processor
    Output Device
    Memory
    Slow
    Devices

    View full-size slide

  31. Not So Simple Now
    Intel Core i7-3770K Ivy Bridge Processor

    View full-size slide

  32. The Memory Hierarchy

    View full-size slide

  33. Processor
    L1 data: 64-byte cache lines
    prefetch
    Stride Prefetching

    View full-size slide

  34. Processor
    L1 data: 64-byte cache lines
    4K pages
    Translation Buffer (TLB)
    Page index Page address
    Virtual address
    (page index + offset)
    Physical address
    Data-Dependent Loads
    Prefetching doesn’t work
    with data-dependent loads!

    View full-size slide

  35. Measuring Cache-Friendliness
    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OperationsPerInvocation(1024)
    public void linkedList() {
    if (iterator == null || !iterator.hasNext()) {
    iterator = linkedList.iterator();
    }
    for (int i = 0; i < 1024 && iterator.hasNext(); i++) {
    sink(iterator.next());
    }
    }
    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OperationsPerInvocation(1024)
    public void primitiveArray() {
    if (index == intArray.length) {
    index = 0;
    }
    for (int i = 0; i < 1024 && index < intArray.length; i++) {
    sink(intArray[index++]);
    }
    }

    View full-size slide

  36. list length (K) 1 7 63 511
    performance (ns/op) 7.25 9.03 20.87 29.07
    CPI (clockticks/instrn) 0.32 0.41 0.93 1.33
    events/operation
    cycles 17.97 22.77 51.32 72.66
    instructions 56.08 55.88 54.96 54.49
    L1-dcache-load-misses 1.18 1.83 1.87 2.65
    L1-dcache-loads 18.94 19.39 18.88 18.22
    L1-dcache-stores 12.00 12.18 11.99 11.15
    LLC-load-misses 0 0 0.41 1.31
    LLC-loads 0 0.72 1.33 1.56
    LinkedList

    View full-size slide

  37. Primitive Array
    list length (K) 1 7 63 511
    performance (ns/op) 3.62 3.65 3.65 3.66
    CPI 0.30 0.30 0.30 0.31
    events/operation
    cycles 9.09 9.16 9.10 9.13
    instructions 30.24 30.13 29.94 29.85
    L1-dcache-load-misses 0.00 0.01 0.06 0.06
    L1-dcache-loads 12.00 12.00 11.97 12.14
    L1-dcache-stores 6.00 6.02 6.02 6.04
    LLC-load-misses 0.00 0.00 0.00 0.00
    LLC-loads 0.00 0.00 0.00 0.00

    View full-size slide

  38. Poor Unloved LinkedList…

    View full-size slide

  39. Conclusions
    From javac soup to hardware nuts, there’s a lot of advanced
    technology here.
    And as everyone knows, to the ignorant –
    “Any sufficiently advanced
    technology is indistinguishable
    from magic.”
    Don’t believe in magic!

    View full-size slide

  40. Digging Deeper…
    • https://groups.google.com/forum/#!forum/mechanical-sympathy
    • http://openjdk.java.net/projects/code-tools/jmh/
    • https://github.com/AdoptOpenJDK/jitwatch
    • https://github.com/AdoptOpenJDK/jitwatch/wiki/Building-hsdis
    • https://github.com/vyazelenko/what-lies-beneath
    • Computer Architecture, A Quantitative Approach (5e), Hennessy &
    Patterson

    View full-size slide