Slide 1

Slide 1 text

memory (m3m0r7) How to implement a RubyVM with PHP? RubyKaigi 2024

Slide 2

Slide 2 text

memory m3m0r7 I am a CTO at Liiga, Inc. I enjoy implementing VMs with PHP; for example, I already implemented JVM and RubyVM. I am an author of some books. My hobby is reading binary fi les. I have been writing with Ruby for 4 months. memory1994 m3m0r7

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Haisai, everyone!

Slide 5

Slide 5 text

Are you enjoying RubyKaigi 2024?

Slide 6

Slide 6 text

By the way,

Slide 7

Slide 7 text

Have you ever made any VM?

Slide 8

Slide 8 text

Are you motivated to learn a new language?

Slide 9

Slide 9 text

I myself have learned various languages, including PHP, Go, and TypeScript

Slide 10

Slide 10 text

Gradually, the new language became too uninspiring to learn "normally"

Slide 11

Slide 11 text

The same goes for Ruby

Slide 12

Slide 12 text

I have established some of my own learning methods, so it's hard to get motivated

Slide 13

Slide 13 text

So I thought about it

Slide 14

Slide 14 text

How can I learn a new language in 
 a stimulus-driven way?

Slide 15

Slide 15 text

"Yes, let's make a VM."
 "If I make a VM, I can understand how Ruby feels."

Slide 16

Slide 16 text

I can't go back to normal

Slide 17

Slide 17 text

How do RubyVM work? How to implement a RubyVM with PHP? What is "CallInfoEntry"? How many instructions are there in the RubyVM instruction set? 1 2 3 4 Table of Contents 1/2

Slide 18

Slide 18 text

How do we understand RubyVM from Ruby's core code? How do local variables work? DEMO 5 6 7 Table of Contents 2/2

Slide 19

Slide 19 text

Topics I will not cover are...

Slide 20

Slide 20 text

How the lexical analyzer for Ruby works How the parser for Ruby works How to write PHP and Ruby Topics I will not cover are...

Slide 21

Slide 21 text

By the way, do you know
 what is important when implementing a VM?

Slide 22

Slide 22 text

This is very very important

Slide 23

Slide 23 text

This is called "DRY"

Slide 24

Slide 24 text

Do Repeat Yaruki (which means "motivation" in English) D R Y What is important to implement a VM?

Slide 25

Slide 25 text

How RubyVM works

Slide 26

Slide 26 text

What is "RubyVM" ?

Slide 27

Slide 27 text

What is "RubyVM" ? - The RubyVM is also called YARV (Yet Another Ruby VM). - The YARV is a set of instruction sequences (a list of instructions to be executed) called ISeq (a.k.a. Instruction Sequence) on Ruby and meta- information about the instruction sequences. How RubyVM works

Slide 28

Slide 28 text

- The fi gure below is the YARV for outputting "Hello World!". What is "RubyVM" ? How RubyVM works You can see the string "Hello World!" Hello World!

Slide 29

Slide 29 text

I'm talking about YAYARV
 (Yet Another Yet Another RubyVM)

Slide 30

Slide 30 text

Explains the structure of YARV

Slide 31

Slide 31 text

Up to Ruby 3.2

Slide 32

Slide 32 text

The header section 
 (36 bytes) The payload section The instruction sequence offsets section (The information is each of instruction sequence offsets; N>0 * 4 bytes) The global object offsets section (The information is each of global object offsets; N>0 * 4 bytes) The extra data (if embedded extra data; N>=0 bytes) A part of instruction sequences A part of global objects The RUBY_PLATFORM name section 
 (string) An information of string / class / fi xed number / bool types and data An information of instruction sequence section (In normally, 44 info * 4 bytes = 176 bytes notice: no ộ A code section (N>0 bytes) ộ A local table section (N>=0 bytes) A call info entry section (N>=0 bytes) ộ The structure of YARV The alignment section 
 (Filled by 0xff to align every 2 bytes) Note: The fi gure is my interpretation of the Ruby's core code

Slide 33

Slide 33 text

From Ruby 3.3

Slide 34

Slide 34 text

The header section 
 (36 bytes) The payload section The instruction sequence offsets section (The information is each of instruction sequence offsets; N>0 * 4 bytes) The global object offsets section (The information is each of global object offsets; N>0 * 4 bytes) The extra data (if embedded extra data; N>=0 bytes) A part of instruction sequences A part of global objects An information of string / class / fi xed number / bool types and data An information of instruction sequence section (In normally, 44 info * 4 bytes = 176 bytes notice: no ộ A code section (N>0 bytes) ộ A local table section (N>=0 bytes) A call info entry section (N>=0 bytes) ộ The structure of YARV The alignment section 
 (Filled by 0xff to align every 2 bytes) Note: The fi gure is my interpretation of the Ruby's core code The endian section (2 bytes) The word size section 
 (2 bytes)

Slide 35

Slide 35 text

The header section 
 (36 bytes) The payload section The instruction sequence offsets section (The information is each of instruction sequence offsets; N>0 * 4 bytes) The global object offsets section (The information is each of global object offsets; N>0 * 4 bytes) The extra data (if embedded extra data; N>=0 bytes) A part of instruction sequences A part of global objects An information of string / class / fi xed number / bool types and data An information of instruction sequence section (In normally, 44 info * 4 bytes = 176 bytes notice: no ộ A code section (N>0 bytes) ộ A local table section (N>=0 bytes) A call info entry section (N>=0 bytes) ộ The structure of YARV The alignment section 
 (Filled by 0xff to align every 2 bytes) Note: The fi gure is my interpretation of the Ruby's core code The endian section (2 bytes) The word size section 
 (2 bytes) The endian section (2 bytes) The word size section 
 (2 bytes) Platform name section changed to
 endian section and word size section.
 No other changes in the YARV structure
 between Ruby 3.2 and 3.3. Platform name section changed to
 endian section and word size section.
 No other changes in the YARV structure
 between Ruby 3.2 and 3.3.

Slide 36

Slide 36 text

What is "RubyVM" ? - In Ruby, the code shown on the left is actually su ffi cient to output "Hello World!" as a string. However, it becomes more di ff i cult when it comes to YARV. - Not only RubyVM, the hardest part is to implement VM and output "Hello World!". - It is so interesting that it become to me crazy. How RubyVM works Hello World! Hello World!

Slide 37

Slide 37 text

How to implement a RubyVM with PHP?

Slide 38

Slide 38 text

You

Slide 39

Slide 39 text

I will explain what ChatGPT cannot explain

Slide 40

Slide 40 text

Note: Of course, you can implement 
 RubyVM in any language!

Slide 41

Slide 41 text

Have you any idea 
 how to implement a RubyVM?

Slide 42

Slide 42 text

Have you any idea how to implement a RubyVM? - The JVM has an extensive document called the "Java Virtual Machine Speci fi cation (i.e., JVM Speci fi cation)" [1]. - While Java documentation is maintained by companies, RubyVM documentation is maintained by the community. 
 Therefore, the maintenance of documentation is inevitably limited compared to that of a company. - In such a situation, how can we implement RubyVM? How to implement a RubyVM with PHP? [1]: https://docs.oracle.com/javase/specs/jvms/se8/html/

Slide 43

Slide 43 text

The answer is very very simple

Slide 44

Slide 44 text

Answer: Read the Ruby's core code

Slide 45

Slide 45 text

https://github.com/m3m0r7/rubyvm-on-php

Slide 46

Slide 46 text

Note: This slide notes 
 are explaining in Ruby 3.3.0

Slide 47

Slide 47 text

The fl ow of implementation

Slide 48

Slide 48 text

The fl ow of implementation How to implement a RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence

Slide 49

Slide 49 text

Processing binary structures in PHP

Slide 50

Slide 50 text

Processing binary structures in PHP - First, generates a YARV fi le; Ruby provides "RubyVM::InstructionSequence.compile" to compile Ruby code into instruction sequences. - For example below command, create a YARV fi le named HelloWorld.yarv using the "ruby -e". How to implement a RubyVM with PHP? RubyVM::InstructionSequence.compile ruby -e HelloWorld.yarv

Slide 51

Slide 51 text

Processing binary structures in PHP - The "fread", "fseek", and "unpack" are very useful if you want to reading binary fi les in PHP. - Of course, implementation without using the "unpack" function is also possible by using bitwise operations. - PHP is unlike the C language. It can not read binary fi les with the speci fi ed type (e.g., an integer type). Therefore, it is necessary to read the binary once as a string (using "fread") and then convert it to an integer type (using "unpack"). How to implement a RubyVM with PHP? fread fseek unpack fread unpack unpack

Slide 52

Slide 52 text

Process the header

Slide 53

Slide 53 text

The fl ow of implementation How to implement a RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence

Slide 54

Slide 54 text

magic major version minor version size extra size global object list iseq list offset global object list The header of YARV 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes 4 bytes Speci fi ed magic string YARB (Yet Another Ruby Binary) The compiled Ruby major version. In the example, it is "3". The compiled Ruby minor version. In the example, it is "2". The binary payload size Number of the instruction sequences Number of the symbols 
 (The symbol is different in Ruby's symbol) Offsets for the instruction sequences Offsets for the Ruby symbols It will look like the fi gure 
 on the right iseq size 4 bytes The extra binary payload size

Slide 55

Slide 55 text

Use the V parameter 
 (unsigned long, Little endian)

Slide 56

Slide 56 text

Process the both o ff sets

Slide 57

Slide 57 text

The fl ow of implementation How to implement a RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence

Slide 58

Slide 58 text

Move cursor to iseq list offset Loop "iseq list size" times to get 
 offsets for each 4 bytes. Move cursor to global object list offset Loop "global object list size" times to get 
 offsets for each 4 bytes. For example, get o ff sets for the instruction sequences For example, get o ff sets for the symbols

Slide 59

Slide 59 text

Process the Instruction Sequence byte-code

Slide 60

Slide 60 text

The fl ow of implementation How to implement a RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence

Slide 61

Slide 61 text

Process the Instruction Sequence byte-code - RubyVM has an implementation called "ibf_(?:load|dump_write)_small_value" for e ff i cient binary handling [1].ɻ - It uses Hamming weights[2] (also called popcount or population count) to handle variable byte lengths. An example implementation is shown in the left fi gure which written in PHP. - In this talk, I will name it "readSmallValue". How to implement a RubyVM with PHP? [1]: https://github.com/ruby/ruby/blob/2f603bc4/compile.c#L11262-L11273
 [2]: https://ja.wikipedia.org/wiki/%E3%83%8F%E3%83%9F%E3%83%B3%E3%82%B0%E9%87%8D%E3%81%BF ibf_(?:load|dump_write)_small_value readSmallValue

Slide 62

Slide 62 text

Process the Instruction Sequence byte-code - If readSmallValue is implemented as in the previous example, it is available to read the data structure of the 0th Instruction Sequence. - The data structure of Instruction Sequence is actually very huge. Among them, meta-information is huge, such as exception table, keyword arguments, etc. And the number of meta-information is more than 50.... - Actually, it is not necessary to implement all of the data structures if only the output is "HelloWorld!". Therefore, to omit it and using the necessary 4 meta-information. How to implement a RubyVM with PHP? readSmallValue HelloWorld!

Slide 63

Slide 63 text

Process the Instruction Sequence byte-code - For an example implementation of RubyVM, read the Ruby core implementation (https://github.com/ruby/ruby/blob/ruby_3_3/ compile.c#L12514) - Or see to my implementation of "RubyVM on PHP" (https://github.com/ m3m0r7/rubyvm-on-php/blob/0.3.3.0/src/VM/Core/Runtime/Kernel/ Ruby3_3/InstructionSequence/InstructionSequenceProcessor.php#L59). How to implement a RubyVM with PHP?

Slide 64

Slide 64 text

type iseq size bytecode offset bytecode size Read an instruction sequence sv sv sv sv ※ "sv" is omitted by a "small value" (it is variable bytes). Move cursor to 0th instruction sequence from offsets list for instruction sequences

Slide 65

Slide 65 text

Execute the byte-code of 0th Instruction Sequence

Slide 66

Slide 66 text

The fl ow of implementation How to implement a RubyVM with PHP? Process the YARV Header Process the Instruction Sequence offsets Process the Global Object offsets Process the Instruction Sequence byte-code Execute the byte-code of 0th Instruction Sequence

Slide 67

Slide 67 text

Execute the byte-code of 0th Instruction Sequence - The $bytecodeO ff set and $iseqSize were got from the previous slide. Using these, we will implement execution of the operation code while reading the instruction sequence. - Use 4 instructions "putself (18)", "putstring (21)", "opt_send_without_block (51)", and "leave (60)" for outputting "HelloWorld!". - Make an array of opcode and mnemonic pairs as shown on the left fi gure. How to implement a RubyVM with PHP? The pairs of opcodes and mnemonics are implemented in https://github.com/ruby/ ruby/blob/ruby_3_3/yjit/src/ cruby_bindings.inc.rs#L669-L872. $bytecodeO ff set $iseqSize putself (18) putstring (18) opt_send_without_block (51) leave (60) HelloWorld!

Slide 68

Slide 68 text

How 4 instructions work?

Slide 69

Slide 69 text

Execute the byte-code of 0th Instruction Sequence How to implement a RubyVM with PHP? putself putstring(operand: "HelloWorld!") leave opt_send_without_block Stack Push the running context 
 to the stack The running context

Slide 70

Slide 70 text

Execute the byte-code of 0th Instruction Sequence How to implement a RubyVM with PHP? putself putstring(operand: "HelloWorld!") leave opt_send_without_block Stack Push the string "HelloWorld!" to the stack The running context "HelloWorld!"

Slide 71

Slide 71 text

Execute the byte-code of 0th Instruction Sequence How to implement a RubyVM with PHP? putself putstring(operand: "HelloWorld!") leave opt_send_without_block Stack Pop two data from the stack. Then, call "puts" method 
 in the running context The running context "HelloWorld!"

Slide 72

Slide 72 text

Execute the byte-code of 0th Instruction Sequence How to implement a RubyVM with PHP? putself putstring(operand: "HelloWorld!") leave opt_send_without_block Stack ←Finish execution. Return the result 
 to upper context.

Slide 73

Slide 73 text

Execute the byte-code of 0th Instruction Sequence - In addition, the instruction sequence after being converted to YARV can be got by doing something like "puts RubyVM::InstructionSequence.compile("puts 'HelloWorld!'", "HelloWorld.rb").disasm"; this is similarity the javap command in Java. How to implement a RubyVM with PHP? puts RubyVM::InstructionSequence.compile("puts 'HelloWorld!'", "HelloWorld.rb").disasm

Slide 74

Slide 74 text

Execute the byte-code of 0th Instruction Sequence - Next, implement loadObject as shown in the left fi gure. loadObject is a function to get the symbol from the o ff set. - In addition, we implement the Main class, which implements the "puts" method. How to implement a RubyVM with PHP? Here number is speci fi ed from https://github.com/ ruby/ruby/blob/ruby_3_3/compile.c#L13303-L13336, which is indexed number in functions array. For example, string is "5". loadObject loadObject puts

Slide 75

Slide 75 text

An example of the implementation of 
 the"putself" instruction An example of the implementation of the "putstring" instruction An example of the implementation of the "opt_send_without_block" instruction An example of the implementation of 
 the Implement "leave" instruction

Slide 76

Slide 76 text

Execute the byte-code of 0th Instruction Sequence - The left fi gure shows the output of "HelloWorld!" when executing the previous implementation. - Example source code is published in the following the gist. - https://gist.github.com/ m3m0r7/226e20c8115caf4a9d43b291861 f978b How to implement a RubyVM with PHP? HelloWorld!

Slide 77

Slide 77 text

What is "CallInfoEntry"?

Slide 78

Slide 78 text

What is "CallInfoEntry"? What is "CallInfoEntry"? - The CallInfoEntry contains meta-information about the method to be executed, such as the method name, the number of arguments, and the names of keyword arguments, and so on. - It can be called without specifying "puts" directly on code by implementing it as shown on the following page. puts CallInfoEntry

Slide 79

Slide 79 text

Only these two variables are used The other variables de fi ned in the fi gure on the left are required In the actual implementation, but are omitted in this example because they are not required for the output of "HelloWorld!". Add this one $ciSize below Result

Slide 80

Slide 80 text

Change to

Slide 81

Slide 81 text

Change to Change to

Slide 82

Slide 82 text

What is "CallInfoEntry"? What is "CallInfoEntry"? - This automatically resolves the method name using the CallInfoEntry . And also resolves the number of arguments so that the method can be called even when the number of arguments increases. - The CallInfoEntry provides a variety of information. If you are interested in more, you can read the Ruby core code or my implementation of "RubyVM on PHP" and try to implement it. CallInfoEntry

Slide 83

Slide 83 text

What is "CallInfoEntry"? What is "CallInfoEntry"? - See the following the gist for the code, including the changes made earlier. - https://gist.github.com/m3m0r7/226e20c8115caf4a9d43b291861f978b? permalink_comment_id=4686761#gistcomment-4686761

Slide 84

Slide 84 text

How many instructions are there in the RubyVM instruction set?

Slide 85

Slide 85 text

How many instructions are there in the RubyVM instruction set? How many instructions are there in the RubyVM instruction set? - RubyVM has about 100 instructions. Actually, it has about 200, but half of them are for tracing instructions. By the way, JVM has about 150 instructions (as of SE 13). - For simple HelloWorld! output, FizzBuzz algorithm, or QuickSort algorithm, it is not necessary to implement everything.
 It is possible to execute them by implementing a few instructions.

Slide 86

Slide 86 text

- See below for the instruction set provided by RubyVM. - https://github.com/ruby/ruby/blob/ruby_3_3/insns.def - See below for an example implementation in PHP. - https://github.com/m3m0r7/rubyvm-on-php/tree/0.3.3.0/src/VM/Core/ Runtime/Executor/Insn/Processor How many instructions are there in the RubyVM instruction set? How many instructions are there in the RubyVM instruction set?

Slide 87

Slide 87 text

How many instructions are there in the RubyVM instruction set? - The following 6 instructions have been added since Ruby 3.3.0. - In Ruby, when the instruction set increases, the opcodes of other instruction sets may become out of sync... For example, OPT_SEND_WITHOUT_BLOCK was 51 in Ruby 3.2, but it is 53 in Ruby 3.3.
 If you want to support multiple versions of RubyVM, you need to consider this speci fi cation. OpCode Mnemonic 33 SPLATKW 45 DEFINEDIVAR 58 OPT_NEWARRAY_SEND 135 TRACE_SPLATKW 147 TRACE_DEFINEDIVAR 160 TRACE_OPT_NEWARRAY_SEND How many instructions are there in the RubyVM instruction set?

Slide 88

Slide 88 text

How do we understand RubyVM from Ruby's core code?

Slide 89

Slide 89 text

How do we understand RubyVM from Ruby's core code? How do we understand RubyVM from Ruby's core code? - It is easy to understand the Ruby implementation by looking at compile.c (https://github.com/ruby/ruby/blob/ruby_3_3/compile.c). - Especially, it is a good idea to follow the `ibf_load_*` function. - However, you will have to follow the code to fi nd out in which order the functions are called, so I will draw a fl ow diagram on the next page to give you a rough idea of how to follow. ibf_load_* compile.c

Slide 90

Slide 90 text

How do we understand RubyVM from Ruby's core code? How do we understand RubyVM from Ruby's core code? rb_iseq_ibf_load ibf_load_setup ibf_load_iseq rb_ibf_load_iseq_complete ibf_load_iseq_each ibf_load_code Return read ISeq ibf_load_small_value ibf_load_object/id ibf_load_local_table Functions below called by ibf_load_iseq_each ibf_load_iseq is called compile.c ← Function called when calling RubyVM::InstructionSequence.load_from_binary method in Ruby's core code iseqw_s_load_from_binary iseq.c RubyVM::InstructionSequence.load_from_binary and so on...

Slide 91

Slide 91 text

How do local variables work?

Slide 92

Slide 92 text

How do local variables work? How do local variables work? - Local variables are one di ffi cult implementation of RubyVM. I myself have often failed to implement it. - When executing a de fi ned method, arguments must be pre-set in a local table at runtime (rather than pushed onto the stack), but the location of the arguments to be set must be calculated, and it is so di ff i cult. - Although the source code seems to require an Environment Pointer (EP), but I explain how to not use EP in this slides. Environment Pointer (EP)

Slide 93

Slide 93 text

VM_ENV_DATA_SIZE Variables in methods Arguments local table size: 4 0 1 2 3 4 6 5 slot index var4 -> slot[3] var3 -> slot[4] var2 -> slot[5] var1 -> slot[6] call info argc: 2

Slide 94

Slide 94 text

How do local variables work? How do local variables work? - The value of "varN" (where N is a natural number, N>0) is determined by "VM_ENV_DATA_SIZE + local table size - (N - 1)". This means var1 can be understood to be stored in "slot[6]" as it is calculated by "VM_ENV_DATA_SIZE(3) + local table size(4) - N(1)". - The arguments passed to a method must be associated with slot indexes in the reverse order of the de fi ned arguments. - The opcode for "[gs]etlocal(?:_WC[01]|)" follows this rule for getting values. varN VM_ENV_DATA_SIZE + local table size - (N - 1) slot[6] VM_ENV_DATA_SIZE(3) + local table size(4) - N(1) [gs]etlocal(?:_WC[01]|)

Slide 95

Slide 95 text

How do local variables work? How do local variables work? - The reason for starting from the third position seems to be due to embedding information necessary for RubyVM ("VM_ENV_DATA_INDEX_ME_CREF", "VM_ENV_DATA_INDEX_SPECVAL", "VM_ENV_DATA_INDEX_FLAGS"). - If there are arguments, it is necessary to prepopulate the slots with values before calling the method. For example, "var1" needs to be placed in "slot[6]" and "var2" in "slot[5]" in advance. Note that for "var3" and "var4", it is not necessary to prepopulate the values as "setlocal" is called within the internal instruction sequence. (Implementation hint: https://github.com/m3m0r7/ rubyvm-on-php/blob/0.3.3.0/src/VM/Core/Runtime/Executor/ CallBlockHelper.php#L121) VM_ENV_DATA_INDEX_ME_CREF VM_ENV_DATA_INDEX_SPECVAL VM_ENV_DATA_INDEX_FLAGS setlocal var1 slot[6] var2 slot[5] var3 var4

Slide 96

Slide 96 text

How do local variables work? How do local variables work? - In addition, RubyVM's local variables have the concept of "level", where level = 0 represents the current execution context. Each time level is increased by 1, 2, 3 ..., the local variable of the previous context (corresponding to the VM_ENV_PREV_EP macro in the Ruby core code) is referenced (see the fi gure on the next page). - It is easy to understand if you think of level as a relative position in terms of the context in which it is being executed. - Therefore, it is necessary to be able to trace back the context in which it is being executed. VM_ENV_PREV_EP

Slide 97

Slide 97 text

The assignment to var3 is a runtime context. Therefore level is 0 and "setlocal_WC0" is executed. The var3 is de fi ned in the previous context. Therefore, level is 1 in terms of the running context and "getlocal_WC1" is executed. The var3 is de fi ned in the previous context. Therefore, level is 1 in terms of the running context and "setlocal_WC1" is executed. It is in the running context that the var1 and var2 are de fi ned. Therefore level is 0 and "getlocal_WC0" is executed. When a call is made (internally when opcode such as send/ opt_send_without_block is called) execution context changes

Slide 98

Slide 98 text

DEMO

Slide 99

Slide 99 text

Slide 100

Slide 100 text

Let's try to make your own YAYARV!

Slide 101

Slide 101 text

__END__