Slide 1

Slide 1 text

Shota Kondo LINE Corporation Identifying topics that caused actual disk read 2023.06.16

Slide 2

Slide 2 text

Speaker • Shota Kondo • Member of LINE IMF team • The team is responsible for developing and maintaining company-wide Kafka platform • Provides multi-tenant shared Kafka cluster

Slide 3

Slide 3 text

Today’s topic • “Identifying topics that caused actual disk read” • Why? • Disk read may have the impact against broker performance

Slide 4

Slide 4 text

Disk read and Kafka broker

Slide 5

Slide 5 text

Request handling in Kafka broker • Network thread receive request and send response to client after processing • Request handler thread process the request from client actually

Slide 6

Slide 6 text

Disk read in Fetch request handling • Usually consumers read topic data from latest log segments (in page cache) • Sometimes consumer tries to fetch old data not in page cache

Slide 7

Slide 7 text

Disk read and its impact • Reading data from HDD is slower than reading from page cache (memory) • Blocking in network thread affects to latency of subsequent requests

Slide 8

Slide 8 text

To apply solutions of such performance degradation • Some of solutions can be considered • Warming up topic data if it was small enough • Setting smaller log segment size to prevent inode lock contention in xfs during reading topic data • https://speakerdeck.com/line_developers/investigating-kafka-performance-issue- caused-by-lock-contention-in-xfs • For proceeding them, we have to know the topic names • If disk read metrics had fi le name, we can use that

Slide 9

Slide 9 text

We already have disk read metrics though… device: xxx

Slide 10

Slide 10 text

Actually we needed is fi le: /data/kafka/…

Slide 11

Slide 11 text

Then, how to collect per file disk read stats?

Slide 12

Slide 12 text

Requirements • Collect the evidence of actual disk read for each fi les • Expose following informations as the prometheus metrics • Read bytes as the value • File name as the label • No performance impact against Kafka broker

Slide 13

Slide 13 text

How to capture the disk read? • Hook the kernel function related disk read • Obtain required informations in the hook

Slide 14

Slide 14 text

How to hook the kernel function? • eBPF (extended Berkley Packet Filter) • The feature is provided by Linux kernel • It makes able to hook kernel events without modifying kernel code • bcc (BPF compiler collection) • Toolkit to compile and run eBPF program with Python/Lua

Slide 15

Slide 15 text

eBPF and BCC • Example code to hook read() system call #!/usr/bin/python from bcc import BPF bpf_text=""" int kprobe__sys_read(struct pt_regs *ctx) { bpf_trace_printk("read() syscall was invoked\\n"); return 0; } """ BPF(text=bpf_text).trace_print()

Slide 16

Slide 16 text

What kernel function should be hooked? • If data resides on page cache, 
 then data will be returned without disk read • Need to hook the function that is 
 close to the storage device

Slide 17

Slide 17 text

generic_make_request() • Kernel function to submit I/O request for devices • It looks good to hook this function void generic_make_request(struct bio *bio)

Slide 18

Slide 18 text

Hook for generic_make_request() struct event_t { SOME_TYPE file; unsigned int bytes; }; int kprobe__generic_make_request(struct pt_regs *ctx, struct bio *bio) { /* Extract read file and bytes from argument */ struct event_t event = {}; event.file = FILE; event.bytes = BYTES; /* Pass the data from eBPF program to python script */ events.perf_submit(ctx, &event, sizeof(event)); return 0; }

Slide 19

Slide 19 text

Does struct bio have file informations…? struct bio { sector_t bi_sector; /* device address in 512 byte sectors */ struct bio *bi_next; /* request queue link */ struct block_device *bi_bdev; unsigned long bi_flags; /* status, command, etc */ unsigned long bi_rw; /* bottom bits READ/WRITE, * top bits priority */ unsigned short bi_vcnt; /* how many bio_vec's */ unsigned short bi_idx; /* current index into bvl_vec */ /* Number of segments in this BIO after * physical address coalescing is performed. */ unsigned int bi_phys_segments; unsigned int bi_size; /* residual I/O count */ ... }

Slide 20

Slide 20 text

Does struct bio have file informations…? • Looks bi_size can be used as read bytes • But read fi le can’t be extracted directly
 from this argument • Need to get read fi le from somewhere • Another kernel function in upper layer?

Slide 21

Slide 21 text

generic_file_aio_read() • Generic fi lesystem read routine • Argument iocb has a pointer to fi le struct ssize_t generic_file_aio_read(struct kiocb *iocb, 
 const struct iovec *iov, unsigned long nr_segs, loff_t pos)

Slide 22

Slide 22 text

Hook for generic_file_aio_read() BPF_HASH(inotbl, u64, unsigned long, INO_TABLE_SIZE); int kprobe__generic_file_aio_read(struct pt_regs *ctx, struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { u64 pid_tgid = bpf_get_current_pid_tgid(); unsigned long ino; if (iocb->ki_filp->f_path.dentry->d_inode) { ino = iocb->ki_filp->f_path.dentry->d_inode->i_ino; } else { // Set 0 if it's negative cache ino = 0; } inotbl.insert(&pid_tgid, &ino); return 0; }

Slide 23

Slide 23 text

We have file information now • File information can be get
 from generic_ fi le_aio_read() • Then let’s refer it in
 generic_make_request()

Slide 24

Slide 24 text

Hook for generic_make_request() int kprobe__generic_make_request(struct pt_regs *ctx, struct bio *bio) { // Only account read requests if (op_is_write(op_from_rq_bits(bio->bi_rw))) return 0; u64 pid_tgid = bpf_get_current_pid_tgid(); unsigned long *pino = inotbl.lookup(&pid_tgid); struct event_t event = {}; if (pino) { event.inode = *pino; } else { event.inode = 0; } event.bytes = bio->bi_size; events.perf_submit(ctx, &event, sizeof(event)); return 0; }

Slide 25

Slide 25 text

Receive data from eBPF program • Disk read stats are available now, then let’s just expose the metrics def record_event(cpu, data, size): event = b["events"].event(data) # Accumulate received data from eBPF program and expose as prometheus metrics b = BPF(text=bpf_text) b["events"].open_perf_buffer(record_event) while True: try: b.perf_buffer_poll() except KeyboardInterrupt: exit()

Slide 26

Slide 26 text

Finally we get per file disk read stats!! fi le: /data/kafka/…

Slide 27

Slide 27 text

Summary • Disk read in network thread could block request processing • per- fi le disk read stats help to identify the topic caused disk read • eBPF provides the way to observe system layer • And it’s not so hard