Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Identifying topics that caused actual disk read

Identifying topics that caused actual disk read

Shota Kondo
LINE Corporation

※この発表は以下イベントで発表した内容です
https://kafka-apache-jp.connpass.com/event/284247/

LINE Developers

June 16, 2023
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. Shota Kondo
    LINE Corporation
    Identifying topics that caused
    actual disk read
    2023.06.16

    View full-size slide

  2. Speaker
    • Shota Kondo
    • Member of LINE IMF team
    • The team is responsible for developing and maintaining company-wide
    Kafka platform
    • Provides multi-tenant shared Kafka cluster

    View full-size slide

  3. Today’s topic
    • “Identifying topics that caused actual disk read”
    • Why?
    • Disk read may have the impact against broker performance

    View full-size slide

  4. Disk read and Kafka broker

    View full-size slide

  5. Request handling in Kafka broker
    • Network thread receive request and send response to client after processing
    • Request handler thread process the request from client actually

    View full-size slide

  6. Disk read in Fetch request handling
    • Usually consumers read topic data from latest log segments (in page cache)
    • Sometimes consumer tries to fetch old data not in page cache

    View full-size slide

  7. Disk read and its impact
    • Reading data from HDD is slower than reading from page cache (memory)
    • Blocking in network thread affects to latency of subsequent requests

    View full-size slide

  8. To apply solutions of such performance degradation
    • Some of solutions can be considered
    • Warming up topic data if it was small enough
    • Setting smaller log segment size to prevent inode lock contention in xfs
    during reading topic data
    • https://speakerdeck.com/line_developers/investigating-kafka-performance-issue-
    caused-by-lock-contention-in-xfs
    • For proceeding them, we have to know the topic names
    • If disk read metrics had
    fi
    le name, we can use that

    View full-size slide

  9. We already have disk read metrics though…
    device: xxx

    View full-size slide

  10. Actually we needed is
    fi
    le: /data/kafka/…

    View full-size slide

  11. Then, how to collect
    per file disk read stats?

    View full-size slide

  12. Requirements
    • Collect the evidence of actual disk read for each
    fi
    les
    • Expose following informations as the prometheus metrics
    • Read bytes as the value
    • File name as the label
    • No performance impact against Kafka broker

    View full-size slide

  13. How to capture the disk read?
    • Hook the kernel function related disk read
    • Obtain required informations in the hook

    View full-size slide

  14. How to hook the kernel function?
    • eBPF (extended Berkley Packet Filter)
    • The feature is provided by Linux kernel
    • It makes able to hook kernel events without modifying kernel code
    • bcc (BPF compiler collection)
    • Toolkit to compile and run eBPF program with Python/Lua

    View full-size slide

  15. eBPF and BCC
    • Example code to hook read() system call
    #!/usr/bin/python


    from bcc import BPF


    bpf_text="""


    int kprobe__sys_read(struct pt_regs *ctx) {


    bpf_trace_printk("read() syscall was invoked\\n");


    return 0;


    }


    """


    BPF(text=bpf_text).trace_print()


    View full-size slide

  16. What kernel function should be hooked?
    • If data resides on page cache, 

    then data will be returned without disk read
    • Need to hook the function that is 

    close to the storage device

    View full-size slide

  17. generic_make_request()
    • Kernel function to submit I/O request for devices
    • It looks good to hook this function
    void generic_make_request(struct bio *bio)

    View full-size slide

  18. Hook for generic_make_request()
    struct event_t {


    SOME_TYPE file;


    unsigned int bytes;


    };


    int kprobe__generic_make_request(struct pt_regs *ctx,


    struct bio *bio) {


    /* Extract read file and bytes from argument */


    struct event_t event = {};


    event.file = FILE;


    event.bytes = BYTES;


    /* Pass the data from eBPF program to python script */


    events.perf_submit(ctx, &event, sizeof(event));


    return 0;


    }


    View full-size slide

  19. Does struct bio have file informations…?
    struct bio {


    sector_t bi_sector; /* device address in 512 byte


    sectors */


    struct bio *bi_next; /* request queue link */


    struct block_device *bi_bdev;


    unsigned long bi_flags; /* status, command, etc */


    unsigned long bi_rw; /* bottom bits READ/WRITE,


    * top bits priority


    */


    unsigned short bi_vcnt; /* how many bio_vec's */


    unsigned short bi_idx; /* current index into bvl_vec */


    /* Number of segments in this BIO after


    * physical address coalescing is performed.


    */


    unsigned int bi_phys_segments;


    unsigned int bi_size; /* residual I/O count */


    ...


    }

    View full-size slide

  20. Does struct bio have file informations…?
    • Looks bi_size can be used as read bytes
    • But read
    fi
    le can’t be extracted directly

    from this argument
    • Need to get read
    fi
    le from somewhere
    • Another kernel function in upper layer?

    View full-size slide

  21. generic_file_aio_read()
    • Generic
    fi
    lesystem read routine
    • Argument iocb has a pointer to
    fi
    le struct
    ssize_t generic_file_aio_read(struct kiocb *iocb,

    const struct iovec *iov,


    unsigned long nr_segs,


    loff_t pos)

    View full-size slide

  22. Hook for generic_file_aio_read()
    BPF_HASH(inotbl, u64, unsigned long, INO_TABLE_SIZE);


    int kprobe__generic_file_aio_read(struct pt_regs *ctx,


    struct kiocb *iocb,


    const struct iovec *iov,


    unsigned long nr_segs,


    loff_t pos) {


    u64 pid_tgid = bpf_get_current_pid_tgid();


    unsigned long ino;


    if (iocb->ki_filp->f_path.dentry->d_inode) {


    ino = iocb->ki_filp->f_path.dentry->d_inode->i_ino;


    } else {


    // Set 0 if it's negative cache


    ino = 0;


    }


    inotbl.insert(&pid_tgid, &ino);


    return 0;


    }


    View full-size slide

  23. We have file information now
    • File information can be get

    from generic_
    fi
    le_aio_read()
    • Then let’s refer it in

    generic_make_request()

    View full-size slide

  24. Hook for generic_make_request()
    int kprobe__generic_make_request(struct pt_regs *ctx,


    struct bio *bio) {


    // Only account read requests


    if (op_is_write(op_from_rq_bits(bio->bi_rw))) return 0;


    u64 pid_tgid = bpf_get_current_pid_tgid();


    unsigned long *pino = inotbl.lookup(&pid_tgid);


    struct event_t event = {};


    if (pino) {


    event.inode = *pino;


    } else {


    event.inode = 0;


    }


    event.bytes = bio->bi_size;


    events.perf_submit(ctx, &event, sizeof(event));


    return 0;


    }

    View full-size slide

  25. Receive data from eBPF program
    • Disk read stats are available now, then let’s just expose the metrics
    def record_event(cpu, data, size):


    event = b["events"].event(data)


    # Accumulate received data from eBPF program and expose as prometheus metrics


    b = BPF(text=bpf_text)


    b["events"].open_perf_buffer(record_event)


    while True:


    try:


    b.perf_buffer_poll()


    except KeyboardInterrupt:


    exit()


    View full-size slide

  26. Finally we get per file disk read stats!!
    fi
    le: /data/kafka/…

    View full-size slide

  27. Summary
    • Disk read in network thread could block request processing
    • per-
    fi
    le disk read stats help to identify the topic caused disk read
    • eBPF provides the way to observe system layer
    • And it’s not so hard

    View full-size slide