Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modern High Performance C# 2023 Edition

Modern High Performance C# 2023 Edition

Yoshifumi Kawai

August 29, 2023
Tweet

More Decks by Yoshifumi Kawai

Other Decks in Technology

Transcript

  1. Modern High Performance C#
    2023 Edition
    CEDEC 2023(Largest Conference for Game Developers in Japan)
    2023-08-23 Yoshifumi Kawai / Cysharp, Inc.
    Translated from JP -> EN with ChatGPT(GPT-4)

    View Slide

  2. About Speaker
    Kawai Yoshifumi / @neuecc
    Cysharp, Inc. - CEO/CTO
    Established in September 2018 as a subsidiary of Cygames, Inc.
    Engages in research and development, open-source software (OSS), and consulting
    related to C#.
    Microsoft MVP for Developer Technologies (C#) since 2011.
    Get CEDEC AWARDS 2022 in Engineering
    Developer of over 50 OSS libraries (UniRx, UniTask, MessagePack C#, etc.).
    Achieved one of the highest numbers of GitHub Stars worldwide in the field of C#.

    View Slide

  3. OSS for high performance
    MessagePack ★4836
    Extremely Fast MessagePack Serializer for
    C#(.NET, Unity).
    MagicOnion ★3308
    Unified Realtime/API framework for .NET
    platform and Unity.
    UniTask ★5901
    Provides an efficient allocation free
    async/await integration for Unity.
    ZString ★1524
    Zero Allocation StringBuilder for .NET and
    Unity.
    MemoryPack ★2007
    Zero encoding extreme performance binary
    serializer for C# and Unity.
    AlterNats ★271
    An alternative high performance NATS client
    for .NET.
    MessagePipe ★1062
    High performance in-memory/distributed
    messaging pipeline for .NET and Unity.
    Ulid ★629
    Fast .NET C# Implementation of ULID for .NET and
    Unity.

    View Slide

  4. OSS for high performance
    MessagePack ★4836
    Extremely Fast MessagePack Serializer for
    C#(.NET, Unity).
    MagicOnion ★3308
    Unified Realtime/API framework for .NET
    platform and Unity.
    UniTask ★5901
    Provides an efficient allocation free
    async/await integration for Unity.
    ZString ★1524
    Zero Allocation StringBuilder for .NET and
    Unity.
    MemoryPack ★2007
    Zero encoding extreme performance binary
    serializer for C# and Unity.
    AlterNats ★271
    An alternative high performance NATS client
    for .NET.
    MessagePipe ★1062
    High performance in-memory/distributed
    messaging pipeline for .NET and Unity.
    Ulid ★629
    Fast .NET C# Implementation of ULID for .NET and
    Unity.
    I have released libraries that pursue unmatched high
    performance across various genres.
    In this session, drawing from those experiences, I will introduce
    techniques for achieving peak performance in modern C#.

    View Slide

  5. Current State of C#

    View Slide

  6. Java/Delphi
    2002 C# 1.0
    2005 C# 2.0
    2008 C# 3.0
    2010 C# 4.0
    2012 C# 5.0
    Generics
    LINQ
    Dynamic
    async/await
    Roslyn(self-hosting C# Compiler)
    Tuple
    Span
    null safety
    async streams
    record class
    code-generator
    2015 C# 6.0
    2017 C# 7.0
    2019 C# 8.0
    2020 C# 9.0
    2021 C# 10.0
    Continuously Evolving
    2022
    C# 11.0
    global using
    record struct
    ref field
    struct abstract
    members

    View Slide

  7. Java/Delphi
    2002 C# 1.0
    2005 C# 2.0
    2008 C# 3.0
    2010 C# 4.0
    2012 C# 5.0
    Generics
    LINQ
    Dynamic
    async/await
    Roslyn(self-hosting C# Compiler)
    Tuple
    Span
    null safety
    async streams
    record class
    code-generator
    2015 C# 6.0
    2017 C# 7.0
    2019 C# 8.0
    2020 C# 9.0
    2021 C# 10.0
    Continuously Evolving
    2022
    C# 11.0
    global using
    record struct
    ref field
    struct abstract
    members
    Periodically underwent major
    updates during the Anders
    Hejlsberg era.
    Adding incremental features every year.
    While bold new features have become less
    common, the language has been steadily
    improving and many of the additions have a
    positive impact on performance.

    View Slide

  8. .NET Framework -> Cross platform
    .NET Framework 1.0
    2005 2008 2010
    .NET Core 2.0
    2017
    .NET Core 3.0
    2020
    2002
    .NET Framework 2.0
    .NET Framework 3.5
    2012 2016
    .NET Framework 4
    .NET Framework 4.5
    .NET Core 1.0
    .NET 5
    2019
    The commencement of full-
    fledged support for Linux.
    Integration of multiple runtimes (.NET
    Framework, Core, Mono, Xamarin).
    The era when it was only
    focused on Windows.

    View Slide

  9. Linux and .NET
    High performance as a server-side programming
    language.
    It's now commonplace to choose Linux for server-side deployment as it's quite
    practical. Performance is also proven through benchmark results (Plaintext
    ranked 1st, C#, .NET, Linux). The performance is competitive even when
    compared to C++ or Rust.
    While these benchmarks are not
    necessarily practical, they do
    serve as evidence that there's a
    sufficient baseline of potential
    for the language.

    View Slide

  10. gRPCとPerformance
    gRPC does not necessarily equal
    high speed. Performance varies
    depending on the
    implementation, and
    unoptimized implementations
    can be subpar.
    C# delivers performance on the
    same level as the top-tier
    languages like Rust, Go, and C++,
    demonstrating its capabilities.
    0
    50000
    100000
    150000
    200000
    250000
    300000
    350000
    dotnet_grpc rust_thruster_mt cpp_grpc_mt scala_akka rust_tonic_mt go_vtgrpc java_quarkus swift_grpc node_grpcjs_st python_async_grpc ruby_grpc erlang_grpcbox php_grpc
    gRPC Implementation performance(2CPUs)
    Requests/sec(higher is better)
    https://github.com/LesnyRumcajs/grpc_bench/discussions/354

    View Slide

  11. Memory

    View Slide

  12. MessagePack for C#
    #1 Binary Serializer in .NET
    https://github.com/MessagePack-CSharp/MessagePack-CSharp
    The most supported binary serializer in .NET (with 4836 stars).
    Even if you've never used it directly,
    you've likely used it indirectly... for sure!
    Visual Studio 2022 internal
    SignalR MessagePack Hub
    Blazor Server protocol(BlazorPack)
    2017-03-13 Release
    Overwhelming speed compared to other
    competitors at the time.

    View Slide

  13. For example, serializing value = int(999)
    Ideal fastest code:
    Thinking about the fastest serializer
    Unsafe.WriteUnaligned(ref MemoryMarshal.GetReference(dest), value);
    Span dest
    ldarg .0
    ldarg .1
    unaligned. 0x01
    stobj !!T
    ret
    In essence, a memory copy.

    View Slide

  14. For example, serializing value = int(999)
    Ideal fastest code:
    Thinking about the fastest serializer
    Unsafe.WriteUnaligned(ref MemoryMarshal.GetReference(dest), value);
    Span dest
    ldarg .0
    ldarg .1
    unaligned. 0x01
    stobj !!T
    ret
    In essence, a memory copy.
    Starting from C# 7.2, there's a need to
    actively utilize Span for handling
    contiguous memory regions.
    The Unsafe class provides primitive
    operations that can be written in IL
    (Intermediate Language) but not in C#.
    These features allow for the removal of C#'s
    language constraints, making it easier to
    control raw behavior.
    NOTE: While it's true that you can write something close to Span using pointers, Span allows
    for a more natural handling in C#, which is why it has started appearing frequently not just
    within methods but also in the method signatures of public APIs. Thanks to this, the pattern
    of pulling through "raw-like" operations across methods, classes, and assemblies has been
    established, contributing to the overall performance improvement of C# in recent years.

    View Slide

  15. MessagePack
    JSON
    // uint16 msgpack code
    Unsafe.WriteUnaligned(ref dest[0], (byte)0xcd);
    // Write value as BigEndian
    var temp = BinaryPrimitives.ReverseEndianness((ushort)value);
    Unsafe.WriteUnaligned(ref dest[1], temp);
    In the case of existing serializers:
    Utf8Formatter.TryFormat(value, dest, out var bytesWritten);
    For JSON, performance is improved by reading and
    writing directly as UTF8 binary rather than as a string.
    Following the MessagePack specification, write the
    type identifier at the beginning and the value in
    BigEndian format.

    View Slide

  16. MessagePack
    JSON
    // uint16 msgpack code
    Unsafe.WriteUnaligned(ref dest[0], (byte)0xcd);
    // Write value as BigEndian
    var temp = BinaryPrimitives.ReverseEndianness((ushort)value);
    Unsafe.WriteUnaligned(ref dest[1], temp);
    In the case of existing serializers:
    Utf8Formatter.TryFormat(value, dest, out var bytesWritten);
    For JSON, performance is improved by reading and
    writing directly as UTF8 binary rather than as a string.
    Following the MessagePack specification, write the
    type identifier at the beginning and the value in
    BigEndian format.
    MessagePack for C# is indeed fast. However, due to the
    binary specifications of MessagePack itself, no matter what
    you do, it will be slower than the "ideal fastest code."

    View Slide

  17. MemoryPack
    Zero encoding extreme fast binary serializer
    https://github.com/Cysharp/MemoryPack/
    Released in 2022-09 with the aim of being the ultimate high-speed serializer.
    Compared to JSON, it offers performance that is
    several times, and in optimal cases, hundreds of times
    faster. It is overwhelmingly superior even when
    compared to MessagePack for C#.
    Optimized binary
    specifications for C#
    Utilizes the latest
    design, fully
    leveraging C# 11

    View Slide

  18. Zero encoding
    Memory copying to the fullest extent possible
    public struct Point3D
    {
    public int X;
    public int Y;
    public int Z;
    }
    new Point3D { X = 1, Y = 2, Z = 3 }
    In C#, it is guaranteed that the values of a struct that does not
    include reference types (not IsReferenceOrContainsReferences) will
    be laid out contiguously in memory.

    View Slide

  19. IsReferenceOrContainsReferences
    Sequentially arranged compact specifications
    [MemoryPackable]
    public partial class Person
    {
    public long Id { get; set; }
    public int Age { get; set; }
    public string? Name { get; set; }
    }
    In the case of reference types, they are
    written sequentially. The specification was
    carefully crafted to balance performance and
    versioning resilience within a simple structure.

    View Slide

  20. T[] where T : unmanaged
    In C#, arrays where elements are of unmanaged type
    (non-reference struct) are arranged sequentially.
    new int[] { 1, 2, 3, 4, 5 }
    var srcLength = Unsafe.SizeOf() * value.Length;
    var allocSize = srcLength + 4;
    ref var dest = ref GetSpanReference(allocSize);
    ref var src = ref Unsafe.As(ref GetArrayDataReference(value));
    Unsafe.WriteUnaligned(ref dest, value.Length);
    Unsafe.CopyBlockUnaligned(ref Unsafe.Add(ref dest, 4), ref src, (uint)srcLength);
    Advance(allocSize);
    Serialize == Memory Copy

    View Slide

  21. T[] where T : unmanaged
    Increasingly advantageous for complex types like Vector3[]
    Vector3(float x, float y, float z)[10000]
    Conventional serializers perform Write/Read
    operations for each field, so for 10,000 items, you
    would need to perform 10,000 x 3 operations.
    MemoryPack requires only one copy operation.
    It's only natural that it would be 200 times
    faster, then!

    View Slide

  22. I/O Write

    View Slide

  23. Three Tenets of I/O Application Speedup
    Minimize allocations
    Reduce copies
    Prioritize asynchronous I/O
    // Bad example
    byte[] result = Serialize(value);
    response.Write(result);
    Frequent byte[] allocation
    Writing to something probably results in
    a copy operation.
    Synchronous Write

    View Slide

  24. Isn't I/O All About Streams?
    async Task WriteToStreamAsync(Stream stream)
    {
    // Queue
    while (messages.TryDequeue(out var message))
    {
    await stream.WriteAsync(message.Encode());
    }
    }
    In applications involving I/O, the ultimate output destinations are usually
    either the network (Socket/NetworkStream) or a file (FileStream).
    "Were you able to implement the three
    points with this...?"

    View Slide

  25. Isn't I/O All About Streams?
    async Task WriteToStreamAsync(Stream stream)
    {
    // Queue
    while (messages.TryDequeue(out var message))
    {
    await stream.WriteAsync(message.Encode());
    }
    }

    View Slide

  26. Why Streams Are Bad: Reason 1
    async Task WriteToStreamAsync(Stream stream)
    {
    // Queue
    while (messages.TryDequeue(out var message))
    {
    await stream.WriteAsync(message.Encode());
    }
    }
    Frequent fine-grained I/O operations are slow, even if they are
    asynchronous! async/await is not a panacea!

    View Slide

  27. Stream is beautiful……?
    async Task WriteToStreamAsync(Stream stream)
    {
    // So, it would be beneficial to add a buffer using BufferedStream, right?
    using (var buffer = new BufferedStream(stream))
    {
    while (messages.TryDequeue(out var message))
    {
    await buffer.WriteAsync(message.Encode());
    }
    }
    }
    The exceptional abstraction in terms of the 'functional aspect' of Streams
    allows for the addition of features freely through the decorator pattern. For
    instance, by encapsulating a GZipStream, compression can be added, or by
    encapsulating a CryptoStream, encryption can be added.
    In this case, since we want to add a buffer, we will
    encapsulate it in a BufferedStream. This way, even with
    WriteAsync, it won't immediately write to I/O.

    View Slide

  28. Stream is beautiful……?
    async Task WriteToStreamAsync(Stream stream)
    {
    // So, it would be beneficial to add a buffer using BufferedStream, right?
    using (var buffer = new BufferedStream(stream))
    {
    while (messages.TryDequeue(out var message))
    {
    await buffer.WriteAsync(message.Encode());
    }
    }
    }

    View Slide

  29. Why Streams Are Bad: Reason 2
    async Task WriteToStreamAsync(Stream stream)
    {
    // So, it would be beneficial to add a buffer using BufferedStream, right?
    using (var buffer = new BufferedStream(stream))
    {
    while (messages.TryDequeue(out var message))
    {
    await buffer.WriteAsync(message.Encode());
    }
    }
    }
    If the Stream is already Buffered, unnecessary allocation
    would contradict the intent of "reducing allocation."
    Being buffered means that in most cases (provided the buffer is not
    overflowing) calls can be synchronous. If a call is synchronous, making
    an asynchronous call is wasteful.

    View Slide

  30. Why Streams Are Bad: Reason 2
    async Task WriteToStreamAsync(Stream stream)
    {
    // So, it would be beneficial to add a buffer using BufferedStream, right?
    using (var buffer = new BufferedStream(stream))
    {
    while (messages.TryDequeue(out var message))
    {
    await buffer.WriteAsync(message.Encode());
    }
    }
    }
    If the Stream is already Buffered, unnecessary allocation
    would contradict the intent of "reducing allocation."
    Being buffered means that in most cases (provided the buffer is not
    overflowing) calls can be synchronous. If a call is synchronous, making
    an asynchronous call is wasteful.
    Public Task WriteAsync (byte[] buffer, int offset, int count);
    public ValueTask WriteAsync (ReadOnlyMemory buffer);
    Due to historical circumstances, Streams (and Sockets) have APIs that return a Task
    and others that return a ValueTask, with similar parameters and the same name. If you
    use an API that returns a Task, there's a chance you might inadvertently generate
    unnecessary Task allocations. Therefore, always use a ValueTask call.
    Fortunately, in the case of BufferedStream, the allocation itself does not occur because it
    returns Task.CompletedTask for synchronous operations. However, there is a cost
    associated with calling await. Regardless of whether it is ValueTask, waste is waste.

    View Slide

  31. async Task WriteToStreamAsync(Stream stream)
    {
    var buffer = ArrayPool.Shared.Rent(4096);
    try
    {
    var slice = buffer.AsMemory();
    var totalWritten = 0;
    {
    while (messages.TryDequeue(out var message))
    {
    var written = message.EncodeTo(slice.Span);
    totalWritten += written;
    slice = slice.Slice(written);
    }
    }
    await stream.WriteAsync(buffer.AsMemory(0, totalWritten));
    }
    finally
    {
    ArrayPool.Shared.Return(buffer);
    }
    }
    There may be a suspicion that the
    message.Encode() function was originally
    returning a byte[]. If you switch to
    EncodeTo, there would be less waste if
    you allocate a large buffer and replace
    the BufferedStream.
    (For the purpose of this sample, we will assume that the buffer will not
    overflow, so we will skip checks and enlargements)
    Let's reduce asynchronous writing to just once.

    View Slide

  32. Stream is Bad
    Focus on synchronous buffers and asynchronous
    reading and writing
    Stream abstraction mixes synchronous and asynchronous
    behavior (forcing us to always use async calls even when
    the actual behavior is synchronous, like in BufferedStream)
    Since the actual entity of the Stream is unclear, for safety
    purposes, or as a working tool, each Stream often has its
    own buffer. (For example, GZipStream allocates 8K just by
    creating a new one, Buffered is 4K, and MemoryStream also
    allocates in fine detail.)

    View Slide

  33. Stream is Dead
    Avoiding Stream
    Time when Stream was a first-class citizen for I/O has passed
    • RandomAccess for file processing (Scatter Gather I/O API)
    • IBufferWriter to directly call the internal buffer
    • System.IO.Pipeline for buffer and flow control
    Classes have emerged to handle processes while avoiding Streams
    Avoiding Stream overhead is the first step to high-performance
    handling
    However, since Streams are at the core of .NET, it's impossible to
    avoid them completely. It's hard to completely bypass
    NetworkStream or FileStream, and there are no alternatives to
    ConsoleStream or SslStream. Try to manage by not touching the
    streams until the very last read/write.

    View Slide

  34. IBufferWriter
    Abstracting the synchronous buffer for writing
    public interface IBufferWriter
    {
    void Advance(int count);
    Memory GetMemory(int sizeHint = 0);
    Span GetSpan(int sizeHint = 0);
    }
    await SendAsync()
    Network buffer
    IBufferWriter requests slice
    Serializer write to slice
    Finally write buffer slice to network
    void Serialize(IBufferWriter writer, T value)
    By directly accessing and writing to the root
    buffer, not only can allocations be eliminated,
    but also copies between buffers.

    View Slide

  35. MemoryPackSerializer.Serialize
    public static partial class MemoryPackSerializer
    {
    public static void Serialize(in TBufferWriter bufferWriter, in T? value)
    where TBufferWriter : IBufferWriter
    public static byte[] Serialize(in T? value)
    public static ValueTask SerializeAsync(Stream stream, T? value)
    }
    This is the most fundamental and can provide
    the best performance.

    View Slide

  36. Example: Flow of MemoryPack's Serialize
    This MemoryPackWriter is important!

    View Slide

  37. public void WriteUnmanaged(scoped in T1 value1)
    where T1 : unmanaged
    {
    var size = Unsafe.SizeOf();
    ref var spanRef = ref GetSpanReference(size);
    Unsafe.WriteUnaligned(ref spanRef, value1);
    Advance(size);
    }
    MemoryPackWriter
    Buffer management for writing
    Or caching of IBufferWriter's buffer
    public ref partial struct MemoryPackWriter
    where TBufferWriter : IBufferWriter
    {
    ref TBufferWriter bufferWriter;
    ref byte bufferReference;
    int bufferLength;
    ref byte GetSpanReference(int sizeHint);
    void Advance(int count);
    public MemoryPackWriter(ref TBufferWriter writer)
    }
    public interface System.Buffers.IBufferWriter
    {
    Span GetSpan(int sizeHint = 0);
    void Advance(int count);
    }
    1. For example, when writing an int
    2. Request the maximum required buffer
    3. Declare the amount written
    Take TBufferWriter in the ctor

    View Slide

  38. public void WriteUnmanaged(scoped in T1 value1)
    where T1 : unmanaged
    {
    var size = Unsafe.SizeOf();
    ref var spanRef = ref GetSpanReference(size);
    Unsafe.WriteUnaligned(ref spanRef, value1);
    Advance(size);
    }
    MemoryPackWriter
    public ref partial struct MemoryPackWriter
    where TBufferWriter : IBufferWriter
    {
    ref TBufferWriter bufferWriter;
    ref byte bufferReference;
    int bufferLength;
    ref byte GetSpanReference(int sizeHint);
    void Advance(int count);
    public MemoryPackWriter(ref TBufferWriter writer)
    }
    public interface System.Buffers.IBufferWriter
    {
    Span GetSpan(int sizeHint = 0);
    void Advance(int count);
    }
    Frequent calls to GetSpan/Advance on
    IBuferWriter are slow, so reserve plenty of
    space within MemoryPackWriter to reduce the
    number of calls to BufferWriter.
    NOTE: When implementing IBufferWriter, the size of the buffer returned
    by GetSpan should not be the one trimmed by sizeHint, but the actual
    buffer size that you likely hold internally. Trimming it forces frequent calls
    to GetSpan, which can lead to performance degradation.
    Buffer management for writing
    Or caching of IBufferWriter's buffer

    View Slide

  39. Optimize the Write
    public ref partial struct MemoryPackWriter
    where TBufferWriter : IBufferWriter
    {
    ref TBufferWriter bufferWriter;
    ref byte bufferReference;
    int bufferLength;
    ref byte GetSpanReference(int sizeHint);
    void Advance(int count);
    public MemoryPackWriter(ref TBufferWriter writer)
    }
    If fixed-size members are consecutive, consolidate
    the calls to reduce the number of calls to
    GetSpanReference/Advance.
    Reduce the number of method calls
    The fewer, the better

    View Slide

  40. Complete Serialize
    public static partial class MemoryPackSerializer
    {
    public static void Serialize(in TBufferWriter bufferWriter, in T? value)
    where TBufferWriter : IBufferWriter
    public static byte[] Serialize(in T? value)
    public static ValueTask SerializeAsync(Stream stream, T? value)
    }
    If you Flush (calling the original IBufferWriter's Advance and synchronously
    confirming the actual written area), the serialization process is completed.
    var writer = new MemoryPackWriter(ref bufferWriter);
    writer.WriteValue(value);
    writer.Flush();

    View Slide

  41. Other overload
    public static partial class MemoryPackSerializer
    {
    public static void Serialize(in TBufferWriter bufferWriter, in T? value)
    public static byte[] Serialize(in T? value)
    public static ValueTask SerializeAsync(Stream stream, T? value)
    }
    Pass ReusableLinkedArrayBufferWriter
    internally through Serialize.
    var bufferWriter = ReusableLinkedArrayBufferWriterPool.Rent();
    var writer = new MemoryPackWriter(ref bufferWriter);
    writer.WriteValue(value);
    writer.Flush();
    await bufferWriter.WriteToAndResetAsync(stream);
    return bufferWriter.ToArrayAndReset();

    View Slide

  42. ReusableLinkedArrayBufferWriter
    byte[] byte[] byte[]
    ArrayPool.Shared.Rent
    GetSpan()
    If you only want the last concatenated array (or writing to a
    Stream), you can represent the internal buffer with concatenated
    chunks, not like a List expansion copy, as it doesn't have to be
    a block of memory. This can reduce the number of copies.
    public sealed class ReusableLinkedArrayBufferWriter : IBufferWriter
    {
    List buffers;
    }
    struct BufferSegment
    {
    byte[] buffer;
    int written;
    }
    NOTE: If the buffer becomes insufficient, instead of linking fixed-size
    items because they are linked (or worrying about LOH), generate (or
    borrow) items of double the size and link them. If not, if the write result
    is large, the number of linked list elements will become too large and
    the performance will deteriorate.

    View Slide

  43. ToArray / WriteTo
    byte[] byte[] byte[]
    var result = new byte[a.Length + b.count + c.Length];
    a.CopyTo(result); b.CopyTo(result); c.CopyTo(result);
    await stream.WriteAsync(a);
    await stream.WriteAsync(b);
    await stream.WriteAsync(c);
    ArrayPool.Shared.Return
    As the final size is known, only 'new' the final result and
    copy it, or write it to a Stream. The completed working
    array is no longer needed, so it is returned to the Pool.

    View Slide

  44. Enumerable.ToArray
    It converts an IEnumerable with an indefinite number of elements
    into a T[]. Conventionally, when it overflowed, the internal T[] was
    expanded, but couldn't a T[] be obtained from concatenated chunks
    in the same way as this time?
    I submitted a PR to dotnet/runtime
    https://github.com/dotnet/runtime/pull/90459
    30~60% dramatic performance improve
    might be included in .NET 9?
    Improve LINQ ToArray NOTE: LINQ's ToArray has already been optimized in various ways,
    estimating the number of elements as much as possible, and when it can
    be estimated, it allocates an array of a fixed size. The size estimation is not
    as simple as is ICollection, but has more complex branches depending on
    the situation of the method chain such as the size is determined if it is
    Enumerable.Range, and it can be determined if it is Take, and so on.

    View Slide

  45. Avoid aggressive use of Pools
    As it's something to be incorporated into the runtime, the extensive use of Pools
    was avoided.
    Instead of using a Reusable LinkedArray,
    InlineArray from C# 12 was adopted
    To put it roughly, it enables stackalloc T[] (in other words, T[][]).
    with InlineArray(C# 12)
    [InlineArray(29)]
    struct ArrayBlock
    {
    private T[] array;
    }
    Because List (or something like that) would cause
    extra allocations, it would have been harder to propose.
    Allocating T[][] in the stack area eliminated the need for
    the allocation of the linked list itself.
    However, InlineArray is only allowed a fixed size specified at
    compile time. Therefore, I adopted '29' as the size ....

    View Slide

  46. 29
    Starting from 4 and repeatedly doubling the size
    leads to 29, which is the maximum value (.NET
    array size is int.MaxValue, just a little less than
    2147483591).
    Since ToArray of IEnumerable always adds one
    element at a time, it is guaranteed that all values
    will be filled and the next array will be
    concatenated without any gaps. Therefore, it is
    absolutely impossible to run out with
    InlineArray(29).

    View Slide

  47. I/O Read

    View Slide

  48. No Stream Again
    Performance is determined by synchronous buffers and asynchronous
    reading and writing
    Don't mix I/O and deserialization
    Calling ReadAsync each time would be too slow.
    MemoryPackSerializer.Deserialize(Stream) constructs a
    ReadOnlySequence first and then flows it into the deserialize process.
    public static partial class MemoryPackSerializer
    {
    public static T? Deserialize(ReadOnlySpan buffer)
    public static int Deserialize(in ReadOnlySequence buffer, ref T? value)
    public static async ValueTask DeserializeAsync(Stream stream)
    }
    NOTE: Not mixing I/O and deserialization means that it is not possible to perform
    true streaming deserialization with an undefined length or minimal buffering. In
    MemoryPack, instead, a deserialization mechanism is available that buffers in
    window widths and returns an IAsyncEnumerable as a supplementary
    mechanism.
    We target only synchronous buffers that have already been read.

    View Slide

  49. ReadOnlySequence
    Like the concatenated T[]
    By entrusting buffer processing in combination with System.IO.Pipelines, it
    can be treated like a connected T[] that can be sliced at any position.
    ReadOnlySequence is not always fast, so it is necessary to
    come up with ways to reduce the number of Slice calls.

    View Slide

  50. Flow of Deserialize
    This MemoryPackReader is important!

    View Slide

  51. public ref partial struct MemoryPackReader
    {
    ReadOnlySequence bufferSource;
    ref byte bufferReference;
    int bufferLength;
    ref byte GetSpanReference(int sizeHint);
    void Advance(int count);
    public MemoryPackReader(
    in ReadOnlySequence source)
    public MemoryPackReader(
    ReadOnlySpan buffer)
    }
    public void ReadUnmanaged(out T1 value1)
    where T1 : unmanaged
    {
    var size = Unsafe.SizeOf();
    ref var spanRef = ref GetSpanReference(size);
    value1 = Unsafe.ReadUnaligned(ref spanRef);
    Advance(size);
    }
    MemoryPackReader
    public readonly struct ReadOnlySequence
    {
    ReadOnlySpan FirstSpan { get; }
    ReadOnlySequence Slice(long start);
    }
    1. For example, when reading data like int.
    2. Request the maximum
    required buffer.
    3. Report the amount read.
    Buffer management for reading
    Set ReadOnlySequence as the source.
    Similar to the MemoryPackWriter, receive the necessary buffer
    with GetSpanReference and proceed with Advance.

    View Slide

  52. public ref partial struct MemoryPackReader
    {
    ReadOnlySequence bufferSource;
    ref byte bufferReference;
    int bufferLength;
    ref byte GetSpanReference(int sizeHint);
    void Advance(int count);
    public MemoryPackReader(
    in ReadOnlySequence source)
    public MemoryPackReader(
    ReadOnlySpan buffer)
    }
    public void ReadUnmanaged(out T1 value1)
    where T1 : unmanaged
    {
    var size = Unsafe.SizeOf();
    ref var spanRef = ref GetSpanReference(size);
    value1 = Unsafe.ReadUnaligned(ref spanRef);
    Advance(size);
    }
    MemoryPackReader
    public readonly struct ReadOnlySequence
    {
    ReadOnlySpan FirstSpan { get; }
    ReadOnlySequence Slice(long start);
    }
    Due to the slowness of frequent calls to Slice on
    ReadOnlySequence, secure the entire block as
    FirstSpan in MemoryPackReader and suppress the
    number of calls to ReadOnlySequence.
    NOTE: It's natural that a Read request exceeds the FirstSpan. Since the deserialization
    of MemoryPack requires a continuous memory area, in the actual MemoryPack,
    processes like copying to a temporary area borrowed from the pool, and assigning it
    to the ref byte bufferReference are performed.
    Buffer management for reading
    Set ReadOnlySequence as the source.

    View Slide

  53. Reader I/O in Application

    View Slide

  54. Efficient Read is challenging
    while (true)
    {
    var read = await socket.ReceiveAsync(buffer);
    var span = buffer.AsSpan(read);
    //
    ...
    }
    The amount read here may not fill one
    message block.
    If you read ReceiveAsync again and pack it into
    the buffer, what happens if it exceeds the buffer? If you keep resizing, it will become infinitely large, but
    can you guarantee that a time will come when you
    can reset it to 0?
    Handling incomplete reads
    It is not always permitted to read to the end of the stream.

    View Slide

  55. A reader that returns ReadOnlySequence
    Concatenating incomplete blocks
    If the size of one message is known (i.e., the Length is
    written in the header as a protocol), it can be converted to
    a command to read at least a certain size (ReadAtLeast).
    async Task ReadLoopAsync()
    {
    while (true)
    {
    ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4);
    // do anything
    }
    } If it's in the form of ReadOnlySequence, you can feed it
    into something that supports it. For example, most modern
    serializers basically support ReadOnlySequence.
    NOTE: Serializers that do not support ReadOnlySequence are considered legacy
    and should be discarded. Of course, MessagePack for C#, MemoryPack is supported.
    NOTE: System.IO.Pipelines is what takes care of the related tasks.

    View Slide

  56. Assuming there is a protocol where the type of message is
    at the top and something is done based on it, how to
    determine if the message type is a string (Text protocol, for
    example, Redis and NATS adopt a text protocol).
    Determining type
    async Task ReadLoopAsync()
    {
    while (true)
    {
    ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4);
    var code = GetCode(buffer);
    if (code == ServerOpCodes.Msg)
    {
    //…
    }
    }
    }
    You can determine it simply by converting it to a
    string. If you convert it to an Enum, it's easy to use
    later. This is an example from NATS, but by cleverly
    adjusting to 4 characters, including symbols and
    spaces, you can ensure it can be determined by
    ReadAtLeastAsync(4).
    ServerOpCodes GetCode(ReadOnlySequence buffer)
    {
    var span = GetSpan(buffer);
    var str = Encoding.UTF8.GetString(span);
    return str switch
    {
    "INFO" => ServerOpCodes.Info,
    "MSG " => ServerOpCodes.Msg,
    "PING" => ServerOpCodes.Ping,
    "PONG" => ServerOpCodes.Pong,
    "+OK¥r" => ServerOpCodes.Ok,
    "-ERR" => ServerOpCodes.Error,
    _ => throw new InvalidOperationException()
    };
    }

    View Slide

  57. Assuming there is a protocol where the type of message is
    at the top and something is done based on it, how to
    determine if the message type is a string (Text protocol, for
    example, Redis and NATS adopt a text protocol).
    Determining type
    async Task ReadLoopAsync()
    {
    while (true)
    {
    ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4);
    var code = GetCode(buffer);
    if (code == ServerOpCodes.Msg)
    {
    //…
    }
    }
    }
    You can determine it simply by converting it to a
    string. If you convert it to an Enum, it's easy to use
    later. This is an example from NATS, but by cleverly
    adjusting to 4 characters, including symbols and
    spaces, you can ensure it can be determined by
    ReadAtLeastAsync(4).
    ServerOpCodes GetCode(ReadOnlySequence buffer)
    {
    var span = GetSpan(buffer);
    var str = Encoding.UTF8.GetString(span);
    return str switch
    {
    "INFO" => ServerOpCodes.Info,
    "MSG " => ServerOpCodes.Msg,
    "PING" => ServerOpCodes.Ping,
    "PONG" => ServerOpCodes.Pong,
    "+OK¥r" => ServerOpCodes.Ok,
    "-ERR" => ServerOpCodes.Error,
    _ => throw new InvalidOperationException()
    };
    }

    View Slide

  58. Assuming there is a protocol where the type of message is
    at the top and something is done based on it, how to
    determine if the message type is a string (Text protocol, for
    example, Redis and NATS adopt a text protocol).
    Determining type
    async Task ReadLoopAsync()
    {
    while (true)
    {
    ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4);
    var code = GetCode(buffer);
    if (code == ServerOpCodes.Msg)
    {
    //…
    }
    }
    }
    You can determine it simply by converting it to a
    string. If you convert it to an Enum, it's easy to use
    later. This is an example from NATS, but by cleverly
    adjusting to 4 characters, including symbols and
    spaces, you can ensure it can be determined by
    ReadAtLeastAsync(4).
    ServerOpCodes GetCode(ReadOnlySequence buffer)
    {
    var span = GetSpan(buffer);
    var str = Encoding.UTF8.GetString(span);
    return str switch
    {
    "INFO" => ServerOpCodes.Info,
    "MSG " => ServerOpCodes.Msg,
    "PING" => ServerOpCodes.Ping,
    "PONG" => ServerOpCodes.Pong,
    "+OK¥r" => ServerOpCodes.Ok,
    "-ERR" => ServerOpCodes.Error,
    _ => throw new InvalidOperationException()
    };
    }
    Turning it into a string (Stringization)
    involves a memory allocation.
    Absolutely, you should absolutely avoid
    this!!!

    View Slide

  59. Take2 Compare ReadOnlySpan
    async Task ReadLoopAsync()
    {
    while (true)
    {
    ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4);
    var code = GetCode(buffer);
    if (code == ServerOpCodes.Msg)
    {
    //…
    }
    }
    }
    With the C# 11 UTF-8 literal (u8), you
    can get ReadOnlySpan in a
    constant manner. If you bring the ones
    with a high match frequency to the top
    of the if statement, the cost of the if
    check can also be reduced. Also,
    SequenceEqual of
    ReadOnlySpan (unlike the one of
    LINQ) compares quite speedily.
    ServerOpCodes GetCode(ReadOnlySequence buffer)
    {
    var span = GetSpan(buffer);
    if (span.SequenceEqual("MSG "u8)) return ServerOpCodes.Msg;
    if (span.SequenceEqual("PONG"u8)) return ServerOpCodes.Pong;
    if (span.SequenceEqual("INFO"u8)) return ServerOpCodes.Info;
    if (span.SequenceEqual("PING"u8)) return ServerOpCodes.Ping;
    if (span.SequenceEqual("+OK¥r"u8)) return ServerOpCodes.Ok;
    if (span.SequenceEqual("-ERR"u8)) return ServerOpCodes.Error;
    throw new InvalidOperationException();
    }

    View Slide

  60. Take2 Compare ReadOnlySpan
    async Task ReadLoopAsync()
    {
    while (true)
    {
    ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4);
    var code = GetCode(buffer);
    if (code == ServerOpCodes.Msg)
    {
    //…
    }
    }
    }
    With the C# 11 UTF-8 literal (u8), you
    can get ReadOnlySpan in a
    constant manner. If you bring the ones
    with a high match frequency to the top
    of the if statement, the cost of the if
    check can also be reduced. Also,
    SequenceEqual of
    ReadOnlySpan (unlike the one of
    LINQ) compares quite speedily.
    ServerOpCodes GetCode(ReadOnlySequence buffer)
    {
    var span = GetSpan(buffer);
    if (span.SequenceEqual("MSG "u8)) return ServerOpCodes.Msg;
    if (span.SequenceEqual("PONG"u8)) return ServerOpCodes.Pong;
    if (span.SequenceEqual("INFO"u8)) return ServerOpCodes.Info;
    if (span.SequenceEqual("PING"u8)) return ServerOpCodes.Ping;
    if (span.SequenceEqual("+OK¥r"u8)) return ServerOpCodes.Ok;
    if (span.SequenceEqual("-ERR"u8)) return ServerOpCodes.Error;
    throw new InvalidOperationException();
    }

    View Slide

  61. Convert first 4 char to int
    // msg = ReadOnlySpan
    if (Unsafe.ReadUnaligned(ref MemoryMarshal.GetReference(msg)) == 1330007625) // INFO
    {
    }
    internal static class ServerOpCodes
    {
    public const int Info = 1330007625; // "INFO"
    public const int Msg = 541545293; // "MSG "
    public const int Ping = 1196312912; // "PING"
    public const int Pong = 1196314448; // "PONG"
    public const int Ok = 223039275; // "+OK¥r"
    public const int Error = 1381123373; // "-ERR"
    }
    If you combine it with follow-up elements
    (spaces or ¥r), all NATS OpCode can be
    determined with exactly 4 bytes (int), so you can
    create a group of constants int-transformed in
    advance.
    Direct int transformation from ReadOnlySpan
    Comparing by stringifying is out of the question,
    but since you can do it with just 4 bytes,
    comparing by int transformation is the fastest.
    NOTE: Well, the best protocol would be a binary one, where the first byte
    represents the type... Text protocols are not good.

    View Slide

  62. async/await and inlining
    async Task ReadLoopAsync()
    {
    while (true)
    {
    ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4);
    var code = GetCode(buffer);
    await DispatchCommandAsync(code, buffer);
    }
    }
    async ValueTask DispatchCommandAsync(int code, ReadOnlySequence buffer)
    {
    }
    The part where data is read from the socket (the
    actual code is a bit more complex, so we might want
    to separate it from the processing part).
    In this method, messages are parsed in detail and
    actual processing is done (such as deserializing
    payload and callbacks).

    View Slide

  63. async/await and inlining
    async Task ReadLoopAsync()
    {
    while (true)
    {
    ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4);
    var code = GetCode(buffer);
    await DispatchCommandAsync(code, buffer);
    }
    }
    async ValueTask DispatchCommandAsync(int code, ReadOnlySequence buffer)
    {
    }
    The part where data is read from the socket (the
    actual code is a bit more complex, so we might want
    to separate it from the processing part).
    In this method, messages are parsed in detail and
    actual processing is done (such as deserializing
    payload and callbacks).

    View Slide

  64. asynchronous state machine generation
    async Task ReadLoopAsync()
    {
    while (true)
    {
    ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4);
    var code = GetCode(buffer);
    await DispatchCommandAsync(code, buffer);
    }
    }
    async ValueTask DispatchCommandAsync(int code, ReadOnlySequence buffer)
    {
    }
    If it is within a loop, there will be no new asynchronous state machine
    generation, so you can go ahead and await as much as you want.
    If you actually perform asynchronous processing in an async method declared with
    async, an asynchronous state machine will be generated for each call, so there will be
    extra allocation.
    If the actual method is an async method of
    IValueTaskSource, it can be designed so that there is no
    asynchronous state machine generation even if you
    directly await it.

    View Slide

  65. Inlining await in the hot path
    async Task ReadLoopAsync()
    {
    while (true)
    {
    ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4);
    var code = GetCode(buffer);
    if (code == ServerOpCodes.Msg)
    {
    await DoAnything();
    await DoAnything();
    }
    else
    {
    await DispatchCommandAsync(code, buffer);
    }
    }
    }
    [AsyncMethodBuilderAttribute(typeof(PoolingAsyncValueTaskMethodBuilder))]
    async ValueTask DispatchCommandAsync(int code, ReadOnlySequence buffer)
    { }
    Since 90% of the loops are receiving Msg (the rest are rarely
    coming things like PING or ERROR), only Msg is inlined to aim for
    maximum efficiency.
    In other cases, methods are separated, but marking with
    PoolingAsyncValueTaskMethodBuilder from .NET 6 makes the
    asynchronous state machine poolable and reusable.

    View Slide

  66. Optimize for All Types

    View Slide

  67. Source Generator based
    Automatically generate at compile time the
    Serialize and Deserialize code optimized for
    each [MemoryPackable] type.
    static abstract members
    from C#11

    View Slide

  68. IL.Emit vs SourceGenerator
    IL.Emit
    Dynamic assembly generation using type information at runtime
    IL black magic that's been available since the early days of .NET
    Not usable in environments where dynamic generation is not allowed (like iOS, WASM,
    NativeAOT, etc.)
    SourceGenerator
    Generate C# code using AST at compile-time
    It started to be extensively used around .NET 6
    Since it's pure C# code, it can be used in all environments.
    Given the diversification of environments where .NET operates and the absence of startup speed penalties, it is desirable to
    move towards SourceGenerator as much as possible.
    Although the inability to use runtime information can make it difficult to generate similar code, especially around Generics,
    let's overcome this with some ingenuity...

    View Slide

  69. Optimize for All Types
    For example, the processing of collections can be mostly handled with
    just one Formatter for IEnumerable, but if you create the optimal
    implementation for each collection one by one, you can run the highest
    performance processing. Implementations to interfaces are only
    required when asked to process unknown types.

    View Slide

  70. Fast Enumeration of Array
    Normally, boundary checks are inserted when accessing
    elements of a C# array. However, the JIT compiler may
    exceptionally remove the boundary check if it can
    detect that it will not exceed the boundary (for example,
    when looping a for loop with .Length).
    An array (or Span) foreach is converted at compile
    time to the same as a for loop at the IL level, so it's
    completely the same.

    View Slide

  71. Optimize for List / Read
    public sealed class ListFormatter : MemoryPackFormatter>
    {
    public override void Serialize(
    ref MemoryPackWriter writer, scoped ref List? value)
    {
    if (value == null)
    {
    writer.WriteNullCollectionHeader();
    return;
    }
    var span = CollectionsMarshal.AsSpan(value);
    var formatter = GetFormatter();
    WriteCollectionHeader(span.Length);
    for (int i = 0; i < span.Length; i++)
    {
    formatter.Serialize(ref this, ref span[i]);
    }
    }
    }
    Fastest List iterate
    CollectionsMarshal.AsSpan

    View Slide

  72. public override void Deserialize(ref MemoryPackReader reader, scoped ref List? value)
    {
    if (!reader.TryReadCollectionHeader(out var length))
    {
    value = null;
    return;
    }
    value = new List(length);
    CollectionsMarshal.SetCount(value, length);
    var span = CollectionsMarshal.AsSpan(value);
    var formatter = GetFormatter();
    for (int i = 0; i < length; i++)
    {
    formatter.Deserialize(ref this, ref span[i]);
    }
    }
    Optimize for List / Write
    Adding to List one by one is slow. By making it
    handleable as Span, the deserialization speed of List
    is made equivalent to that of the array.
    Just using new List(capacity) results in an internal size of
    0, so using CollectionsMarshal.AsSpan only retrieves a
    Span of length 0, which is meaningless.
    By forcefully changing the internal size with
    CollectionsMarshal.SetCount, which was added
    in .NET 8, you can avoid Add and extract Span.

    View Slide

  73. public override void Deserialize(ref MemoryPackReader reader, scoped ref List? value)
    {
    if (!reader.TryReadCollectionHeader(out var length))
    {
    value = null; return;
    }
    value = new List(length);
    CollectionsMarshal.SetCount(value, length);
    var span = CollectionsMarshal.AsSpan(value);
    if (!RuntimeHelpers.IsReferenceOrContainsReferences())
    {
    var byteCount = length * Unsafe.SizeOf();
    ref var src = ref reader.GetSpanReference(byteCount);
    ref var dest = ref Unsafe.As(ref MemoryMarshal.GetReference(span)!);
    Unsafe.CopyBlockUnaligned(ref dest, ref src, (uint)byteCount);
    reader.Advance(byteCount);
    }
    else
    {
    var formatter = GetFormatter();
    for (int i = 0; i < length; i++)
    {
    formatter.Deserialize(ref this, ref span[i]);
    }
    }
    }
    Actual code of ListFormatter
    MemoryPack has a binary specification that can
    handle unmanaged type T[] with only memory copy.
    By extracting Span, even List can be
    deserialized with memory copy.

    View Slide

  74. String / UTF8
    SIMD
    FFI(DllImport/LibraryImport)
    Channel
    More async/await

    View Slide

  75. Conclusion

    View Slide

  76. Expanding the possibilities of C#
    Languages evolve, techniques evolve
    C# has great potential and continues to be at the forefront of
    competitive programming languages
    Building a strong ecosystem is crucial
    Indeed, in the modern era where open source software (OSS) is central, the
    vitality of the ecosystem is crucial.
    The evolution of language/runtime and OSS are two wheels that drive progress
    forward.
    It's no longer the era where we can just rely on Microsoft or Unity.

    View Slide

  77. View Slide