Modern High Performance C# 2023 Edition

Slide 1

Slide 1 text

Modern High Performance C# 2023 Edition CEDEC 2023(Largest Conference for Game Developers in Japan) 2023-08-23 Yoshifumi Kawai / Cysharp, Inc. Translated from JP -> EN with ChatGPT(GPT-4)

Slide 2

Slide 2 text

About Speaker Kawai Yoshifumi / @neuecc Cysharp, Inc. - CEO/CTO Established in September 2018 as a subsidiary of Cygames, Inc. Engages in research and development, open-source software (OSS), and consulting related to C#. Microsoft MVP for Developer Technologies (C#) since 2011. Get CEDEC AWARDS 2022 in Engineering Developer of over 50 OSS libraries (UniRx, UniTask, MessagePack C#, etc.). Achieved one of the highest numbers of GitHub Stars worldwide in the field of C#.

Slide 3

Slide 3 text

OSS for high performance MessagePack ★4836 Extremely Fast MessagePack Serializer for C#(.NET, Unity). MagicOnion ★3308 Unified Realtime/API framework for .NET platform and Unity. UniTask ★5901 Provides an efficient allocation free async/await integration for Unity. ZString ★1524 Zero Allocation StringBuilder for .NET and Unity. MemoryPack ★2007 Zero encoding extreme performance binary serializer for C# and Unity. AlterNats ★271 An alternative high performance NATS client for .NET. MessagePipe ★1062 High performance in-memory/distributed messaging pipeline for .NET and Unity. Ulid ★629 Fast .NET C# Implementation of ULID for .NET and Unity.

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Current State of C#

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Java/Delphi 2002 C# 1.0 2005 C# 2.0 2008 C# 3.0 2010 C# 4.0 2012 C# 5.0 Generics LINQ Dynamic async/await Roslyn(self-hosting C# Compiler) Tuple Span null safety async streams record class code-generator 2015 C# 6.0 2017 C# 7.0 2019 C# 8.0 2020 C# 9.0 2021 C# 10.0 Continuously Evolving 2022 C# 11.0 global using record struct ref field struct abstract members Periodically underwent major updates during the Anders Hejlsberg era. Adding incremental features every year. While bold new features have become less common, the language has been steadily improving and many of the additions have a positive impact on performance.

Slide 8

Slide 8 text

.NET Framework -> Cross platform .NET Framework 1.0 2005 2008 2010 .NET Core 2.0 2017 .NET Core 3.0 2020 2002 .NET Framework 2.0 .NET Framework 3.5 2012 2016 .NET Framework 4 .NET Framework 4.5 .NET Core 1.0 .NET 5 2019 The commencement of full- fledged support for Linux. Integration of multiple runtimes (.NET Framework, Core, Mono, Xamarin). The era when it was only focused on Windows.

Slide 9

Slide 9 text

Linux and .NET High performance as a server-side programming language. It's now commonplace to choose Linux for server-side deployment as it's quite practical. Performance is also proven through benchmark results (Plaintext ranked 1st, C#, .NET, Linux). The performance is competitive even when compared to C++ or Rust. While these benchmarks are not necessarily practical, they do serve as evidence that there's a sufficient baseline of potential for the language.

Slide 10

Slide 10 text

gRPCとPerformance gRPC does not necessarily equal high speed. Performance varies depending on the implementation, and unoptimized implementations can be subpar. C# delivers performance on the same level as the top-tier languages like Rust, Go, and C++, demonstrating its capabilities. 0 50000 100000 150000 200000 250000 300000 350000 dotnet_grpc rust_thruster_mt cpp_grpc_mt scala_akka rust_tonic_mt go_vtgrpc java_quarkus swift_grpc node_grpcjs_st python_async_grpc ruby_grpc erlang_grpcbox php_grpc gRPC Implementation performance(2CPUs) Requests/sec(higher is better) https://github.com/LesnyRumcajs/grpc_bench/discussions/354

Slide 11

Slide 11 text

Memory

Slide 12

Slide 12 text

MessagePack for C# #1 Binary Serializer in .NET https://github.com/MessagePack-CSharp/MessagePack-CSharp The most supported binary serializer in .NET (with 4836 stars). Even if you've never used it directly, you've likely used it indirectly... for sure! Visual Studio 2022 internal SignalR MessagePack Hub Blazor Server protocol(BlazorPack) 2017-03-13 Release Overwhelming speed compared to other competitors at the time.

Slide 13

Slide 13 text

Slide 14

Slide 14 text

For example, serializing value = int(999) Ideal fastest code： Thinking about the fastest serializer Unsafe.WriteUnaligned(ref MemoryMarshal.GetReference(dest), value); Span dest ldarg .0 ldarg .1 unaligned. 0x01 stobj !!T ret In essence, a memory copy. Starting from C# 7.2, there's a need to actively utilize Span for handling contiguous memory regions. The Unsafe class provides primitive operations that can be written in IL (Intermediate Language) but not in C#. These features allow for the removal of C#'s language constraints, making it easier to control raw behavior. NOTE: While it's true that you can write something close to Span using pointers, Span allows for a more natural handling in C#, which is why it has started appearing frequently not just within methods but also in the method signatures of public APIs. Thanks to this, the pattern of pulling through "raw-like" operations across methods, classes, and assemblies has been established, contributing to the overall performance improvement of C# in recent years.

Slide 15

Slide 15 text

MessagePack JSON // uint16 msgpack code Unsafe.WriteUnaligned(ref dest[0], (byte)0xcd); // Write value as BigEndian var temp = BinaryPrimitives.ReverseEndianness((ushort)value); Unsafe.WriteUnaligned(ref dest[1], temp); In the case of existing serializers: Utf8Formatter.TryFormat(value, dest, out var bytesWritten); For JSON, performance is improved by reading and writing directly as UTF8 binary rather than as a string. Following the MessagePack specification, write the type identifier at the beginning and the value in BigEndian format.

Slide 16

Slide 16 text

Slide 17

Slide 17 text

MemoryPack Zero encoding extreme fast binary serializer https://github.com/Cysharp/MemoryPack/ Released in 2022-09 with the aim of being the ultimate high-speed serializer. Compared to JSON, it offers performance that is several times, and in optimal cases, hundreds of times faster. It is overwhelmingly superior even when compared to MessagePack for C#. Optimized binary specifications for C# Utilizes the latest design, fully leveraging C# 11

Slide 18

Slide 18 text

Zero encoding Memory copying to the fullest extent possible public struct Point3D { public int X; public int Y; public int Z; } new Point3D { X = 1, Y = 2, Z = 3 } In C#, it is guaranteed that the values of a struct that does not include reference types (not IsReferenceOrContainsReferences) will be laid out contiguously in memory.

Slide 19

Slide 19 text

IsReferenceOrContainsReferences Sequentially arranged compact specifications [MemoryPackable] public partial class Person { public long Id { get; set; } public int Age { get; set; } public string? Name { get; set; } } In the case of reference types, they are written sequentially. The specification was carefully crafted to balance performance and versioning resilience within a simple structure.

Slide 20

Slide 20 text

T[] where T : unmanaged In C#, arrays where elements are of unmanaged type (non-reference struct) are arranged sequentially. new int[] { 1, 2, 3, 4, 5 } var srcLength = Unsafe.SizeOf() * value.Length; var allocSize = srcLength + 4; ref var dest = ref GetSpanReference(allocSize); ref var src = ref Unsafe.As(ref GetArrayDataReference(value)); Unsafe.WriteUnaligned(ref dest, value.Length); Unsafe.CopyBlockUnaligned(ref Unsafe.Add(ref dest, 4), ref src, (uint)srcLength); Advance(allocSize); Serialize == Memory Copy

Slide 21

Slide 21 text

T[] where T : unmanaged Increasingly advantageous for complex types like Vector3[] Vector3(float x, float y, float z)[10000] Conventional serializers perform Write/Read operations for each field, so for 10,000 items, you would need to perform 10,000 x 3 operations. MemoryPack requires only one copy operation. It's only natural that it would be 200 times faster, then!

Slide 22

Slide 22 text

I/O Write

Slide 23

Slide 23 text

Three Tenets of I/O Application Speedup Minimize allocations Reduce copies Prioritize asynchronous I/O // Bad example byte[] result = Serialize(value); response.Write(result); Frequent byte[] allocation Writing to something probably results in a copy operation. Synchronous Write

Slide 24

Slide 24 text

Isn't I/O All About Streams? async Task WriteToStreamAsync(Stream stream) { // Queue while (messages.TryDequeue(out var message)) { await stream.WriteAsync(message.Encode()); } } In applications involving I/O, the ultimate output destinations are usually either the network (Socket/NetworkStream) or a file (FileStream). "Were you able to implement the three points with this...?"

Slide 25

Slide 25 text

Isn't I/O All About Streams? async Task WriteToStreamAsync(Stream stream) { // Queue while (messages.TryDequeue(out var message)) { await stream.WriteAsync(message.Encode()); } }

Slide 26

Slide 26 text

Why Streams Are Bad: Reason 1 async Task WriteToStreamAsync(Stream stream) { // Queue while (messages.TryDequeue(out var message)) { await stream.WriteAsync(message.Encode()); } } Frequent fine-grained I/O operations are slow, even if they are asynchronous! async/await is not a panacea!

Slide 27

Slide 27 text

Stream is beautiful……? async Task WriteToStreamAsync(Stream stream) { // So, it would be beneficial to add a buffer using BufferedStream, right? using (var buffer = new BufferedStream(stream)) { while (messages.TryDequeue(out var message)) { await buffer.WriteAsync(message.Encode()); } } } The exceptional abstraction in terms of the 'functional aspect' of Streams allows for the addition of features freely through the decorator pattern. For instance, by encapsulating a GZipStream, compression can be added, or by encapsulating a CryptoStream, encryption can be added. In this case, since we want to add a buffer, we will encapsulate it in a BufferedStream. This way, even with WriteAsync, it won't immediately write to I/O.

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Why Streams Are Bad: Reason 2 async Task WriteToStreamAsync(Stream stream) { // So, it would be beneficial to add a buffer using BufferedStream, right? using (var buffer = new BufferedStream(stream)) { while (messages.TryDequeue(out var message)) { await buffer.WriteAsync(message.Encode()); } } } If the Stream is already Buffered, unnecessary allocation would contradict the intent of "reducing allocation." Being buffered means that in most cases (provided the buffer is not overflowing) calls can be synchronous. If a call is synchronous, making an asynchronous call is wasteful. Public Task WriteAsync (byte[] buffer, int offset, int count); public ValueTask WriteAsync (ReadOnlyMemory buffer); Due to historical circumstances, Streams (and Sockets) have APIs that return a Task and others that return a ValueTask, with similar parameters and the same name. If you use an API that returns a Task, there's a chance you might inadvertently generate unnecessary Task allocations. Therefore, always use a ValueTask call. Fortunately, in the case of BufferedStream, the allocation itself does not occur because it returns Task.CompletedTask for synchronous operations. However, there is a cost associated with calling await. Regardless of whether it is ValueTask, waste is waste.

Slide 31

Slide 31 text

async Task WriteToStreamAsync(Stream stream) { var buffer = ArrayPool.Shared.Rent(4096); try { var slice = buffer.AsMemory(); var totalWritten = 0; { while (messages.TryDequeue(out var message)) { var written = message.EncodeTo(slice.Span); totalWritten += written; slice = slice.Slice(written); } } await stream.WriteAsync(buffer.AsMemory(0, totalWritten)); } finally { ArrayPool.Shared.Return(buffer); } } There may be a suspicion that the message.Encode() function was originally returning a byte[]. If you switch to EncodeTo, there would be less waste if you allocate a large buffer and replace the BufferedStream. (For the purpose of this sample, we will assume that the buffer will not overflow, so we will skip checks and enlargements) Let's reduce asynchronous writing to just once.

Slide 32

Slide 32 text

Stream is Bad Focus on synchronous buffers and asynchronous reading and writing Stream abstraction mixes synchronous and asynchronous behavior (forcing us to always use async calls even when the actual behavior is synchronous, like in BufferedStream) Since the actual entity of the Stream is unclear, for safety purposes, or as a working tool, each Stream often has its own buffer. (For example, GZipStream allocates 8K just by creating a new one, Buffered is 4K, and MemoryStream also allocates in fine detail.)

Slide 33

Slide 33 text

Stream is Dead Avoiding Stream Time when Stream was a first-class citizen for I/O has passed • RandomAccess for file processing (Scatter Gather I/O API) • IBufferWriter to directly call the internal buffer • System.IO.Pipeline for buffer and flow control Classes have emerged to handle processes while avoiding Streams Avoiding Stream overhead is the first step to high-performance handling However, since Streams are at the core of .NET, it's impossible to avoid them completely. It's hard to completely bypass NetworkStream or FileStream, and there are no alternatives to ConsoleStream or SslStream. Try to manage by not touching the streams until the very last read/write.

Slide 34

Slide 34 text

IBufferWriter Abstracting the synchronous buffer for writing public interface IBufferWriter { void Advance(int count); Memory GetMemory(int sizeHint = 0); Span GetSpan(int sizeHint = 0); } await SendAsync() Network buffer IBufferWriter requests slice Serializer write to slice Finally write buffer slice to network void Serialize(IBufferWriter writer, T value) By directly accessing and writing to the root buffer, not only can allocations be eliminated, but also copies between buffers.

Slide 35

Slide 35 text

MemoryPackSerializer.Serialize public static partial class MemoryPackSerializer { public static void Serialize(in TBufferWriter bufferWriter, in T? value) where TBufferWriter : IBufferWriter public static byte[] Serialize(in T? value) public static ValueTask SerializeAsync(Stream stream, T? value) } This is the most fundamental and can provide the best performance.

Slide 36

Slide 36 text

Example: Flow of MemoryPack's Serialize This MemoryPackWriter is important!

Slide 37

Slide 37 text

public void WriteUnmanaged(scoped in T1 value1) where T1 : unmanaged { var size = Unsafe.SizeOf(); ref var spanRef = ref GetSpanReference(size); Unsafe.WriteUnaligned(ref spanRef, value1); Advance(size); } MemoryPackWriter Buffer management for writing Or caching of IBufferWriter's buffer public ref partial struct MemoryPackWriter where TBufferWriter : IBufferWriter { ref TBufferWriter bufferWriter; ref byte bufferReference; int bufferLength; ref byte GetSpanReference(int sizeHint); void Advance(int count); public MemoryPackWriter(ref TBufferWriter writer) } public interface System.Buffers.IBufferWriter { Span GetSpan(int sizeHint = 0); void Advance(int count); } 1. For example, when writing an int 2. Request the maximum required buffer 3. Declare the amount written Take TBufferWriter in the ctor

Slide 38

Slide 38 text

public void WriteUnmanaged(scoped in T1 value1) where T1 : unmanaged { var size = Unsafe.SizeOf(); ref var spanRef = ref GetSpanReference(size); Unsafe.WriteUnaligned(ref spanRef, value1); Advance(size); } MemoryPackWriter public ref partial struct MemoryPackWriter where TBufferWriter : IBufferWriter { ref TBufferWriter bufferWriter; ref byte bufferReference; int bufferLength; ref byte GetSpanReference(int sizeHint); void Advance(int count); public MemoryPackWriter(ref TBufferWriter writer) } public interface System.Buffers.IBufferWriter { Span GetSpan(int sizeHint = 0); void Advance(int count); } Frequent calls to GetSpan/Advance on IBuferWriter are slow, so reserve plenty of space within MemoryPackWriter to reduce the number of calls to BufferWriter. NOTE: When implementing IBufferWriter, the size of the buffer returned by GetSpan should not be the one trimmed by sizeHint, but the actual buffer size that you likely hold internally. Trimming it forces frequent calls to GetSpan, which can lead to performance degradation. Buffer management for writing Or caching of IBufferWriter's buffer

Slide 39

Slide 39 text

Optimize the Write public ref partial struct MemoryPackWriter where TBufferWriter : IBufferWriter { ref TBufferWriter bufferWriter; ref byte bufferReference; int bufferLength; ref byte GetSpanReference(int sizeHint); void Advance(int count); public MemoryPackWriter(ref TBufferWriter writer) } If fixed-size members are consecutive, consolidate the calls to reduce the number of calls to GetSpanReference/Advance. Reduce the number of method calls The fewer, the better

Slide 40

Slide 40 text

Complete Serialize public static partial class MemoryPackSerializer { public static void Serialize(in TBufferWriter bufferWriter, in T? value) where TBufferWriter : IBufferWriter public static byte[] Serialize(in T? value) public static ValueTask SerializeAsync(Stream stream, T? value) } If you Flush (calling the original IBufferWriter's Advance and synchronously confirming the actual written area), the serialization process is completed. var writer = new MemoryPackWriter(ref bufferWriter); writer.WriteValue(value); writer.Flush();

Slide 41

Slide 41 text

Other overload public static partial class MemoryPackSerializer { public static void Serialize(in TBufferWriter bufferWriter, in T? value) public static byte[] Serialize(in T? value) public static ValueTask SerializeAsync(Stream stream, T? value) } Pass ReusableLinkedArrayBufferWriter internally through Serialize. var bufferWriter = ReusableLinkedArrayBufferWriterPool.Rent(); var writer = new MemoryPackWriter(ref bufferWriter); writer.WriteValue(value); writer.Flush(); await bufferWriter.WriteToAndResetAsync(stream); return bufferWriter.ToArrayAndReset();

Slide 42

Slide 42 text

ReusableLinkedArrayBufferWriter byte[] byte[] byte[] ArrayPool.Shared.Rent GetSpan() If you only want the last concatenated array (or writing to a Stream), you can represent the internal buffer with concatenated chunks, not like a List expansion copy, as it doesn't have to be a block of memory. This can reduce the number of copies. public sealed class ReusableLinkedArrayBufferWriter : IBufferWriter { List buffers; } struct BufferSegment { byte[] buffer; int written; } NOTE: If the buffer becomes insufficient, instead of linking fixed-size items because they are linked (or worrying about LOH), generate (or borrow) items of double the size and link them. If not, if the write result is large, the number of linked list elements will become too large and the performance will deteriorate.

Slide 43

Slide 43 text

ToArray / WriteTo byte[] byte[] byte[] var result = new byte[a.Length + b.count + c.Length]; a.CopyTo(result); b.CopyTo(result); c.CopyTo(result); await stream.WriteAsync(a); await stream.WriteAsync(b); await stream.WriteAsync(c); ArrayPool.Shared.Return As the final size is known, only 'new' the final result and copy it, or write it to a Stream. The completed working array is no longer needed, so it is returned to the Pool.

Slide 44

Slide 44 text

Enumerable.ToArray It converts an IEnumerable with an indefinite number of elements into a T[]. Conventionally, when it overflowed, the internal T[] was expanded, but couldn't a T[] be obtained from concatenated chunks in the same way as this time? I submitted a PR to dotnet/runtime https://github.com/dotnet/runtime/pull/90459 30~60% dramatic performance improve might be included in .NET 9? Improve LINQ ToArray NOTE: LINQ's ToArray has already been optimized in various ways, estimating the number of elements as much as possible, and when it can be estimated, it allocates an array of a fixed size. The size estimation is not as simple as is ICollection, but has more complex branches depending on the situation of the method chain such as the size is determined if it is Enumerable.Range, and it can be determined if it is Take, and so on.

Slide 45

Slide 45 text

Avoid aggressive use of Pools As it's something to be incorporated into the runtime, the extensive use of Pools was avoided. Instead of using a Reusable LinkedArray, InlineArray from C# 12 was adopted To put it roughly, it enables stackalloc T[] (in other words, T[][]). with InlineArray(C# 12) [InlineArray(29)] struct ArrayBlock { private T[] array; } Because List (or something like that) would cause extra allocations, it would have been harder to propose. Allocating T[][] in the stack area eliminated the need for the allocation of the linked list itself. However, InlineArray is only allowed a fixed size specified at compile time. Therefore, I adopted '29' as the size ....

Slide 46

Slide 46 text

29 Starting from 4 and repeatedly doubling the size leads to 29, which is the maximum value (.NET array size is int.MaxValue, just a little less than 2147483591). Since ToArray of IEnumerable always adds one element at a time, it is guaranteed that all values will be filled and the next array will be concatenated without any gaps. Therefore, it is absolutely impossible to run out with InlineArray(29).

Slide 47

Slide 47 text

I/O Read

Slide 48

Slide 48 text

No Stream Again Performance is determined by synchronous buffers and asynchronous reading and writing Don't mix I/O and deserialization Calling ReadAsync each time would be too slow. MemoryPackSerializer.Deserialize(Stream) constructs a ReadOnlySequence first and then flows it into the deserialize process. public static partial class MemoryPackSerializer { public static T? Deserialize(ReadOnlySpan buffer) public static int Deserialize(in ReadOnlySequence buffer, ref T? value) public static async ValueTask DeserializeAsync(Stream stream) } NOTE: Not mixing I/O and deserialization means that it is not possible to perform true streaming deserialization with an undefined length or minimal buffering. In MemoryPack, instead, a deserialization mechanism is available that buffers in window widths and returns an IAsyncEnumerable as a supplementary mechanism. We target only synchronous buffers that have already been read.

Slide 49

Slide 49 text

ReadOnlySequence Like the concatenated T[] By entrusting buffer processing in combination with System.IO.Pipelines, it can be treated like a connected T[] that can be sliced at any position. ReadOnlySequence is not always fast, so it is necessary to come up with ways to reduce the number of Slice calls.

Slide 50

Slide 50 text

Flow of Deserialize This MemoryPackReader is important!

Slide 51

Slide 51 text

public ref partial struct MemoryPackReader { ReadOnlySequence bufferSource; ref byte bufferReference; int bufferLength; ref byte GetSpanReference(int sizeHint); void Advance(int count); public MemoryPackReader( in ReadOnlySequence source) public MemoryPackReader( ReadOnlySpan buffer) } public void ReadUnmanaged(out T1 value1) where T1 : unmanaged { var size = Unsafe.SizeOf(); ref var spanRef = ref GetSpanReference(size); value1 = Unsafe.ReadUnaligned(ref spanRef); Advance(size); } MemoryPackReader public readonly struct ReadOnlySequence { ReadOnlySpan FirstSpan { get; } ReadOnlySequence Slice(long start); } 1. For example, when reading data like int. 2. Request the maximum required buffer. 3. Report the amount read. Buffer management for reading Set ReadOnlySequence as the source. Similar to the MemoryPackWriter, receive the necessary buffer with GetSpanReference and proceed with Advance.

Slide 52

Slide 52 text

public ref partial struct MemoryPackReader { ReadOnlySequence bufferSource; ref byte bufferReference; int bufferLength; ref byte GetSpanReference(int sizeHint); void Advance(int count); public MemoryPackReader( in ReadOnlySequence source) public MemoryPackReader( ReadOnlySpan buffer) } public void ReadUnmanaged(out T1 value1) where T1 : unmanaged { var size = Unsafe.SizeOf(); ref var spanRef = ref GetSpanReference(size); value1 = Unsafe.ReadUnaligned(ref spanRef); Advance(size); } MemoryPackReader public readonly struct ReadOnlySequence { ReadOnlySpan FirstSpan { get; } ReadOnlySequence Slice(long start); } Due to the slowness of frequent calls to Slice on ReadOnlySequence, secure the entire block as FirstSpan in MemoryPackReader and suppress the number of calls to ReadOnlySequence. NOTE: It's natural that a Read request exceeds the FirstSpan. Since the deserialization of MemoryPack requires a continuous memory area, in the actual MemoryPack, processes like copying to a temporary area borrowed from the pool, and assigning it to the ref byte bufferReference are performed. Buffer management for reading Set ReadOnlySequence as the source.

Slide 53

Slide 53 text

Reader I/O in Application

Slide 54

Slide 54 text

Efficient Read is challenging while (true) { var read = await socket.ReceiveAsync(buffer); var span = buffer.AsSpan(read); // ... } The amount read here may not fill one message block. If you read ReceiveAsync again and pack it into the buffer, what happens if it exceeds the buffer? If you keep resizing, it will become infinitely large, but can you guarantee that a time will come when you can reset it to 0? Handling incomplete reads It is not always permitted to read to the end of the stream.

Slide 55

Slide 55 text

A reader that returns ReadOnlySequence Concatenating incomplete blocks If the size of one message is known (i.e., the Length is written in the header as a protocol), it can be converted to a command to read at least a certain size (ReadAtLeast). async Task ReadLoopAsync() { while (true) { ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4); // do anything } } If it's in the form of ReadOnlySequence, you can feed it into something that supports it. For example, most modern serializers basically support ReadOnlySequence. NOTE: Serializers that do not support ReadOnlySequence are considered legacy and should be discarded. Of course, MessagePack for C#, MemoryPack is supported. NOTE: System.IO.Pipelines is what takes care of the related tasks.

Slide 56

Slide 56 text

Assuming there is a protocol where the type of message is at the top and something is done based on it, how to determine if the message type is a string (Text protocol, for example, Redis and NATS adopt a text protocol). Determining type async Task ReadLoopAsync() { while (true) { ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4); var code = GetCode(buffer); if (code == ServerOpCodes.Msg) { //… } } } You can determine it simply by converting it to a string. If you convert it to an Enum, it's easy to use later. This is an example from NATS, but by cleverly adjusting to 4 characters, including symbols and spaces, you can ensure it can be determined by ReadAtLeastAsync(4). ServerOpCodes GetCode(ReadOnlySequence buffer) { var span = GetSpan(buffer); var str = Encoding.UTF8.GetString(span); return str switch { "INFO" => ServerOpCodes.Info, "MSG " => ServerOpCodes.Msg, "PING" => ServerOpCodes.Ping, "PONG" => ServerOpCodes.Pong, "+OK¥r" => ServerOpCodes.Ok, "-ERR" => ServerOpCodes.Error, _ => throw new InvalidOperationException() }; }

Slide 57

Slide 57 text

Slide 58

Slide 58 text

Slide 59

Slide 59 text

Take2 Compare ReadOnlySpan async Task ReadLoopAsync() { while (true) { ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4); var code = GetCode(buffer); if (code == ServerOpCodes.Msg) { //… } } } With the C# 11 UTF-8 literal (u8), you can get ReadOnlySpan in a constant manner. If you bring the ones with a high match frequency to the top of the if statement, the cost of the if check can also be reduced. Also, SequenceEqual of ReadOnlySpan (unlike the one of LINQ) compares quite speedily. ServerOpCodes GetCode(ReadOnlySequence buffer) { var span = GetSpan(buffer); if (span.SequenceEqual("MSG "u8)) return ServerOpCodes.Msg; if (span.SequenceEqual("PONG"u8)) return ServerOpCodes.Pong; if (span.SequenceEqual("INFO"u8)) return ServerOpCodes.Info; if (span.SequenceEqual("PING"u8)) return ServerOpCodes.Ping; if (span.SequenceEqual("+OK¥r"u8)) return ServerOpCodes.Ok; if (span.SequenceEqual("-ERR"u8)) return ServerOpCodes.Error; throw new InvalidOperationException(); }

Slide 60

Slide 60 text

Slide 61

Slide 61 text

Convert first 4 char to int // msg = ReadOnlySpan if (Unsafe.ReadUnaligned(ref MemoryMarshal.GetReference(msg)) == 1330007625) // INFO { } internal static class ServerOpCodes { public const int Info = 1330007625; // "INFO" public const int Msg = 541545293; // "MSG " public const int Ping = 1196312912; // "PING" public const int Pong = 1196314448; // "PONG" public const int Ok = 223039275; // "+OK¥r" public const int Error = 1381123373; // "-ERR" } If you combine it with follow-up elements (spaces or ¥r), all NATS OpCode can be determined with exactly 4 bytes (int), so you can create a group of constants int-transformed in advance. Direct int transformation from ReadOnlySpan Comparing by stringifying is out of the question, but since you can do it with just 4 bytes, comparing by int transformation is the fastest. NOTE: Well, the best protocol would be a binary one, where the first byte represents the type... Text protocols are not good.

Slide 62

Slide 62 text

async/await and inlining async Task ReadLoopAsync() { while (true) { ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4); var code = GetCode(buffer); await DispatchCommandAsync(code, buffer); } } async ValueTask DispatchCommandAsync(int code, ReadOnlySequence buffer) { } The part where data is read from the socket (the actual code is a bit more complex, so we might want to separate it from the processing part). In this method, messages are parsed in detail and actual processing is done (such as deserializing payload and callbacks).

Slide 63

Slide 63 text

Slide 64

Slide 64 text

asynchronous state machine generation async Task ReadLoopAsync() { while (true) { ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4); var code = GetCode(buffer); await DispatchCommandAsync(code, buffer); } } async ValueTask DispatchCommandAsync(int code, ReadOnlySequence buffer) { } If it is within a loop, there will be no new asynchronous state machine generation, so you can go ahead and await as much as you want. If you actually perform asynchronous processing in an async method declared with async, an asynchronous state machine will be generated for each call, so there will be extra allocation. If the actual method is an async method of IValueTaskSource, it can be designed so that there is no asynchronous state machine generation even if you directly await it.

Slide 65

Slide 65 text

Inlining await in the hot path async Task ReadLoopAsync() { while (true) { ReadOnlySequence buffer = await socketReader.ReadAtLeastAsync(4); var code = GetCode(buffer); if (code == ServerOpCodes.Msg) { await DoAnything(); await DoAnything(); } else { await DispatchCommandAsync(code, buffer); } } } [AsyncMethodBuilderAttribute(typeof(PoolingAsyncValueTaskMethodBuilder))] async ValueTask DispatchCommandAsync(int code, ReadOnlySequence buffer) { } Since 90% of the loops are receiving Msg (the rest are rarely coming things like PING or ERROR), only Msg is inlined to aim for maximum efficiency. In other cases, methods are separated, but marking with PoolingAsyncValueTaskMethodBuilder from .NET 6 makes the asynchronous state machine poolable and reusable.

Slide 66

Slide 66 text

Optimize for All Types

Slide 67

Slide 67 text

Source Generator based Automatically generate at compile time the Serialize and Deserialize code optimized for each [MemoryPackable] type. static abstract members from C#11

Slide 68

Slide 68 text

IL.Emit vs SourceGenerator IL.Emit Dynamic assembly generation using type information at runtime IL black magic that's been available since the early days of .NET Not usable in environments where dynamic generation is not allowed (like iOS, WASM, NativeAOT, etc.) SourceGenerator Generate C# code using AST at compile-time It started to be extensively used around .NET 6 Since it's pure C# code, it can be used in all environments. Given the diversification of environments where .NET operates and the absence of startup speed penalties, it is desirable to move towards SourceGenerator as much as possible. Although the inability to use runtime information can make it difficult to generate similar code, especially around Generics, let's overcome this with some ingenuity...

Slide 69

Slide 69 text

Optimize for All Types For example, the processing of collections can be mostly handled with just one Formatter for IEnumerable, but if you create the optimal implementation for each collection one by one, you can run the highest performance processing. Implementations to interfaces are only required when asked to process unknown types.

Slide 70

Slide 70 text

Fast Enumeration of Array Normally, boundary checks are inserted when accessing elements of a C# array. However, the JIT compiler may exceptionally remove the boundary check if it can detect that it will not exceed the boundary (for example, when looping a for loop with .Length). An array (or Span) foreach is converted at compile time to the same as a for loop at the IL level, so it's completely the same.

Slide 71

Slide 71 text

Optimize for List / Read public sealed class ListFormatter : MemoryPackFormatter> { public override void Serialize( ref MemoryPackWriter writer, scoped ref List? value) { if (value == null) { writer.WriteNullCollectionHeader(); return; } var span = CollectionsMarshal.AsSpan(value); var formatter = GetFormatter(); WriteCollectionHeader(span.Length); for (int i = 0; i < span.Length; i++) { formatter.Serialize(ref this, ref span[i]); } } } Fastest List iterate CollectionsMarshal.AsSpan

Slide 72

Slide 72 text

public override void Deserialize(ref MemoryPackReader reader, scoped ref List? value) { if (!reader.TryReadCollectionHeader(out var length)) { value = null; return; } value = new List(length); CollectionsMarshal.SetCount(value, length); var span = CollectionsMarshal.AsSpan(value); var formatter = GetFormatter(); for (int i = 0; i < length; i++) { formatter.Deserialize(ref this, ref span[i]); } } Optimize for List / Write Adding to List one by one is slow. By making it handleable as Span, the deserialization speed of List is made equivalent to that of the array. Just using new List(capacity) results in an internal size of 0, so using CollectionsMarshal.AsSpan only retrieves a Span of length 0, which is meaningless. By forcefully changing the internal size with CollectionsMarshal.SetCount, which was added in .NET 8, you can avoid Add and extract Span.

Slide 73

Slide 73 text

public override void Deserialize(ref MemoryPackReader reader, scoped ref List? value) { if (!reader.TryReadCollectionHeader(out var length)) { value = null; return; } value = new List(length); CollectionsMarshal.SetCount(value, length); var span = CollectionsMarshal.AsSpan(value); if (!RuntimeHelpers.IsReferenceOrContainsReferences()) { var byteCount = length * Unsafe.SizeOf(); ref var src = ref reader.GetSpanReference(byteCount); ref var dest = ref Unsafe.As(ref MemoryMarshal.GetReference(span)!); Unsafe.CopyBlockUnaligned(ref dest, ref src, (uint)byteCount); reader.Advance(byteCount); } else { var formatter = GetFormatter(); for (int i = 0; i < length; i++) { formatter.Deserialize(ref this, ref span[i]); } } } Actual code of ListFormatter MemoryPack has a binary specification that can handle unmanaged type T[] with only memory copy. By extracting Span, even List can be deserialized with memory copy.

Slide 74

Slide 74 text

String / UTF8 SIMD FFI(DllImport/LibraryImport) Channel More async/await

Slide 75

Slide 75 text

Conclusion

Slide 76

Slide 76 text

Expanding the possibilities of C# Languages evolve, techniques evolve C# has great potential and continues to be at the forefront of competitive programming languages Building a strong ecosystem is crucial Indeed, in the modern era where open source software (OSS) is central, the vitality of the ecosystem is crucial. The evolution of language/runtime and OSS are two wheels that drive progress forward. It's no longer the era where we can just rely on Microsoft or Unity.

Slide 77

Slide 77 text

No content