Modern High Performance C# 2023 Edition

Modern High Performance C# 2023 Edition CEDEC 2023(Largest Conference for
Game Developers in Japan) 2023-08-23 Yoshifumi Kawai / Cysharp, Inc. Translated from JP -> EN with ChatGPT(GPT-4)

About Speaker Kawai Yoshifumi / @neuecc Cysharp, Inc. - CEO/CTO
Established in September 2018 as a subsidiary of Cygames, Inc. Engages in research and development, open-source software (OSS), and consulting related to C#. Microsoft MVP for Developer Technologies (C#) since 2011. Get CEDEC AWARDS 2022 in Engineering Developer of over 50 OSS libraries (UniRx, UniTask, MessagePack C#, etc.). Achieved one of the highest numbers of GitHub Stars worldwide in the field of C#.

OSS for high performance MessagePack ★4836 Extremely Fast MessagePack Serializer
for C#(.NET, Unity). MagicOnion ★3308 Unified Realtime/API framework for .NET platform and Unity. UniTask ★5901 Provides an efficient allocation free async/await integration for Unity. ZString ★1524 Zero Allocation StringBuilder for .NET and Unity. MemoryPack ★2007 Zero encoding extreme performance binary serializer for C# and Unity. AlterNats ★271 An alternative high performance NATS client for .NET. MessagePipe ★1062 High performance in-memory/distributed messaging pipeline for .NET and Unity. Ulid ★629 Fast .NET C# Implementation of ULID for .NET and Unity.

OSS for high performance MessagePack ★4836 Extremely Fast MessagePack Serializer
for C#(.NET, Unity). MagicOnion ★3308 Unified Realtime/API framework for .NET platform and Unity. UniTask ★5901 Provides an efficient allocation free async/await integration for Unity. ZString ★1524 Zero Allocation StringBuilder for .NET and Unity. MemoryPack ★2007 Zero encoding extreme performance binary serializer for C# and Unity. AlterNats ★271 An alternative high performance NATS client for .NET. MessagePipe ★1062 High performance in-memory/distributed messaging pipeline for .NET and Unity. Ulid ★629 Fast .NET C# Implementation of ULID for .NET and Unity. I have released libraries that pursue unmatched high performance across various genres. In this session, drawing from those experiences, I will introduce techniques for achieving peak performance in modern C#.

Current State of C#

Java/Delphi 2002 C# 1.0 2005 C# 2.0 2008 C# 3.0
2010 C# 4.0 2012 C# 5.0 Generics LINQ Dynamic async/await Roslyn(self-hosting C# Compiler) Tuple Span null safety async streams record class code-generator 2015 C# 6.0 2017 C# 7.0 2019 C# 8.0 2020 C# 9.0 2021 C# 10.0 Continuously Evolving 2022 C# 11.0 global using record struct ref field struct abstract members

Java/Delphi 2002 C# 1.0 2005 C# 2.0 2008 C# 3.0
2010 C# 4.0 2012 C# 5.0 Generics LINQ Dynamic async/await Roslyn(self-hosting C# Compiler) Tuple Span null safety async streams record class code-generator 2015 C# 6.0 2017 C# 7.0 2019 C# 8.0 2020 C# 9.0 2021 C# 10.0 Continuously Evolving 2022 C# 11.0 global using record struct ref field struct abstract members Periodically underwent major updates during the Anders Hejlsberg era. Adding incremental features every year. While bold new features have become less common, the language has been steadily improving and many of the additions have a positive impact on performance.

.NET Framework -> Cross platform .NET Framework 1.0 2005 2008
2010 .NET Core 2.0 2017 .NET Core 3.0 2020 2002 .NET Framework 2.0 .NET Framework 3.5 2012 2016 .NET Framework 4 .NET Framework 4.5 .NET Core 1.0 .NET 5 2019 The commencement of full- fledged support for Linux. Integration of multiple runtimes (.NET Framework, Core, Mono, Xamarin). The era when it was only focused on Windows.

Linux and .NET High performance as a server-side programming language.
It's now commonplace to choose Linux for server-side deployment as it's quite practical. Performance is also proven through benchmark results (Plaintext ranked 1st, C#, .NET, Linux). The performance is competitive even when compared to C++ or Rust. While these benchmarks are not necessarily practical, they do serve as evidence that there's a sufficient baseline of potential for the language.

gRPCとPerformance gRPC does not necessarily equal high speed. Performance varies
depending on the implementation, and unoptimized implementations can be subpar. C# delivers performance on the same level as the top-tier languages like Rust, Go, and C++, demonstrating its capabilities. 0 50000 100000 150000 200000 250000 300000 350000 dotnet_grpc rust_thruster_mt cpp_grpc_mt scala_akka rust_tonic_mt go_vtgrpc java_quarkus swift_grpc node_grpcjs_st python_async_grpc ruby_grpc erlang_grpcbox php_grpc gRPC Implementation performance(2CPUs) Requests/sec(higher is better) https://github.com/LesnyRumcajs/grpc_bench/discussions/354

Memory

MessagePack for C# #1 Binary Serializer in .NET https://github.com/MessagePack-CSharp/MessagePack-CSharp The
most supported binary serializer in .NET (with 4836 stars). Even if you've never used it directly, you've likely used it indirectly... for sure! Visual Studio 2022 internal SignalR MessagePack Hub Blazor Server protocol(BlazorPack) 2017-03-13 Release Overwhelming speed compared to other competitors at the time.

For example, serializing value = int(999) Ideal fastest code： Thinking
about the fastest serializer Unsafe.WriteUnaligned(ref MemoryMarshal.GetReference(dest), value); Span<byte> dest ldarg .0 ldarg .1 unaligned. 0x01 stobj !!T ret In essence, a memory copy.

For example, serializing value = int(999) Ideal fastest code： Thinking
about the fastest serializer Unsafe.WriteUnaligned(ref MemoryMarshal.GetReference(dest), value); Span<byte> dest ldarg .0 ldarg .1 unaligned. 0x01 stobj !!T ret In essence, a memory copy. Starting from C# 7.2, there's a need to actively utilize Span for handling contiguous memory regions. The Unsafe class provides primitive operations that can be written in IL (Intermediate Language) but not in C#. These features allow for the removal of C#'s language constraints, making it easier to control raw behavior. NOTE: While it's true that you can write something close to Span using pointers, Span allows for a more natural handling in C#, which is why it has started appearing frequently not just within methods but also in the method signatures of public APIs. Thanks to this, the pattern of pulling through "raw-like" operations across methods, classes, and assemblies has been established, contributing to the overall performance improvement of C# in recent years.

MessagePack JSON // uint16 msgpack code Unsafe.WriteUnaligned(ref dest[0], (byte)0xcd); //
Write value as BigEndian var temp = BinaryPrimitives.ReverseEndianness((ushort)value); Unsafe.WriteUnaligned(ref dest[1], temp); In the case of existing serializers: Utf8Formatter.TryFormat(value, dest, out var bytesWritten); For JSON, performance is improved by reading and writing directly as UTF8 binary rather than as a string. Following the MessagePack specification, write the type identifier at the beginning and the value in BigEndian format.

MessagePack JSON // uint16 msgpack code Unsafe.WriteUnaligned(ref dest[0], (byte)0xcd); //
Write value as BigEndian var temp = BinaryPrimitives.ReverseEndianness((ushort)value); Unsafe.WriteUnaligned(ref dest[1], temp); In the case of existing serializers: Utf8Formatter.TryFormat(value, dest, out var bytesWritten); For JSON, performance is improved by reading and writing directly as UTF8 binary rather than as a string. Following the MessagePack specification, write the type identifier at the beginning and the value in BigEndian format. MessagePack for C# is indeed fast. However, due to the binary specifications of MessagePack itself, no matter what you do, it will be slower than the "ideal fastest code."

MemoryPack Zero encoding extreme fast binary serializer https://github.com/Cysharp/MemoryPack/ Released in
2022-09 with the aim of being the ultimate high-speed serializer. Compared to JSON, it offers performance that is several times, and in optimal cases, hundreds of times faster. It is overwhelmingly superior even when compared to MessagePack for C#. Optimized binary specifications for C# Utilizes the latest design, fully leveraging C# 11

Zero encoding Memory copying to the fullest extent possible public
struct Point3D { public int X; public int Y; public int Z; } new Point3D { X = 1, Y = 2, Z = 3 } In C#, it is guaranteed that the values of a struct that does not include reference types (not IsReferenceOrContainsReferences) will be laid out contiguously in memory.

IsReferenceOrContainsReferences Sequentially arranged compact specifications [MemoryPackable] public partial class Person
{ public long Id { get; set; } public int Age { get; set; } public string? Name { get; set; } } In the case of reference types, they are written sequentially. The specification was carefully crafted to balance performance and versioning resilience within a simple structure.

T[] where T : unmanaged In C#, arrays where elements
are of unmanaged type (non-reference struct) are arranged sequentially. new int[] { 1, 2, 3, 4, 5 } var srcLength = Unsafe.SizeOf<T>() * value.Length; var allocSize = srcLength + 4; ref var dest = ref GetSpanReference(allocSize); ref var src = ref Unsafe.As<T, byte>(ref GetArrayDataReference(value)); Unsafe.WriteUnaligned(ref dest, value.Length); Unsafe.CopyBlockUnaligned(ref Unsafe.Add(ref dest, 4), ref src, (uint)srcLength); Advance(allocSize); Serialize == Memory Copy

T[] where T : unmanaged Increasingly advantageous for complex types
like Vector3[] Vector3(float x, float y, float z)[10000] Conventional serializers perform Write/Read operations for each field, so for 10,000 items, you would need to perform 10,000 x 3 operations. MemoryPack requires only one copy operation. It's only natural that it would be 200 times faster, then!

I/O Write

Three Tenets of I/O Application Speedup Minimize allocations Reduce copies
Prioritize asynchronous I/O // Bad example byte[] result = Serialize(value); response.Write(result); Frequent byte[] allocation Writing to something probably results in a copy operation. Synchronous Write

Isn't I/O All About Streams? async Task WriteToStreamAsync(Stream stream) {
// Queue<Message> while (messages.TryDequeue(out var message)) { await stream.WriteAsync(message.Encode()); } } In applications involving I/O, the ultimate output destinations are usually either the network (Socket/NetworkStream) or a file (FileStream). "Were you able to implement the three points with this...?"

Isn't I/O All About Streams? async Task WriteToStreamAsync(Stream stream) {
// Queue<Message> while (messages.TryDequeue(out var message)) { await stream.WriteAsync(message.Encode()); } }

Why Streams Are Bad: Reason 1 async Task WriteToStreamAsync(Stream stream)
{ // Queue<Message> while (messages.TryDequeue(out var message)) { await stream.WriteAsync(message.Encode()); } } Frequent fine-grained I/O operations are slow, even if they are asynchronous! async/await is not a panacea!

Stream is beautiful……? async Task WriteToStreamAsync(Stream stream) { // So,
it would be beneficial to add a buffer using BufferedStream, right? using (var buffer = new BufferedStream(stream)) { while (messages.TryDequeue(out var message)) { await buffer.WriteAsync(message.Encode()); } } } The exceptional abstraction in terms of the 'functional aspect' of Streams allows for the addition of features freely through the decorator pattern. For instance, by encapsulating a GZipStream, compression can be added, or by encapsulating a CryptoStream, encryption can be added. In this case, since we want to add a buffer, we will encapsulate it in a BufferedStream. This way, even with WriteAsync, it won't immediately write to I/O.

Stream is beautiful……? async Task WriteToStreamAsync(Stream stream) { // So,
it would be beneficial to add a buffer using BufferedStream, right? using (var buffer = new BufferedStream(stream)) { while (messages.TryDequeue(out var message)) { await buffer.WriteAsync(message.Encode()); } } }

{ // So, it would be beneficial to add a buffer using BufferedStream, right? using (var buffer = new BufferedStream(stream)) { while (messages.TryDequeue(out var message)) { await buffer.WriteAsync(message.Encode()); } } } If the Stream is already Buffered, unnecessary allocation would contradict the intent of "reducing allocation." Being buffered means that in most cases (provided the buffer is not overflowing) calls can be synchronous. If a call is synchronous, making an asynchronous call is wasteful.

{ // So, it would be beneficial to add a buffer using BufferedStream, right? using (var buffer = new BufferedStream(stream)) { while (messages.TryDequeue(out var message)) { await buffer.WriteAsync(message.Encode()); } } } If the Stream is already Buffered, unnecessary allocation would contradict the intent of "reducing allocation." Being buffered means that in most cases (provided the buffer is not overflowing) calls can be synchronous. If a call is synchronous, making an asynchronous call is wasteful. Public Task WriteAsync (byte[] buffer, int offset, int count); public ValueTask WriteAsync (ReadOnlyMemory<byte> buffer); Due to historical circumstances, Streams (and Sockets) have APIs that return a Task and others that return a ValueTask, with similar parameters and the same name. If you use an API that returns a Task, there's a chance you might inadvertently generate unnecessary Task allocations. Therefore, always use a ValueTask call. Fortunately, in the case of BufferedStream, the allocation itself does not occur because it returns Task.CompletedTask for synchronous operations. However, there is a cost associated with calling await. Regardless of whether it is ValueTask, waste is waste.

async Task WriteToStreamAsync(Stream stream) { var buffer = ArrayPool<byte>.Shared.Rent(4096); try
{ var slice = buffer.AsMemory(); var totalWritten = 0; { while (messages.TryDequeue(out var message)) { var written = message.EncodeTo(slice.Span); totalWritten += written; slice = slice.Slice(written); } } await stream.WriteAsync(buffer.AsMemory(0, totalWritten)); } finally { ArrayPool<byte>.Shared.Return(buffer); } } There may be a suspicion that the message.Encode() function was originally returning a byte[]. If you switch to EncodeTo, there would be less waste if you allocate a large buffer and replace the BufferedStream. (For the purpose of this sample, we will assume that the buffer will not overflow, so we will skip checks and enlargements) Let's reduce asynchronous writing to just once.

Stream is Bad Focus on synchronous buffers and asynchronous reading
and writing Stream abstraction mixes synchronous and asynchronous behavior (forcing us to always use async calls even when the actual behavior is synchronous, like in BufferedStream) Since the actual entity of the Stream is unclear, for safety purposes, or as a working tool, each Stream often has its own buffer. (For example, GZipStream allocates 8K just by creating a new one, Buffered is 4K, and MemoryStream also allocates in fine detail.)

Stream is Dead Avoiding Stream Time when Stream was a
first-class citizen for I/O has passed • RandomAccess for file processing (Scatter Gather I/O API) • IBufferWriter<T> to directly call the internal buffer • System.IO.Pipeline for buffer and flow control Classes have emerged to handle processes while avoiding Streams Avoiding Stream overhead is the first step to high-performance handling However, since Streams are at the core of .NET, it's impossible to avoid them completely. It's hard to completely bypass NetworkStream or FileStream, and there are no alternatives to ConsoleStream or SslStream. Try to manage by not touching the streams until the very last read/write.

IBufferWriter<byte> Abstracting the synchronous buffer for writing public interface IBufferWriter<T>
{ void Advance(int count); Memory<T> GetMemory(int sizeHint = 0); Span<T> GetSpan(int sizeHint = 0); } await SendAsync() Network buffer IBufferWriter requests slice Serializer write to slice Finally write buffer slice to network void Serialize<T>(IBufferWriter<byte> writer, T value) By directly accessing and writing to the root buffer, not only can allocations be eliminated, but also copies between buffers.

MemoryPackSerializer.Serialize public static partial class MemoryPackSerializer { public static void
Serialize<T, TBufferWriter>(in TBufferWriter bufferWriter, in T? value) where TBufferWriter : IBufferWriter<byte> public static byte[] Serialize<T>(in T? value) public static ValueTask SerializeAsync<T>(Stream stream, T? value) } This is the most fundamental and can provide the best performance.

Example: Flow of MemoryPack's Serialize This MemoryPackWriter is important!

public void WriteUnmanaged<T1>(scoped in T1 value1) where T1 : unmanaged
{ var size = Unsafe.SizeOf<T1>(); ref var spanRef = ref GetSpanReference(size); Unsafe.WriteUnaligned(ref spanRef, value1); Advance(size); } MemoryPackWriter Buffer management for writing Or caching of IBufferWriter<byte>'s buffer public ref partial struct MemoryPackWriter<TBufferWriter> where TBufferWriter : IBufferWriter<byte> { ref TBufferWriter bufferWriter; ref byte bufferReference; int bufferLength; ref byte GetSpanReference(int sizeHint); void Advance(int count); public MemoryPackWriter(ref TBufferWriter writer) } public interface System.Buffers.IBufferWriter<T> { Span<T> GetSpan(int sizeHint = 0); void Advance(int count); } 1. For example, when writing an int 2. Request the maximum required buffer 3. Declare the amount written Take TBufferWriter in the ctor

public void WriteUnmanaged<T1>(scoped in T1 value1) where T1 : unmanaged
{ var size = Unsafe.SizeOf<T1>(); ref var spanRef = ref GetSpanReference(size); Unsafe.WriteUnaligned(ref spanRef, value1); Advance(size); } MemoryPackWriter public ref partial struct MemoryPackWriter<TBufferWriter> where TBufferWriter : IBufferWriter<byte> { ref TBufferWriter bufferWriter; ref byte bufferReference; int bufferLength; ref byte GetSpanReference(int sizeHint); void Advance(int count); public MemoryPackWriter(ref TBufferWriter writer) } public interface System.Buffers.IBufferWriter<T> { Span<T> GetSpan(int sizeHint = 0); void Advance(int count); } Frequent calls to GetSpan/Advance on IBuferWriter<byte> are slow, so reserve plenty of space within MemoryPackWriter to reduce the number of calls to BufferWriter. NOTE: When implementing IBufferWriter, the size of the buffer returned by GetSpan should not be the one trimmed by sizeHint, but the actual buffer size that you likely hold internally. Trimming it forces frequent calls to GetSpan, which can lead to performance degradation. Buffer management for writing Or caching of IBufferWriter<byte>'s buffer

Optimize the Write public ref partial struct MemoryPackWriter<TBufferWriter> where TBufferWriter
: IBufferWriter<byte> { ref TBufferWriter bufferWriter; ref byte bufferReference; int bufferLength; ref byte GetSpanReference(int sizeHint); void Advance(int count); public MemoryPackWriter(ref TBufferWriter writer) } If fixed-size members are consecutive, consolidate the calls to reduce the number of calls to GetSpanReference/Advance. Reduce the number of method calls The fewer, the better

Complete Serialize public static partial class MemoryPackSerializer { public static
void Serialize<T, TBufferWriter>(in TBufferWriter bufferWriter, in T? value) where TBufferWriter : IBufferWriter<byte> public static byte[] Serialize<T>(in T? value) public static ValueTask SerializeAsync<T>(Stream stream, T? value) } If you Flush (calling the original IBufferWriter's Advance and synchronously confirming the actual written area), the serialization process is completed. var writer = new MemoryPackWriter<TBufferWriter>(ref bufferWriter); writer.WriteValue(value); writer.Flush();

Other overload public static partial class MemoryPackSerializer { public static
void Serialize<T, TBufferWriter>(in TBufferWriter bufferWriter, in T? value) public static byte[] Serialize<T>(in T? value) public static ValueTask SerializeAsync<T>(Stream stream, T? value) } Pass ReusableLinkedArrayBufferWriter internally through Serialize. var bufferWriter = ReusableLinkedArrayBufferWriterPool.Rent(); var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufferWriter); writer.WriteValue(value); writer.Flush(); await bufferWriter.WriteToAndResetAsync(stream); return bufferWriter.ToArrayAndReset();

ReusableLinkedArrayBufferWriter byte[] byte[] byte[] ArrayPool<byte>.Shared.Rent GetSpan() If you only want
the last concatenated array (or writing to a Stream), you can represent the internal buffer with concatenated chunks, not like a List<T> expansion copy, as it doesn't have to be a block of memory. This can reduce the number of copies. public sealed class ReusableLinkedArrayBufferWriter : IBufferWriter<byte> { List<BufferSegment> buffers; } struct BufferSegment { byte[] buffer; int written; } NOTE: If the buffer becomes insufficient, instead of linking fixed-size items because they are linked (or worrying about LOH), generate (or borrow) items of double the size and link them. If not, if the write result is large, the number of linked list elements will become too large and the performance will deteriorate.

ToArray / WriteTo byte[] byte[] byte[] var result = new
byte[a.Length + b.count + c.Length]; a.CopyTo(result); b.CopyTo(result); c.CopyTo(result); await stream.WriteAsync(a); await stream.WriteAsync(b); await stream.WriteAsync(c); ArrayPool<byte>.Shared.Return As the final size is known, only 'new' the final result and copy it, or write it to a Stream. The completed working array is no longer needed, so it is returned to the Pool.

Enumerable.ToArray It converts an IEnumerable<T> with an indefinite number of
elements into a T[]. Conventionally, when it overflowed, the internal T[] was expanded, but couldn't a T[] be obtained from concatenated chunks in the same way as this time? I submitted a PR to dotnet/runtime https://github.com/dotnet/runtime/pull/90459 30~60% dramatic performance improve might be included in .NET 9? Improve LINQ ToArray NOTE: LINQ's ToArray has already been optimized in various ways, estimating the number of elements as much as possible, and when it can be estimated, it allocates an array of a fixed size. The size estimation is not as simple as is ICollection, but has more complex branches depending on the situation of the method chain such as the size is determined if it is Enumerable.Range, and it can be determined if it is Take, and so on.

Avoid aggressive use of Pools As it's something to be
incorporated into the runtime, the extensive use of Pools was avoided. Instead of using a Reusable LinkedArray, InlineArray from C# 12 was adopted To put it roughly, it enables stackalloc T[] (in other words, T[][]). with InlineArray(C# 12) [InlineArray(29)] struct ArrayBlock<T> { private T[] array; } Because List<T[]> (or something like that) would cause extra allocations, it would have been harder to propose. Allocating T[][] in the stack area eliminated the need for the allocation of the linked list itself. However, InlineArray is only allowed a fixed size specified at compile time. Therefore, I adopted '29' as the size ....

29 Starting from 4 and repeatedly doubling the size leads
to 29, which is the maximum value (.NET array size is int.MaxValue, just a little less than 2147483591). Since ToArray of IEnumerable<T> always adds one element at a time, it is guaranteed that all values will be filled and the next array will be concatenated without any gaps. Therefore, it is absolutely impossible to run out with InlineArray(29).

I/O Read

No Stream Again Performance is determined by synchronous buffers and
asynchronous reading and writing Don't mix I/O and deserialization Calling ReadAsync each time would be too slow. MemoryPackSerializer.Deserialize(Stream) constructs a ReadOnlySequence<byte> first and then flows it into the deserialize process. public static partial class MemoryPackSerializer { public static T? Deserialize<T>(ReadOnlySpan<byte> buffer) public static int Deserialize<T>(in ReadOnlySequence<byte> buffer, ref T? value) public static async ValueTask<T?> DeserializeAsync<T>(Stream stream) } NOTE: Not mixing I/O and deserialization means that it is not possible to perform true streaming deserialization with an undefined length or minimal buffering. In MemoryPack, instead, a deserialization mechanism is available that buffers in window widths and returns an IAsyncEnumerable<T> as a supplementary mechanism. We target only synchronous buffers that have already been read.

ReadOnlySequence<T> Like the concatenated T[] By entrusting buffer processing in
combination with System.IO.Pipelines, it can be treated like a connected T[] that can be sliced at any position. ReadOnlySequence<T> is not always fast, so it is necessary to come up with ways to reduce the number of Slice calls.

Flow of Deserialize This MemoryPackReader is important!

public ref partial struct MemoryPackReader { ReadOnlySequence<byte> bufferSource; ref byte
bufferReference; int bufferLength; ref byte GetSpanReference(int sizeHint); void Advance(int count); public MemoryPackReader( in ReadOnlySequence<byte> source) public MemoryPackReader( ReadOnlySpan<byte> buffer) } public void ReadUnmanaged<T1>(out T1 value1) where T1 : unmanaged { var size = Unsafe.SizeOf<T1>(); ref var spanRef = ref GetSpanReference(size); value1 = Unsafe.ReadUnaligned(ref spanRef); Advance(size); } MemoryPackReader public readonly struct ReadOnlySequence<T> { ReadOnlySpan<T> FirstSpan { get; } ReadOnlySequence<T> Slice(long start); } 1. For example, when reading data like int. 2. Request the maximum required buffer. 3. Report the amount read. Buffer management for reading Set ReadOnlySequence<byte> as the source. Similar to the MemoryPackWriter, receive the necessary buffer with GetSpanReference and proceed with Advance.

public ref partial struct MemoryPackReader { ReadOnlySequence<byte> bufferSource; ref byte
bufferReference; int bufferLength; ref byte GetSpanReference(int sizeHint); void Advance(int count); public MemoryPackReader( in ReadOnlySequence<byte> source) public MemoryPackReader( ReadOnlySpan<byte> buffer) } public void ReadUnmanaged<T1>(out T1 value1) where T1 : unmanaged { var size = Unsafe.SizeOf<T1>(); ref var spanRef = ref GetSpanReference(size); value1 = Unsafe.ReadUnaligned(ref spanRef); Advance(size); } MemoryPackReader public readonly struct ReadOnlySequence<T> { ReadOnlySpan<T> FirstSpan { get; } ReadOnlySequence<T> Slice(long start); } Due to the slowness of frequent calls to Slice on ReadOnlySequence<byte>, secure the entire block as FirstSpan in MemoryPackReader and suppress the number of calls to ReadOnlySequence. NOTE: It's natural that a Read request exceeds the FirstSpan. Since the deserialization of MemoryPack requires a continuous memory area, in the actual MemoryPack, processes like copying to a temporary area borrowed from the pool, and assigning it to the ref byte bufferReference are performed. Buffer management for reading Set ReadOnlySequence<byte> as the source.

Reader I/O in Application

Efficient Read is challenging while (true) { var read =
await socket.ReceiveAsync(buffer); var span = buffer.AsSpan(read); // ... } The amount read here may not fill one message block. If you read ReceiveAsync again and pack it into the buffer, what happens if it exceeds the buffer? If you keep resizing, it will become infinitely large, but can you guarantee that a time will come when you can reset it to 0? Handling incomplete reads It is not always permitted to read to the end of the stream.

A reader that returns ReadOnlySequence Concatenating incomplete blocks If the
size of one message is known (i.e., the Length is written in the header as a protocol), it can be converted to a command to read at least a certain size (ReadAtLeast). async Task ReadLoopAsync() { while (true) { ReadOnlySequence<byte> buffer = await socketReader.ReadAtLeastAsync(4); // do anything } } If it's in the form of ReadOnlySequence<byte>, you can feed it into something that supports it. For example, most modern serializers basically support ReadOnlySequence<byte>. NOTE: Serializers that do not support ReadOnlySequence<byte> are considered legacy and should be discarded. Of course, MessagePack for C#, MemoryPack is supported. NOTE: System.IO.Pipelines is what takes care of the related tasks.

Assuming there is a protocol where the type of message
is at the top and something is done based on it, how to determine if the message type is a string (Text protocol, for example, Redis and NATS adopt a text protocol). Determining type async Task ReadLoopAsync() { while (true) { ReadOnlySequence<byte> buffer = await socketReader.ReadAtLeastAsync(4); var code = GetCode(buffer); if (code == ServerOpCodes.Msg) { //… } } } You can determine it simply by converting it to a string. If you convert it to an Enum, it's easy to use later. This is an example from NATS, but by cleverly adjusting to 4 characters, including symbols and spaces, you can ensure it can be determined by ReadAtLeastAsync(4). ServerOpCodes GetCode(ReadOnlySequence<byte> buffer) { var span = GetSpan(buffer); var str = Encoding.UTF8.GetString(span); return str switch { "INFO" => ServerOpCodes.Info, "MSG " => ServerOpCodes.Msg, "PING" => ServerOpCodes.Ping, "PONG" => ServerOpCodes.Pong, "+OK¥r" => ServerOpCodes.Ok, "-ERR" => ServerOpCodes.Error, _ => throw new InvalidOperationException() }; }

Assuming there is a protocol where the type of message
is at the top and something is done based on it, how to determine if the message type is a string (Text protocol, for example, Redis and NATS adopt a text protocol). Determining type async Task ReadLoopAsync() { while (true) { ReadOnlySequence<byte> buffer = await socketReader.ReadAtLeastAsync(4); var code = GetCode(buffer); if (code == ServerOpCodes.Msg) { //… } } } You can determine it simply by converting it to a string. If you convert it to an Enum, it's easy to use later. This is an example from NATS, but by cleverly adjusting to 4 characters, including symbols and spaces, you can ensure it can be determined by ReadAtLeastAsync(4). ServerOpCodes GetCode(ReadOnlySequence<byte> buffer) { var span = GetSpan(buffer); var str = Encoding.UTF8.GetString(span); return str switch { "INFO" => ServerOpCodes.Info, "MSG " => ServerOpCodes.Msg, "PING" => ServerOpCodes.Ping, "PONG" => ServerOpCodes.Pong, "+OK¥r" => ServerOpCodes.Ok, "-ERR" => ServerOpCodes.Error, _ => throw new InvalidOperationException() }; } Turning it into a string (Stringization) involves a memory allocation. Absolutely, you should absolutely avoid this!!!

Take2 Compare ReadOnlySpan<byte> async Task ReadLoopAsync() { while (true) {
ReadOnlySequence<byte> buffer = await socketReader.ReadAtLeastAsync(4); var code = GetCode(buffer); if (code == ServerOpCodes.Msg) { //… } } } With the C# 11 UTF-8 literal (u8), you can get ReadOnlySpan<byte> in a constant manner. If you bring the ones with a high match frequency to the top of the if statement, the cost of the if check can also be reduced. Also, SequenceEqual of ReadOnlySpan<byte> (unlike the one of LINQ) compares quite speedily. ServerOpCodes GetCode(ReadOnlySequence<byte> buffer) { var span = GetSpan(buffer); if (span.SequenceEqual("MSG "u8)) return ServerOpCodes.Msg; if (span.SequenceEqual("PONG"u8)) return ServerOpCodes.Pong; if (span.SequenceEqual("INFO"u8)) return ServerOpCodes.Info; if (span.SequenceEqual("PING"u8)) return ServerOpCodes.Ping; if (span.SequenceEqual("+OK¥r"u8)) return ServerOpCodes.Ok; if (span.SequenceEqual("-ERR"u8)) return ServerOpCodes.Error; throw new InvalidOperationException(); }

Convert first 4 char to int // msg = ReadOnlySpan<byte>
if (Unsafe.ReadUnaligned<int>(ref MemoryMarshal.GetReference<byte>(msg)) == 1330007625) // INFO { } internal static class ServerOpCodes { public const int Info = 1330007625; // "INFO" public const int Msg = 541545293; // "MSG " public const int Ping = 1196312912; // "PING" public const int Pong = 1196314448; // "PONG" public const int Ok = 223039275; // "+OK¥r" public const int Error = 1381123373; // "-ERR" } If you combine it with follow-up elements (spaces or ¥r), all NATS OpCode can be determined with exactly 4 bytes (int), so you can create a group of constants int-transformed in advance. Direct int transformation from ReadOnlySpan<byte> Comparing by stringifying is out of the question, but since you can do it with just 4 bytes, comparing by int transformation is the fastest. NOTE: Well, the best protocol would be a binary one, where the first byte represents the type... Text protocols are not good.

async/await and inlining async Task ReadLoopAsync() { while (true) {
ReadOnlySequence<byte> buffer = await socketReader.ReadAtLeastAsync(4); var code = GetCode(buffer); await DispatchCommandAsync(code, buffer); } } async ValueTask DispatchCommandAsync(int code, ReadOnlySequence<byte> buffer) { } The part where data is read from the socket (the actual code is a bit more complex, so we might want to separate it from the processing part). In this method, messages are parsed in detail and actual processing is done (such as deserializing payload and callbacks).

asynchronous state machine generation async Task ReadLoopAsync() { while (true)
{ ReadOnlySequence<byte> buffer = await socketReader.ReadAtLeastAsync(4); var code = GetCode(buffer); await DispatchCommandAsync(code, buffer); } } async ValueTask DispatchCommandAsync(int code, ReadOnlySequence<byte> buffer) { } If it is within a loop, there will be no new asynchronous state machine generation, so you can go ahead and await as much as you want. If you actually perform asynchronous processing in an async method declared with async, an asynchronous state machine will be generated for each call, so there will be extra allocation. If the actual method is an async method of IValueTaskSource, it can be designed so that there is no asynchronous state machine generation even if you directly await it.

Inlining await in the hot path async Task ReadLoopAsync() {
while (true) { ReadOnlySequence<byte> buffer = await socketReader.ReadAtLeastAsync(4); var code = GetCode(buffer); if (code == ServerOpCodes.Msg) { await DoAnything(); await DoAnything(); } else { await DispatchCommandAsync(code, buffer); } } } [AsyncMethodBuilderAttribute(typeof(PoolingAsyncValueTaskMethodBuilder))] async ValueTask DispatchCommandAsync(int code, ReadOnlySequence<byte> buffer) { } Since 90% of the loops are receiving Msg (the rest are rarely coming things like PING or ERROR), only Msg is inlined to aim for maximum efficiency. In other cases, methods are separated, but marking with PoolingAsyncValueTaskMethodBuilder from .NET 6 makes the asynchronous state machine poolable and reusable.

Optimize for All Types

Source Generator based Automatically generate at compile time the Serialize
and Deserialize code optimized for each [MemoryPackable] type. static abstract members from C#11

IL.Emit vs SourceGenerator IL.Emit Dynamic assembly generation using type information
at runtime IL black magic that's been available since the early days of .NET Not usable in environments where dynamic generation is not allowed (like iOS, WASM, NativeAOT, etc.) SourceGenerator Generate C# code using AST at compile-time It started to be extensively used around .NET 6 Since it's pure C# code, it can be used in all environments. Given the diversification of environments where .NET operates and the absence of startup speed penalties, it is desirable to move towards SourceGenerator as much as possible. Although the inability to use runtime information can make it difficult to generate similar code, especially around Generics, let's overcome this with some ingenuity...

Optimize for All Types For example, the processing of collections
can be mostly handled with just one Formatter for IEnumerable<T>, but if you create the optimal implementation for each collection one by one, you can run the highest performance processing. Implementations to interfaces are only required when asked to process unknown types.

Fast Enumeration of Array Normally, boundary checks are inserted when
accessing elements of a C# array. However, the JIT compiler may exceptionally remove the boundary check if it can detect that it will not exceed the boundary (for example, when looping a for loop with .Length). An array (or Span) foreach is converted at compile time to the same as a for loop at the IL level, so it's completely the same.

Optimize for List<T> / Read public sealed class ListFormatter<T> :
MemoryPackFormatter<List<T?>> { public override void Serialize<TBufferWriter>( ref MemoryPackWriter<TBufferWriter> writer, scoped ref List<T?>? value) { if (value == null) { writer.WriteNullCollectionHeader(); return; } var span = CollectionsMarshal.AsSpan(value); var formatter = GetFormatter<T>(); WriteCollectionHeader(span.Length); for (int i = 0; i < span.Length; i++) { formatter.Serialize(ref this, ref span[i]); } } } Fastest List<T> iterate CollectionsMarshal.AsSpan

public override void Deserialize(ref MemoryPackReader reader, scoped ref List<T?>? value)
{ if (!reader.TryReadCollectionHeader(out var length)) { value = null; return; } value = new List<T?>(length); CollectionsMarshal.SetCount(value, length); var span = CollectionsMarshal.AsSpan(value); var formatter = GetFormatter<T>(); for (int i = 0; i < length; i++) { formatter.Deserialize(ref this, ref span[i]); } } Optimize for List<T> / Write Adding to List<T> one by one is slow. By making it handleable as Span, the deserialization speed of List<T> is made equivalent to that of the array. Just using new List(capacity) results in an internal size of 0, so using CollectionsMarshal.AsSpan only retrieves a Span of length 0, which is meaningless. By forcefully changing the internal size with CollectionsMarshal.SetCount, which was added in .NET 8, you can avoid Add and extract Span.

public override void Deserialize(ref MemoryPackReader reader, scoped ref List<T?>? value)
{ if (!reader.TryReadCollectionHeader(out var length)) { value = null; return; } value = new List<T?>(length); CollectionsMarshal.SetCount(value, length); var span = CollectionsMarshal.AsSpan(value); if (!RuntimeHelpers.IsReferenceOrContainsReferences<T>()) { var byteCount = length * Unsafe.SizeOf<T>(); ref var src = ref reader.GetSpanReference(byteCount); ref var dest = ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(span)!); Unsafe.CopyBlockUnaligned(ref dest, ref src, (uint)byteCount); reader.Advance(byteCount); } else { var formatter = GetFormatter<T>(); for (int i = 0; i < length; i++) { formatter.Deserialize(ref this, ref span[i]); } } } Actual code of ListFormatter MemoryPack has a binary specification that can handle unmanaged type T[] with only memory copy. By extracting Span<T>, even List<T> can be deserialized with memory copy.

String / UTF8 SIMD FFI(DllImport/LibraryImport) Channel More async/await

Conclusion

Expanding the possibilities of C# Languages evolve, techniques evolve C#
has great potential and continues to be at the forefront of competitive programming languages Building a strong ecosystem is crucial Indeed, in the modern era where open source software (OSS) is central, the vitality of the ecosystem is crucial. The evolution of language/runtime and OSS are two wheels that drive progress forward. It's no longer the era where we can just rely on Microsoft or Unity.

Modern High Performance C# 2023 Edition

Modern High Performance C# 2023 Edition

More Decks by Yoshifumi Kawai

Other Decks in Technology

Featured

Transcript