Java. Naturally, it contains its string data internally. Do you know how the data is actually stored in String, and what happens when instantiating a String from a byte array?
Java 8, java.lang.String contains its string data as a 16-bit char array. https://github.com/openjdk/jdk8u/blob/4fa5109b8deee6539ba1765f27aa6cb641010221/jdk/src/share/classes/java/lang/String.java#L111-L114
instantiating a String from a byte array, StringCoding.decode() is called. https://github.com/openjdk/jdk8u/blob/4fa5109b8deee6539ba1765f27aa6cb641010221/jdk/src/share/classes/java/lang/String.java#L459-L464
the case of US_ASCII, sun.nio.cs.US_ASCII.Decoder.decode() is finally called, which copies the bytes of the source byte array into a char array one by one. https://github.com/openjdk/jdk8u/blob/master/jdk/src/share/classes/sun/nio/cs/US_ASCII.java#L137-L148
as a 8-bit byte array. https://github.com/openjdk/jdk11u/blob/f12fc3feee66def311164b81d9d017 0b550d22ad/src/java.base/share/classes/java/lang/String.java#L125-L140 Internal structure of java.lang.String in Java 9 or later
also called. https://github.com/openjdk/jdk11u/blob/f12fc3feee66def311164b81d9d0170b550d22ad/ src/java.base/share/classes/java/lang/String.java#L502-L510 Internal structure of java.lang.String in Java 9 or later
the source byte array using Arrays.copyOfRange(), as both the source and destination are byte arrays. Arrays.copyOfRange() internally uses System.arrayCopy() that is a native method and significantly fast. https://github.com/openjdk/jdk11u/blob/f12fc3feee66def311164b81d9d0170b550d22ad /src/java.base/share/classes/java/lang/StringCoding.java#L521-L534 Internal structure of java.lang.String in Java 9 or later
instantiation from a byte array? new String(byte bytes[], int offset, int length, Charset charset) in Java 9 or later copies the byte array. Even it uses System.copyArray() that is a native method and fast, it takes some time. Zero copy String instantiation from a byte array is possible? When I read the source code of Apache Fury which is "a blazingly-fast multi-language serialization framework powered by JIT (just-in-time compilation) and zero-copy", I found their StringSerializer achieves zero copy String instantiation. Let's look into the implementation.
method is to call non-public new String(byte[] value, byte coder) to set the internal value to the source byte array without any copy. https://github.com/openjdk/jdk11u/blob/f12fc3feee66def311164b81d9d0170b550d22ad/ src/java.base/share/classes/java/lang/String.java#L3252-L3255 No byte copy occurs
is similar to BYTES_STRING_ZERO_COP Y_CTR StringSerializer.newBytesStringZeroCopy() only calls a Function BYTES_STRING_ZERO_COPY_CTR with the source byte array
BiFunction returned from getBytesStringZeroCopyCtr() https://github.com/apache/fury/blob/3865dcd0982c0ca9de04a8ba9635892b76288769/java/f ury-core/src/main/java/org/apache/fury/serializer/StringSerializer.java#L618C1-L621C40 Get a MethodHandle for the non-public String constructor new String(byte[] value, byte coder) Calling the MethodHandle via CallSite is faster than directly calling handle.invokeExact()
non-public new String(byte[] value, byte coder) using MethodHandle to avoid byte array copy • Minimize the cost of MethodHandle invocation as much as possible by invoking the MethodHandle via CallSite using LambdaMetafactory.metafactory().
A byte array passed to the non-public new String(byte[] value, byte coder) is owned by multiple objects; the new String object and objects having reference to the byte array. This mutability could cause an issue that a string content is unexpectedly changed.
as possible if you're using Java 8, in terms of the performance of String instantiation. - There is a technique to instantiate a String from a byte array with zero copy. It's blazing-fast.