Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Inside java.lang.String: Understanding and Opti...

Inside java.lang.String: Understanding and Optimizing Instantiation Performance

Mitsunori Komatsu

December 13, 2024
Tweet

More Decks by Mitsunori Komatsu

Other Decks in Programming

Transcript

  1. java.lang.String is probably one of the most used classes in

    Java. Naturally, it contains its string data internally. Do you know how the data is actually stored in String, and what happens when instantiating a String from a byte array?
  2. Internal structure of java.lang.String in Java 8 or earlier In

    Java 8, java.lang.String contains its string data as a 16-bit char array. https://github.com/openjdk/jdk8u/blob/4fa5109b8deee6539ba1765f27aa6cb641010221/jdk/src/share/classes/java/lang/String.java#L111-L114
  3. Internal structure of java.lang.String in Java 8 or earlier When

    instantiating a String from a byte array, StringCoding.decode() is called. https://github.com/openjdk/jdk8u/blob/4fa5109b8deee6539ba1765f27aa6cb641010221/jdk/src/share/classes/java/lang/String.java#L459-L464
  4. Internal structure of java.lang.String in Java 8 or earlier In

    the case of US_ASCII, sun.nio.cs.US_ASCII.Decoder.decode() is finally called, which copies the bytes of the source byte array into a char array one by one. https://github.com/openjdk/jdk8u/blob/master/jdk/src/share/classes/sun/nio/cs/US_ASCII.java#L137-L148
  5. In Java 9 or later, java.lang.String contains its string data

    as a 8-bit byte array. https://github.com/openjdk/jdk11u/blob/f12fc3feee66def311164b81d9d017 0b550d22ad/src/java.base/share/classes/java/lang/String.java#L125-L140 Internal structure of java.lang.String in Java 9 or later
  6. When instantiating a String from a byte array, StringCoding.decode() is

    also called. https://github.com/openjdk/jdk11u/blob/f12fc3feee66def311164b81d9d0170b550d22ad/ src/java.base/share/classes/java/lang/String.java#L502-L510 Internal structure of java.lang.String in Java 9 or later
  7. In the case of US_ASCII, StringCoding.decodeASCII() is called, which copies

    the source byte array using Arrays.copyOfRange(), as both the source and destination are byte arrays. Arrays.copyOfRange() internally uses System.arrayCopy() that is a native method and significantly fast. https://github.com/openjdk/jdk11u/blob/f12fc3feee66def311164b81d9d0170b550d22ad /src/java.base/share/classes/java/lang/StringCoding.java#L521-L534 Internal structure of java.lang.String in Java 9 or later
  8. This improvement introduced in Java 9 is called Compact Strings.

    The feature is enabled by default, but you can disable it if you want. What’s “COMPACT_STRINGS” ? https://openjdk.org/jeps/254
  9. The performance of new String(byte[]) in Java 8, 11, 17

    and 21 JMH benchmark with 512 byte text
  10. The performance of new String(byte[]) in Java 8, 11, 17

    and 21 More than 3.5 times faster than Java 8
  11. There is no room to improve the performance of String

    instantiation from a byte array? new String(byte bytes[], int offset, int length, Charset charset) in Java 9 or later copies the byte array. Even it uses System.copyArray() that is a native method and fast, it takes some time. Zero copy String instantiation from a byte array is possible? When I read the source code of Apache Fury which is "a blazingly-fast multi-language serialization framework powered by JIT (just-in-time compilation) and zero-copy", I found their StringSerializer achieves zero copy String instantiation. Let's look into the implementation.
  12. What’s the goal of org.apache.fury.serializer.StringSerializer ? The goal of the

    method is to call non-public new String(byte[] value, byte coder) to set the internal value to the source byte array without any copy. https://github.com/openjdk/jdk11u/blob/f12fc3feee66def311164b81d9d0170b550d22ad/ src/java.base/share/classes/java/lang/String.java#L3252-L3255 No byte copy occurs
  13. How StringSerializer achieves zero-copy initialization? https://github.com/apache/fury/blob/3865dcd0982c0ca9de04a8ba 9635892b76288769/java/fury-core/src/main/java/org/apache/fury/ serializer/StringSerializer.java#L633-L655 This function

    is similar to BYTES_STRING_ZERO_COP Y_CTR StringSerializer.newBytesStringZeroCopy() only calls a Function BYTES_STRING_ZERO_COPY_CTR with the source byte array
  14. How StringSerializer achieves zero-copy initialization? BYTES_STRING_ZERO_COPY_CTR is initialized to a

    BiFunction returned from getBytesStringZeroCopyCtr() https://github.com/apache/fury/blob/3865dcd0982c0ca9de04a8ba9635892b76288769/java/f ury-core/src/main/java/org/apache/fury/serializer/StringSerializer.java#L618C1-L621C40 Get a MethodHandle for the non-public String constructor new String(byte[] value, byte coder) Calling the MethodHandle via CallSite is faster than directly calling handle.invokeExact()
  15. How StringSerializer achieves zero-copy initialization? The points are: • Call

    non-public new String(byte[] value, byte coder) using MethodHandle to avoid byte array copy • Minimize the cost of MethodHandle invocation as much as possible by invoking the MethodHandle via CallSite using LambdaMetafactory.metafactory().
  16. Concerns on org.apache.fury.serializer.StringSerializer The internal byte array is unexpectedly changed!

    A byte array passed to the non-public new String(byte[] value, byte coder) is owned by multiple objects; the new String object and objects having reference to the byte array. This mutability could cause an issue that a string content is unexpectedly changed.
  17. Wrap up - Use Java 9 or later as much

    as possible if you're using Java 8, in terms of the performance of String instantiation. - There is a technique to instantiate a String from a byte array with zero copy. It's blazing-fast.