Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2020-12-10 SeaJUG - Confusing Java Strings

2020-12-10 SeaJUG - Confusing Java Strings

In this example-driven presentation, I’m going to show you a couple of interesting ("inconceivable"? :)) things around Java Strings. I'm also going to explain them and give you solutions for these issues so that next time you will be prepared if you see those beasts.

Jonatan Ivanov

May 02, 2022
Tweet

More Decks by Jonatan Ivanov

Other Decks in Programming

Transcript

  1. Confusing Java Strings “I do not think it means what

    you think it means” tw: @jonatan_ivanov gh: jonatan-ivanov www: develotters.com github.com/jonatan-ivanov/java-strings-demo
  2. • Shock, Denial, Anger, Bargaining, Depression • 𝕒𝕓𝕔 • 👩❤☕

    • 󰠁❤🍵 • Testing, Acceptance (understanding) Agenda
  3. Java String • Java uses UTF-16 to encode a Unicode

    String • Unicode: a standard to represent text • UTF-16: a way to encode Unicode characters (later) That’s why the size of a Java char is 2 bytes (2x8 = 16 bits).
  4. Unicode • Code Point: A unique integer value that identifies

    a character e.g.: 65 (letter A) • Code Unit: A bit sequence used to encode a character (Code Point) e.g.: 0b1000001 (depends on the encoding) UTF-16 • Unicode Code Points are divided into 17 planes • Code Points on the first plane are encoded with one 16-bit Code Unit • The rest of the Code Points are encoded with two Code Units
  5. Example Character: A Unicode Code Point: U+0041 UTF-16 Code Unit(s):

    0041 (hex, one 16-bit Code Unit) Character: 𝔸 Unicode Code Point: U+1D538 UTF-16 Code Unit(s): D835 DD38 (hex, two 16-bit Code Units)
  6. String::length public int length() Returns the length of this string.

    The length is equal to the number of Unicode code units in the string. (javadoc) So if you have one supplementary character that consists of two code units, the length of that single character is two. So the char type is not really what we mean by a character.
  7. Java U+004A U+0061 U+0076 U+0061 //4 Java 004A 0061 0076

    0061 //4 我喜欢茶 U+6211 U+559C U+6B22 U+8336 //4 我喜欢茶 6211 559C 6B22 8336 //4 𝕒𝕓𝕔 U+1D552 U+1D553 U+1D554 //3 𝕒𝕓𝕔 D835 DD52 D835 DD53 D835 DD54 //6 Code Points vs. Code Units
  8. What can we do about this? 1. Do not do

    String manipulation, it’s tricky 2. If really needed, you can count the Code Points String str = "𝕒𝕓𝕔"; str.codePointCount(0, str.length()) //3
  9. Consequences (example: "𝔸") • Represented by two char values (Code

    Units); length is two • toCharArray() returns a char array (char[]) with two elements • charAt(0) and charAt(1) return invalid characters • If you do any character manipulation ◦ you need to consider this case ◦ handle these characters appropriately Most of the char manipulation code we ever wrote is probably broken :)
  10. Let’s reverse a String String buggyReverse(String str) { String result

    = ""; for (int i = str.length() - 1; i >= 0; i--) { result += str.charAt(i); } return result; } return new StringBuilder(str).reverse().toString(); 𝕒 𝕓 𝕔 -> ?? ?? ??
  11. And what about 󰠁❤🍵? 󰠁 is actually two emojis joined

    with a Zero Width Joiner (ZWJ) character. 󰠁 -> 👩 ZWJ 💻 //3 󰠁 -> U+1F469 U+200D U+1F4BB //3 󰠁 -> D83D DC69 200D D83D DCBB //5 ❤ is actually a ❤ plus a variation selector that makes it red. ❤ -> ❤ mod //2 ❤ -> U+2764 U+FE0F //2 ❤ -> 2764 FE0F //2
  12. And what about 󰠁❤🍵? 🍵 is “just” a supplementary character.

    🍵 -> U+1F375 //1 🍵 -> D83C DF75 //2 󰠁(5) + ❤(2) + 🍵(2) = 9