Confusing Java Strings
“I do not think it means what you think it means”
tw: @jonatan_ivanov
gh: jonatan-ivanov
www: develotters.com github.com/jonatan-ivanov/java-strings-demo
Slide 2
Slide 2 text
Quiz: Length of String in Java?
Java //4
"Java".length()
Java String
● Java uses UTF-16 to encode a Unicode String
● Unicode: a standard to represent text
● UTF-16: a way to encode Unicode characters (later)
That’s why the size of a Java char is 2 bytes (2x8 = 16 bits).
Slide 13
Slide 13 text
Unicode
● Code Point: A unique integer value that identifies a character
e.g.: 65 (letter A)
● Code Unit: A bit sequence used to encode a character (Code Point)
e.g.: 0b1000001 (depends on the encoding)
UTF-16
● Unicode Code Points are divided into 17 planes
● Code Points on the first plane are encoded with one 16-bit Code Unit
● The rest of the Code Points are encoded with two Code Units
Slide 14
Slide 14 text
Example
Character: A
Unicode Code Point: U+0041
UTF-16 Code Unit(s): 0041 (hex, one 16-bit Code Unit)
Character: 𝔸
Unicode Code Point: U+1D538
UTF-16 Code Unit(s): D835 DD38 (hex, two 16-bit Code Units)
Slide 15
Slide 15 text
String::length
public int length()
Returns the length of this string. The length is equal to the number of
Unicode code units in the string.
(javadoc)
So if you have one supplementary character that consists of two code
units, the length of that single character is two.
So the char type is not really what we mean by a character.
What can we do about this?
1. Do not do String manipulation, it’s tricky
2. If really needed, you can count the Code Points
String str = "𝕒𝕓𝕔";
str.codePointCount(0, str.length()) //3
Slide 19
Slide 19 text
Consequences (example: "𝔸")
● Represented by two char values (Code Units); length is two
● toCharArray() returns a char array (char[]) with two elements
● charAt(0) and charAt(1) return invalid characters
● If you do any character manipulation
○ you need to consider this case
○ handle these characters appropriately
Most of the char manipulation code we ever wrote is probably broken :)
Slide 20
Slide 20 text
Let’s reverse a String
String buggyReverse(String str) {
String result = "";
for (int i = str.length() - 1; i >= 0; i--) {
result += str.charAt(i);
}
return result;
}
return new StringBuilder(str).reverse().toString();
𝕒 𝕓 𝕔 -> ?? ?? ??
Slide 21
Slide 21 text
But what about 👩❤☕?
👩❤☕ -> U+1F469 U+2764 U+2615 //3
👩❤☕ -> D83D DC69 2764 2615 //4
Slide 22
Slide 22 text
And what about ❤🍵?
is actually two emojis joined with a Zero Width Joiner (ZWJ) character.
-> 👩 ZWJ 💻 //3
-> U+1F469 U+200D U+1F4BB //3
-> D83D DC69 200D D83D DCBB //5
❤ is actually a ❤ plus a variation selector that makes it red.
❤ -> ❤ mod //2
❤ -> U+2764 U+FE0F //2
❤ -> 2764 FE0F //2
Slide 23
Slide 23 text
And what about ❤🍵?
🍵 is “just” a supplementary character.
🍵 -> U+1F375 //1
🍵 -> D83C DF75 //2
(5) + ❤(2) + 🍵(2) = 9
Slide 24
Slide 24 text
Try to avoid String
manipulation
as much as you can