Slide 1

Slide 1 text

Confusing Java Strings “I do not think it means what you think it means” tw: @jonatan_ivanov gh: jonatan-ivanov www: develotters.com github.com/jonatan-ivanov/java-strings-demo

Slide 2

Slide 2 text

Quiz: Length of String in Java? Java //4 "Java".length()

Slide 3

Slide 3 text

Quiz: Length of String in Java? 我喜欢茶 //4

Slide 4

Slide 4 text

Quiz: Length of String in Java? 𝕒𝕓𝕔 //6

Slide 5

Slide 5 text

Quiz: Length of String in Java? 👩❤☕ //4

Slide 6

Slide 6 text

Quiz: Length of String in Java? 󰠁❤🍵 //9

Slide 7

Slide 7 text

WHAT!?

Slide 8

Slide 8 text

Seriously?

Slide 9

Slide 9 text

“I do not think it means what you think it means”

Slide 10

Slide 10 text

"𝕒𝕓𝕔".length() == 6

Slide 11

Slide 11 text

● Shock, Denial, Anger, Bargaining, Depression ● 𝕒𝕓𝕔 ● 👩❤☕ ● 󰠁❤🍵 ● Testing, Acceptance (understanding) Agenda

Slide 12

Slide 12 text

Java String ● Java uses UTF-16 to encode a Unicode String ● Unicode: a standard to represent text ● UTF-16: a way to encode Unicode characters (later) That’s why the size of a Java char is 2 bytes (2x8 = 16 bits).

Slide 13

Slide 13 text

Unicode ● Code Point: A unique integer value that identifies a character e.g.: 65 (letter A) ● Code Unit: A bit sequence used to encode a character (Code Point) e.g.: 0b1000001 (depends on the encoding) UTF-16 ● Unicode Code Points are divided into 17 planes ● Code Points on the first plane are encoded with one 16-bit Code Unit ● The rest of the Code Points are encoded with two Code Units

Slide 14

Slide 14 text

Example Character: A Unicode Code Point: U+0041 UTF-16 Code Unit(s): 0041 (hex, one 16-bit Code Unit) Character: 𝔸 Unicode Code Point: U+1D538 UTF-16 Code Unit(s): D835 DD38 (hex, two 16-bit Code Units)

Slide 15

Slide 15 text

String::length public int length() Returns the length of this string. The length is equal to the number of Unicode code units in the string. (javadoc) So if you have one supplementary character that consists of two code units, the length of that single character is two. So the char type is not really what we mean by a character.

Slide 16

Slide 16 text

Excuse me; what?

Slide 17

Slide 17 text

Java U+004A U+0061 U+0076 U+0061 //4 Java 004A 0061 0076 0061 //4 我喜欢茶 U+6211 U+559C U+6B22 U+8336 //4 我喜欢茶 6211 559C 6B22 8336 //4 𝕒𝕓𝕔 U+1D552 U+1D553 U+1D554 //3 𝕒𝕓𝕔 D835 DD52 D835 DD53 D835 DD54 //6 Code Points vs. Code Units

Slide 18

Slide 18 text

What can we do about this? 1. Do not do String manipulation, it’s tricky 2. If really needed, you can count the Code Points String str = "𝕒𝕓𝕔"; str.codePointCount(0, str.length()) //3

Slide 19

Slide 19 text

Consequences (example: "𝔸") ● Represented by two char values (Code Units); length is two ● toCharArray() returns a char array (char[]) with two elements ● charAt(0) and charAt(1) return invalid characters ● If you do any character manipulation ○ you need to consider this case ○ handle these characters appropriately Most of the char manipulation code we ever wrote is probably broken :)

Slide 20

Slide 20 text

Let’s reverse a String String buggyReverse(String str) { String result = ""; for (int i = str.length() - 1; i >= 0; i--) { result += str.charAt(i); } return result; } return new StringBuilder(str).reverse().toString(); 𝕒 𝕓 𝕔 -> ?? ?? ??

Slide 21

Slide 21 text

But what about 👩❤☕? 👩❤☕ -> U+1F469 U+2764 U+2615 //3 👩❤☕ -> D83D DC69 2764 2615 //4

Slide 22

Slide 22 text

And what about 󰠁❤🍵? 󰠁 is actually two emojis joined with a Zero Width Joiner (ZWJ) character. 󰠁 -> 👩 ZWJ 💻 //3 󰠁 -> U+1F469 U+200D U+1F4BB //3 󰠁 -> D83D DC69 200D D83D DCBB //5 ❤ is actually a ❤ plus a variation selector that makes it red. ❤ -> ❤ mod //2 ❤ -> U+2764 U+FE0F //2 ❤ -> 2764 FE0F //2

Slide 23

Slide 23 text

And what about 󰠁❤🍵? 🍵 is “just” a supplementary character. 🍵 -> U+1F375 //1 🍵 -> D83C DF75 //2 󰠁(5) + ❤(2) + 🍵(2) = 9

Slide 24

Slide 24 text

Try to avoid String manipulation as much as you can

Slide 25

Slide 25 text

Q&A tw: @jonatan_ivanov gh: jonatan-ivanov www: develotters.com github.com/jonatan-ivanov/java-strings-demo