Slide 1

Slide 1 text

High Performance Scala scala matsuri 2019 Hiroki Fujino

Slide 2

Slide 2 text

About Me • Hiroki Fujino • Server Side Engineer • Working in Berlin since February

Slide 3

Slide 3 text

Background Experienced advertising distribution system • Throughput for large number of request • Ef fi cient memory usage for large data What is required: Average ~10ms response

Slide 4

Slide 4 text

• Runs on JVM • Static type system • Rich libraries supporting performance Scala is potentially High Performance Language

Slide 5

Slide 5 text

Achievement of high performance is not easy

Slide 6

Slide 6 text

• I/O • Garbage Collection(GC) ※Based on my experience Two Critical Performance Issue

Slide 7

Slide 7 text

I/O Issue Throughput decrease Thread Starvation Context Switch Task1 Task2 Task3 CPU Core Task1 Task2 Task3 Task1 Task2 Task3 Multithreading for tasks which include I/O thread1 thread2 thread3

Slide 8

Slide 8 text

I/O Issue Why thread starvation happens? How blocking affects thread? Throughput decrease Thread Starvation Context Switch Task1 Task2 Task3 Task1 Task2 Task3 Task1 Task2 Task3 CPU Core thread1 thread2 thread3 Multithreading for tasks which include I/O

Slide 9

Slide 9 text

GC Issue Increase memory to store lots objects Application stops for a longer time Stop The World by GC )FBQ )FBQ

Slide 10

Slide 10 text

GC Issue How to reduce application stop time by GC? Application stops for a longer time Stop The World by GC )FBQ )FBQ Increase memory to store lots objects

Slide 11

Slide 11 text

It’s necessary to understand the theory Why thread starvation happens? How blocking affects thread? How to reduce application stop time by GC?

Slide 12

Slide 12 text

Focus Point Theory • Basic Module • Scala Collection • Concurrency Library • I/O Library Scala Language

Slide 13

Slide 13 text

Focus Point Theory • Basic Module • Scala Collection • Concurrency Library • I/O Library Scala Language × • Mechanism of Garbage Collection • Data structure of Collection • Relationship between thread and CPU • Difference between blocking and non-blocking

Slide 14

Slide 14 text

Focus Point Theory • Basic Module • Scala Collection • Concurrency Library • I/O Library Scala Language × • Mechanism of Garbage Collection • Data structure of Collection • Relationship between thread and CPU • Difference between blocking and non-blocking High Performance

Slide 15

Slide 15 text

Goal Prerequisites: Basic knowledge of Scala Acknowledgement: The code in this presentation is sample Object size on HotSpot64 •Use CPU ef fi ciently for I/O •Avoid application stop by GC High Performance by:

Slide 16

Slide 16 text

Agenda 1. I/O 2. Garbage Collection 3. Conclusion

Slide 17

Slide 17 text

Agenda 1. I/O 2. Garbage Collection 3. Conclusion

Slide 18

Slide 18 text

I/O is very expensive •10[ns] Main Memory Reference •250,000[ns] Read 1MB sequentially from memory •10,000,000[ns] Disk seek •10,000,000[ns] Read 1MB sequentially from network •20,000,000[ns] Read 1MB sequentially from disk https://gist.github.com/jboner/2841832

Slide 19

Slide 19 text

Concurrency Concurrency is effective for I/O I/O CPU Core Task1 Task2 Task1 Task2 Task1 Task2 CPU Core can work on one task at a time

Slide 20

Slide 20 text

The Goal of Concurrency •How to use thread pool •Blocking or Non-blocking Ef fi cient use of CPU Important things:

Slide 21

Slide 21 text

1. Asynchronous Call with Blocking Operation 2. Asynchronous Call with Non-blocking Operation Understanding through Code Comparison

Slide 22

Slide 22 text

Sample Code getUser getMessages getTeam Disk I/O Network I/O MicroService def execute(userId: String, teamId: String) (implicit ec: ExecutionContext) = { // get User by MySQL val userF = database.getUser(userId) // get Message by MySQL val messagesF = database.getMessages(userId) // get Team by MicroService val teamF = teamService.getTeam(teamId) }

Slide 23

Slide 23 text

1. Asynchronous Call with Blocking Operation 2. Asynchronous Call with Non-blocking Operation Understanding through Code Comparison

Slide 24

Slide 24 text

Database Access ScalikeJDBC def getUser(id: String) (implicit s: DBSession, ec: ExecutionContext): Future[Option[User]] = Future { withSQL { select.from(Users as u).where.eq(u.id, id) }.map(rs => Users(u.resultName, rs)).single.apply() } getUser getMessages Disk I/O Blocking

Slide 25

Slide 25 text

Http Request Apache Http Client def getTeam(name: String) (implicit ec: ExecutionContext): Future[Option[Team]] = Future { val client = new DefaultHttpClient() val httpResponse = client.execute(new HttpGet(url)) // transform response into Team … } getTeam Network I/O MicroService Blocking

Slide 26

Slide 26 text

1. getUser 2. getMessages 3. getTeam Asynchronous Call with Blocking Operation thread1 thread2 thread3 I/O ※The fi gure for illustration purposes

Slide 27

Slide 27 text

Pro fi ling is effective for monitoring thread Switching threads depends on operating system Dif fi cult to predict thread condition Pro fi ling Tool: • CPU • Thread • I/O • Garbage Collection • Memory Recorded Factors:

Slide 28

Slide 28 text

Sample result in multiple executions Running SocketRead Thread Graph

Slide 29

Slide 29 text

Asynchronous Call with Blocking Operation thread1 thread2 thread3 Thread is blocked 1. getUser 2. getMessages 3. getTeam I/O ※The fi gure for illustration purposes

Slide 30

Slide 30 text

Blocking Issue Blocking threads leads to thread starvation and decrease of throughput

Slide 31

Slide 31 text

Increasing thread pool size solves this issue?

Slide 32

Slide 32 text

Too Many Threads lead to problem • Excessive Memory Usage • Overhead of Context Switch

Slide 33

Slide 33 text

Thread takes cost [参考] openjdk-jdk11/src/hotspot/os/linux/vm/os_linux.cpp Kernel Thread int ret = pthread_create(&tid, &attr, (void* (*)(void*)) thread_native_entry, thread); JVM thread consumes 1MB implicit val ec = ExecutionContext.fromExecutor(new ForkJoinPool(8)) val thread = new Thread()

Slide 34

Slide 34 text

Cost of Context Switch Total overhead of context switch thread1 thread2 thread3 Time CPU Core CPU Core

Slide 35

Slide 35 text

Non-blocking Ef fi cient use of threads while waiting I/O

Slide 36

Slide 36 text

I/O CPU Core Task1 Task2 Task1 Task2 Task1 Task2 Non-blocking Single thread with non-blocking operation Thread executes another task during I/O wait thread1

Slide 37

Slide 37 text

1. Asynchronous Call with Blocking Operation 2. Asynchronous Call with Non-blocking Operation Understanding through Code Comparison

Slide 38

Slide 38 text

Non-blocking Library Database Client Http ScalikeJDBC-Async quill-async quill- fi nagle-mysql akka-http

Slide 39

Slide 39 text

Database Access with Non-blocking quill-async def getUser(id: String) (implicit ec: ExecutionContext): Future[Option[Users]] = { val select = quote { query[Users]. fi lter(_.id == lift(id)) } ctx.run(select).map(_.headOption) } getUser getMessages Disk I/O Non-blocking

Slide 40

Slide 40 text

Http Request with Non-blocking def getTeam(teamName: String) (implicit ec: ExecutionContext): Future[Team] = { val response = Http().singleRequest(HttpRequest(uri = innerUrl, method = HttpMethods.GET)) // transform response into Team … } akka-http getTeam Network I/O MicroService Non-blocking

Slide 41

Slide 41 text

1. getUser 2. getMessages 3. getTeam Asynchronous Call with Non-blocking Operation I/O Thread is non-blocked ※The fi gure for illustration purposes thread1

Slide 42

Slide 42 text

Running Monitor Wait Thread Park Thread Graph Sample result in multiple executions

Slide 43

Slide 43 text

Blocking vs Non-blocking Blocking Non-blocking Non-blocking enables thread to run ef fi cient

Slide 44

Slide 44 text

Thread Pool Strategy • Should not use `execution global` in blocking operation import scala.concurrent.ExecutionContext.Implicits.global •Thread pool size implicit val blockingEC = ExecutionContext.fromExecutor(new ForkJoinPool(8)) - CPU bound => About number of CPU Core - I/O bound with blocking => More than number of CPU Core, but not too large • If there is blocking operation, use another thread pool for it The next slide is the updated one because some information is wrong here

Slide 45

Slide 45 text

Thread Pool Strategy (Updated Slide) • Should not use `execution global` in blocking operation import scala.concurrent.ExecutionContext.Implicits.global • *If there is blocking operation, use another thread pool for it •Thread pool size - CPU bound => About number of CPU Core - I/O bound with blocking => More than number of CPU Core, but not too large implicit val blockingEC = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(8)) implicit val blockingEC = ExecutionContext.fromExecutor(Executors.newCachedThreadPool()) Be careful of growing thread pool size *Updated the previous slide

Slide 46

Slide 46 text

Garbage Collector also use CPU CPU Core Application Garbage Collector Task 1 Task 2 GC Task Task1 Task2 Task1 GC Task

Slide 47

Slide 47 text

Garbage Collector also use CPU Stop The World CPU Core Application Garbage Collector Task 1 Task 2 GC Task Task1 Task2 Task1 GC Task

Slide 48

Slide 48 text

Agenda 1. I/O 2. Garbage Collection 3. Conclusion

Slide 49

Slide 49 text

Garbage Collection )FBQ )FBQ Frequency of GC Application stop time Frequency of GC Application stop time Small heap Large heap

Slide 50

Slide 50 text

Types of Garbage Collection •ZGC •Shenandoah •CMS •G1 •Serial •Parallel Each GC has features: •Generational or Regional •How many heap size is assumed

Slide 51

Slide 51 text

Types of Garbage Collection Default GC from Java9 •ZGC •Shenandoah •CMS •G1 •Serial •Parallel Each GC has features: •Generational or Regional •How many heap size is assumed

Slide 52

Slide 52 text

Types of Garbage Collection New GC in Java11 Require large heap size •CMS •G1 •Serial •Parallel •ZGC •Shenandoah Each GC has features: •Generational or Regional •How many heap size is assumed

Slide 53

Slide 53 text

Types of Garbage Collection Each GC has features: •Generational or Regional •How many heap size is assumed In general, application stops during garbage collection step (ZGC and Shenandoah may be not) •ZGC •Shenandoah •CMS •G1 •Serial •Parallel

Slide 54

Slide 54 text

Common Consideration for GC Optimization •Performance •Memory Usage - The faster operation is completed, the more object is collected in early stage - The less use memory, the less GC happens - Avoid unnecessary memory allocation - Avoid inef fi cient algorithm

Slide 55

Slide 55 text

Top 3 Solution for GC 1. Use correct collection 2. Avoid Boxing 3. Consider object lifetime in cache ※Based on my experience

Slide 56

Slide 56 text

Top 3 Solution for GC 1. Use correct collection 2. Avoid Boxing 3. Consider object lifetime in cache

Slide 57

Slide 57 text

Collection affects GC )FBQ Large collection object tends to bottleneck. •Use large memory •Objects stay in memory for long time if calculation is slow Normal Object Large Collection Object

Slide 58

Slide 58 text

Scala Collection scala.collection.mutable scala.collection.immutable Many collection type in scala.collection package https://docs.scala-lang.org/overviews/collections/overview.html

Slide 59

Slide 59 text

Data Structure of Collection Each collection has different data structure Ex. List Vector Linked List Bitmapped Vector Trie ɾ ɾ ɾ ʜ ʜ … ʜ ʜ … ʜ

Slide 60

Slide 60 text

Each collection has different characteristics Ex. List Vector L eC C L eC eC val lookUpElem = list(2) val listWithZero = 0 :: list val listWithFour = list :+ 4 Performance Characteristics of Collection val list = List(1, 2, 3) val vector = Vector(1, 2, 3) val vectorWithZero = 0 +: vector val lookUpElem = vector(2) val vectorWithFour = vector :+ 4 Complexity Complexity

Slide 61

Slide 61 text

Performance Characteristics of Collection https://docs.scala-lang.org/overviews/collections/performance-characteristics.html Of fi cial documents

Slide 62

Slide 62 text

Memory Characteristics of Collection val seq: Seq[Int] = Seq. fi ll(10000)(1) val list: List[Int] = List. fi ll(10000)(1) val array: Array[Int] = Array. fi ll(10000)(1) val vector: Vector[Int] = Vector. fi ll(10000)(1) Seq List Array Vector 240,032 Bytes 240,032 Bytes 40,016 Bytes 46,728 Bytes

Slide 63

Slide 63 text

Characteristics of Parallel Collection • Array • ArrayBuffer • Vector • mutable.HashMap • immutable.HashMap • Range • mutable.HashSet • immutable.HashSet • concurrent.TrieMap List Vector ParVector scala> (1 to 10000).toList.par res2: ParVector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, … Takes cost of transformation

Slide 64

Slide 64 text

Micro Benchmark is effective for Performance JMH https://github.com/ktoso/sbt-jmh class ParallelCollectionBenchmark { val vecNumbers = Random.shuf fl e(Vector.tabulate(10000000)(i => i)) val listNumbers = Random.shuf fl e(List.tabulate(10000000)(i => i)) @Benchmark def maxInVec() = { vecNumbers.par.max } @Benchmark def maxInList() = { listNumbers.par.max } }

Slide 65

Slide 65 text

Micro Benchmark is effective for Performance JMH https://github.com/ktoso/sbt-jmh [info] Benchmark           Mode Cnt Score Error Units [info] CollectionBenchmark.maxInList thrpt 10 1.109 ± 0.436 ops/s [info] CollectionBenchmark.maxInVec  thrpt 10 9.524 ± 1.339 ops/s

Slide 66

Slide 66 text

Use Correct Collection •Sequential Access or Random Access •Parallel execution •Not need all data in memory - List is suitable for sequential access - Vector is suitable for random access - Use collection which has pure parallel collection. - Use Iterator as possible

Slide 67

Slide 67 text

Top 3 Solution for GC 1. Use correct collection 2. Avoid Boxing 3. Consider object lifetime in cache

Slide 68

Slide 68 text

Primitive type vs Wrapper type Primitive Type Wrapper class type ・boolean ・char ・byte ・short ・int ・long ・ fl oat ・double ・Boolean ・Char ・Byte ・Short ・Int ・Long ・Float ・Double

Slide 69

Slide 69 text

Primitive type vs Wrapper type Primitive Type Wrapper class type All wrapper class type extends Object Wrapper type consumes more memory than Primitive Type 4 bytes 32 bytes int numInt = 123; Integer numInteger = new Integer(123);

Slide 70

Slide 70 text

Boxing Boxing int numInt = 123; Integer numInteger = new Integer(numInt); // Boxing int numInt = 123; Object obj = numInt; // Auto Boxing byte var1 = 123; // 4bytes Integer var2 = Integer.valueOf(var1); // 16bytes =

Slide 71

Slide 71 text

– https://twitter.github.io/effectivescala/ “Scala hides boxing/unboxing operations from you, which can incur severe performance or space penalties.”

Slide 72

Slide 72 text

Boxing •Collection •Generics •Tuple Popular Case

Slide 73

Slide 73 text

Collection val array: Array[Int] = Array. fi ll(10000)(1) public int[] array(); Compile List stores java.lang.Integer Array stores int val list: List[Int] = List. fi ll(10000)(1) public scala.collection.immutable.List list(); Compile 240032 bytes 40016 bytes Boxing

Slide 74

Slide 74 text

Generics Generics incurs the costs of boxing case class GenValue[T](value: T) def genIntValue(v: Int) = GenValue(1) public bytecodes.GenValue genIntValue(int); … Code: stack=3, locals=3, args_size=2 0: new #17 // class bytecodes/GenValue … 5: invokestatic #23 // Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/ lang/Integer; … Compile Boxing

Slide 75

Slide 75 text

@specialized Generates multiple versions of a class to remove boxing overhead case class SGenValue[@specialized T](value: T) SGenValue$mcB$sp.class // byte SGenValue$mcC$sp.class // char SGenValue$mcD$sp.class // double SGenValue$mcF$sp.class // fl oat SGenValue$mcI$sp.class // int SGenValue$mcJ$sp.class // long SGenValue$mcS$sp.class // short SGenValue$mcV$sp.class // null SGenValue$mcZ$sp.class // boolean SGenValue.class // AnyRef

Slide 76

Slide 76 text

@specialized case class SGenValue[@specialized T](value: T) def genIntValue(v: Long) = SGenValue(1) public bytecodes.SGenValue genIntValue(long); … Code: stack=3, locals=3, args_size=2 0: new #17 // class bytecodes/SGenValue$mcI$sp … 5: invokespecial #20 // Method bytecodes/SGenValue$mcI$sp."":(I)V … Compile No Boxing class for int

Slide 77

Slide 77 text

Comparison of Object Size case class GenValue[T](value: T) case class SGenValue[@specialized T](value: T) Boxing Non Boxing 24 bytes 32 bytes val v = GenValue[Int](1) val sv = SGenValue[Int](1) val list: List[GenValue[Int]] = (1 to 10000).toList.map(GenValue[Int]) val slist: List[SGenValue[Int]] = (1 to 10000).toList.map(SGenValue[Int]) 480016 bytes 560016 bytes

Slide 78

Slide 78 text

@specialized case class SValues[@specialized A, @specialized B](a: A, b: B) Generates 82 classes case class SValues[@specialized(Int, Long, Double) A, @specialized(Char, Byte) B](a: A, b: B) Generates 7 classes

Slide 79

Slide 79 text

Tuple vs Case Class case class Bar(value: Int) def tuple2Boxed: (Int, Bar) case class ErrorResponse( title: String, message: String, errorCode: Int ) def errorResponse: ErrorResponse case class IntBar(value: Int, bar: Bar) def intBar: IntBar def tuple3: (String, String, Int) Boxing Non Boxing 200 bytes 184 bytes 56 bytes 40 bytes

Slide 80

Slide 80 text

Boxing •Be careful of unnecessary boxing •Check whether boxing happens by using javap: javap -v xxx.class •Apply @specialized for generics Ex. Integer.value emerge a lot

Slide 81

Slide 81 text

Top 3 Solution for GC 1. Use correct collection 2. Avoid Boxing 3. Consider object lifetime in cache

Slide 82

Slide 82 text

Cache is effective for performance Cache in memory (key-value) Database File key value If cache don’t has key, get data from origin If cache has key, response cache value

Slide 83

Slide 83 text

Cache implemented with Map val cache = new mutable.HashMap[UserId, User] cache.get(id) match { case Some(user) => ??? // cache hit case None => ??? // get user from database } In general, Map is used when implement cache by yourself

Slide 84

Slide 84 text

Cache Dilemma When should keys be refreshed? •If cache has many keys, performance may improve •Cache must not lead to OutOfMemory Should all keys be refreshed constantly?

Slide 85

Slide 85 text

Obsolete Object Reference affects GC Cache in memory (key-value) null Object1 strong reference null Object2 soft reference null Object3 weak reference

Slide 86

Slide 86 text

Obsolete Object Reference affects GC Cache in memory (key-value) null Object1 strong reference null Object2 soft reference null Object3 weak reference GC does not collect key GC collect key if it needs memory GC collect key

Slide 87

Slide 87 text

HashMap vs WeakHashMap •HashMap •WeakHashMap Key with strong reference Key with weak reference val hashMap = mutable.HashMap[UserId, User](id -> user) id = null val weakMap = mutable.WeakHashMap[UserId, User](id -> user) id = null if id is referenced only by HashMap, GC don’t collect key. if id is referenced only by WeakHashMap, GC collect key.

Slide 88

Slide 88 text

// Strong Reference val map = mutable.HashMap[UserId, User](id1 -> suzuki) // Weak Reference val weakMap = mutable.WeakHashMap[UserId, User](id2 -> tanaka) println(map.toString()) // Map(UserId(1) -> User(1,Suzuki)) println(weakMap.toString()) // Map(UserId(2) -> User(2,Tanaka)) id1 = null id2 = null System.gc() // GC run println(map.toString()) // Map(UserId(1) -> User(1,Ichiro,Suzuki)) println(weakMap.toString()) // Map() HashMap vs WeakHashMap

Slide 89

Slide 89 text

Cache Strategy with Object Lifetime •WeakHashMap is effective Ex. Temporary cache for large object •Refresh Key Pattern - Set TTL of key - Refresh all keys constantly - Delete key when GC runs by Weak Reference

Slide 90

Slide 90 text

Agenda 1. I/O 2. Garbage Collection 3. Conclusion

Slide 91

Slide 91 text

Use CPU ef fi ciently for I/O •Thread Pool Size •Blocking or Non blocking - Many threads lead to context switch overhead - Not too small, not too large for number of CPU core - Non blocking is effective - If non blocking isn’t enable, use thread pool for I/O

Slide 92

Slide 92 text

No more GC!! •Performance •Memory Usage How 1.Collection 2.Boxing 3.Object lifetime in Cache What Avoid low throughput by GC

Slide 93

Slide 93 text

Conclusion •Use CPU ef fi ciently for I/O High Performance by: •Avoid low throughput by GC Don’t guess, measure

Slide 94

Slide 94 text

References • Scala High Performance Programming • Java High Performance - O’Reilly • Benchmarking Scala Collections • @inline and @specialized - Scala Days Berlin 2016 • http://techlog.mvrck.co.jp/entry/specialize-in-scala/ • https://www.baeldung.com/java-weakhashmap • https://www.slideshare.net/mogproject/adtech-scala-performance-tuning • https://www.infoq.com/presentations/JVM-Performance-Tuning-twitter/ • https://gist.github.com/djspiewak/46b543800958cf61af6efa8e072bfd5c