High Performance Scala/high_performance_scala

Ca4df28501e4c9cfbceb91f367afa784?s=47 fuzyco
June 28, 2019

High Performance Scala/high_performance_scala

Ca4df28501e4c9cfbceb91f367afa784?s=128

fuzyco

June 28, 2019
Tweet

Transcript

  1. High Performance Scala scala matsuri 2019 Hiroki Fujino 

  2. About Me  • Hiroki Fujino • Server Side Engineer

    • Working in Berlin since February
  3. Background  Experienced advertising distribution system • Throughput for large

    number of request • Efficient memory usage for large data What is required: Average ~10ms response
  4. • Runs on JVM • Static type system • Rich

    libraries supporting performance Scala is potentially High Performance Language 
  5. Achievement of high performance is not easy 

  6. • I/O • Garbage Collection(GC)  ※Based on my experience

    Two Critical Performance Issue
  7. I/O Issue  Throughput decrease Thread Starvation Context Switch Task1

    Task2 Task3 CPU Core Task1 Task2 Task3 Task1 Task2 Task3 Multithreading for tasks which include I/O thread1 thread2 thread3
  8. I/O Issue  Why thread starvation happens? How blocking affects

    thread? Throughput decrease Thread Starvation Context Switch Task1 Task2 Task3 Task1 Task2 Task3 Task1 Task2 Task3 CPU Core thread1 thread2 thread3 Multithreading for tasks which include I/O
  9. GC Issue  Increase memory to store lots objects Application

    stops for a longer time Stop The World by GC )FBQ )FBQ
  10. GC Issue  How to reduce application stop time by

    GC? Application stops for a longer time Stop The World by GC )FBQ )FBQ Increase memory to store lots objects
  11.  It’s necessary to understand the theory Why thread starvation

    happens? How blocking affects thread? How to reduce application stop time by GC?
  12. Focus Point  Theory • Basic Module • Scala Collection

    • Concurrency Library • I/O Library Scala Language
  13. Focus Point  Theory • Basic Module • Scala Collection

    • Concurrency Library • I/O Library Scala Language × • Mechanism of Garbage Collection • Data structure of Collection • Relationship between thread and CPU • Difference between blocking and non-blocking
  14. Focus Point  Theory • Basic Module • Scala Collection

    • Concurrency Library • I/O Library Scala Language × • Mechanism of Garbage Collection • Data structure of Collection • Relationship between thread and CPU • Difference between blocking and non-blocking High Performance
  15. Goal  Prerequisites: Basic knowledge of Scala Acknowledgement: The code

    in this presentation is sample Object size on HotSpot64 •Use CPU efficiently for I/O •Avoid application stop by GC High Performance by:
  16. Agenda  1. I/O 2. Garbage Collection 3. Conclusion

  17. Agenda  1. I/O 2. Garbage Collection 3. Conclusion

  18. I/O is very expensive  •10[ns] Main Memory Reference •250,000[ns]

    Read 1MB sequentially from memory •10,000,000[ns] Disk seek •10,000,000[ns] Read 1MB sequentially from network •20,000,000[ns] Read 1MB sequentially from disk https://gist.github.com/jboner/2841832
  19. Concurrency  Concurrency is effective for I/O I/O CPU Core

    Task1 Task2 Task1 Task2 Task1 Task2 CPU Core can work on one task at a time
  20. The Goal of Concurrency  •How to use thread pool

    •Blocking or Non-blocking Efficient use of CPU Important things:
  21. 1. Asynchronous Call with Blocking Operation 2. Asynchronous Call with

    Non-blocking Operation Understanding through Code Comparison 
  22. Sample Code  getUser getMessages getTeam Disk I/O Network I/O

    MicroService def execute(userId: String, teamId: String) (implicit ec: ExecutionContext) = { // get User by MySQL val userF = database.getUser(userId) // get Message by MySQL val messagesF = database.getMessages(userId) // get Team by MicroService val teamF = teamService.getTeam(teamId) }
  23. 1. Asynchronous Call with Blocking Operation 2. Asynchronous Call with

    Non-blocking Operation  Understanding through Code Comparison
  24.  Database Access ScalikeJDBC def getUser(id: String) (implicit s: DBSession,

    ec: ExecutionContext): Future[Option[User]] = Future { withSQL { select.from(Users as u).where.eq(u.id, id) }.map(rs => Users(u.resultName, rs)).single.apply() } getUser getMessages Disk I/O Blocking
  25.  Http Request Apache Http Client def getTeam(name: String) (implicit

    ec: ExecutionContext): Future[Option[Team]] = Future { val client = new DefaultHttpClient() val httpResponse = client.execute(new HttpGet(url)) // transform response into Team … } getTeam Network I/O MicroService Blocking
  26.  1. getUser 2. getMessages 3. getTeam Asynchronous Call with

    Blocking Operation thread1 thread2 thread3 I/O ※The figure for illustration purposes
  27. Profiling is effective for monitoring thread  Switching threads depends

    on operating system Difficult to predict thread condition Profiling Tool: • CPU • Thread • I/O • Garbage Collection • Memory Recorded Factors:
  28.  Sample result in multiple executions Running SocketRead Thread Graph

  29.  Asynchronous Call with Blocking Operation thread1 thread2 thread3 Thread

    is blocked 1. getUser 2. getMessages 3. getTeam I/O ※The figure for illustration purposes
  30. Blocking Issue  Blocking threads leads to thread starvation and

    decrease of throughput
  31.  Increasing thread pool size solves this issue?

  32. Too Many Threads lead to problem  • Excessive Memory

    Usage • Overhead of Context Switch
  33. Thread takes cost  [ࢀߟ] openjdk-jdk11/src/hotspot/os/linux/vm/os_linux.cpp Kernel Thread int ret

    = pthread_create(&tid, &attr, (void* (*)(void*)) thread_native_entry, thread); JVM thread consumes 1MB implicit val ec = ExecutionContext.fromExecutor(new ForkJoinPool(8)) val thread = new Thread()
  34. Cost of Context Switch  Total overhead of context switch

    thread1 thread2 thread3 Time CPU Core CPU Core
  35.  Non-blocking Efficient use of threads while waiting I/O

  36.  I/O CPU Core Task1 Task2 Task1 Task2 Task1 Task2

    Non-blocking Single thread with non-blocking operation Thread executes another task during I/O wait thread1
  37. 1. Asynchronous Call with Blocking Operation 2. Asynchronous Call with

    Non-blocking Operation  Understanding through Code Comparison
  38. Non-blocking Library  Database Client Http ScalikeJDBC-Async quill-async quill-finagle-mysql akka-http

  39.  Database Access with Non-blocking quill-async def getUser(id: String) (implicit

    ec: ExecutionContext): Future[Option[Users]] = { val select = quote { query[Users].filter(_.id == lift(id)) } ctx.run(select).map(_.headOption) } getUser getMessages Disk I/O Non-blocking
  40.  Http Request with Non-blocking def getTeam(teamName: String) (implicit ec:

    ExecutionContext): Future[Team] = { val response = Http().singleRequest(HttpRequest(uri = innerUrl, method = HttpMethods.GET)) // transform response into Team … } akka-http getTeam Network I/O MicroService Non-blocking
  41.  1. getUser 2. getMessages 3. getTeam Asynchronous Call with

    Non-blocking Operation I/O Thread is non-blocked ※The figure for illustration purposes thread1
  42.  Running Monitor Wait Thread Park Thread Graph Sample result

    in multiple executions
  43. Blocking vs Non-blocking  Blocking Non-blocking Non-blocking enables thread to

    run efficient
  44. Thread Pool Strategy  • Should not use `execution global`

    in blocking operation import scala.concurrent.ExecutionContext.Implicits.global • If there is blocking operation, use thread pool for it •Thread pool size implicit val blockingEC = ExecutionContext.fromExecutor(new ForkJoinPool(8)) - CPU bound => About number of CPU Core - I/O bound with blocking => More than number of CPU Core, but not too large
  45. Garbage Collector also use CPU  CPU Core Application Garbage

    Collector Task 1 Task 2 GC Task Task1 Task2 Task1 GC Task
  46. Garbage Collector also use CPU  Stop The World CPU

    Core Application Garbage Collector Task 1 Task 2 GC Task Task1 Task2 Task1 GC Task
  47. Agenda  1. I/O 2. Garbage Collection 3. Conclusion

  48. Garbage Collection  )FBQ )FBQ Frequency of GC Application stop

    time Frequency of GC Application stop time Small heap Large heap
  49. Types of Garbage Collection  •ZGC •Shenandoah •CMS •G1 •Serial

    •Parallel Each GC has features: •Generational or Regional •How many heap size is assumed
  50. Types of Garbage Collection  Default GC from Java9 •ZGC

    •Shenandoah •CMS •G1 •Serial •Parallel Each GC has features: •Generational or Regional •How many heap size is assumed
  51. Types of Garbage Collection  New GC in Java11 Require

    large heap size •CMS •G1 •Serial •Parallel •ZGC •Shenandoah Each GC has features: •Generational or Regional •How many heap size is assumed
  52. Types of Garbage Collection  Each GC has features: •Generational

    or Regional •How many heap size is assumed In general, application stops during garbage collection step (ZGC and Shenandoah may be not) •ZGC •Shenandoah •CMS •G1 •Serial •Parallel
  53. Common Consideration for GC Optimization  •Performance •Memory Usage -

    The faster operation is completed, the more object is collected in early stage - The less use memory, the less GC happens - Avoid unnecessary memory allocation - Avoid inefficient algorithm
  54. Top 3 Solution for GC  1. Use correct collection

    2. Avoid Boxing 3. Consider object lifetime in cache ※Based on my experience
  55. Top 3 Solution for GC  1. Use correct collection

    2. Avoid Boxing 3. Consider object lifetime in cache
  56. Collection affects GC  )FBQ Large collection object tends to

    bottleneck. •Use large memory •Objects stay in memory for long time if calculation is slow Normal Object Large Collection Object
  57. Scala Collection  scala.collection.mutable scala.collection.immutable Many collection type in scala.collection

    package https://docs.scala-lang.org/overviews/collections/overview.html
  58. Data Structure of Collection  Each collection has different data

    structure Ex. List Vector Linked List Bitmapped Vector Trie  ɾ  ɾ  ɾ   ʜ    ʜ  … ʜ ʜ … ʜ
  59.  Each collection has different characteristics Ex. List Vector L

    eC C L eC eC val lookUpElem = list(2) val listWithZero = 0 :: list val listWithFour = list :+ 4 Performance Characteristics of Collection val list = List(1, 2, 3) val vector = Vector(1, 2, 3) val vectorWithZero = 0 +: vector val lookUpElem = vector(2) val vectorWithFour = vector :+ 4 Complexity Complexity
  60. Performance Characteristics of Collection  https://docs.scala-lang.org/overviews/collections/performance-characteristics.html Official documents

  61.  Memory Characteristics of Collection val seq: Seq[Int] = Seq.fill(10000)(1)

    val list: List[Int] = List.fill(10000)(1) val array: Array[Int] = Array.fill(10000)(1) val vector: Vector[Int] = Vector.fill(10000)(1) Seq List Array Vector 240,032 Bytes 240,032 Bytes 40,016 Bytes 46,728 Bytes
  62. Characteristics of Parallel Collection  • Array • ArrayBuffer •

    Vector • mutable.HashMap • immutable.HashMap • Range • mutable.HashSet • immutable.HashSet • concurrent.TrieMap List Vector ParVector scala> (1 to 10000).toList.par res2: ParVector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, … Takes cost of transformation
  63. Micro Benchmark is effective for Performance  JMH https://github.com/ktoso/sbt-jmh class

    ParallelCollectionBenchmark { val vecNumbers = Random.shuffle(Vector.tabulate(10000000)(i => i)) val listNumbers = Random.shuffle(List.tabulate(10000000)(i => i)) @Benchmark def maxInVec() = { vecNumbers.par.max } @Benchmark def maxInList() = { listNumbers.par.max } }
  64. Micro Benchmark is effective for Performance  JMH https://github.com/ktoso/sbt-jmh [info]

    Benchmarkɹɹɹɹɹɹɹɹɹɹ Mode Cnt Score Error Units [info] CollectionBenchmark.maxInList thrpt 10 1.109 ± 0.436 ops/s [info] CollectionBenchmark.maxInVecɹɹthrpt 10 9.524 ± 1.339 ops/s
  65. Use Correct Collection  •Sequential Access or Random Access •Parallel

    execution •Not need all data in memory - List is suitable for sequential access - Vector is suitable for random access - Use collection which has pure parallel collection. - Use Iterator as possible
  66. Top 3 Solution for GC  1. Use correct collection

    2. Avoid Boxing 3. Consider object lifetime in cache
  67. Primitive type vs Wrapper type  Primitive Type Wrapper class

    type ɾboolean ɾchar ɾbyte ɾshort ɾint ɾlong ɾfloat ɾdouble ɾBoolean ɾChar ɾByte ɾShort ɾInt ɾLong ɾFloat ɾDouble
  68. Primitive type vs Wrapper type  Primitive Type Wrapper class

    type All wrapper class type extends Object Wrapper type consumes more memory than Primitive Type 4 bytes 32 bytes int numInt = 123; Integer numInteger = new Integer(123);
  69. Boxing  Boxing int numInt = 123; Integer numInteger =

    new Integer(numInt); // Boxing int numInt = 123; Object obj = numInt; // Auto Boxing byte var1 = 123; // 4bytes Integer var2 = Integer.valueOf(var1); // 16bytes ʹ
  70. – https://twitter.github.io/effectivescala/ “Scala hides boxing/unboxing operations from you, which can

    incur severe performance or space penalties.” 
  71. Boxing  •Collection •Generics •Tuple Popular Case

  72. Collection  val array: Array[Int] = Array.fill(10000)(1) public int[] array();

    Compile List stores java.lang.Integer Array stores int val list: List[Int] = List.fill(10000)(1) public scala.collection.immutable.List<java.lang.Object> list(); Compile 240032 bytes 40016 bytes Boxing
  73. Generics  Generics incurs the costs of boxing case class

    GenValue[T](value: T) def genIntValue(v: Int) = GenValue(1) public bytecodes.GenValue<java.lang.Object> genIntValue(int); … Code: stack=3, locals=3, args_size=2 0: new #17 // class bytecodes/GenValue … 5: invokestatic #23 // Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/ lang/Integer; … Compile Boxing
  74. @specialized  Generates multiple versions of a class to remove

    boxing overhead case class SGenValue[@specialized T](value: T) SGenValue$mcB$sp.class // byte SGenValue$mcC$sp.class // char SGenValue$mcD$sp.class // double SGenValue$mcF$sp.class // float SGenValue$mcI$sp.class // int SGenValue$mcJ$sp.class // long SGenValue$mcS$sp.class // short SGenValue$mcV$sp.class // null SGenValue$mcZ$sp.class // boolean SGenValue.class // AnyRef
  75. @specialized  case class SGenValue[@specialized T](value: T) def genIntValue(v: Long)

    = SGenValue(1) public bytecodes.SGenValue<java.lang.Object> genIntValue(long); … Code: stack=3, locals=3, args_size=2 0: new #17 // class bytecodes/SGenValue$mcI$sp … 5: invokespecial #20 // Method bytecodes/SGenValue$mcI$sp."<init>":(I)V … Compile No Boxing class for int
  76. Comparison of Object Size  case class GenValue[T](value: T) case

    class SGenValue[@specialized T](value: T) Boxing Non Boxing 24 bytes 32 bytes val v = GenValue[Int](1) val sv = SGenValue[Int](1) val list: List[GenValue[Int]] = (1 to 10000).toList.map(GenValue[Int]) val slist: List[SGenValue[Int]] = (1 to 10000).toList.map(SGenValue[Int]) 480016 bytes 560016 bytes
  77. @specialized  case class SValues[@specialized A, @specialized B](a: A, b:

    B) Generates 82 classes case class SValues[@specialized(Int, Long, Double) A, @specialized(Char, Byte) B](a: A, b: B) Generates 7 classes
  78. Tuple vs Case Class  case class Bar(value: Int) def

    tuple2Boxed: (Int, Bar) case class ErrorResponse( title: String, message: String, errorCode: Int ) def errorResponse: ErrorResponse case class IntBar(value: Int, bar: Bar) def intBar: IntBar def tuple3: (String, String, Int) Boxing Non Boxing 200 bytes 184 bytes 56 bytes 40 bytes
  79. Boxing  •Be careful of unnecessary boxing • Check whether

    boxing happens by using javap: javap -v xxx.class •Apply @speacialized for generics Ex. Integer.value emerge a lot
  80. Top 3 Solution for GC  1. Use correct collection

    2. Avoid Boxing 3. Consider object lifetime in cache
  81. Cache is effective for performance  Cache in memory (key-value)

    Database File key value If cache don’t has key, get data from origin If cache has key, response cache value
  82. Cache implemented with Map  val cache = new mutable.HashMap[UserId,

    User] cache.get(id) match { case Some(user) => ??? // cache hit case None => ??? // get user from database } In general, Map is used when implement cache by yourself
  83. Cache Dilemma  When should keys be refreshed? •If cache

    has many keys, performance may improve •Cache must not lead to OutOfMemory Should all keys be refreshed constantly?
  84. Obsolete Object Reference affects GC  Cache in memory (key-value)

    null Object1 strong reference null Object2 soft reference null Object3 weak reference
  85. Obsolete Object Reference affects GC  Cache in memory (key-value)

    null Object1 strong reference null Object2 soft reference null Object3 weak reference GC does not collect key GC collect key if it needs memory GC collect key
  86. HashMap vs WeakHashMap  •HashMap •WeakHashMap Key with strong reference

    Key with weak reference val hashMap = mutable.HashMap[UserId, User](id -> user) id = null val weakMap = mutable.WeakHashMap[UserId, User](id -> user) id = null if id is referenced only by HashMap, GC don’t collect key. if id is referenced only by WeakHashMap, GC collect key.
  87.  // Strong Reference val map = mutable.HashMap[UserId, User](id1 ->

    suzuki) // Weak Reference val weakMap = mutable.WeakHashMap[UserId, User](id2 -> tanaka) println(map.toString()) // Map(UserId(1) -> User(1,Suzuki)) println(weakMap.toString()) // Map(UserId(2) -> User(2,Tanaka)) id1 = null id2 = null System.gc() // GC run println(map.toString()) // Map(UserId(1) -> User(1,Ichiro,Suzuki)) println(weakMap.toString()) // Map() HashMap vs WeakHashMap
  88. Cache Strategy with Object Lifetime  •WeakHashMap is effective Ex.

    Temporary cache for large object •Refresh Key Pattern - Set TTL of key - Refresh all keys constantly - Delete key when GC runs by Weak Reference
  89. Agenda  1. I/O 2. Garbage Collection 3. Conclusion

  90. Use CPU efficiently for I/O  •Thread Pool Size •Blocking

    or Non blocking - Many threads lead to context switch overhead - Not too small, not too large for number of CPU core - Non blocking is effective - If non blocking isn’t enable, use thread pool for I/O
  91.  No more GC!! •Performance •Memory Usage How 1.Collection 2.Boxing

    3.Object lifetime in Cache What Avoid low throughput by GC
  92. Conclusion  •Use CPU efficiently for I/O High Performance by:

    •Avoid low throughput by GC Don’t guess, measure
  93. References  • Scala High Performance Programming • Java High

    Performance - O’Reilly • Benchmarking Scala Collections • @inline and @specialized - Scala Days Berlin 2016 • http://techlog.mvrck.co.jp/entry/specialize-in-scala/ • https://www.baeldung.com/java-weakhashmap • https://www.slideshare.net/mogproject/adtech-scala-performance-tuning • https://www.infoq.com/presentations/JVM-Performance-Tuning-twitter/