High Performance Scala/high_performance_scala

Ca4df28501e4c9cfbceb91f367afa784?s=47 fuzyco
June 28, 2019

High Performance Scala/high_performance_scala

Ca4df28501e4c9cfbceb91f367afa784?s=128

fuzyco

June 28, 2019
Tweet

Transcript

  1. 2.

    About Me  • Hiroki Fujino • Server Side Engineer

    • Working in Berlin since February
  2. 3.

    Background  Experienced advertising distribution system • Throughput for large

    number of request • Efficient memory usage for large data What is required: Average ~10ms response
  3. 4.

    • Runs on JVM • Static type system • Rich

    libraries supporting performance Scala is potentially High Performance Language 
  4. 7.

    I/O Issue  Throughput decrease Thread Starvation Context Switch Task1

    Task2 Task3 CPU Core Task1 Task2 Task3 Task1 Task2 Task3 Multithreading for tasks which include I/O thread1 thread2 thread3
  5. 8.

    I/O Issue  Why thread starvation happens? How blocking affects

    thread? Throughput decrease Thread Starvation Context Switch Task1 Task2 Task3 Task1 Task2 Task3 Task1 Task2 Task3 CPU Core thread1 thread2 thread3 Multithreading for tasks which include I/O
  6. 9.

    GC Issue  Increase memory to store lots objects Application

    stops for a longer time Stop The World by GC )FBQ )FBQ
  7. 10.

    GC Issue  How to reduce application stop time by

    GC? Application stops for a longer time Stop The World by GC )FBQ )FBQ Increase memory to store lots objects
  8. 11.

     It’s necessary to understand the theory Why thread starvation

    happens? How blocking affects thread? How to reduce application stop time by GC?
  9. 12.

    Focus Point  Theory • Basic Module • Scala Collection

    • Concurrency Library • I/O Library Scala Language
  10. 13.

    Focus Point  Theory • Basic Module • Scala Collection

    • Concurrency Library • I/O Library Scala Language × • Mechanism of Garbage Collection • Data structure of Collection • Relationship between thread and CPU • Difference between blocking and non-blocking
  11. 14.

    Focus Point  Theory • Basic Module • Scala Collection

    • Concurrency Library • I/O Library Scala Language × • Mechanism of Garbage Collection • Data structure of Collection • Relationship between thread and CPU • Difference between blocking and non-blocking High Performance
  12. 15.

    Goal  Prerequisites: Basic knowledge of Scala Acknowledgement: The code

    in this presentation is sample Object size on HotSpot64 •Use CPU efficiently for I/O •Avoid application stop by GC High Performance by:
  13. 18.

    I/O is very expensive  •10[ns] Main Memory Reference •250,000[ns]

    Read 1MB sequentially from memory •10,000,000[ns] Disk seek •10,000,000[ns] Read 1MB sequentially from network •20,000,000[ns] Read 1MB sequentially from disk https://gist.github.com/jboner/2841832
  14. 19.

    Concurrency  Concurrency is effective for I/O I/O CPU Core

    Task1 Task2 Task1 Task2 Task1 Task2 CPU Core can work on one task at a time
  15. 20.

    The Goal of Concurrency  •How to use thread pool

    •Blocking or Non-blocking Efficient use of CPU Important things:
  16. 21.

    1. Asynchronous Call with Blocking Operation 2. Asynchronous Call with

    Non-blocking Operation Understanding through Code Comparison 
  17. 22.

    Sample Code  getUser getMessages getTeam Disk I/O Network I/O

    MicroService def execute(userId: String, teamId: String) (implicit ec: ExecutionContext) = { // get User by MySQL val userF = database.getUser(userId) // get Message by MySQL val messagesF = database.getMessages(userId) // get Team by MicroService val teamF = teamService.getTeam(teamId) }
  18. 23.

    1. Asynchronous Call with Blocking Operation 2. Asynchronous Call with

    Non-blocking Operation  Understanding through Code Comparison
  19. 24.

     Database Access ScalikeJDBC def getUser(id: String) (implicit s: DBSession,

    ec: ExecutionContext): Future[Option[User]] = Future { withSQL { select.from(Users as u).where.eq(u.id, id) }.map(rs => Users(u.resultName, rs)).single.apply() } getUser getMessages Disk I/O Blocking
  20. 25.

     Http Request Apache Http Client def getTeam(name: String) (implicit

    ec: ExecutionContext): Future[Option[Team]] = Future { val client = new DefaultHttpClient() val httpResponse = client.execute(new HttpGet(url)) // transform response into Team … } getTeam Network I/O MicroService Blocking
  21. 26.

     1. getUser 2. getMessages 3. getTeam Asynchronous Call with

    Blocking Operation thread1 thread2 thread3 I/O ※The figure for illustration purposes
  22. 27.

    Profiling is effective for monitoring thread  Switching threads depends

    on operating system Difficult to predict thread condition Profiling Tool: • CPU • Thread • I/O • Garbage Collection • Memory Recorded Factors:
  23. 29.

     Asynchronous Call with Blocking Operation thread1 thread2 thread3 Thread

    is blocked 1. getUser 2. getMessages 3. getTeam I/O ※The figure for illustration purposes
  24. 32.

    Too Many Threads lead to problem  • Excessive Memory

    Usage • Overhead of Context Switch
  25. 33.

    Thread takes cost  [ࢀߟ] openjdk-jdk11/src/hotspot/os/linux/vm/os_linux.cpp Kernel Thread int ret

    = pthread_create(&tid, &attr, (void* (*)(void*)) thread_native_entry, thread); JVM thread consumes 1MB implicit val ec = ExecutionContext.fromExecutor(new ForkJoinPool(8)) val thread = new Thread()
  26. 34.

    Cost of Context Switch  Total overhead of context switch

    thread1 thread2 thread3 Time CPU Core CPU Core
  27. 36.

     I/O CPU Core Task1 Task2 Task1 Task2 Task1 Task2

    Non-blocking Single thread with non-blocking operation Thread executes another task during I/O wait thread1
  28. 37.

    1. Asynchronous Call with Blocking Operation 2. Asynchronous Call with

    Non-blocking Operation  Understanding through Code Comparison
  29. 39.

     Database Access with Non-blocking quill-async def getUser(id: String) (implicit

    ec: ExecutionContext): Future[Option[Users]] = { val select = quote { query[Users].filter(_.id == lift(id)) } ctx.run(select).map(_.headOption) } getUser getMessages Disk I/O Non-blocking
  30. 40.

     Http Request with Non-blocking def getTeam(teamName: String) (implicit ec:

    ExecutionContext): Future[Team] = { val response = Http().singleRequest(HttpRequest(uri = innerUrl, method = HttpMethods.GET)) // transform response into Team … } akka-http getTeam Network I/O MicroService Non-blocking
  31. 41.

     1. getUser 2. getMessages 3. getTeam Asynchronous Call with

    Non-blocking Operation I/O Thread is non-blocked ※The figure for illustration purposes thread1
  32. 44.

    Thread Pool Strategy  • Should not use `execution global`

    in blocking operation import scala.concurrent.ExecutionContext.Implicits.global • If there is blocking operation, use thread pool for it •Thread pool size implicit val blockingEC = ExecutionContext.fromExecutor(new ForkJoinPool(8)) - CPU bound => About number of CPU Core - I/O bound with blocking => More than number of CPU Core, but not too large
  33. 45.

    Garbage Collector also use CPU  CPU Core Application Garbage

    Collector Task 1 Task 2 GC Task Task1 Task2 Task1 GC Task
  34. 46.

    Garbage Collector also use CPU  Stop The World CPU

    Core Application Garbage Collector Task 1 Task 2 GC Task Task1 Task2 Task1 GC Task
  35. 48.

    Garbage Collection  )FBQ )FBQ Frequency of GC Application stop

    time Frequency of GC Application stop time Small heap Large heap
  36. 49.

    Types of Garbage Collection  •ZGC •Shenandoah •CMS •G1 •Serial

    •Parallel Each GC has features: •Generational or Regional •How many heap size is assumed
  37. 50.

    Types of Garbage Collection  Default GC from Java9 •ZGC

    •Shenandoah •CMS •G1 •Serial •Parallel Each GC has features: •Generational or Regional •How many heap size is assumed
  38. 51.

    Types of Garbage Collection  New GC in Java11 Require

    large heap size •CMS •G1 •Serial •Parallel •ZGC •Shenandoah Each GC has features: •Generational or Regional •How many heap size is assumed
  39. 52.

    Types of Garbage Collection  Each GC has features: •Generational

    or Regional •How many heap size is assumed In general, application stops during garbage collection step (ZGC and Shenandoah may be not) •ZGC •Shenandoah •CMS •G1 •Serial •Parallel
  40. 53.

    Common Consideration for GC Optimization  •Performance •Memory Usage -

    The faster operation is completed, the more object is collected in early stage - The less use memory, the less GC happens - Avoid unnecessary memory allocation - Avoid inefficient algorithm
  41. 54.

    Top 3 Solution for GC  1. Use correct collection

    2. Avoid Boxing 3. Consider object lifetime in cache ※Based on my experience
  42. 55.

    Top 3 Solution for GC  1. Use correct collection

    2. Avoid Boxing 3. Consider object lifetime in cache
  43. 56.

    Collection affects GC  )FBQ Large collection object tends to

    bottleneck. •Use large memory •Objects stay in memory for long time if calculation is slow Normal Object Large Collection Object
  44. 57.
  45. 58.

    Data Structure of Collection  Each collection has different data

    structure Ex. List Vector Linked List Bitmapped Vector Trie  ɾ  ɾ  ɾ   ʜ    ʜ  … ʜ ʜ … ʜ
  46. 59.

     Each collection has different characteristics Ex. List Vector L

    eC C L eC eC val lookUpElem = list(2) val listWithZero = 0 :: list val listWithFour = list :+ 4 Performance Characteristics of Collection val list = List(1, 2, 3) val vector = Vector(1, 2, 3) val vectorWithZero = 0 +: vector val lookUpElem = vector(2) val vectorWithFour = vector :+ 4 Complexity Complexity
  47. 61.

     Memory Characteristics of Collection val seq: Seq[Int] = Seq.fill(10000)(1)

    val list: List[Int] = List.fill(10000)(1) val array: Array[Int] = Array.fill(10000)(1) val vector: Vector[Int] = Vector.fill(10000)(1) Seq List Array Vector 240,032 Bytes 240,032 Bytes 40,016 Bytes 46,728 Bytes
  48. 62.

    Characteristics of Parallel Collection  • Array • ArrayBuffer •

    Vector • mutable.HashMap • immutable.HashMap • Range • mutable.HashSet • immutable.HashSet • concurrent.TrieMap List Vector ParVector scala> (1 to 10000).toList.par res2: ParVector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, … Takes cost of transformation
  49. 63.

    Micro Benchmark is effective for Performance  JMH https://github.com/ktoso/sbt-jmh class

    ParallelCollectionBenchmark { val vecNumbers = Random.shuffle(Vector.tabulate(10000000)(i => i)) val listNumbers = Random.shuffle(List.tabulate(10000000)(i => i)) @Benchmark def maxInVec() = { vecNumbers.par.max } @Benchmark def maxInList() = { listNumbers.par.max } }
  50. 64.

    Micro Benchmark is effective for Performance  JMH https://github.com/ktoso/sbt-jmh [info]

    Benchmarkɹɹɹɹɹɹɹɹɹɹ Mode Cnt Score Error Units [info] CollectionBenchmark.maxInList thrpt 10 1.109 ± 0.436 ops/s [info] CollectionBenchmark.maxInVecɹɹthrpt 10 9.524 ± 1.339 ops/s
  51. 65.

    Use Correct Collection  •Sequential Access or Random Access •Parallel

    execution •Not need all data in memory - List is suitable for sequential access - Vector is suitable for random access - Use collection which has pure parallel collection. - Use Iterator as possible
  52. 66.

    Top 3 Solution for GC  1. Use correct collection

    2. Avoid Boxing 3. Consider object lifetime in cache
  53. 67.

    Primitive type vs Wrapper type  Primitive Type Wrapper class

    type ɾboolean ɾchar ɾbyte ɾshort ɾint ɾlong ɾfloat ɾdouble ɾBoolean ɾChar ɾByte ɾShort ɾInt ɾLong ɾFloat ɾDouble
  54. 68.

    Primitive type vs Wrapper type  Primitive Type Wrapper class

    type All wrapper class type extends Object Wrapper type consumes more memory than Primitive Type 4 bytes 32 bytes int numInt = 123; Integer numInteger = new Integer(123);
  55. 69.

    Boxing  Boxing int numInt = 123; Integer numInteger =

    new Integer(numInt); // Boxing int numInt = 123; Object obj = numInt; // Auto Boxing byte var1 = 123; // 4bytes Integer var2 = Integer.valueOf(var1); // 16bytes ʹ
  56. 72.

    Collection  val array: Array[Int] = Array.fill(10000)(1) public int[] array();

    Compile List stores java.lang.Integer Array stores int val list: List[Int] = List.fill(10000)(1) public scala.collection.immutable.List<java.lang.Object> list(); Compile 240032 bytes 40016 bytes Boxing
  57. 73.

    Generics  Generics incurs the costs of boxing case class

    GenValue[T](value: T) def genIntValue(v: Int) = GenValue(1) public bytecodes.GenValue<java.lang.Object> genIntValue(int); … Code: stack=3, locals=3, args_size=2 0: new #17 // class bytecodes/GenValue … 5: invokestatic #23 // Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/ lang/Integer; … Compile Boxing
  58. 74.

    @specialized  Generates multiple versions of a class to remove

    boxing overhead case class SGenValue[@specialized T](value: T) SGenValue$mcB$sp.class // byte SGenValue$mcC$sp.class // char SGenValue$mcD$sp.class // double SGenValue$mcF$sp.class // float SGenValue$mcI$sp.class // int SGenValue$mcJ$sp.class // long SGenValue$mcS$sp.class // short SGenValue$mcV$sp.class // null SGenValue$mcZ$sp.class // boolean SGenValue.class // AnyRef
  59. 75.

    @specialized  case class SGenValue[@specialized T](value: T) def genIntValue(v: Long)

    = SGenValue(1) public bytecodes.SGenValue<java.lang.Object> genIntValue(long); … Code: stack=3, locals=3, args_size=2 0: new #17 // class bytecodes/SGenValue$mcI$sp … 5: invokespecial #20 // Method bytecodes/SGenValue$mcI$sp."<init>":(I)V … Compile No Boxing class for int
  60. 76.

    Comparison of Object Size  case class GenValue[T](value: T) case

    class SGenValue[@specialized T](value: T) Boxing Non Boxing 24 bytes 32 bytes val v = GenValue[Int](1) val sv = SGenValue[Int](1) val list: List[GenValue[Int]] = (1 to 10000).toList.map(GenValue[Int]) val slist: List[SGenValue[Int]] = (1 to 10000).toList.map(SGenValue[Int]) 480016 bytes 560016 bytes
  61. 77.

    @specialized  case class SValues[@specialized A, @specialized B](a: A, b:

    B) Generates 82 classes case class SValues[@specialized(Int, Long, Double) A, @specialized(Char, Byte) B](a: A, b: B) Generates 7 classes
  62. 78.

    Tuple vs Case Class  case class Bar(value: Int) def

    tuple2Boxed: (Int, Bar) case class ErrorResponse( title: String, message: String, errorCode: Int ) def errorResponse: ErrorResponse case class IntBar(value: Int, bar: Bar) def intBar: IntBar def tuple3: (String, String, Int) Boxing Non Boxing 200 bytes 184 bytes 56 bytes 40 bytes
  63. 79.

    Boxing  •Be careful of unnecessary boxing • Check whether

    boxing happens by using javap: javap -v xxx.class •Apply @speacialized for generics Ex. Integer.value emerge a lot
  64. 80.

    Top 3 Solution for GC  1. Use correct collection

    2. Avoid Boxing 3. Consider object lifetime in cache
  65. 81.

    Cache is effective for performance  Cache in memory (key-value)

    Database File key value If cache don’t has key, get data from origin If cache has key, response cache value
  66. 82.

    Cache implemented with Map  val cache = new mutable.HashMap[UserId,

    User] cache.get(id) match { case Some(user) => ??? // cache hit case None => ??? // get user from database } In general, Map is used when implement cache by yourself
  67. 83.

    Cache Dilemma  When should keys be refreshed? •If cache

    has many keys, performance may improve •Cache must not lead to OutOfMemory Should all keys be refreshed constantly?
  68. 84.

    Obsolete Object Reference affects GC  Cache in memory (key-value)

    null Object1 strong reference null Object2 soft reference null Object3 weak reference
  69. 85.

    Obsolete Object Reference affects GC  Cache in memory (key-value)

    null Object1 strong reference null Object2 soft reference null Object3 weak reference GC does not collect key GC collect key if it needs memory GC collect key
  70. 86.

    HashMap vs WeakHashMap  •HashMap •WeakHashMap Key with strong reference

    Key with weak reference val hashMap = mutable.HashMap[UserId, User](id -> user) id = null val weakMap = mutable.WeakHashMap[UserId, User](id -> user) id = null if id is referenced only by HashMap, GC don’t collect key. if id is referenced only by WeakHashMap, GC collect key.
  71. 87.

     // Strong Reference val map = mutable.HashMap[UserId, User](id1 ->

    suzuki) // Weak Reference val weakMap = mutable.WeakHashMap[UserId, User](id2 -> tanaka) println(map.toString()) // Map(UserId(1) -> User(1,Suzuki)) println(weakMap.toString()) // Map(UserId(2) -> User(2,Tanaka)) id1 = null id2 = null System.gc() // GC run println(map.toString()) // Map(UserId(1) -> User(1,Ichiro,Suzuki)) println(weakMap.toString()) // Map() HashMap vs WeakHashMap
  72. 88.

    Cache Strategy with Object Lifetime  •WeakHashMap is effective Ex.

    Temporary cache for large object •Refresh Key Pattern - Set TTL of key - Refresh all keys constantly - Delete key when GC runs by Weak Reference
  73. 90.

    Use CPU efficiently for I/O  •Thread Pool Size •Blocking

    or Non blocking - Many threads lead to context switch overhead - Not too small, not too large for number of CPU core - Non blocking is effective - If non blocking isn’t enable, use thread pool for I/O
  74. 91.

     No more GC!! •Performance •Memory Usage How 1.Collection 2.Boxing

    3.Object lifetime in Cache What Avoid low throughput by GC
  75. 92.

    Conclusion  •Use CPU efficiently for I/O High Performance by:

    •Avoid low throughput by GC Don’t guess, measure
  76. 93.

    References  • Scala High Performance Programming • Java High

    Performance - O’Reilly • Benchmarking Scala Collections • @inline and @specialized - Scala Days Berlin 2016 • http://techlog.mvrck.co.jp/entry/specialize-in-scala/ • https://www.baeldung.com/java-weakhashmap • https://www.slideshare.net/mogproject/adtech-scala-performance-tuning • https://www.infoq.com/presentations/JVM-Performance-Tuning-twitter/