Upgrade to Pro — share decks privately, control downloads, hide ads and more …

High Performance Scala/high_performance_scala

fuzyco
June 28, 2019

High Performance Scala/high_performance_scala

fuzyco

June 28, 2019
Tweet

More Decks by fuzyco

Other Decks in Technology

Transcript

  1. High Performance Scala
    scala matsuri 2019


    Hiroki Fujino

    View Slide

  2. About Me
    • Hiroki Fujino


    • Server Side Engineer


    • Working in Berlin since February

    View Slide

  3. Background
    Experienced advertising distribution system
    • Throughput for large number of request


    • Ef
    fi
    cient memory usage for large data
    What is required:
    Average ~10ms response

    View Slide

  4. • Runs on JVM


    • Static type system


    • Rich libraries supporting performance
    Scala is potentially High Performance Language

    View Slide

  5. Achievement of high performance
    is not easy

    View Slide

  6. • I/O


    • Garbage Collection(GC)

    ※Based on my experience
    Two Critical Performance Issue

    View Slide

  7. I/O Issue
    Throughput decrease
    Thread Starvation


    Context Switch
    Task1 Task2 Task3
    CPU Core
    Task1
    Task2
    Task3
    Task1
    Task2
    Task3
    Multithreading for tasks which include I/O
    thread1
    thread2
    thread3

    View Slide

  8. I/O Issue
    Why thread starvation happens?


    How blocking affects thread?
    Throughput decrease
    Thread Starvation


    Context Switch
    Task1
    Task2
    Task3
    Task1
    Task2
    Task3
    Task1 Task2 Task3
    CPU Core
    thread1
    thread2
    thread3
    Multithreading for tasks which include I/O

    View Slide

  9. GC Issue
    Increase memory to store lots objects
    Application stops


    for a longer time
    Stop The World by GC
    )FBQ
    )FBQ

    View Slide

  10. GC Issue
    How to reduce application stop time by GC?
    Application stops


    for a longer time
    Stop The World by GC
    )FBQ
    )FBQ
    Increase memory to store lots objects

    View Slide


  11. It’s necessary to understand the theory
    Why thread starvation happens?


    How blocking affects thread?


    How to reduce application stop time by GC?

    View Slide

  12. Focus Point
    Theory
    • Basic Module


    • Scala Collection


    • Concurrency Library


    • I/O Library
    Scala Language

    View Slide

  13. Focus Point
    Theory
    • Basic Module


    • Scala Collection


    • Concurrency Library


    • I/O Library
    Scala Language
    × • Mechanism of Garbage Collection


    • Data structure of Collection


    • Relationship between thread and CPU


    • Difference between


    blocking and non-blocking

    View Slide

  14. Focus Point
    Theory
    • Basic Module


    • Scala Collection


    • Concurrency Library


    • I/O Library
    Scala Language
    × • Mechanism of Garbage Collection


    • Data structure of Collection


    • Relationship between thread and CPU


    • Difference between


    blocking and non-blocking
    High Performance

    View Slide

  15. Goal
    Prerequisites:
    Basic knowledge of Scala
    Acknowledgement:
    The code in this presentation is sample


    Object size on HotSpot64
    •Use CPU ef
    fi
    ciently for I/O


    •Avoid application stop by GC
    High Performance by:

    View Slide

  16. Agenda
    1. I/O


    2. Garbage Collection


    3. Conclusion

    View Slide

  17. Agenda
    1. I/O


    2. Garbage Collection


    3. Conclusion

    View Slide

  18. I/O is very expensive
    •10[ns] Main Memory Reference


    •250,000[ns] Read 1MB sequentially from memory


    •10,000,000[ns] Disk seek


    •10,000,000[ns] Read 1MB sequentially from network


    •20,000,000[ns] Read 1MB sequentially from disk
    https://gist.github.com/jboner/2841832

    View Slide

  19. Concurrency
    Concurrency is effective for I/O
    I/O
    CPU Core Task1 Task2
    Task1
    Task2
    Task1
    Task2
    CPU Core can work on one task at a time

    View Slide

  20. The Goal of Concurrency
    •How to use thread pool


    •Blocking or Non-blocking
    Ef
    fi
    cient use of CPU
    Important things:

    View Slide

  21. 1. Asynchronous Call with Blocking Operation


    2. Asynchronous Call with Non-blocking Operation
    Understanding through Code Comparison

    View Slide

  22. Sample Code
    getUser
    getMessages
    getTeam
    Disk I/O
    Network I/O
    MicroService
    def execute(userId: String, teamId: String)


    (implicit ec: ExecutionContext) = {


    // get User by MySQL


    val userF = database.getUser(userId)


    // get Message by MySQL


    val messagesF = database.getMessages(userId)


    // get Team by MicroService


    val teamF = teamService.getTeam(teamId)


    }

    View Slide

  23. 1. Asynchronous Call with Blocking Operation


    2. Asynchronous Call with Non-blocking Operation

    Understanding through Code Comparison

    View Slide


  24. Database Access
    ScalikeJDBC
    def getUser(id: String)


    (implicit s: DBSession, ec: ExecutionContext): Future[Option[User]] =


    Future {


    withSQL {


    select.from(Users as u).where.eq(u.id, id)


    }.map(rs => Users(u.resultName, rs)).single.apply()


    }
    getUser
    getMessages
    Disk I/O
    Blocking

    View Slide


  25. Http Request
    Apache Http Client
    def getTeam(name: String)


    (implicit ec: ExecutionContext): Future[Option[Team]] = Future {


    val client = new DefaultHttpClient()


    val httpResponse = client.execute(new HttpGet(url))


    // transform response into Team





    }
    getTeam
    Network I/O
    MicroService
    Blocking

    View Slide


  26. 1. getUser 2. getMessages 3. getTeam
    Asynchronous Call with Blocking Operation
    thread1
    thread2
    thread3
    I/O
    ※The
    fi
    gure for illustration purposes

    View Slide

  27. Pro
    fi
    ling is effective for monitoring thread
    Switching threads depends on operating system
    Dif
    fi
    cult to predict thread condition
    Pro
    fi
    ling Tool:
    • CPU • Thread • I/O • Garbage Collection • Memory
    Recorded Factors:

    View Slide


  28. Sample result in multiple executions
    Running


    SocketRead
    Thread Graph

    View Slide


  29. Asynchronous Call with Blocking Operation
    thread1
    thread2
    thread3
    Thread is blocked
    1. getUser 2. getMessages 3. getTeam
    I/O
    ※The
    fi
    gure for illustration purposes

    View Slide

  30. Blocking Issue
    Blocking threads leads to


    thread starvation and decrease of throughput

    View Slide


  31. Increasing thread pool size solves this issue?

    View Slide

  32. Too Many Threads lead to problem
    • Excessive Memory Usage


    • Overhead of Context Switch

    View Slide

  33. Thread takes cost
    [参考] openjdk-jdk11/src/hotspot/os/linux/vm/os_linux.cpp
    Kernel Thread
    int ret = pthread_create(&tid, &attr, (void* (*)(void*)) thread_native_entry, thread);
    JVM thread consumes 1MB
    implicit val ec = ExecutionContext.fromExecutor(new ForkJoinPool(8))
    val thread = new Thread()

    View Slide

  34. Cost of Context Switch
    Total overhead of


    context switch
    thread1
    thread2
    thread3
    Time
    CPU Core
    CPU Core

    View Slide


  35. Non-blocking
    Ef
    fi
    cient use of threads while waiting I/O

    View Slide


  36. I/O
    CPU Core Task1 Task2
    Task1
    Task2
    Task1
    Task2
    Non-blocking
    Single thread with non-blocking operation
    Thread executes another task during I/O wait
    thread1

    View Slide

  37. 1. Asynchronous Call with Blocking Operation


    2. Asynchronous Call with Non-blocking Operation

    Understanding through Code Comparison

    View Slide

  38. Non-blocking Library
    Database Client
    Http
    ScalikeJDBC-Async quill-async


    quill-
    fi
    nagle-mysql
    akka-http

    View Slide


  39. Database Access with Non-blocking
    quill-async
    def getUser(id: String)


    (implicit ec: ExecutionContext): Future[Option[Users]] = {


    val select = quote {


    query[Users].
    fi
    lter(_.id == lift(id))


    }


    ctx.run(select).map(_.headOption)


    }
    getUser
    getMessages
    Disk I/O
    Non-blocking

    View Slide


  40. Http Request with Non-blocking
    def getTeam(teamName: String)


    (implicit ec: ExecutionContext): Future[Team] = {


    val response =


    Http().singleRequest(HttpRequest(uri = innerUrl, method = HttpMethods.GET))


    // transform response into Team





    }
    akka-http
    getTeam
    Network I/O
    MicroService
    Non-blocking

    View Slide


  41. 1. getUser 2. getMessages 3. getTeam
    Asynchronous Call with Non-blocking Operation
    I/O
    Thread is non-blocked
    ※The
    fi
    gure for illustration purposes
    thread1

    View Slide


  42. Running


    Monitor Wait


    Thread Park
    Thread Graph
    Sample result in multiple executions

    View Slide

  43. Blocking vs Non-blocking
    Blocking Non-blocking
    Non-blocking enables thread to run ef
    fi
    cient

    View Slide

  44. Thread Pool Strategy
    • Should not use `execution global` in blocking operation
    import scala.concurrent.ExecutionContext.Implicits.global
    •Thread pool size
    implicit val blockingEC = ExecutionContext.fromExecutor(new ForkJoinPool(8))
    - CPU bound => About number of CPU Core


    - I/O bound with blocking


    => More than number of CPU Core, but not too large
    • If there is blocking operation, use another thread pool for it
    The next slide is the updated one


    because some information is wrong here

    View Slide

  45. Thread Pool Strategy (Updated Slide)
    • Should not use `execution global` in blocking operation
    import scala.concurrent.ExecutionContext.Implicits.global
    • *If there is blocking operation, use another thread pool for it
    •Thread pool size
    - CPU bound => About number of CPU Core


    - I/O bound with blocking


    => More than number of CPU Core, but not too large
    implicit val blockingEC = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(8))
    implicit val blockingEC = ExecutionContext.fromExecutor(Executors.newCachedThreadPool())
    Be careful of growing thread pool size
    *Updated the previous slide

    View Slide

  46. Garbage Collector also use CPU
    CPU Core
    Application Garbage Collector
    Task 1 Task 2 GC Task
    Task1
    Task2
    Task1
    GC Task

    View Slide

  47. Garbage Collector also use CPU
    Stop


    The


    World
    CPU Core
    Application Garbage Collector
    Task 1 Task 2 GC Task
    Task1
    Task2
    Task1
    GC Task

    View Slide

  48. Agenda
    1. I/O


    2. Garbage Collection


    3. Conclusion

    View Slide

  49. Garbage Collection
    )FBQ
    )FBQ
    Frequency of GC


    Application stop time
    Frequency of GC


    Application stop time
    Small heap
    Large heap

    View Slide

  50. Types of Garbage Collection
    •ZGC


    •Shenandoah
    •CMS


    •G1
    •Serial


    •Parallel
    Each GC has features:
    •Generational or Regional


    •How many heap size is assumed

    View Slide

  51. Types of Garbage Collection
    Default GC from Java9
    •ZGC


    •Shenandoah
    •CMS


    •G1
    •Serial


    •Parallel
    Each GC has features:
    •Generational or Regional


    •How many heap size is assumed

    View Slide

  52. Types of Garbage Collection
    New GC in Java11


    Require large heap size
    •CMS


    •G1
    •Serial


    •Parallel
    •ZGC


    •Shenandoah
    Each GC has features:
    •Generational or Regional


    •How many heap size is assumed

    View Slide

  53. Types of Garbage Collection
    Each GC has features:
    •Generational or Regional


    •How many heap size is assumed
    In general, application stops during garbage collection step


    (ZGC and Shenandoah may be not)
    •ZGC


    •Shenandoah
    •CMS


    •G1
    •Serial


    •Parallel

    View Slide

  54. Common Consideration for GC Optimization
    •Performance
    •Memory Usage
    - The faster operation is completed,


    the more object is collected in early stage
    - The less use memory, the less GC happens
    - Avoid unnecessary memory allocation
    - Avoid inef
    fi
    cient algorithm

    View Slide

  55. Top 3 Solution for GC
    1. Use correct collection


    2. Avoid Boxing


    3. Consider object lifetime in cache
    ※Based on my experience

    View Slide

  56. Top 3 Solution for GC
    1. Use correct collection


    2. Avoid Boxing


    3. Consider object lifetime in cache

    View Slide

  57. Collection affects GC
    )FBQ
    Large collection object tends to bottleneck.
    •Use large memory


    •Objects stay in memory for long time if calculation is slow
    Normal Object
    Large Collection Object

    View Slide

  58. Scala Collection
    scala.collection.mutable scala.collection.immutable
    Many collection type in scala.collection package
    https://docs.scala-lang.org/overviews/collections/overview.html

    View Slide

  59. Data Structure of Collection
    Each collection has different data structure
    Ex.
    List Vector
    Linked List Bitmapped Vector Trie
    ɾ ɾ ɾ
    ʜ ʜ

    ʜ
    ʜ
    … ʜ

    View Slide


  60. Each collection has different characteristics
    Ex.
    List Vector
    L eC
    C
    L
    eC
    eC
    val lookUpElem = list(2)
    val listWithZero = 0 :: list
    val listWithFour = list :+ 4
    Performance Characteristics of Collection
    val list = List(1, 2, 3) val vector = Vector(1, 2, 3)
    val vectorWithZero = 0 +: vector
    val lookUpElem = vector(2)
    val vectorWithFour = vector :+ 4
    Complexity Complexity

    View Slide

  61. Performance Characteristics of Collection
    https://docs.scala-lang.org/overviews/collections/performance-characteristics.html
    Of
    fi
    cial documents

    View Slide


  62. Memory Characteristics of Collection
    val seq: Seq[Int] = Seq.
    fi
    ll(10000)(1)
    val list: List[Int] = List.
    fi
    ll(10000)(1)
    val array: Array[Int] = Array.
    fi
    ll(10000)(1)
    val vector: Vector[Int] = Vector.
    fi
    ll(10000)(1)
    Seq
    List
    Array
    Vector
    240,032 Bytes
    240,032 Bytes
    40,016 Bytes
    46,728 Bytes

    View Slide

  63. Characteristics of Parallel Collection
    • Array


    • ArrayBuffer


    • Vector


    • mutable.HashMap


    • immutable.HashMap
    • Range


    • mutable.HashSet


    • immutable.HashSet


    • concurrent.TrieMap
    List Vector ParVector
    scala> (1 to 10000).toList.par


    res2: ParVector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
    Takes cost of transformation

    View Slide

  64. Micro Benchmark is effective for Performance
    JMH
    https://github.com/ktoso/sbt-jmh
    class ParallelCollectionBenchmark {


    val vecNumbers = Random.shuf
    fl
    e(Vector.tabulate(10000000)(i => i))


    val listNumbers = Random.shuf
    fl
    e(List.tabulate(10000000)(i => i))


    @Benchmark


    def maxInVec() = {


    vecNumbers.par.max


    }


    @Benchmark


    def maxInList() = {


    listNumbers.par.max


    }


    }

    View Slide

  65. Micro Benchmark is effective for Performance
    JMH
    https://github.com/ktoso/sbt-jmh
    [info] Benchmark           Mode Cnt Score Error Units


    [info] CollectionBenchmark.maxInList thrpt 10 1.109 ± 0.436 ops/s


    [info] CollectionBenchmark.maxInVec  thrpt 10 9.524 ± 1.339 ops/s

    View Slide

  66. Use Correct Collection
    •Sequential Access or Random Access
    •Parallel execution
    •Not need all data in memory
    - List is suitable for sequential access
    - Vector is suitable for random access
    - Use collection which has pure parallel collection.
    - Use Iterator as possible

    View Slide

  67. Top 3 Solution for GC
    1. Use correct collection


    2. Avoid Boxing


    3. Consider object lifetime in cache

    View Slide

  68. Primitive type vs Wrapper type
    Primitive Type Wrapper class type
    ・boolean


    ・char


    ・byte


    ・short


    ・int


    ・long



    fl
    oat


    ・double
    ・Boolean


    ・Char


    ・Byte


    ・Short


    ・Int


    ・Long


    ・Float


    ・Double

    View Slide

  69. Primitive type vs Wrapper type
    Primitive Type Wrapper class type
    All wrapper class type extends Object
    Wrapper type consumes more memory than Primitive Type
    4 bytes 32 bytes
    int numInt = 123; Integer numInteger = new Integer(123);

    View Slide

  70. Boxing
    Boxing
    int numInt = 123;


    Integer numInteger = new Integer(numInt); // Boxing
    int numInt = 123;


    Object obj = numInt; // Auto Boxing
    byte var1 = 123; // 4bytes


    Integer var2 = Integer.valueOf(var1); // 16bytes

    View Slide

  71. – https://twitter.github.io/effectivescala/
    “Scala hides boxing/unboxing operations from
    you, which can incur severe performance or space
    penalties.”

    View Slide

  72. Boxing
    •Collection


    •Generics


    •Tuple
    Popular Case

    View Slide

  73. Collection
    val array: Array[Int] = Array.
    fi
    ll(10000)(1)
    public int[] array();
    Compile
    List stores java.lang.Integer
    Array stores int
    val list: List[Int] = List.
    fi
    ll(10000)(1)
    public scala.collection.immutable.List list();
    Compile
    240032 bytes
    40016 bytes
    Boxing

    View Slide

  74. Generics
    Generics incurs the costs of boxing
    case class GenValue[T](value: T)


    def genIntValue(v: Int) = GenValue(1)
    public bytecodes.GenValue genIntValue(int);





    Code:


    stack=3, locals=3, args_size=2


    0: new #17 // class bytecodes/GenValue





    5: invokestatic #23 // Method scala/runtime/BoxesRunTime.boxToInteger:(I)Ljava/
    lang/Integer;



    Compile
    Boxing

    View Slide

  75. @specialized
    Generates multiple versions of a class


    to remove boxing overhead
    case class SGenValue[@specialized T](value: T)
    SGenValue$mcB$sp.class // byte


    SGenValue$mcC$sp.class // char


    SGenValue$mcD$sp.class // double


    SGenValue$mcF$sp.class //
    fl
    oat


    SGenValue$mcI$sp.class // int


    SGenValue$mcJ$sp.class // long


    SGenValue$mcS$sp.class // short


    SGenValue$mcV$sp.class // null


    SGenValue$mcZ$sp.class // boolean


    SGenValue.class // AnyRef

    View Slide

  76. @specialized
    case class SGenValue[@specialized T](value: T)


    def genIntValue(v: Long) = SGenValue(1)
    public bytecodes.SGenValue genIntValue(long);





    Code:


    stack=3, locals=3, args_size=2


    0: new #17 // class bytecodes/SGenValue$mcI$sp





    5: invokespecial #20 // Method bytecodes/SGenValue$mcI$sp."":(I)V



    Compile
    No Boxing
    class for int

    View Slide

  77. Comparison of Object Size
    case class GenValue[T](value: T) case class SGenValue[@specialized T](value: T)
    Boxing Non Boxing
    24 bytes
    32 bytes
    val v = GenValue[Int](1) val sv = SGenValue[Int](1)
    val list: List[GenValue[Int]] =


    (1 to 10000).toList.map(GenValue[Int])
    val slist: List[SGenValue[Int]] =


    (1 to 10000).toList.map(SGenValue[Int])
    480016 bytes
    560016 bytes

    View Slide

  78. @specialized
    case class SValues[@specialized A, @specialized B](a: A, b: B)
    Generates 82 classes
    case class SValues[@specialized(Int, Long, Double) A, @specialized(Char, Byte) B](a: A, b: B)
    Generates 7 classes

    View Slide

  79. Tuple vs Case Class
    case class Bar(value: Int)


    def tuple2Boxed: (Int, Bar)
    case class ErrorResponse(


    title: String, message: String, errorCode: Int


    )


    def errorResponse: ErrorResponse
    case class IntBar(value: Int, bar: Bar)


    def intBar: IntBar
    def tuple3: (String, String, Int)
    Boxing Non Boxing
    200 bytes
    184 bytes
    56 bytes 40 bytes

    View Slide

  80. Boxing
    •Be careful of unnecessary boxing
    •Check whether boxing happens by using javap:
    javap -v xxx.class
    •Apply @specialized for generics
    Ex. Integer.value emerge a lot

    View Slide

  81. Top 3 Solution for GC
    1. Use correct collection


    2. Avoid Boxing


    3. Consider object lifetime in cache

    View Slide

  82. Cache is effective for performance
    Cache in memory


    (key-value)
    Database
    File
    key
    value
    If cache don’t has key,


    get data from origin
    If cache has key,


    response cache value

    View Slide

  83. Cache implemented with Map
    val cache = new mutable.HashMap[UserId, User]


    cache.get(id) match {


    case Some(user) => ??? // cache hit


    case None => ??? // get user from database


    }
    In general, Map is used when implement cache by yourself

    View Slide

  84. Cache Dilemma
    When should keys be refreshed?
    •If cache has many keys, performance may improve


    •Cache must not lead to OutOfMemory
    Should all keys be refreshed constantly?

    View Slide

  85. Obsolete Object Reference affects GC
    Cache in memory


    (key-value)
    null
    Object1
    strong reference
    null
    Object2
    soft reference
    null
    Object3
    weak reference

    View Slide

  86. Obsolete Object Reference affects GC
    Cache in memory


    (key-value)
    null
    Object1
    strong reference
    null
    Object2
    soft reference
    null
    Object3
    weak reference
    GC does not collect key
    GC collect key


    if it needs memory
    GC collect key

    View Slide

  87. HashMap vs WeakHashMap
    •HashMap
    •WeakHashMap
    Key with strong reference
    Key with weak reference
    val hashMap = mutable.HashMap[UserId, User](id -> user)


    id = null
    val weakMap = mutable.WeakHashMap[UserId, User](id -> user)


    id = null
    if id is referenced only by HashMap, GC don’t collect key.
    if id is referenced only by WeakHashMap, GC collect key.

    View Slide


  88. // Strong Reference


    val map = mutable.HashMap[UserId, User](id1 -> suzuki)


    // Weak Reference


    val weakMap = mutable.WeakHashMap[UserId, User](id2 -> tanaka)


    println(map.toString()) // Map(UserId(1) -> User(1,Suzuki))


    println(weakMap.toString()) // Map(UserId(2) -> User(2,Tanaka))


    id1 = null


    id2 = null


    System.gc() // GC run


    println(map.toString()) // Map(UserId(1) -> User(1,Ichiro,Suzuki))


    println(weakMap.toString()) // Map()
    HashMap vs WeakHashMap

    View Slide

  89. Cache Strategy with Object Lifetime
    •WeakHashMap is effective
    Ex. Temporary cache for large object
    •Refresh Key Pattern
    - Set TTL of key


    - Refresh all keys constantly


    - Delete key when GC runs by Weak Reference

    View Slide

  90. Agenda
    1. I/O


    2. Garbage Collection


    3. Conclusion

    View Slide

  91. Use CPU ef
    fi
    ciently for I/O
    •Thread Pool Size
    •Blocking or Non blocking
    - Many threads lead to context switch overhead


    - Not too small, not too large for number of CPU core
    - Non blocking is effective


    - If non blocking isn’t enable, use thread pool for I/O

    View Slide


  92. No more GC!!
    •Performance


    •Memory Usage
    How
    1.Collection


    2.Boxing


    3.Object lifetime in Cache
    What
    Avoid low throughput by GC

    View Slide

  93. Conclusion
    •Use CPU ef
    fi
    ciently for I/O
    High Performance by:
    •Avoid low throughput by GC
    Don’t guess, measure

    View Slide

  94. References
    • Scala High Performance Programming


    • Java High Performance - O’Reilly


    • Benchmarking Scala Collections


    • @inline and @specialized - Scala Days Berlin 2016


    • http://techlog.mvrck.co.jp/entry/specialize-in-scala/


    • https://www.baeldung.com/java-weakhashmap


    • https://www.slideshare.net/mogproject/adtech-scala-performance-tuning


    • https://www.infoq.com/presentations/JVM-Performance-Tuning-twitter/


    • https://gist.github.com/djspiewak/46b543800958cf61af6efa8e072bfd5c

    View Slide