Getting Started Performance Troubleshoot

by Chihiro Ito

Slide 1

Slide 1 text

伊藤ちひろ Chihiro Ito パフォーマンスのトラブルシュート入門 Getting started Performance Troubleshooting 1

Slide 2

Slide 2 text

Source: Insert source data here Insert source data here Insert source data here OPTIONAL SECTION MARKER OR TITLE 伊藤ちひろ (Chihiro Ito) OpenJDK Committer (JFR, SA) Application Services Solution Architect Twitter : @chiroito 自己紹介 Proﬁle 2

Slide 3

Slide 3 text

本資料の対象者 Target of this presentation ● 性能問題が起きてる人 People having performance problems ● 性能問題が起きてるけど原因が分からない人 People who are having performance problems but don't know why ● 性能問題が起きてる気がするので勘で対応している人 People who feel like having a performance problem and is dealing with it on a hunch ● 性能問題が起きてるか分からない人 People who don't know if they're having performance problems ● 3

Slide 4

Slide 4 text

性能問題とは何か？ What is performance problem? 4

Slide 5

Slide 5 text

よくある疑問 common questions アプリが遅い気がする？ I feel like the app is slow? 性能が目標値に達してない･･･ My app isn't meeting its performance targets. 5

Slide 6

Slide 6 text

よくある対処方法 Common ways to deal 良く分からないけどこれかな？ Maybe it's this cause. いくつか試したら改善した気がする I tried a few things and it improved. 6

Slide 7

Slide 7 text

あるべき対処方法 The ideal way to deal with a problem アプリからDBへの接続が枯渇した Lack of connection from the app to the DB DBはまだ接続を受け入れられるなアプリからの接続を増やそう。 DB can still accept connections. Get more connections from the app. 7

Slide 8

Slide 8 text

正しいパフォーマンスチューニングの流れ The Right Performance Tuning Flow 1. アプリが何をしているか理解 Understanding what the app is doing 2. アプリの事実を計測 Measuring the facts of the app 3. 計測結果を分析 Analyze the measurement results 4. 根拠のある改善策を適用 Apply evidence-based remedies 8

Slide 9

Slide 9 text

アプリが何をしてるのか Understanding what the app is doing 9

Slide 10

Slide 10 text

アプリが何をしているか - Webアプリの場合 Understanding what the app is doing - For web apps リクエスト Request 10 レスポンス Response アプリケーション Application 異常時はこの期間が延びる In abnormal times, this period is extended. 処理 Processing

Slide 11

Slide 11 text

アプリが何をしているか - Webアプリの詳細 Understanding what the app is doing - More about the web app 11 アプリケーション Application データベース Database API API 異常時はどこかの期間が延びる In abnormal times, some period of time is extended.

Slide 12

Slide 12 text

注意：アプリの下も把握しなければならない Note: You have to ﬁgure out what's going on underneath the app as well ● アプリケーションを動かす基盤を含めて把握が必要 You need to understand the infra that the application works ○ フレームワーク Framework ○ ランタイム（アプリケーションサーバなど） Runtime (include Application Server, etc) ○ オペレーティングシステム Operating System 12 F/W Runtime OS スタック Stack アプリ App

Slide 13

Slide 13 text

アプリの事実を計測 Measuring the facts of the app 13

Slide 14

Slide 14 text

アプリの事実を計測とは Measuring the facts of the app 14 ログ・メトリクス Log, Metric 記録 Record アプリの振る舞い App Behavior

Slide 15

Slide 15 text

どのようにアプリの事実を計測するか How to measure the facts of your app ● 分析に必要な全ての事実を計測 Measure all the facts needed for analysis ○ 何の情報を計測するか What information is being measured ○ いつ情報を計測するか When to measure the information ○ どこへ情報を記録するか Where to record the information 15

Slide 16

Slide 16 text

何の情報を計測するか What information is being measured ● 各処理の視点 Perspectives on each process ○ 処理のどこが遅くなっているかの事実が分かる Being able to see the facts about what part of the process is slowing down ● システム全体の視点 Perspective in whole system ○ なぜ遅くなっているのか原因が分かる Being able to see why it's slowing down 16

Slide 17

Slide 17 text

何を計測するか - 各処理の視点 What information is being measured - Perspectives on each process A. HTTPをリクエストする（クライアント） Request HTTP (client) B. HTTPリクエストを受ける Receiving HTTP requests C. ネットワーク操作（接続、問い合せ、切断など） Network operations (connect, query, disconnect, etc.) D. スレッド操作（処理中、待機、スリープなど） Thread operations (in process, wait, sleep, etc.) E. ファイル操作（読み込み、書み込みなど） File operations (read, write, etc.) 17 A B B C D E リクエスト Request レスポンス Response

Slide 18

Slide 18 text

何を計測するか - システム全体の視点 What information is being measured - Perspective in whole system ● リソースの全体量、使用量、残量や待ち数を記録 Record the overall amount of resources, usage, remaining and number of waits ● バックグラウンドで処理されるものを記録 Record what is processed in the background ● OSやプロセスの情報を記録 Record OS and process information ● 毎秒～10秒毎のように定期的に記録 Record it periodically, like every second or 10 seconds 18 残量 Remain 使用量 Usage リソースプール Resource Pool

Slide 19

Slide 19 text

何を計測するか：具体例 What information is being measured ● HTTPやDBなどの処理期間 Processing time for HTTP, DB, etc ● HTTPなどの処理数とそれらの受付待ち数 Number of HTTP, etc. processes and their waiting lists ● プールの使用量（コネクションやスレッドなど） Pool usage (connections, threads, etc.) ● ランタイムのバックグラウンド処理の統計 Run-time background processing statistics ● CPU使用率、IO時間、OSのメモリ使用量など CPU usage, IO time, OS memory usage, etc 19 各処理の視点 Perspectives on each process システムの視点 Perspective in whole system

Slide 20

Slide 20 text

いつ情報を計測するか When to measure the information ● 問題が起きた期間 The period of time when the problem occurred ● 何も問題が起きていない期間 A period of time in which nothing has gone wrong 20 正常期間 Normal Period 異常期間 Abnormal Period 記録期間 Recording period 正常期間 Normal Period

Slide 21

Slide 21 text

どこへ情報を記録するか Where to record the information ● 何度でも確認できるところ Where you can check it again and again ○ ファイル File ○ ミドルウェア/サービス（Prometheus/Elasticsearch/Datadog） Middleware or Services (Prometheus/Elasticsearch/Datadog) ● ダメ：すぐに消えてしまうようなところ（Shell） No Good : Where it's going to disappear quickly (Shell) 21

Slide 22

Slide 22 text

計測結果を分析 Analyze the measurement results 22

Slide 23

Slide 23 text

計測結果を分析 Analyze the measurement results ● 問題の根拠を見つける Find the reasons for the problem ○ 計測結果を可視化 Visualization of measurement results ○ 正常期間と異常期間を比較する Comparing normal and abnormal periods ○ 情報が足りなかったら計測へ戻る Return to measurement when the acquired information is insuﬃcient 23

Slide 24

Slide 24 text

計測した情報を可視化 Visualization of measurement results ● 各処理の視点 Perspectives on each process 24 ● システム全体の視点 Perspective in whole system DB処理 API呼出時間 time 開始 Start 終了 End 3:24:10 3:24:20 3:24:30 使用中 usage 割当待ち waiting lists リソース状況 Resource Status 3:24:23 3:24:34

Slide 25

Slide 25 text

正常期間と異常期間を比較 Comparing normal and abnormal periods ● 正常期間 Normal period 25 ● 異常期間 Abnormal Period DB API 呼出 3:24:10 3:24:20 3:24:30 使用中 usage 割当待ち waiting lists D B API 呼出 1:01:10 1:01:20 1:00:30 使用中 usage コネクションプール Connection Pool 処理 Processing 3:24:23 3:24:34 1:01:12 1:01:17

Slide 26

Slide 26 text

ここまでのまとめ Summary ● アプリがしてることを内部まで正しく理解する Get what the app is doing right down to the insides ● 必要な情報は全て記録する record all the information we need ● 可視化して正常期間と異常期間を比較する Visualize and compare normal and abnormal periods ● 根拠のある改善を行う Making Evidence-Based Improvements 26

Slide 27

Slide 27 text

Java の場合 Case of Java 27

Slide 28

Slide 28 text

アプリが何をしてるのか：Javaの構成 Understanding what the app is doing - Java stack ● アプリケーション Application ● アプリケーションサーバ（フレームワーク含む） Application Server (Inc. Framework, etc) ● Java 仮想マシン Java Virtual Machine ● オペレーティングシステム Operating System 28 Windows Linux Mac Java VM Quarkus Wildﬂy Spring App App App

Slide 29

Slide 29 text

何を計測するか：Java What information is being measured : Java stack ● HTTPやDBなどの処理期間 Processing time for HTTP, DB, etc ● HTTPなどの処理数とそれらの受付待ち数 Number of HTTP, etc. processes and their waiting lists ● プールの使用量（コネクションやスレッドなど） Pool usage (connections, threads, etc.) ● ガベージコレクション、クラスロード、コンパイル Garbage Collection, Loading class, Compile, etc. ● CPU使用率、IO時間、OSのメモリ使用量など CPU usage, IO time, OS memory usage, etc 29 各処理の視点 Perspectives on each process システムの視点 Perspective in whole system 大抵のFWが提供 Most frameworks provide

Slide 30

Slide 30 text

アプリの事実を計測：Java 標準のツール How do information be measured : Java standard tools ● Uniﬁed Logging ● MBean / Java Management Extention (JMX) ● Jcmd / Jstat ● Eclipse MicroProﬁle Metrics / Health ● JDK Flight Recorder 30

Slide 31

Slide 31 text

Unified Logging ● JVMのイベントをログとして出力するJVMのログフレームワーク A JVM logging framework that logs JVM events ○ GC、クラスロード、JITコンパイルなどの実行時に出力 Output during GC, class loading, JIT compilation, etc when these are processing ● 以前は取りたいログ毎にVM引数が存在 Previously, there were VM arguments for each log you wanted to get 31 java -Xlog:gc=trace:file=gctrace.txt:uptimemillis,pid:filecount=5,filesize=1M App [5ms][34164] MarkStackSize: 4096k MarkStackSizeMax: 16384k [21ms][34164] ConcGCThreads: 2 offset 16 [21ms][34164] ParallelGCThreads: 8 [22ms][34164] Initialize mark stack with 4096 chunks, maximum 16384

Slide 32

Slide 32 text

MBean / JMX ● アプリ/FW/JVMがメトリクスの公開などの管理に使うフレームワーク A framework that apps/FW/JVMs use to manage metrics exposure, etc ○ JVMはGC、クラスロード、JITコンパイルなどの統計を提供 JVM provides statistics on GC, class loads, JIT compilation, etc. ○ フレームワークはリソース量、処理待ち数などを提供 App framework provides such as amount of resources, processing waits, etc. 32 @MXBean public interface Metrics {} public class MetricsImpl implements Metrics {}

Slide 33

Slide 33 text

Eclipse MicroProﬁle Metrics/Health ● FWがアプリケーションやFWのメトリクスを公開 FW exposes application and FW metrics ○ 生存確認、APIの呼び出し回数、処理時間など Health check, number of API calls, processing time, etc ● XXX 33 curl http://xxxxx/metrics/application { "org.acme.microproﬁle.metrics.PrimeNumberChecker.checksTimer" : { "p99": 217.231273, "p999": 217.231273, "min": 0.58961, "mean": 112.15909190834819, (略) "stddev": 108.2721053982776, } }

Slide 34

Slide 34 text

JDK Flight Recorder ● アプリ/FW/JVMのメトリクス・イベントを記録 Record app/FW/JVM metrics and events ○ JVMの主要な分析に必要なほぼ全てのメトリクスを記録 Record nearly all the metrics needed for key JVM analyses ○ アプリ独自のメトリクスも作れる。対応FWもある You can also create app-speciﬁc metrics. Some FW support JFR 34 Recording r = new Recording(); r.start(); r.stop(); r.dump() java -XX:StartFlightRecording App

Slide 35

Slide 35 text

アプリの状況を計測：昔ながらの収集例 How do information be measured : example of old-fashioned collecting 35 ログ Log JMX JStat 収集アプリ Collector app アプリ App 記録 Log 記録 Log

Slide 36

Slide 36 text

アプリの状況を計測：今どきの収集例 How do information be measured : Example of the latest collections 36 ログ Log MicroProﬁle Metrics/Health アプリ App JFR Prometheus

Slide 37

Slide 37 text

計測結果を分析：ツール Analyze the measurement results : Tools ● スプレッドシート（Excelなど） Spread sheet (ex. Excel) ● Prometheus + Grafana ● Elasticsearch + Kibana ● JDK Mission Control 37

Slide 38

Slide 38 text

JDK Mission Control 38

Slide 39

Slide 39 text

Java のまとめ ● Java でも上から下まで情報を記録する必要がある ● Java には様々なツールが用意されていて幅広く記録できる ● Java には分析ツールも含まれている 39

Slide 40

Slide 40 text

ケーススタディ Case study 40

Slide 41

Slide 41 text

ケーススタディ Case Study 41 アプリケーション Application ● 1回目のリクエストだけ遅い、2回目からは早くなる Only the ﬁrst request is slow, the second one is faster. 遅い Slow 速い Fast 1回目のリクエスト First time 2回目のリクエスト First time

Slide 42

Slide 42 text

原因は？ What is the cause? 42 Just In Time コンパイルされてないコードは激遅なんだよ！ Just In Time Any code that isn't compiled is very slow! 起動直後はスレッドプールが準備できてないからだろ。 That's because the thread pool isn't ready right after startup. コネクションプールが用意できてないからに決まってるだろ It's because they don't have a connection pool available.

Slide 43

Slide 43 text

デモ環境 Demo Environment ● DBアクセスと他のAPIへアクセスするアプリケーション Applications that access the DB and other APIs ● SpringBoot 2.3.4 on OpenJDK 14 SpringBoot 2.3.4 on OpenJDK 14 ● 全てのイベントを記録するようにJFRを設定 Set JFR to record all events ● HTTPとJDBCを記録するように独自イベントを作成 Create your own events to record HTTP and JDBC 43

Slide 44

Slide 44 text

デモで実行した処理の流れ The ﬂow of the process performed in the demo 44 JFR開始 Start JFR JFR終了 Stop JFR リクエスト Request 操作 Operations 1回目(1.2s) 2回目(16ms) 3回目(12ms) 4回目(13ms) 5回目(15ms)

Slide 45

Slide 45 text

各リクエストの時間 Processing time for each request 45 1回目のリクエスト First request 2回目以降のリクエスト Second and subsequent requests 処理期間は10-15ms Processing time is 10-15ms 処理期間は800msだが、400ms差がある The processing time is 800ms, but there is a 400ms diﬀerence

Slide 46

Slide 46 text

2～5回目のリクエストの処理の内訳 A breakdown of the processing of the second through ﬁfth requests 46 遅延している処理は記録されていない Delayed processes are not recorded. DBの処理期間は短い DB processing time is short.

Slide 47

Slide 47 text

最初のリクエストの処理の内訳 A breakdown of the processing of the ﬁrst request 47 DBの処理期間は短い DB processing time is short. クラスロードが長期間を占める Class load takes up a long period of time.

Slide 48

Slide 48 text

他のスレッド The other thread. 48 スレッドは処理を受ける前にある Threads are there before it gets processed. JITは処理され続けている JIT continues to be processed. 処理スレッド processing thread 処理前のFWの処理が400ms掛っている It is taking 400ms to process the FW before processing

Slide 49

Slide 49 text

結論 Conclusion ● 処理開始時点にはスレッドは用意ができている At the start of the process, the thread is ready to go ● コネクションプールでは遅延していない No delays in connection pools ● JIT されていなくても処理は速い Even if it's not JIT'd, it's still fast ● 処理に必要なクラスのロードに時間が掛かっている It's taking a long time to load the classes needed for processing 49

Slide 50

Slide 50 text

CONFIDENTIAL Designator linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHat Thank you 50