SRv6-mobile VPP plugin Performance Optimization

Naoyuki Mori IAGS/CPDP/CEE/NnP team 2019/09/19 for 2nd FD.io Users community
event #2

Copyright © 2019, Intel Corporation. All rights reserved. *Other names
and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 2 Agenda 1. Target workload: VPP* SRv6-mobile plugin 2. Sources & patches for baseline 3. Analysis 4. Performance optimization 5. Results & Summary

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 3 Target workload T.M.GTP4.D (ex. T.M.Tmap) ▪ Convert GTP-U over IPv4 packet To SRH inserted IPv6 packet Running on BareMetal VPP

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 4 Source code sets used $ git reflog 597f6e865 (HEAD -> ietf105-hackathon, origin/ietf105-hackathon, origin/HEAD) HEAD@{0}: clone: from https://github.com/filvarga/srv6-mobile.git $ git log | head commit 597f6e86590745f54ad07e8bdc24dd4c5b066274 Author: matsusato3 <[email protected]> Date: Mon Jun 24 11:41:09 2019 +0900

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 5 Motivation & Disclaimer • For research purpose • Intended to share IA optimization principal ideas and practical method • Performance numbers in this slide are for indication only, Intel do not grantee these

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 7 既存のアプリケーションに対する最適化のアプローチこのモデルに基づきインテル® アーキテクチャー向け最適化を行います。ワークロード、使用している言語により最適化できる内容・レベルは変わります。

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 労力 vs. パフォーマンス労力パフォーマンス理論上のパフォーマンス必須パフォーマンスツール利用によるパフォーマンスツールを利用しないパフォーマンス

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice ラインレートの実現どれほどシビアか? プログラムの視点からラインレート 64 byte/パケット時のパケット到達間隔 2GHz クロック CPU でのサイクル数 10Gbps 67.2ns 134 cycles 40Gbps 16.8ns 33 cycles パケット 1 取込パケット 1 ルックアップパケット 1 アクションパケット 2 取り込みパケット 2 ルックアップパケット 2 アクションパケット到達間隔 11 パケット 1 送出パケット 2 送出

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice L3 (LLC) L1 L2 Main memory L1 L2 コア1 コア2 より近くから (時間のかからないキャッシュから) データを取り寄せるようにして、読み込み待ちの遅延を少なくしているここにプログラムとデータが置かれるコピーが転送されていくレジスターレジスター最終的にここで命令は実行される実行ユニット実行ユニット命令とデータ 0 4~6 14 50~70 170~ Skylake† サーバー・マイクロアーキテクチャーでのレイテンシー単位は CPU サイクル、Main memory はCPU 2.50GHz 時の参考値 32kB 1MB 1.375MB /コア ~1.5TB /ソケット容量 Memory access latency at Skylake architecture レイテンシー命令とデータ命令とデータ命令とデータ NFVにおいて L3 miss = 170cycles latency => Packet drop! † 開発コード名

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 15 Step 0: Bug fix patches applied as baseline Fix only 1 out of following 3 packets are correctly handled: Impacted performance -1.1% @@ -1645,42 +1645,42 @@ sr_policy_rewrite_encaps_v4 (vlib_main_t * vm, vlib_node_runtime_t * node, if (PREDICT_TRUE (sl1->is_tmap)) { - sr1 = (void*)(ip0+1); + sr1 = (void*)(ip1+1); sr1->segments->as_u32[1] = dst_addr1.as_u32; sr1->segments->as_u8[9] = ((u8*) &teid1)[0]; sr1->segments->as_u8[10] = ((u8*) &teid1)[1]; sr1->segments->as_u8[11] = ((u8*) &teid1)[2]; sr1->segments->as_u8[12] = ((u8*) &teid1)[3]; - ip1->src_address.as_u64[0] = sl3->local_prefix.as_u64[0]; + ip1->src_address.as_u64[0] = sl1->local_prefix.as_u64[0]; ip1->src_address.as_u32[2] = sr_addr1.as_u32; ip1->src_address.as_u16[6] = sr_port1; } Fix seg fault by packet trace dump cli cmd: No performance impact @@ -1089,7 +1091,7 @@ format_sr_policy_rewrite_trace (u8 * s, va_list * args) s = format (s, "SR-policy-rewrite: src %U dst %U\n\tLocal SID Prefix: %U Node: %U", format_ip6_address, &t->src, format_ip6_address, &t->dst, - format_ip6_address, t->local_prefix, format_ip6_address, t->node); + format_ip6_address, &t->local_prefix, format_ip6_address, &t->node); } else {

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 16 VTune profiling: Overall of vpp process Traffic applied to only this worker thread, so filter in

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 17 VPP thread info identified by vppctl commands $ dpdk-devbind.py -s Network devices using DPDK-compatible driver ============================================ 0000:88:00.0 'Ethernet Controller X710 for 10GbE SFP+ 1572' drv=vfio-pci unused=i40e,uio_pci_generic 0000:af:00.0 'Ethernet Controller XXV710 for 25GbE SFP28 158b' drv=vfio-pci unused=i40e,uio_pci_generic 0000:af:00.1 'Ethernet Controller XXV710 for 25GbE SFP28 158b' drv=vfio-pci unused=i40e,uio_pci_generic vpp# show interface rx-placement Thread 1 (vpp_wk_0): node dpdk-input: TenGigabitEthernet88/0/0 queue 0 (polling) Thread 2 (vpp_wk_1): node dpdk-input: TwentyFiveGigabitEthernetaf/0/0 queue 0 (polling) Thread 3 (vpp_wk_2): node dpdk-input: TwentyFiveGigabitEthernetaf/0/1 queue 0 (polling) vpp# show threads ID Name Type LWP Sched Policy (Priority) lcore Core Socket State 0 vpp_main 57303 other (0) 32 0 0 1 vpp_wk_0 workers 351 other (0) 33 5 1 2 vpp_wk_1 workers 367 other (0) 34 6 1 3 vpp_wk_2 workers 368 other (0) 35 8 1 4 vpp_wk_3 workers 369 other (0) 36 9 1

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 18 VTune profiling: filtered target worker thread Most intensive function: sr_policy_rewrite_encaps_v4() 10% of portion, high CPI 0.56 compared w/ thead ave. 0.36 Issue in Back End CPU frequency should be fixed: Disable Intel® Turbo Boost, Intel® Speed Step

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 19 Look Skylake architecture block diagram Load Buffer Store Buffer Reorder Buffer 5 6 Scheduler Allocate/Rename/Retire In order OOO INT VEC Port 0 Port 1 MUL ALU FMA Shift ALU LEA Port 5 ALU Shuffle ALU LEA Port 6 JMP 1 ALU Shift JMP 2 ALU ALU DIV Shift Shift FMA Port 4 32KB L1 D$ Port 2 Load/STA Store Data Port 3 Load/STA Port 7 STA Load Data 2 Load Data 3 Memory Control Fill Buffers Fill Buffers μop Cache 32KB L1 I$ Pre decode Inst Q Decoders Decoders Decoders Decoders Branch Prediction Unit μop Queue Memory Front End 1MB L2$ FMA Front End Back End

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 何に注目して最適化すべきか ◆ 解析タイプ “General Exploration” でのサンプリング結果を “General Exploration” ビューポイントで表示した場合の項目 Micro-ops Issued? Allocation Stall? Core Front End Bound Core Back End Bound Micro-op ever Retire? Bad Speculation Retiring No Yes No Yes No Yes マイクロオペレーションの供給不足 (フロントエンドに 1 サイクルあたり 4 以下のマイクロオペレーションしか供給されない) メモリーアクセス、実行、ディスパッチ、割り当てがボトルネック分岐予測ミスを復旧する必要があり、遅延になっているリタイアの成功 – アルゴリズム的な経路の長さがサイクルを消費している Micro-ops: マイクロオペレーション Retire: 実行が完了

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 22 VTune profiling: Look into worker thread Issue in store

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 23 VTune profiling: Store intensive in SRH generation part Store, store, store…. Not good CPI

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 24 What’s operation doing in this function? 32bit Reg 8bit Reg 32bit Reg

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 25 Look back Skylake architecture diagram Load Buffer Store Buffer Reorder Buffer 5 6 Scheduler Allocate/Rename/Retire In order OOO INT VEC Port 0 Port 1 MUL ALU FMA Shift ALU LEA Port 5 ALU Shuffle ALU LEA Port 6 JMP 1 ALU Shift JMP 2 ALU ALU DIV Shift Shift FMA Port 4 32KB L1 D$ Port 2 Load/STA Store Data Port 3 Load/STA Port 7 STA Load Data 2 Load Data 3 Memory Control Fill Buffers Fill Buffers μop Cache 32KB L1 I$ Pre decode Inst Q Decoders Decoders Decoders Decoders Branch Prediction Unit μop Queue Memory Front End 1MB L2$ FMA Only 1 store data port! = CPI become 1.0 Front End Back End Strategy: Reduce # of store

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 26 Optimization1: Changed 4 of 8bit access to 1 of 32bit access Reduced 4 load & 4 store operations to 1 load & 1 store operation: Performance +1.0% @@ -1640,10 +1640,7 @@ sr_policy_rewrite_encaps_v4 (vlib_main_t * vm, vlib_node_runtime_t * node, { sr0 = (void*)(ip0+1); sr0->segments->as_u32[1] = dst_addr0.as_u32; - sr0->segments->as_u8[9] = ((u8*) &teid0)[0]; - sr0->segments->as_u8[10] = ((u8*) &teid0)[1]; - sr0->segments->as_u8[11] = ((u8*) &teid0)[2]; - sr0->segments->as_u8[12] = ((u8*) &teid0)[3]; + *(u32*)&(sr0->segments->as_u8[9]) = teid0; ip0->src_address.as_u64[0] = sl0->local_prefix.as_u64[0]; ip0->src_address.as_u32[2] = sr_addr0.as_u32; Reason: Store bound issue found, 4 times of 8bit store cause CPI 1.0 where other parts are CPI ~0.4, bottle neck part. WoW: not 32bit aligned… But this works w/o fault, better than 4 of stores.

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 27 VTune profiling: Could reduce clockticks of target function little bit (-4%) Still backend bound Reduced time: 29.7s to 28.4s 1.04x better

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 28 What’s operation doing in this function, again? 128bit Register Why not Utilize 128bit vector register that matches to IPv6/SR header length

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 論理レジスターファイルの拡張 EAX RAX RBX RCX RDX RBP RSI RDI RSP R8 R9 R10 R11 R12 R13 R14 R15 EBX ECX EDX EBP ESI EDI ESP インテル® SSE (128)/インテル® AVX (256)/インテル® AVX-512 (512) レジスター ZMM0 YMM0 XMM0 ZMM0 ZMM1 ZMM2 ZMM3 ZMM4 ZMM5 ZMM6 ZMM7 ZMM8 ZMM9 ZMM10 ZMM27 ZMM28 ZMM29 ZMM30 ZMM31 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 XMM9 XMM14 XMM15 XMM16 XMM17 XMM30 XMM31 汎用レジスター (32/64) 512bit 64bit グリーンは64bitモード時のみ、オレンジはインテル®AVX-512での拡張 128bit 64bitモードの AVX/AVX2で使える範囲 64bitモードのAVX-512 で使える範囲 64bitモードの SSEで使える範囲 32bitモードのSSE で使える範囲 32bitモードで使える範囲 64bitモードで使える範囲 256bit Skylake† での拡張 2倍のレジスター＝本数 8本 16本 32本 † 開発コード名 8 本 16 本

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice L3 (LLC) L1 L2 Main memory L1 L2 コア1 コア2 ここにプログラムとデータが置かれるコピーが転送されていくレジスターレジスター最終的にここで命令は実行される実行ユニット実行ユニット命令とデータ 0 4~6 14 50~70 170~ Skylake† サーバー・マイクロアーキテクチャーでのレイテンシー単位は CPU サイクル、DDR4はCPU 2.50GHz時の参考値 32kB 1MB 1.375MB /コア ~1.5TB /ソケット容量 Memory access latency at Skylake architecture レイテンシー命令とデータ命令とデータ命令とデータ † 開発コード名

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice SIMD:Single Instruction, Multiple Data + • スカラー処理 1 つの命令で 1 つの結果を出力 • SIMD 処理 (インテル®AVX/AVX2) 1 つの命令で複数の結果を出力 X Y X + Y + x7 x6 x5 x4 y7 y6 y5 y4 x7+y7 x6+y6 x5+y5 x4+y4 X Y X + Y = = x3 x2 x1 x0 y3 y2 y1 y0 x3+y3 x2+y2 x1+y1 x0+y0

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 34 Optimization2: Changed to 128bit store access only, Vectorization by SSE Reduced 5 store operations to 2 store operation w/ SIMD: Performance +4.2% @@ -1639,45 +1639,89 @@ sr_policy_rewrite_encaps_v4 (vlib_main_t * vm, vlib_node_runtime_t * node, if (PREDICT_TRUE (sl0->is_tmap)) { sr0 = (void*)(ip0+1); - sr0->segments->as_u32[1] = dst_addr0.as_u32; - *(u32*)&(sr0->segments->as_u8[9]) = teid0; - - ip0->src_address.as_u64[0] = sl0->local_prefix.as_u64[0]; - ip0->src_address.as_u32[2] = sr_addr0.as_u32; - ip0->src_address.as_u16[6] = sr_port0; + __m128i segl = _mm_cvtsi32_si128(sr0->segments->as_u32[0]); + __m128i dst_addr = _mm_cvtsi32_si128(dst_addr0.as_u32); + dst_addr = _mm_bslli_si128(dst_addr , 4); + __m128i teid = _mm_cvtsi32_si128(teid0); + tmp1 = _mm_bslli_si128(teid, 9); + dst_addr = _mm_or_si128(teid, dst_addr); + segl = _mm_or_si128(dst_addr, segl); + _mm_storeu_si128((void *)&(sr0->segments->as_u32[0]), segl); + + __m128i ipv6_h = _mm_cvtsi32_si128(sl0->local_prefix.as_u64[0]); + __m128i src_addr = _mm_cvtsi32_si128(sr_addr0.as_u32); + src_addr = _mm_bslli_si128(src_addr, 8); + __m128i src_port = _mm_cvtsi32_si128(sr_port0); + src_port = _mm_bslli_si128(src_port, 12); + ipv6_h = _mm_or_si128(dst_addr, ipv6_h); + ipv6_h = _mm_or_si128(src_port, ipv6_h); + _mm_storeu_si128((void *)&(ip0->src_address.as_u64[0]), ipv6_h); } Strategy: Prepare 128bit registers and load ingredients into them, merge to one, then store it in one operation. Reduced 5+3 stores to 2 stores only. @@ -59,7 +59,7 @@ set(VPP_LOG2_CACHE_LINE_SIZE ${VPP_LOG2_CACHE_LINE_SIZE} # CPU optimizations and multiarch support ############################################################################## if(CMAKE_SYSTEM_PROCESSOR MATCHES "amd64.*|x86_64.*|AMD64.*") - set(CMAKE_C_FLAGS "-march=corei7 -mtune=corei7-avx ${CMAKE_C_FLAGS}") + set(CMAKE_C_FLAGS "-march=skylake-avx512 -mtune=skylake-avx512 ${CMAKE_C_FLAGS}") Quick hack to avoid SSE to AVX transition penalty. VPP build system links legacy SSE option compiled obj and AVX option compiled obj into one binary if not correctly write multi arch aware source code…

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 35 How utilize vector register 128bit Register Prepare 3u of 128bit reg, load ingredients into them, then reduce into one rep and store it at once 128bit Register 128bit Register

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 36 VTune profiling: Reduced # of stores by SSE 128bit store code, 2.0x faster Ranked down from #1 to #3 bottleneck! 29.7s to 14.4s; 2x faster! # of instructions: 1.48x less CPI 0.57 down to 0.43: 1.32x better i.e. 1.48 x 1.32 = 1.96x

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 37 Optimization result Original Bug fixed Baseline u32 store optimization SIMD optimization Throughput [Mpps] 6.89 6.82 6.91 7.10 6.89 6.82 6.91 7.10 6.65 6.70 6.75 6.80 6.85 6.90 6.95 7.00 7.05 7.10 7.15 Throughput in Mpps Performance results are based on testing as of Aug. 6, 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Perfo rmance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult othe r information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Configuration: Testing by Intel as of Aug. 6, 2019. Intel® Xeon® Scalable Processor Platinum 8180 @ 2.50GHz, DDR4-2666 768 GB RAM, Intel HyperThreading Technology is disabled. Software: GCC 7.4.0, Linux OS: Ubuntu* 18.04, Kernel 4.15.0-52. Intel® Ethernet Adaptor XXV710-DA2 installed in NUMA domain 1 Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Source packet: ICMP ping over GTP-U over IPv4 (78bytes) in source code set 1 flow only Not Non Drop Rate condition, max throughput measurement

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 38 Verified by packet dump Input: GTP-U Output: original src set Output: optimized w/ SIMD

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 39 Summary • To realize highest performance, need to aware underlay architecture • Leverage tools for fast & efficient performance improvement • VPP framework is already well optimized, so any plugin can get benefit of it!

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice Legal Disclaimer & Optimization Notice Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 40 Performance results are based on testing as of Aug/9th/2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Copyright © 2019, Intel Corporation. All rights reserved. Intel, the Intel logo, Pentium, Xeon, Core, VTune, OpenVINO, Cilk, are trademarks of Intel Corporation or its subsidiaries in the U.S. and other countries.

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 42 既存のアプリケーションに対する最適化のアプローチ

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 43 Optimization result Original Bug fixed Baseline u32 store optimization SIMD optimization (2.5GHz) SIMD + Turbo Boost (3.2GHz) Throughput [Mpps] 6.89 6.82 6.91 7.10 9.08 6.89 6.82 6.91 7.10 9.08 6.00 6.50 7.00 7.50 8.00 8.50 9.00 9.50 Throughput in Mpps Performance results are based on testing as of Aug. 6, 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Perfo rmance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult othe r information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Configuration: Testing by Intel as of Sep. 2, 2019. Intel® Xeon® Scalable Processor Platinum 8180 @ 2.50GHz, DDR4-2666 768 GB RAM, Intel HyperThreading Technology is disabled. Software: GCC 7.4.0, Linux OS: Ubuntu* 18.04, Kernel 4.15.0-52. Intel® Ethernet Adaptor XXV710-DA2 installed in NUMA domain 1 Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Source packet: ICMP ping over GTP-U over IPv4 (78bytes) in source code set 1 flow only Not Non Drop Rate condition, max throughput measurement

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 44 命令スループットの向上（プロセッサー・アーキテクチャーの進化）リソース 1 リソース 2 リソース 3 リソース 1 リソース 2 リソース 3 リソース 1 リソース 2 リソース 3 リソース 1 リソース 2 リソース 3 物理コア OS により認識される論理プロセッサーパイプライン上の命令スループットスレッド 1 スレッド 1 スレッド 2 スレッド 1 + + 5 命令 9 命令 14 命令 2 命令スーパースカラー HT MP 従来の CPU 2 論理 CPU 2 物理 CPU

and brands may be claimed as the property of others. Optimization Notice Optimization Notice Optimization Notice 45 Capture clock ticks per graph node by vppctl vpp# show runtime … Thread 2 vpp_wk_1 (lcore 34) Time 10.0, average vectors/node 256.00, last 128 main loops 22.00 per node 256.00 vector rates in 7.1194e6, out 7.1194e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call TwentyFiveGigabitEthernetaf/0/ active 278638 71331328 0 8.05e0 256.00 TwentyFiveGigabitEthernetaf/0/ active 278638 71331328 0 7.36e1 256.00 dpdk-input polling 278638 71331328 0 2.95e1 256.00 ethernet-input active 278638 71331328 0 1.16e1 256.00 ip4-input-no-checksum active 278638 71331328 0 1.68e1 256.00 ip4-load-balance active 278638 71331328 0 1.28e1 256.00 ip4-lookup active 278638 71331328 0 2.51e1 256.00 ip6-load-balance active 278638 71331328 0 1.54e1 256.00 ip6-lookup active 278638 71331328 0 7.06e1 256.00 ip6-rewrite active 278638 71331328 0 2.71e1 256.00 sr-pl-rewrite-encaps-v4 active 278638 71331328 0 5.86e1 256.00

SRv6-mobile VPP plugin Performance Optimization

SRv6-mobile VPP plugin Performance Optimization

More Decks by Yasuhiro Ohara

Featured

Transcript