実践的！FPGA開発セミナーvol.21 / FPGA_seminar_21_fixstars_corporation_20230426

Copyright© Fixstars Group Who I am 写真 Hiroki NISHIMOTO 西本
宏樹ソリューション第四事業部　エンジニア 3

Copyright© Fixstars Group アウトライン - AI Engineとは？ - テンプレートマッチングの紹介 -
実装にあたっての7つのステップ - まとめ 4

Copyright© Fixstars Group AI Engineとは？ - AMD社によって開発されたプロセッサ - 電力効率が良い &
演算処理が高速 - PL(FPGA)とAXI4-Streamで接続されている 6 ※ https://japan.xilinx.com/products/technology/ai-engine.htmlより引用 ※ ACAP概要図 → FPGA + AI Engineでより高速に演算が可能に

Copyright© Fixstars Group どんなアーキテクチャか - 強力なベクトル演算ユニットを搭載 - 沢山のAI Engineコアが接続されたAIE Arrayで構成されている
- データ処理→次のコアへ流す → マルチコアの並列性とパイプラインを兼ね備えたアーキテクチャ応用範囲はAIだけではない 7 ※ https://japan.xilinx.com/products/technology/ai-engine.htmlより引用 ※ ※

Copyright© Fixstars Group どういう構成なのか - ホストからメモリを介してPL→AIE Arrayに送る - PL->AIE Arrayの送信方法はAXI4-Stream
8 AIE Array ※ ※ https://japan.xilinx.com/products/technology/ai-engine.htmlより引用

Copyright© Fixstars Group テンプレートマッチングとは？画像の一部を切り出したものから、その一部が切り出し元の画像のどの位置かを探索するタスクテンプレート画像元画像 10 応用事例）
- 深度推定 - 物体追跡

Copyright© Fixstars Group どういうアルゴリズムを使用する？色々あるが、今回はSAD(Sum of Absolute Diﬀerence)を採用テンプレートのサイズと同じ枠を走査し、画素間の差異が最小の位置を探索する。
元画像 11

Copyright© Fixstars Group テンプレートマッチングのAI Engineへの落とし込み AI Engine ① テンプレート画像のAIEへのロード ②
検証対象データのAIEへのロード ③ AIE内部でSADを計算結果をホストへ送信 ④ 13

Copyright© Fixstars Group 実装にあたっての7つのステップ 1. リファレンスプログラムの実装 2. AI Engine上で動作するカーネルの実装 3.
AI Engine Graphの実装 4. AI Engineのシミュレーション 5. AI Engineとホストを繋ぐ、PLの実装 6. ホストプログラムを書く 7. HW Emulation or 実機での動作確認 14

AI Engine Graphの実装 4. AI Engineのシミュレーション 5. AI Engineとホストを繋ぐ、PLの実装 6. ホストプログラムを書く 7. HW Emulation 8. 実機での動作確認 15

Copyright© Fixstars Group リファレンスプログラムの実装 16 元・テンプレート画像読み込みグレースケール化 SAD計算終了 for(
int hi = 0 ; hi < SRC_H - TMP_H ; ++hi ){ for( int wi = 0 ; wi < SRC_W - TMP_W ; ++wi ){ // 検証範囲を切り出し uint8_t *ins_img_gray = cut_img(); // sad score計算関数実行 uint32_t score = sad_part( ins_img_gray, tmp_img_gray, TMP_SIZE) best_score = max(score, best_score); } }

Copyright© Fixstars Group SAD値計算部分概要 uint32_t sad_part(const uint8_t *src, const uint8_t
*tmp, const uint32_t TMP_SIZE ){ uint32_t score = 0; for( uint32_t idx = 0 ; idx < TMP_SIZE ; ++idx ){ const uint8_t comp_src_val = src[ idx ]; const uint8_t comp_tmp_val = tmp[ idx ]; if( comp_src_val >= comp_tmp_val ) score += (comp_src_val - comp_tmp_val); else score += (comp_tmp_val - comp_src_val); } return score; } 元画像とテンプレート画像の全ピクセルを走査し、 2つの画像の差異を取得 17

Copyright© Fixstars Group テンプレートマッチングのAI Engineへの落とし込み AI Engine ① テンプレート画像のAIEへのロード ②
検証対象データのAIEへのロード ③ AIE内部でSADを計算結果をホストへ送信 ④ 19

Copyright© Fixstars Group AIEへの落とし込み（入力部分） void compute_krnl( input_stream<data_t>* dat, output_stream<uint32_t>* out){
aie::vector<data_t, READ_BYTE_SIZE> tmp_vecs[READ_NUM]; for ( itr_t i = 0 ; i < READ_NUM ; ++i ) tmp_vecs[i] = readincr_v<READ_BYTE_SIZE>(dat); while(true){ uint32_t sad_val = 0; for( itr_t bi = 0 ; bi < READ_NUM ; bi++ ){ aie::vector<data_t, READ_BYTE_SIZE> src_vec = readincr_v<READ_BYTE_SIZE>(dat); const uint32_t score = compute_score( src_vec, tmp_vecs[bi]); sad_val += score; } writeincr(out, sad_val); } } 20 テンプレートデータを入力切り出した元画像を入力し、差異を計算テンプレートサイズ画像と同ピクセル計算した後、出力

Copyright© Fixstars Group SADカーネルの実装 -aie::vectorをフル活用する- … … 検証対象画像テンプレート …
大小マスク … … 大きい方のベクタ小さい方のベクタ … 絶対値差分のベクタ絶対値差分の総和 aie::vector関数を活用して高速化 21

Copyright© Fixstars Group AIEへの落とし込み（画素の差異の計算） uint32_t compute_score( aie::vector<data_t, READ_BYTE_SIZE> src_vec, aie::vector<data_t,
READ_BYTE_SIZE> tmp_vec ){ auto msk_lt = aie::lt( src_vec, tmp_vec ); aie::vector<data_t, READ_BYTE_SIZE> grt_vec = aie::select(src_vec, tmp_vec, msk_lt ); aie::vector<data_t, READ_BYTE_SIZE> les_vec = aie::select(tmp_vec, src_vec, msk_lt ); uint32_t score = 0; aie::vector<data_t, READ_BYTE_SIZE> sub_vec = aie::sub(grt_vec, les_vec); aie::vector<reduce_t, READ_BYTE_SIZE/2> red_vec; // reduce用の配列. 16bit for( itr_t wi = 0; wi < READ_BYTE_SIZE/2; ++wi ) red_vec[wi] = sub_vec[wi] + sub_vec[wi + READ_BYTE_SIZE / 2]; score = aie::reduce_add(red_vec); return score; } 大小のマスクを取得対応ピクセルごとに大小を分ける大きい方から小さい方を引いた値を取得差をreduceで計算 22

Copyright© Fixstars Group AI Engine Graphでカーネル同士を繋ぎフローを作成 class compute_graph : public
adf::graph { private : adf::kernel krnl; public : adf::port<input> dat_in; adf::port<output> res_out; compute_graph(){ krnl = adf::kernel::create(compute_krnl); adf::source(krnl) = "../src/compute_krnl.cpp"; // カーネル読み込み adf::connect<adf::stream>(dat_in, krnl.in[0]); adf::connect<adf::stream>(krnl.out[0], res_out); adf::runtime<adf::ratio>(krnl) = 1.0; }; }; 25 入出力ポートカーネル入力→カーネル→出力

Copyright© Fixstars Group AI Engine Graphでカーネル同士を繋ぎフローを作成 // 0段目 for( itr_t
ii = 0 ; ii < PARALLEL_NUM / 2 ; ++ii ){ adf::connect<adf::stream>( reduce_in[ii * 2 ] , krnl[ii].in[0]); adf::connect<adf::stream>( reduce_in[ii * 2 + 1 ], krnl[ii].in[1]); } // 中段 uint32_t ofs = 0; for( itr_t pi = PARALLEL_NUM / 2; pi > 1; pi = pi >> 1 ){ for( itr_t ii = 0 ; ii < pi; ++ii) adf::connect<adf::stream>( krnl[ ii + ofs ].out[0], krnl[ ii / 2 + ofs + pi].in[ ii % 2 ]); ofs += pi; } // 最終段 adf::connect<adf::stream>( krnl[ PARALLEL_NUM - 1 - 1 ].out[0] , reduce_out ); 26

AI Engine Graphの実装 4. AI Engineのシミュレーション 5. AI Engineとホストを繋ぐ、PLの実装 6. ホストプログラムを書く 7. HW Emulation 8. 実機での動作確認 27

Copyright© Fixstars Group AI Engine『だけ』のシミュレーション各ポートへの入力ファイルを用意するだけで、 AI Engine部分だけのシミュレーションが可能！ここだけ！ここでテストを通しておけば安心して以降の実装に進める
1923201977 3648631153 3015563216 2931403443 3535452630 1352627154 1599623816 1095908160 …… input_0.txt 8bit × 4 = 32 28

Copyright© Fixstars Group AI Engineとホストを繋ぐ、PLの実装 void mm2s(　ap_uint<32>* mem, hls::stream<ap_axiu<32, 0,
0, 0>>& str, int size ) { for (int i = 0; i < size; i++) { ap_axiu<32, 0, 0, 0> x; x.data = mem[i]; x.keep = -1; // バイトイネーブルのフラグをすべて立てる str.write(x); } } void s2mm( ap_uint<32>* mem, hls::stream<ap_axiu<32, 0, 0, 0>>& str, int size) { for (int i = 0; i < size; i++){ auto x = str.read(); mem[i] = x.data; } } 32

Copyright© Fixstars Group リファレンスのC++実装に何を足せばAI Engineを使えるのか 1. デバイスオープン 2. xclbinをデバイスに書き込み
3. カーネルの作成 4. バッファオブジェクトの作成 5. ホスト側のバッファポインタをユーザー空間にマップ 6. バッファへのデータ書き込み＆ホストからデバイスバッファへの内容の同期 7. カーネルの起動&完了待機 8. デバイスからホストへバッファの内容を同期 34

Copyright© Fixstars Group 1&2. デバイスオープン＆xclbinの書き込み 35 const int device_index =
0; const std::string xclbin_file = argv[1]; auto device = xrt::device(device_index); // デバイスのオープン auto uuid = device.load_xclbin(xclbin_file); // xcl_binのデバイスの書き込み用意されている関数に渡すだけです

Copyright© Fixstars Group 3. カーネル作成 36 // PARALLEL_NUM 並列でのカーネル送信作成 std::vector<
xrt::kernel > mm2s_krnls; for(itr_t pi = 0; pi < PARALLEL_NUM ; pi++ ){ mm2s_krnls.push_back( xrt::kernel(device, uuid, "mm2s:{mm2s_" + std::to_string(pi + 1) + "}") ); } // 受信カーネルの作成 auto s2mm_krnl = xrt::kernel(device, uuid, "s2mm:{s2mm_1}"); 用意されている関数をほぼ叩くだけでできます

Copyright© Fixstars Group 4. バッファオブジェクト作成 37 // PARALLEL_NUM並列での送信用バッファの作成 std::vector< xrt::bo
> mm2s_bos; for(itr_t pi = 0; pi < PARALLEL_NUM ; pi++ ){ mm2s_bos.push_back( xrt::bo(device, sizeof(uint8_t) * IMG_SIZE / 　　　　　　 PARALLEL_NUM, mm2s_krnls[pi].group_id(0)) ); } // 受信用バッファオブジェクトの作成 auto s2mm_bo = xrt::bo(device, sizeof(uint32_t) * TRIAL_CNT, \ s2mm_krnl.group_id(0)); 欲しいサイズをbyte単位で指定するだけでできます

Copyright© Fixstars Group 5. ホスト側のバッファポインタをユーザー空間にマップする 38 // PARALLEL_NUM並列での送信用バッファの作成 std::vector<uint8_t
*> mm2s_mapd_bufs; for(itr_t pi = 0; pi < PARALLEL_NUM ; pi++ ){ mm2s_mapd_bufs.push_back( mm2s_bos[pi].map<uint8_t *>() ); } // 受信用バッファオブジェクトのマップ auto s2mm_mapd_buf = s2mm_bo.map<uint32_t *>(); バッファオブジェクトのmap関数を叩くだけです

Copyright© Fixstars Group 6. バッファへのデータ書き込み＆ホストからデバイスバッファへの内容の同期 39 // マップした送信用バッファへのデータを書き込み for
( itr_t pi = 0 ; pi < PARALLEL_NUM ; pi++ ){ const uint32_t SIZE_PER_UNIT = IMG_SIZE / PARALLEL_NUM; const uint32_t OFS = SIZE_PER_UNIT * pi; for( int bi = 0; bi < SIZE_PER_UNIT ; bi++ ){ mm2s_mapd_bufs[pi][bi] = tmp_img_gray[ bi + OFS ]; } } for ( itr_t pi = 0 ; pi < PARALLEL_NUM ; pi++ ) mm2s_bos[pi].sync(XCL_BO_SYNC_BO_TO_DEVICE); // 同期書き込んで同期関数を叩くだけです

Copyright© Fixstars Group 7. カーネルの起動&完了待機 40 std::vector< xrt::run > mm2s_runs;
for ( itr_t pi = 0 ; pi < PARALLEL_NUM ; pi++ ){ mm2s_runs.push_back ( mm2s_krnls[pi]( mm2s_bos[pi], nullptr, sizeof(uint8_t) * IMG_SIZE / PARALLEL_NUM / 4) ); } for ( itr_t pi = 0 ; pi < PARALLEL_NUM ; pi++ ) mm2s_runs[pi].wait();　// カーネル終了の待機 3で作ったカーネルにbuﬀer objectと『何回実行してほしいか』を渡すだけです

Copyright© Fixstars Group リファレンスプログラムのAIE 対応 41 元・テンプレート画像読み込みグレースケール化 SAD計算終了
[AIE 準備] [テンプレート画像送信 ] [s2mm 受信開始] for( int hi = 0 ; hi < SRC_H - TMP_H ; ++hi ){ for( int wi = 0 ; wi < SRC_W - TMP_W ; ++wi ){ [ mm2s送信処理（一枚ずつ） ] } } [s2mm 受信待機]

Copyright© Fixstars Group ホストを含めたHW Emulation 実機で動かないときも波形で確認できる合成後、ホストで先に実行したが期待値一致しなかったため HW Emulationで波形で確認した
43

Copyright© Fixstars Group 動作確認 - 期待値との一致を確認 44 元画像テンプレート画像出力画像
- 速度検証 CPU : 600 ms AIE : 7000ms

Copyright© Fixstars Group 改良　ホスト→AIE の送信を減らす 45 元・テンプレート画像読み込みグレースケール化 SAD計算終了
[AIE 準備] [テンプレート画像送信 ] for( int hi = 0 ; hi < SRC_H - TMP_H ; ++hi ){ for( int wi = 0 ; wi < SRC_W - TMP_W ; ++wi ){ [検証用元画像格納] } } [s2mm 受信開始] [mm2s送信処理 (全部) ] [s2mm 受信待機] AIE : 650 ms

Copyright© Fixstars Group まとめ - AI Engineでテンプレートマッチングを実装 - 実機で動かすまでの7ステップの解説 -
動かなかった際のデバッグは大変ですが、シミュレーションが充実しているので追える - 今後、カーネルの変更含め、最適化します 47

Copyright© Fixstars Group Who I am 写真 Ryuji NISHIDA 西田
竜之ソリューション第四事業部　シニアエンジニア 50

Copyright© Fixstars Group 自己紹介 • 西田竜之 ◦ FPGAを用いたシステム開発に従事 ◦ ハードウェア開発をメインに担当
◦ 略歴 ▪ 半導体ベンダ • サーバー向けASIC開発 ▪ 映像事務機メーカー • 高画質エンジンLSI　映像機器向けFPGA開発 ▪ フィックスターズ • FPGAを用いた高速取引金融システム • OpenCLによるアプリの高速化 51

Copyright© Fixstars Group アジェンダ • AI Engine FFT 演算　背景＆目的 •
単一カーネルでの FFT の実装 • 複数カーネルを用いた FFT の実装 • 参考：AMD 提供サンプルデザイン（XAPP1356）の実装 • まとめ 52

Copyright© Fixstars Group AI Engine FFT 演算　背景＆目的 • Versal AI
Engine ◦ 演算負荷の高いアプリケーションに有用 ◦ 性能を発揮するには・・・ ▪ AI Engine に適したアルゴリズムとプログラミングが必要 ▪ AI Engine のアーキテクチャ、特徴の把握が必要 • FFT 演算 ◦ 演算負荷の高い代表的なアプリケーションの1つ ▪ AI Engine FFT 演算は、DSP Library を用いてすでに利用可能今回、AI Engine の理解を深めるために、なじみのある FFT 実装を題材に試行した結果を共有する 53

Copyright© Fixstars Group AI Engine FFT 演算　背景＆目的 • FFT 演算
(図：時間間引き16点) ◦ 演算（複素数の加算、乗算）⇒ メモリ保持（データ並び替え）⇒ … の繰り返し ◦ 演算全体を時間的、空間的にどう割り付けるかによって構成が変わる 54 データ数分繰り返し O(N) stage 数分繰り返し O(logN)

Copyright© Fixstars Group 単一カーネルでの FFT 実装 • データ型：単精度複素数 ◦ cﬂoat
型を使用する　⇒　複素数演算の記述が容易に可能 ▪ .real, .imag で実部、虚部にアクセスする • AI Engine メモリ容量 ◦ １コア辺りのデータメモリ　32KByte ◦ 隣接する４コアのメモリを共有可能　Total 128KByte • FFT データ点数　1024点 ◦ 入出力、中間のバッファが必要（4 ケ程度） ◦ 8Byte (=sizeof(cﬂoat)) x 1024 点 x ~4ケ < 32KByte ※ 隣接コアのメモリを使用すれば、これより大きな点数も実現できる可能性はある（未試行） 55 参照URL https://docs.xilinx.com/r/ja-JP/am009-versal-ai-engine/AI-%E3%82%A8%E3%83%B3%E3%82%B8%E3%83%B3%E3%8 1%AE%E3%83%A1%E3%83%A2%E3%83%AA-%E3%83%A2%E3%82%B8%E3%83%A5%E3%83%BC%E3%83%AB

Copyright© Fixstars Group 単一カーネルでの FFT 実装 • カーネルコード 56 ←　Butterﬂy演算
　　＆並び替え 10 stage分繰り返し Windowアクセス（=共有メモリを介した転送） • グラフコード外部 I/F からWindowアクセスするための接続

Copyright© Fixstars Group 単一カーネルでの FFT 実装 • 演算機能（バタフライ演算）の記述 ◦ AI
Engine Intrinsics より高速、効率的な動作をさせるには、Intrinsics を利用する https://www.xilinx.com/htmldocs/xilinx2022_2/aiengine_intrinsics/intrinsics/index.html ▪ ベクタ演算エンジンの利用、パイプライン動作など ▪ FFT 用の Intrinsics も存在ただし、cﬂoat 対応の FFT 演算が見当たらない、仕様理解に時間がかかる（難解）ため、今回未適用 ◦ AI Engine API Intrinsics よりも抽象度の高い C++ ヘッダライブラリ https://www.xilinx.com/htmldocs/xilinx2022_2/aiengine_api/aie_api/doc/index.html ▪ ﬀt_dit も使用可能適用を試みたが期待通りの演算ができず適用断念パラメータ、入出力データ設定を探ったものの、適切な使い方がつかめなかった 57 参照URL https://japan.xilinx.com/developer/articles/aie-kernel-programming-vitis-ai e-api.html

Copyright© Fixstars Group 単一カーネルでの FFT 実装 • FFT 1000 point
プロファイル結果（aiesimulator 実行） ◦ 演算実行サイクルで埋まっている　効率的に演算を実行させる工夫が必要 58 Total 5,720,000 cycle = 5.72msec @ 1GHz （拡大）

Copyright© Fixstars Group 複数カーネルでの FFT 実装 • カーネル分割 ◦ 共有メモリを介して
カーネルを直列に接続 ◦ 連続したデータ投入でもスループットを確保できる構成 59 データ数分繰り返し O(N) stage 数分カーネルを直列に配置参照URL https://docs.xilinx.com/r/ja-JP/am009-versal-ai-engine/%E5%85%B1%E6%9C%89%E3%83%A1%E3%83%A2%E3%83%AA%E3%82%92%E4%BD%BF%E7%94%A8%E3%81%97%E3%81 %9F-AI-%E3%82%A8%E3%83%B3%E3%82%B8%E3%83%B3%E3%81%8B%E3%82%89-AI-%E3%82%A8%E3%83%B3%E3%82%B8%E3%83%B3%E3%81%B8%E3%81%AE%E3%83 %87%E3%83%BC%E3%82%BF%E9%80%9A%E4%BF%A1

Copyright© Fixstars Group • カーネルコード複数カーネルでの FFT 実装 60 ←　Butterﬂy演算
　　＆並び替え 2 stageずつ 5 ケのカーネルに分割

Copyright© Fixstars Group 複数カーネルでの FFT 実装 61 • グラフコードカーネル間をWindowアクセス
するための接続 5 カーネル分を作成

Copyright© Fixstars Group 複数カーネルでの FFT 実装 62 • コンパイル結果 ◦
グラフ接続 ◦ タイル配置 ▪ 配置指定はせず、自動で配置

Copyright© Fixstars Group 複数カーネルでの FFT 実装 63 • FFT 1000
point プロファイル結果（aiesimulator 実行） Total 5,714,000 cycle = 5.71msec @ 1GHz カーネル間で順番に処理している自動で同期をとりながら動作できている（カーネルコードは独立に記述し、グラフ接続しているだけ） ※単一カーネルと同じ連続したデータ投入ができていないため、カーネル分割した利点が見えていない　↓ スループットの改善効果は実機で確認する必要がある

Copyright© Fixstars Group • FFT カーネル並列化方式の検討 ◦ 試行した構成 ◦ 別構成案
▪ スループットは同等、カーネルはシンプル、タイル配置の自由度も高い（これでよかったのでは？） ▪ AI Engine は比較的新しいデバイスなので、最適構成を探る余地は大きい複数カーネルでの FFT 実装 64

Copyright© Fixstars Group 参考：AMD 提供サンプルデザイン（XAPP1356）の実装 65 • ブロックごとにコンフィギュレーション可能な高速フーリエ変換の AI
エンジンでの実装 ◦ 1024 point FFT x4 カーネルから共有メモリを介して結果を集約する構成参照URL https://docs.xilinx.com/r/ja-JP/xapp1356-fft-ai-engine/%E8%A4%87%E6%95%B0%E3%81%AE-AI-%E3%82%A8%E3%83%B3%E3%82%B8%E3%83%B3%E3%81%AB%E3%82%88%E3%82%8B-FFT

Copyright© Fixstars Group 参考：AMD 提供サンプルデザイン（XAPP1356）の実装 66 • 演算処理 ◦ Intrinsics
を駆使して作り込んでいる • プロファイル結果 ◦ 効率的にすき間なく演算が行われている様子が見える

Copyright© Fixstars Group まとめ 67 • AI Engine を用いた FFT
演算の実装を通して、AI Engine の構成、特徴の共有を行った ◦ 演算機能の最適化方法 ◦ メモリ構成、共有メモリを介したカーネル間接続方法 • 比較的簡単な方法で、複数のカーネルが連携して動作することを示した • 演算処理の最適化が不足しており、AI Engine API や Intrinsics を用いた高速化が必須であることが明らかになった

Copyright© Fixstars Group 70 今日の LT でお話しすること Generative AI の代表的なサービスの
1 つである『ChatGPT』を FPGA 開発に活かそうという話引用元 : https://investingnews.com/invest-in-openai-chatgpt/

Copyright© Fixstars Group 72 ChatGPT とは？ ChatGPT（チャットジーピーティー、英語: Chat Generative Pre-trained
Transformer）は、OpenAIが2022年11月に公開した人工知能チャットボット。原語のGenerative Pre-trained Transformerとは、「生成可能な事前学習済み変換器」という意味である。 OpenAIのGPT-3ファミリーの言語モデルを基に構築されており、教師あり学習と強化学習の両方の手法で転移学習されている。 Wikipedia より引用 : https://ja.wikipedia.org/wiki/ChatGPT

Copyright© Fixstars Group ①構想段階での活用： (cont.) 76 I want to find
the TOE open source for Intel FPGA. Please let me know the best OSS and its reason. SH The TOE (TCP Offload Engine) open source for Intel FPGA that I would recommend is the TOE project maintained by the Open Programmable Acceleration Engine (OPAE) community. This project provides a full-featured, high-performance TCP/IP stack for FPGA-based NICs (network interface cards) and is optimized for use with Intel FPGAs. The TOE project offers several benefits that make it a great choice for Intel FPGA users: 1. High-performance: The TOE project is designed to offer high-performance TCP/IP offloading capabilities for FPGA-based NICs, allowing for increased throughput and reduced CPU utilization. 2. Open source: The TOE project is open source, which means that users can modify and customize the code to suit their specific needs. This also means that the project is constantly being improved and updated by a community of contributors. 3. Easy to use: The TOE project is designed to be easy to use, with a straightforward API that allows developers to quickly integrate it into their applications. 4. Compatibility: The TOE project is compatible with a wide range of Intel FPGA devices and development boards, making it a versatile option for developers working on FPGA-based networking projects. Overall, the TOE project is a solid choice for anyone looking to implement TCP/IP offloading on Intel FPGAs. Its high-performance, open-source nature, ease of use, and compatibility with a wide range of Intel FPGA devices make it a top contender in this space.

Copyright© Fixstars Group ①構想段階での活用： (cont.) 77 Where can I find
it? Give me the link to download the software. SH You can ﬁnd the TOE project on the Open Programmable Acceleration Engine (OPAE) GitHub repository. Here's the link to the repository: https://github.com/OPAE/toe You can download the latest version of the TOE project from the repository's releases page: https://github.com/OPAE/toe/releases From there, you can download the source code or pre-built binaries for your platform. The repository also contains documentation, examples, and other resources to help you get started with the project.

Copyright© Fixstars Group ①構想段階での活用： (cont.) 78 Do you have any
alternatives? SH Yes, there are alternative TCP Offload Engine (TOE) open source projects available for Intel FPGAs. Here are a few options: 1. DPDK: The Data Plane Development Kit (DPDK) is an open source set of libraries and drivers for fast packet processing on Intel architectures, including FPGAs. DPDK provides a high-performance TCP/IP stack that can be used for TOE offloading. 2. NetCOPE: NetCOPE is an open source framework for FPGA-based network processing that includes a TCP/IP stack optimized for use with Intel FPGAs. NetCOPE offers a range of features, including support for multiple protocols and hardware acceleration. 3. NetFPGA: NetFPGA is an open source platform for networking research and education that includes a TCP/IP stack optimized for FPGA-based NICs. NetFPGA offers a range of features, including support for multiple protocols and hardware acceleration. These alternative projects offer similar capabilities to the TOE project, but each has its own strengths and weaknesses. Ultimately, the best choice will depend on your specific needs and requirements.

Copyright© Fixstars Group ②仕様検討段階での活用： (cont.) 81 Which FPGA board do
you recommend to implement the TOE project on the OPAE, and why? SH There are several FPGA boards that can be used to implement the TOE project on the OPAE, depending on your specific requirements and budget. Here are a few options that I would recommend: 1. Intel PAC with Intel Arria 10 GX FPGA: This FPGA board is designed specifically for network acceleration and includes an Arria 10 GX FPGA along with 8 GB of DDR4 memory and a 100 Gbps QSFP28 network interface. The board is well-suited for implementing the TOE project, as it provides high-speed networking capabilities and powerful FPGA resources. 2. Terasic DE5a-Net FPGA Development Kit: This development kit includes a Stratix V GX FPGA along with 4 GB of DDR3 memory and multiple high-speed networking interfaces. The board is a good choice for implementing the TOE project, as it provides a large FPGA with plenty of resources and support for multiple networking protocols. 3. Intel FPGA Development Kit: This development kit includes a Cyclone V GX FPGA along with 1 GB of DDR3 memory and multiple high-speed networking interfaces. The board is a more budget-friendly option for implementing the TOE project, while still providing a powerful FPGA and networking capabilities. Ultimately, the choice of FPGA board will depend on your specific requirements, including performance, cost, and development environment. All of the boards listed above are compatible with the OPAE and provide a good starting point for implementing the TOE project.

Copyright© Fixstars Group ②仕様検討段階での活用： (cont.) 82 The cost is limited
to 10,000 USD, and higher performance is better. Which FPGA board is your best choice? SH With a budget of $10,000 and a requirement for higher performance, I would recommend the Intel PAC with Intel Arria 10 GX FPGA. This board is designed speciﬁcally for network acceleration and offers a powerful Arria 10 GX FPGA, along with 8 GB of DDR4 memory and a 100 Gbps QSFP28 network interface. The board is well-suited for implementing the TOE project and can provide the high performance you require. The Intel PAC with Intel Arria 10 GX FPGA has a list price of around $3,000, leaving plenty of room in your budget for additional components, such as power supplies, cables, and development tools. The board is also well-supported by the OPAE community, which can help simplify the development process and provide access to a range of resources and examples. Overall, the Intel PAC with Intel Arria 10 GX FPGA is a great choice for implementing the TOE project on a budget of $10,000 while still providing the high performance you require.

Copyright© Fixstars Group ③設計/実装段階での活用： 85 Write Verilog code for an
asynchronous FIFO with parameterizable width and depth. SH Here's an example of an asynchronous FIFO with parameterizable width and depth in Verilog: module ﬁfo # ( parameter WIDTH = 8, parameter DEPTH = 16 ) ( input clk, input rst, input wr_en, input rd_en, input [WIDTH-1:0] wr_data, output [WIDTH-1:0] rd_data, output full, output empty ); reg [WIDTH-1:0] memory [0:DEPTH-1]; reg [WIDTH-1:0] rd_data_reg; reg [WIDTH-1:0] wr_ptr; reg [WIDTH-1:0] rd_ptr; reg [DEPTH-1:0] occupancy; always @(posedge clk) begin if (rst) begin wr_ptr <= {WIDTH{1'b0}}; rd_ptr <= {WIDTH{1'b0}}; occupancy <= {DEPTH{1'b0}}; rd_data_reg <= {WIDTH{1'b0}}; end else begin if (wr_en && !full) begin memory[wr_ptr] <= wr_data; wr_ptr <= wr_ptr + 1; occupancy <= occupancy + 1; end if (rd_en && !empty) begin rd_data_reg <= memory[rd_ptr]; rd_ptr <= rd_ptr + 1; occupancy <= occupancy - 1; end end end assign rd_data = rd_data_reg; assign full = (occupancy == DEPTH); assign empty = (occupancy == 0); endmodule

Copyright© Fixstars Group ③設計/実装段階での活用： (cont.) 86 Write Verilog code for
an asynchronous FIFO with parameterizable width and depth. SH This module defines an asynchronous FIFO with a configurable WIDTH and DEPTH. The module has input ports for the clock signal (clk), reset signal (rst), write enable (wr_en), and read enable (rd_en). The module also has input ports for write data (wr_data) and output ports for read data (rd_data), full flag (full), and empty flag (empty). The module uses an array (memory) to store data in the FIFO, and read and write pointers (rd_ptr and wr_ptr) to keep track of the current positions in the FIFO. The module also uses an occupancy counter (occupancy) to keep track of how many elements are currently in the FIFO. In the always block, the module checks for a reset signal and updates the read and write pointers and occupancy counter based on the write and read enable signals. The read data is stored in a register (rd_data_reg) to avoid race conditions. The assign statements assign the read data, full flag, and empty flag based on the current occupancy of the FIFO.

Copyright© Fixstars Group ChatGPT で実現できないことはまだまだある！ 88 Write the fastest TCP
Offload Engine HDL code for Intel FPGA as detailed as possible. SH Designing a TCP Oﬄoad Engine (TOE) is a complex task that requires a deep understanding of both the TCP/IP protocol and FPGA architecture. Here are some general steps that may be involved in implementing a TOE on an Intel FPGA, along with some tips for optimizing the HDL code for maximum performance: (snip) 抽象度が高くかつ難易度の高い非機能要件を持つ要件から具体的な実装に落とし込むのはエンジニアの仕事！

Copyright© Fixstars Group Takeaway 89 • ChatGPT と対話することで人力検索を効率化できる • 非エンジニア
(PM/PdM) とエンジニアの溝が埋まる • ChatGPT に簡単なプログラミングを任せることができる • FPGA エンジニアの仕事がなくなることは (おそらく) ない Happy FPGA Life!

実践的！FPGA開発セミナーvol.21 / FPGA_seminar_21_fixstars...

実践的！FPGA開発セミナーvol.21 / FPGA_seminar_21_fixstars_corporation_20230426

More Decks by 株式会社フィックスターズ

Other Decks in Science

Featured

Transcript