Pedestrian-Centric大規模交通安全映像解析向けWoven Traffic Safety (WTS) データセットの紹介

Pedestrian-Centric大規模交通安全映像解析向け Woven Traffic Safety (WTS) データセットの紹介 Quan Kong (孔全)
Woven by Toyota, Inc. ([email protected]) 2024.03.28 第7回Data Centric AI 勉強会

目次自己紹介 & Woven City 1. モチベーションと背景 2. WTS +
AI City Challenge 2024 @ CVPR24 3. WTS データセットの概要と特徴 4. データセットの作成 5. タスク & ベースライン 6. 評価結果

自己紹介 ❖ 名前：孔全 (コウゼン) (Kong Quan) ❖
所属 & 経歴：大阪大学情報科学博士課程 (ML+wearable computing / sensing) (株)日立製作所中央研究所 (CV + ML) Woven by Toyota, Inc. (CV + ML), Research Scientist, ML Modeling Sub-Lead ❖ 関心のトピック： Video understanding, Representation learning Multi-Modal learning, Generative learning, Dataset creation Paper & Project

4 Why are we building Woven City? なぜWoven Cityを作るのか？ Woven
Cityの紹介

Susono, Shizuoka Tohoku region Starting point Great Tohoku Earthquake and
production shift 起点は、2011年東日本大震災と生産の移転

CES 2018

3つの  モビリティ  Three kinds of Mobility

自動車のテストコースモビリティのテストコース Woven City 士別下山 Woven City

9 PURPOSE / “Why we exist” VISION / “Where we
want to be” MISSION / “What we do to get there” Well-being for all. 幸せの量産 Building the future fabric of life in a City as a Test Course for Mobility. テストコースの街で、未来の当たり前を発明する。 9 Expand mobility. Enhance humanity. Engage society. 「モビリティ」の拡張 OVERVIEW

リアルなテストコースの街  人の生活を組み込んだ   実証実験の街  働く人・住む人・訪れる人   誰しもが発明家 

発明をサポートする体制ハードウェアソフトウェア開発を加速一緒に創る

Woven Test Course 「Phase1」として 2024年に建築工事完了、2025年に一部実証開始予定 <Phase 1> 50,000㎡ 360人が居住予定　　　　　　　　　　　　　　　　　　　その後も改善・進化し続けていく「未完成の街」
将来は 708,000㎡　175 acres　　を予定

サービス紹介

Phase1での実証内容の紹介（一部）※実際の内容は変更になる可能性もありますヒト・モノ・情報のモビリティで心までも動かしていく e-Paletteなどの自動運転やモビリティサービスロボットなども活用した物流サービスより「心がつながる」遠隔コミュニケーション技術手軽に持ち運べる水素エネルギー
with ENEOS 水素を「つくる」「運ぶ」「使う」一連のサプライチェーン実証 with 日清食品食を通じたWell-beingの実現に向けた実証（完全栄養食メニューの提供など） with Rinnai 水素調理器を使用したカーボンニュートラルへの貢献などに向けた実証

1. Woven Traﬃc Safety Datasetのモチベーションと背景

1. モチベーション & 背景 - 歩行者事故の現状大区分中区分日本全体、年間件数
構成率 [%] 横断中横断歩道 12402 33.7 横断歩道付近 935 2.5 横断歩道橋付近 47 0.1 横断中その他 7379 20.1 横断中以外路上遊戯中 173 0.5 路上作業中 619 1.7 路上停止中 789 2.1 路上横臥 214 0.6 対面通行中 2702 7.3 背面通行中 3779 10.3 その他 7762 21.1 小区分日本全体、年間件数構成率信号無視 522 1.4 通行区分違反 892 2.4 横断歩道以外 1496 4.1 斜め横断 402 1.1 駐車車両直前直後 265 0.7 走行車両直前直後 1098 3 横断禁止場所横断 139 0.4 幼児一人歩き 113 0.3 踏切不注意 32 0.1 酩酊徘徊 246 0.7 路上遊戯 108 0.3 路上作業 255 0.7 飛び出し 1361 3.7 その他違反 969 2.6 違反なし 28698 78.4 日本における歩行者事故は年間約 35,000件、歩行者違反は約 7,000件事故の定義・歩行者違反による事故・歩行者違反による巻き込み -

1. モチベーション & 背景 - 街づくり&歩行者中心視点からの交通安全歩行者違反による事故を減らすため、街や歩行者中心の視点の考えが不可欠長期的な視点：・自動運転の到来に向け、どのような交通ルールやインフラ設計に合わせるべきでしょうか・街として、どのようなインタフェースで歩行者の行動変容を引き起こし、事故の再発を防ぐ
必要されるもの：データ＋アプローチ＋アイディアない→作りましょうベースライン、モデルコンペティション歩行者中心となる交通映像Woven Traffic Safety Datasetを作成

1. モチベーション & 背景 - データセットのイメージ映像の内容歩行者中心となる [位置、行動、注意先、コンテキスト] 関連の説明文
↓ 事故理由となる要因分析、予測、検索 etc.

2. WTS + AI City Challenge @ CVPR24

2. WTS + AI City Challenge@ CVPR24 https://www.aicitychallenge.org/ WTSデータセットはAI City
Challenge Track2のコンペ用データセットで利用される Challenge Track 2: Traffic Safety Description and Analysis This task revolves around the long fine-grained video captioning of traffic safety scenarios, especially those involving pedestrian accidents. Leveraging multiple cameras and viewpoints, participants will be challenged to describe the continuous moment before the incidents, as well as the normal scene, captioning all pertinent details regarding the surrounding context, attention, location, and behavior of the pedestrian and vehicle. This task provides a new dataset WTS, featuring staged accidents with stunt drivers and pedestrians in a controlled environment, and offers a unique opportunity for detailed analysis in traffic safety scenarios. The analysis result could be valuable for wide usage across industry and society, e.g., it could lead to the streamlining of the inspection process in insurance cases and contribute to the prevention of pedestrian accidents. More features of the dataset can be referred to the dataset homepage (https://woven-visionai.github.io/wts-dataset-homepage/). The top teams of this task are planned to be invited and offered the opportunity to deploy and test their solutions in Woven City after 2025 Summer. https://www.aicitychallenge.org/2024-challenge-tracks/ Organization: NVIDIA, Woven by Toyota, Johns Hopkins University, IIT Kanpur, Australian National University, Santa Clara University, University at Albany-SUNY

2. WTS + AI City Challenge@ CVPR24 ❏ 200+チームからリクエスト ❏
400以上のアクセス Beijing University of Posts and Communications, New York University, The Hong Kong University of Science and Technology, IIT Kanpur, National Yang Ming Chiao Tung University, Southeast university, KIT TECO, DiDi Technology, GMOz, Korea University, NEC, University of British Columbia, etc… ❏ リクエストをした会社や大学 2024.02.07まで、159チームはコンペを参加申請

3. WTSデータセットの概要と特徴

3. WTSデータセットの概要と特徴 : 概要歩行者中心となる交通映像 (事故/正常)とその説明文を含める大規模映像データセット街のカメラと車両のカメラを連携したイメージで取得したマルチビューの映像を提供

3. WTSデータセットの概要と特徴 : 特徴 ① Large Scale & Diversity ③
Long Detail Traffic Description ② Behaviour Phases Segmentation ④ Multi-views / 3D Gaze and environment

3. WTSデータセットの概要と特徴 : Large Scale & Diversity Largest dataset in
traffic domain with instance level information of video description. 事故のISOパターンの例

3. WTSデータセットの概要と特徴 : Behaviour Phases • 事故が発生する前の時間を5フェーズに分かれて、分析を行う • 各フェーズに対して、歩行者の行動や位置情報を言語化を行い、事故に引き起こす要因を分析 /
予測する事故前までの時間情報を行動ベースでフェーズ化する

3. WTSデータセットの概要と特徴 : Long Detailed Description [Pedestrian Caption][Action phase] The
pedestrian, a male in his 20s, stood perpendicular to the vehicle and to the left. He was positioned diagonally to the right, in front of the vehicle, at a close distance. Slowly looking around, the pedestrian's line of sight was fixed on the vehicle. He appeared to notice the vehicle and was aware of its presence. In front of him, he planned to continue going straight ahead, despite traveling in a car lane. His speed was slow, matching his cautious actions. As for the environment, the weather was cloudy, and the brightness of the surroundings was dim. The road surface conditions were dry on the level asphalt road, which was classified as a residential road with two-way traffic. Notably, there were no sidewalks or roadside strips on both sides of the road, but there were street lights illuminating the area. [location][attention][behaviour][context attributes]

3. WTSデータセットの概要と特徴 : Multi-views & 3D Gaze 3D scanned environment
3D Gaze data (left:measured, right:GT) Projected 3D location Multi-views under infra-vehicle cooperated env. • 3D space 3D gaze are synced for further free-angle analysis in 3D digital environment • Multiple views from infra to vehicle cameras

3. WTSデータセットの概要と特徴 : サンプル映像 Vehicle view Pedestrian view Surveillance view
ISO34502-37: 信号のある交差点を左折する際、横断歩道を横断開始した歩行者との衝突事故

4. データセットの作成

4. データセットの作成 - アノテーションのフロー：課題課題：・交通安全関連の説明文を作成することは高い専門性がいる・説明文を書くこと自体はバイアス＆時間かかるアプローチ：・説明文を要素レベルに構造化に分解・専門知識がないアノテータでも、映像を見て、要素をチェックするのみ
・チェックした要素を GPTなどのLLMモデルで文書化する

4. データセットの作成 - アノテーションのフロー：phaseセグメンテーション [環境、位置、行動、注意先 ]などに関して、180+ チェック項目の構造化を実施 pre-recognition recognition judgement
action avoidance 周辺の環境意識（横断歩道、信号機、車両など）を開始する前のタイミング。環境意識（横断歩道、信号機、車両等）の開始から判定までのタイミング。原則として、環境認識が完了してから行動を開始するまでのこと。身体の任意の部分（目と耳を除く）の動きの開始から、結果（衝突など）が発生するまでの時間。回避可能になってから、回避が発生するまで、または回避に失敗するまでの時間。

4. データセットの作成 - アノテーションのフロー：要項チェック・対象：歩行者、車両、環境・スーパーカテゴリ：位置、行動、注意先・チェック項目：向き、距離、移動方向など
チェック内容のサンプル '被害者の体の向き': '加害車両と逆の方向 ', '被害者の位置': '加害車両の正面', '加害車両との相対距離 ': '０ｍ', '被害者視線': '加害車両', '被害者目視状況': '注視している', '被害者進行方向': '前方', '加害車両の認知': '加害車両に気づいたが ', '被害者行動（一般的） ': '直進している', '被害者行動（特殊）': '飛び出している', ・・・ [環境、位置、行動、注意先 ]などに関して、180+ チェック項目の構造化を実施

4. データセットの作成 - アノテーションのフロー：要項チェック [環境、位置、行動、注意先 ]などに関して、180+ チェック項目の構造化を実施人の向き情報の判断のバイアスを無くすため、車両の位置を中心とした方位判定を正規化

4. データセットの作成 - アノテーションのフロー：説明文を生成チェック内容のサンプル '被害者の体の向き': '加害車両と逆の方向 ', '被害者の位置': '加害車両の正面',
'加害車両との相対距離 ': '０ｍ', '被害者視線': '加害車両', '被害者目視状況': '注視している', '被害者進行方向': '前方', '加害車両の認知': '加害車両に気づいたが ', '被害者行動（一般的） ': '直進している', '被害者行動（特殊）': '飛び出している', ・・・歩行者:歩行者は車両に気づいていたにもかかわらず、早いスピードで直進して飛び出してきた。車両: 歩行者が早いスピードで飛び出してくる一方、クルマは時速5キロの低速で左折を開始した。 Caption Generation LLM [環境、位置、行動、注意先 ]などに関して、180+ チェック項目の構造化を実施

4. データセットの作成 - アノテーションのフロー：領域情報の作成自動生成したマスク情報のサンプル事故に関連する歩行者と車両の領域をアノテーションするコストが高い： → visual prompt ベースした関連歩行者と車両の領域の自動生成と追跡
事故に関連する歩行者と車両の情報量が高いため、メタ情報として、該当する領域を提供

4. データセットの作成 - アノテーションのフロー：3D Gaze情報の作成・一人称視点の映像から5 fpsでサンプリングされ、SfMを用いて、事前構築された位置推定用の3Dマップに基づき、ワールド座標で一人称視点の映像フレームをローカライズされ、ego-viewでの3D camera poseを推定
・3DマップはLiDAR スキャン機能を備えた Matterport カメラを利用して作成。・事前に用意した固定カメラのpose、一人称視点のcamera pose、およびTobii Glassからの2D視線先を用いて、一人称視点の2D視線方向を各固定カメラの第三人称視点のビューへ変換し、3D Gazeを取得 Tobii pro Glass 3

5. タスク & ベースライン

5. タスク & ベースライン Video captioning model The pedestrian, a
male in his 20s, stood perpendicular to the vehicle and to the left. He was positioned diagonally to the right, in front of the vehicle, at a close distance. Slowly looking around, the pedestrian's line of sight was fixed on the vehicle. He appeared to notice the vehicle and was aware of its presence. In front of him, he planned to continue going straight ahead, despite traveling in a car lane. His speed was slow, matching his cautious actions. As for the environment, the weather was cloudy, and the brightness of the surroundings was dim. The road surface conditions were dry on the level asphalt road, which was classified as a residential road with two-way traffic. Notably, there were no sidewalks or roadside strips on both sides of the road, but there were street lights illuminating the area. タスク：traffic safetyの映像を入力し、説明文を生成する Video Captioning タスクをまず検証評価：生成された文書を教師情報と比べて、文書内容の一致性をスコアリングする・専門性がある文書はどこまでモデルから理解できることに関心・長い&詳細な説明文をモデルから生成する性能に関心

5. タスク & ベースライン - ベースライン1 適用のため： - 3 種類promptを用意
- Audio Q-Formerを利用しない - 8 frames uniform samplingを実施 - LLMの部分はLLaMA-2-7B - Video Q-Former = BLIP2 Q-Fromerの時系列版ベースライン1: Video-LLaMA Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

5. タスク & ベースライン - ベースライン2 ベースライン2: Video-ChatGPT 適用のため： -
3種類promptを用意 - 24 frames uniform samplingを実施 - LLM = Vicuna-1.1-7B - Frame feature = CLIP ViT encoder - Q-Former cross attention 構造は利用なし、代わりに映像の特徴量を frameの特徴量を poolingする形 Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

5. タスク & ベースライン - ベースライン3 ベースライン3 : Video-LLaMAのVideo Branchをカスタマイズし、Fine-tuneを実施
User Query + System prompt Describe the traffic scene in the following video from the pedestrian perspective … Visual encoder Spatial extractor Linear Position embedding Video features NxDxB … Learnable query feed-forward cross-attention self-attention Video Q-Former … … User Query + System prompt (suffix) LLM(Vicuan-7B) Output caption: “A woman is seeing walking direction along the sidewalk and start crossing the crossroad while a silver car is going straight through the traffic lights … … … Spatial token ・LLM = Vicuan-1.1-7B, Video encoder = ViT-G/14 with position encoding, Q-Former = Video Q-Former in Video-LLaMA *Fine-tune = fine-tune the Video Q-former part

6. 評価結果

6. 評価設定・データセット： - WTSのtrain (~2000 scenarios)と val (~800 scenarios)を利用
- WTSのmulti-viewのデータから一つviewのみをtrain / valに利用 - 1映像に複数のフェーズセグメンテーションがあるため、評価はフェーズ単位で行う・評価基準： ①BLEU-4, METEOR, ROUGE-L, CIDERなどtext similarityの一致性を評価する方法 ②LLMを用いた言語の語彙的な一致性を評価する方法 WTSで新規収集したMulti-viewのデータとBDDのデータ、両方それぞれの評価結果の平均スコアを計算

6. 評価結果・zero-shotに強いLLMベースの方法でも、WTSみたい専門的な説明文の生成が困難・Fine-tuneによる性能向上が見られますか、限定的・Instanceレベルのprior knowledgeを入力する場合、性能の向上が見られますか、学習に含めないため、限定的

6. 評価結果語彙的なスコアと構造的なスコア両方計算するように、 LLMの評価用promptプロトコルを定義する

6. 評価結果 - 生成されたcaptionの例

関連リソース WTS dataset homepage Github for data usage

Thank you

Pedestrian-Centric大規模交通安全映像解析向けWoven Traffic Sa...

Pedestrian-Centric大規模交通安全映像解析向けWoven Traffic Safety (WTS) データセットの紹介

Other Decks in Technology

Featured

Transcript