機械学習基礎 TAレクチャー回「学部二年生はどう生きるべきか」

機械学習基礎 TAレクチャー回「学部二年生はどう生きるべきか」杉浦孔明研 D1 和田唯我

- 2 - o 和⽥唯我 (@YuigaWada; 適当に名前でググってください) • 経歴: 学⾨5
→ 修⼠ (1.5年で卒業) → 博⼠ • 研究: 機械学習 (Vision and Language) • 国際論⽂7本, 国内論⽂11本, 特許1件, 受賞3件 • 未踏IT’24 (落合陽⼀PM), 学振DC1, JST BOOST o 趣味 • 開発: iOSアプリ / サービス • GitHub総スター数: 582 (Swift, Go, Rust, Python, Typescript, … etc) • 競プロ: AtCoder Highest 1545 o その他 • Appleと研究 ← AI分野世界１位に採択 / 採択論⽂のtop3.6% • PFNでインターン / SONY のYouTubeにて論⽂を解説 • 応⽤情報技術者 / 学費全額免除スライド⼀枚で⾃⼰紹介 @YuigaWada @YuigaWada

- 3 - o 和⽥唯我 (@YuigaWada; 適当に名前でググってください) • 経歴: 学⾨5
→ 修⼠ (1.5年で卒業) → 博⼠ • 研究: 機械学習 (Vision and Language) • 国際論⽂7本, 国内論⽂11本, 特許1件, 受賞3件 • 未踏IT’24 (落合陽⼀PM), 学振DC1, JST BOOST o 趣味 • 開発: iOSアプリ / サービス • GitHub総スター数: 582 (Swift, Go, Rust, Python, Typescript, … etc) • 競プロ: AtCoder Highest 1545 o その他 • Appleと研究 ← AI分野世界１位に採択 / 採択論⽂のtop3.6% • PFNでインターン / SONY のYouTubeにて論⽂を解説 • 応⽤情報技術者 / 学費全額免除⾃分 ! 𝒚 ≔ 𝑓 𝜽 𝒛 と⾃分の世界観 𝒒(𝒀) の両者を同時に最適化していく @YuigaWada @YuigaWada o ⼤域的に最適な世界観（＝好き・こうなりたい）を作り, 探索と逆算で⽣きる • 局所的な最適解に陥らない ← ⼤域的に最適な世界観が必要 • 真の世界観 𝒚 は分からない • ⾃分は何が好きで，どうなりたいか？ • 世界観の探索が必要 (cf. EM algorithm, Reinforcement Learning) → 探索空間を⾃ら押し広げ，負例を⾒つける旅に出よう • ⾃分 𝑓 𝜽 𝒛 と⾃分の世界観 𝒒(𝒀) の２つを同時に（交互に）最適化する

- 4 - Q. あなたは何を最適化したの？ A. こんな感じ↓ B1: 他⼈との差別化 →
iOS頑張る B2: J科配属 → CSの基礎を磨く B3: 専⾨性の⽋如 → Web・機械学習を磨く B4: 研究室配属 → Top. 1を⽬指す B1 B2 B3 B4 M1 • iOSアプリ開発に尽⼒ • GitHub総スター 130 • B2Bの会社を起業 • Web開発 (front / back)に励む • 機械学習関連の実装を磨く • 会社が忙しくなる • 研究活動頑張る → B4で国際会議 2本国内会議 4本 • 競プロ頑張る • 群論を勉強する • ⽂学理論の勉強 • PFNでインターン • 国際論⽂ 5本 • 国内論⽂ 8本 • 特許 1件 • 受賞 3件 back prop. back prop. back prop. back prop.

iOS頑張る B2: J科配属 → CSの基礎を磨く B3: 専⾨性の⽋如 → Web・機械学習を磨く B4: 研究室配属 → Top. 1を⽬指す B1 B2 B3 B4 M1 • iOSアプリ開発に尽⼒ • GitHub総スター 130 • B2Bの会社を起業 • Web開発 (front / back)に励む • 機械学習関連の実装を磨く • 会社が忙しくなる • 研究活動頑張る → B4で国際会議 2本国内会議 4本 • 競プロ頑張る • 群論を勉強する • ⽂学理論の勉強 back prop. back prop. back prop. back prop. • 正例だけでなく負例を⾒つける旅に出よう • ⾃分 𝑓 𝜽 𝒛 と⾃分の世界観 𝒒(𝒀) の２つを同時に最適化していく • ⽴ち位置 𝒇 𝜽 𝒛 を適切に推定しつつ，⽬標 𝒚 との損失を計算

- 7 - 1. ⾃分の世界観（＝好き・こうなりたい）の構築に努めよう 2. 競プロと英語と数学やろう！ • 局所的な好き嫌いと⼤域的な好き嫌いは違う •
英語を勉強する今→嫌い • 英語使ってAppleと研究してる1年後 → 好き • 英語でヤベー仕事してる10年後 → もっと好き 3. ⽂理問わず，本を⼤量に読もうね (安部公房, Martin Heidegger, Peter Thiel) • 本好き → いいね！ / 本嫌い → 君の世界観にたどり着くためには必要です 4. 低レイヤーも触ろうね！！ (≈ CSは全⽅⾯やろう) • ⼈⽣で⼀回はシェルやコンパイラを作ろう．いつか役に⽴ちます 5. 中⻑期的なプランを⽴てましょう (最低でも半年に⼀回は⾒直す = back prop.) • (ホントは)B1の時点で少なくともB3までのプランは⽴てておくべき具体的に何すればいいの？

- 8 - 気軽に相談・連絡してください

Appendix

iOS頑張る B2: J科配属 → CSの基礎を磨く B3: 専⾨性の⽋如 → Webを磨く B4: 研究室配属 → Top. 1を⽬指す B1 B2 B3 B4 M1 • iOSアプリ開発に尽⼒ • GitHub総スター 130 • B2Bの会社を起業 • Web開発 (front / back)に励む • 機械学習関連の実装を磨く • 会社が忙しくなる • 研究活動頑張る → B4で国際会議 2本国内会議 4本 • 競プロ頑張る • 群論を勉強する • ⽂学理論の勉強 • PFNでインターン • 国際論⽂ 5本 • 国内論⽂ 8本 • 特許 1件 • 受賞 3件 back prop. back prop. back prop. back prop. • iOSのライブラリとアプリをひたすら作った

iOS頑張る B2: J科配属 → CSの基礎を磨く B3: 専⾨性の⽋如 → Webを磨く B4: 研究室配属 → Top. 1を⽬指す B1 B2 B3 B4 M1 • iOSアプリ開発に尽⼒ • GitHub総スター 130 • B2Bの会社を起業 • Web開発 (front / back)に励む • 機械学習関連の実装を磨く • 会社が忙しくなる • 研究活動頑張る → B4で国際会議 2本国内会議 4本 • 競プロ頑張る • 群論を勉強する • ⽂学理論の勉強 • PFNでインターン • 国際論⽂ 5本 • 国内論⽂ 8本 • 特許 1件 • 受賞 3件 back prop. back prop. back prop. back prop. • 機械学習の基礎を勉強 • ⾊々実装してみたり…

iOS頑張る B2: J科配属 → CSの基礎を磨く B3: 専⾨性の⽋如 → Webを磨く B4: 研究室配属 → Top. 1を⽬指す B1 B2 B3 B4 M1 • iOSアプリ開発に尽⼒ • GitHub総スター 130 • B2Bの会社を起業 • Web開発 (front / back)に励む • 機械学習関連の実装を磨く • 会社が忙しくなる • 研究活動頑張る → B4で国際会議 2本国内会議 4本 • 競プロ頑張る • 群論を勉強する • ⽂学理論の勉強 • PFNでインターン • 国際論⽂ 5本 • 国内論⽂ 8本 • 特許 1件 • 受賞 3件 back prop. back prop. back prop. back prop.

iOS頑張る B2: J科配属 → CSの基礎を磨く B3: 専⾨性の⽋如 → Web・機械学習を磨く B4: 研究室配属 → Top. 1を⽬指す B1 B2 B3 B4 M1 • iOSアプリ開発に尽⼒ • GitHub総スター 130 • B2Bの会社を起業 • Web開発 (front / back)に励む • 機械学習関連の実装を磨く • 会社が忙しくなる • 研究活動頑張る → B4で国際会議 2本国内会議 4本 • 競プロ頑張る • 群論を勉強する • ⽂学理論の勉強 • PFNでインターン • 国際論⽂ 5本 • 国内論⽂ 8本 • 特許 1件 • 受賞 3件 back prop. back prop. back prop. back prop. • 機械学習コンペで優勝 → B3で国際会議に論⽂採択

iOS頑張る B2: J科配属 → CSの基礎を磨く B3: 専⾨性の⽋如 → Webを磨く B4: 研究室配属 → Top. 1を⽬指す B1 B2 B3 B4 M1 • iOSアプリ開発に尽⼒ • GitHub総スター 130 • B2Bの会社を起業 • Web開発 (front / back)に励む • 機械学習関連の実装を磨く • 会社が忙しくなる • 研究活動頑張る → B4で国際会議 2本国内会議 4本 • 競プロ頑張る • 群論を勉強する • ⽂学理論の勉強 • PFNでインターン • 国際論⽂ 5本 • 国内論⽂ 8本 • 特許 1件 • 受賞 3件 back prop. back prop. back prop. back prop. SHARED TRANSFORMER ENCODER WITH MASK-BASED 3D MODEL ESTIMATION FOR CONTAINER MASS ESTIMATION Tomoya Matsubara,? Seitaro Otsuki,? Yuiga Wada,? Haruka Matsuo, Takumi Komatsu, Yui Iioka, Komei Sugiura and Hideo Saito Keio University, Japan [email protected] ABSTRACT For human-safe robot control in human-to-robot handover, the physical properties of containers and fillings should be accurately estimated. In this paper, we propose a Transformer encoder that shares the same architecture and parameters for filling level and type estimation. We also propose a mask-based geometric algorithm to estimate 3D models of containers for the estimation of their capacity and dimensions. We further use these estimations to estimate their mass in a Convolutional Neural Network model. Experiments show that our Transformer model produced encouraging results in both estimations. While challenges remain in our mask-based algorithm and Convolutional Neural Network model, their results revealed sev- eral ways for improvement. Index Terms— Transformer encoder, visual hull, Mask R- CNN, point cloud 1. INTRODUCTION Accurate estimations of physical properties (e.g., mass and dimensions) are essential in human-to-robot handovers of objects [1, 2, 3]. As robot control often depends on their physical properties, inaccu- rate estimations can cause unexpected behavior and put users in dan- ger [4]. Under limited prior data and knowledge, however, devising a robust method that can make accurate estimations even for unseen objects is never easy due to the variety of configurations [2, 5]. RGB or RGB-D images of pouring scenes are used in existing methods for the reconstruction of containers’ 3D models and their dimensions estimations [6, 7]. However, these methods are only avail- able for opaque containers whose 3D models are known. Learning the Grasping Point [8] can be used without 3D models and for trans- parent containers but only estimate their localization in 3D, not their dimensions. As for fillings in containers, filling levels are estimated from RGB or RGB-D images [9, 10, 11], audio recordings [12, 13] of pouring scenes, or both [14]. Among them, a few methods estimate filling levels based on the results of filling type estimations [13], or estimate both filling levels and types in an architecture [14]. How- ever, such architectures that combine and process both estimations for a wider variety of filling types are not yet investigated. In this paper, we propose three methods: the first one is for si- multaneous estimations of filling levels and types, the second one for the estimation of container capacities and dimensions, and the 1 audio-visual recordings of a person subject pouring a filling into a container or shaking an already-filled container. 2. FILLING LEVEL AND TYPE CLASSIFICATION We formulate the filling level classification as a 3-class classification among empty, half full, and full, and the filling type classification as a 4-class classification among no content, pasta, rice, and water. To tackle the two estimations, we propose a model composed of a Convolutional Neural Network (CNN) encoder, a Transformer encoder, and two classification heads (see Figure 1). We share the same architecture and parameters of the CNN and Transformer encoder for both estimations, while we use a task-specific Multi-Layer Percep- tron (MLP) head in each task after the Transformer encoder. The model takes as input audio signals pre-processed and transformed into mel-spectrograms in logarithmic scale and outputs estimations of filling level and type. We consider that type estimation does not require the entire mel-spectrograms and assume even 20% of the audio could have enough information. Since the manipulation is less likely to continue from the very beginning of the recording until the very end, we discard the mel-spectrograms before time index start and after time index end, which are defined by: • start: randomly chosen from 0 to 40% of T • end: randomly chosen from 60 to 100% of T where T is the time length of the entire audio. We use the cross-entropy loss to evaluate the training and validation loss. To avoid overfitting, we save the model only when the following condition is satisfied: L > max (0.15, Ll ) + max (0.15, Lt) (1) where L is initialized to float(’inf’) of Python [15] while Ll and Lt are the validation loss of filling level and type estimation. L is updated, after saving the model, by L = Ll + Lt . While Equation 1 is always true in the first epoch, it requires the model to make the sum of the two losses less than at least the previous sum in the following epochs. If the two losses are sufficiently small, the two constants 0.15, larger than Ll and Lt , stop the training. 2.1. CNN encoder The obtained mel-spectrograms xspec have multi-channel 2D shapes (C, T, Nmel ), where C is the number of channels and Nmel is the number of mel filter banks. We reshape xspec using a CNN encoder 022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-6654-0540-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICASSP43922.2022.9747110

iOS頑張る B2: J科配属 → CSの基礎を磨く B3: 専⾨性の⽋如 → Web・機械学習を磨く B4: 研究室配属 → Top. 1を⽬指す B1 B2 B3 B4 M1 • iOSアプリ開発に尽⼒ • GitHub総スター 130 • B2Bの会社を起業 • Web開発 (front / back)に励む • 機械学習関連の実装を磨く • 会社が忙しくなる • 研究活動頑張る → B4で国際会議 2本国内会議 4本 • 競プロ頑張る • 群論を勉強する • ⽂学理論の勉強 • PFNでインターン • 国際論⽂ 5本 • 国内論⽂ 8本 • 特許 1件 • 受賞 3件 back prop. back prop. back prop. back prop.

機械学習基礎 TAレクチャー回「学部二年生はどう生きるべきか」

機械学習基礎 TAレクチャー回「学部二年生はどう生きるべきか」

Yuiga Wada (和田唯我)

More Decks by Yuiga Wada (和田唯我)

Other Decks in Technology

Featured

Transcript