2. PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions 3. Unifying Language Learning Paradigms 4. Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers 5. A Comprehensive Survey of Image Augmentation Techniques for Deep Learning 6. CoCa: Contrastive Captioners are Image-Text Foundation Models 7. A Comprehensive Survey of Few-shot Learning: Evolution, Applications, Challenges, and Opportunities 8. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers 9. Neurocompositional computing: From the Central Paradox of Cognition to a new generation of AI systems 10. Building Machine Translation Systems for the Next Thousand Languages
(原文: Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.) GitHub - facebookresearch/metaseq: Repo for external large-scale work Meta AI 1. オプトOpen Pre-trained Transformer Language Modelsの略。 (原文: OPT: Open Pre-trained Transformer Language Models)
パーパラメータを1つ追加し、コードを1行追加するだけで、我々のポリ1定式化は、2次元画像分類、インスタンス分割、物体検出、 3次元物体検出タスクにおいて、 時には大きな差をもってクロスエントロピー損失と焦点損失を上回る性能を発揮する。 (原文: Cross-entropy loss and focal loss are the most common choices when training deep neural networks for classification problems. Generally speaking, however, a good loss function can take on much more flexible forms, and should be tailored for different tasks and datasets. Motivated by how functions can be approximated via Taylor expansion, we propose a simple framework, named PolyLoss, to view and design loss functions as a linear combination of polynomial functions. Our PolyLoss allows the importance of different polynomial bases to be easily adjusted depending on the targeting tasks and datasets, while naturally subsuming the aforementioned cross-entropy loss and focal loss as special cases. Extensive experimental results show that the optimal choice within the PolyLoss is indeed dependent on the task and dataset. Simply by introducing one extra hyperparameter and adding one line of code, our Poly-1 formulation outperforms the cross-entropy loss and focal loss on 2D image classification, instance segmentation, object detection, and 3D object detection tasks, sometimes by a large margin.) Waymo LLC, Google LLC Table 1: PolyLoss outperforms cross-entropy and focal loss on various models and tasks. Results are for the simplest Poly-1, which has only a single hyperparameter. On ImageNet (Deng et al., 2009), our PolyLoss improves both pre-training and fine-tuning for the recent EfficientNetV2 (Tan & Le, 2021); on COCO (Lin et al., 2014), PolyLoss improves both 2D detection and segmentation AR for Mask-RCNN (He et al., 2017); on Waymo Open Dataset (WOD) (Sun et al., 2020), PolyLoss improves 3D detection AP for the widely used PointPillars (Lang et al., 2019) and the very recent Range Sparse Net (RSN) (Sun et al., 2021). Details are in Table 4, 5, 7. 2. PolyLoss: 分類損失関数の多項式展開の視点 (原文: PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions)
propose a novel loss we term the Focal Loss that adds a factor (1 − pt) γ to the standard cross entropy criterion. Setting γ > 0 reduces the relative loss for well-classified examples (pt > .5), putting more focus on hard, misclassified examples. As our experiments will demonstrate, the proposed focal loss enables training highly accurate dense object detectors in the presence of vast numbers of easy background examples. 正解に近い予測値の場合はそれ以上学習す る事を抑えることになり、不正解なデータに対 しての学習を進めやすくなります。 https://ai-lab-boatrace.xyz/blog/672/ しかし、 focallossは多くの検出タスクに有効であるが、不 均衡なImageNet-21Kには最適でないことが今回明らか になった。不均衡なデータセットに強いと思われていた が、精査するとそうでも無いデータセットがあった。 Focal Loss for Dense Object Detection - https://arxiv.org/pdf/1708.02002.pdf
られたテーマである。( https://eman-physics.net/math/taylor.html) テイラー級数が収束し、元の関数 f に一致するとき、 f はテイラー展開可能であるという。 (https://ja.wikipedia.org/wiki/%E3%83%86%E3%82%A4%E3%83%A9%E3%83%BC%E5%B1%95%E9%96%8B)
loss, and PolyLoss. PolyLoss ∑∞j=1 αj(1 −Pt)j is a more general framework, where Pt stands for prediction probability of the target class. Left: Polyloss is more flexible: it can be steeper (deep red) than cross-entropy loss (black) or flatter (light red) than focal loss (green). Right: Polynomial coefficients of different loss functions in the bases of (1 − Pt )j , where j ∈ Z+ . Black dash lines are drawn to show the trend of polynomial coefficients. In the PolyLoss framework, focal loss can only shift the polynomial coefficients horizontally (green arrow), see Equation 2, whereas the proposed PolyLoss framework is more general, which also allows vertical adjustment (red arrows) of the polynomial coefficient for each polynomial term. これだけ損失関数の幅が拡がる! • クラスインバランスに対して Focall Loss よりよい損失関数を発見できる • ラベルノイズに頑健な損失 • 損失関数の学習
higher order polynomial, proposed in prior works, truncates all higher order (N + 1 → ∞) polynomial terms. We propose Poly-N loss, which perturbs the leading N polynomial coefficients. Poly-1 is the final loss formulation, which further simplifies Poly-N and only requires a simple grid search over one hyper- parameter. The differences compared to cross-entropy loss are highlighted in red. 最適な多項式係数について これだけ
training ResNet-50 on ImageNet- 1K. (a) Increasing the coefficient of the first polynomial term (ε1 > 0) consistently improves the ResNet50 prediction accuracy. Red dash line shows the accuracy when using cross-entropy loss. Mean and stdev of three runs are plotted. (b) The first polynomial (1−Pt) contributes more than half of the cross-entropy gradient at the last 65% of the training steps, which highlights the importance of tuning the first polynomial. The red dash line shows the crossover. Poly-1 Loss
We set ε1 = 2 for both. Figure 5: PolyLoss improves EfficientNetV2-L by increasing prediction confidence Pt. Figure 4: PolyLoss improves EfficientNetV2 family on the speed-accuracy Pareto curve. Validation accuracy of EfficientNetV2 models pretrained on ImageNet-21K are plotted. Poly-Loss outperforms cross-entropy loss with about x2 speed-up. Poly-1 Loss - 実験結果(画像分類)
Mean and stdev of three runs are plotted. PolyLoss に置き換え Poly-1 Loss - 実験結果(セグメンテーション, 物体検出) Table 5: PolyLoss improves detection results on COCO validation set. Bounding box and instance segmentation mask average-precision (AP) and average-recall (AR) are reported for Mask R-CNN model with a ResNet-50 backbone. Mean and stdev of three runs are reported. → εの最適値が違うので、データセットやタスクに合わせて損失関数を調整することが重要である。
PolyLoss framework. Poly-1 Loss - 実験結果(3D 物体検出) Table 6: PolyLoss vs. focal loss for 3D detection models. Differences are highlighted in red. We found the best Poly-1 for PointPillars is ε1 = −1, which is equivalent to dropping the first term. Therefore, for RSN, we drop the first term and tune the new leading polynomial (1 − Pt)γ+2. Table 7: PolyLoss improves detection results on Waymo Open Dataset validation set. Two detection models: single-stage PointPillars (Lang et al., 2019) and two-stage SOTA RSN (Sun et al., 2021) are evaluated. Bird’s eye view (BEV) and 3D detection average precision (AP) and average precision with heading (APH) at Level 1 (L1) and Level 2 (L2) difficulties are reported. The IoU threshold is set to 0.7 for vehicle detection and 0.5 for pedestrian detection. RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection PointPillars: Fast Encoders for Object Detection from Point Clouds
20Bパラメータまで拡張することにより、言語生成(自動 評価と人間による評価を含む)、言語理解、テキスト分類、質問応答、常識的推論、長文推論、構造化知識接地、情報検索に及ぶ 50の確立された教師付き NLPタスクにおいて SOTA性能を達 成する。また、文脈内学習においても、ゼロショット SuperGLUEで175B GPT-3を上回り、ワンショット要約で T5-XXLの3倍の性能を達成しました。 20BモデルのFlaxベースのT5Xモデルの チェックポイントは、 \url{https://github.com/google-research/google-research/tree/master/ul2} で公開されています。 (原文: Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. We release Flax-based T5X model checkpoints for the 20B model at \url{https://github.com/google-research/google-research/tree/master/ul2}.) GoogleResearch 3. 言語学習パラダイムの統一 (原文: Unifying Language Learning Paradigms)
baseline model. In this example, we process the sentence “Thank you for inviting me to your party last week.” The words “for”, “inviting” and “last” (marked with an ×) are randomly chosen for corruption. Each consecutive span of corrupted tokens is replaced by a sentinel token (shown as <X> and <Y>) that is unique over the example. Since “for” and “inviting” occur consecutively, they are replaced by a single sentinel <X>. The output sequence then consists of the dropped-out spans, delimited by the sentinel tokens used to replace them in the input plus a final sentinel token <Z>. Denoiser • masked language modeling の改変。 • T5 にて使用される。 ◦ 元論文 : Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (https://arxiv.org/abs/1910.10683) ◦ 解説 : 【深層学習】T5 - 入出力をテキストにする Transformer の新利用法【ディープラーニングの 世界vol.37】(https://youtu.be/-x08lNz3Qfo?t=934)
denotes leaderboard submission. (♯) denotes the best published we could find on the leaderboard. (e) denotes SOTA used an ensembled approach. Because we evaluate finetuning and in-context trade-offs for SuperGLUE, SuperGLUE scores have their own dedicated section below. 評価
compare with GPT-3, GLaM and PaLM (Chowdhery et al., 2022). We also include models that are relatively compute-matched with UL20B such as T5-XXL with LM adaptation (Lester et al., 2021), GPT-3 13B and GLaM-8B dense. Notably, UL20B outperforms GPT-3 175B and all other models in a similar compute class on average score. Table 9: Results on SuperGLUE dev set. We compare with T5-11B (Raffel et al., 2019), ST-MoE-32B (Zoph et al., 2022) and PaLM-8B, PaLM-62B and PaLM-540B (Chowdhery et al., 2022). Scores reported are the peak validation scores per task. 評価 Text Summarization with Pretrained Encoders(https://arxiv.org/abs/1908.08345) Table 4: ROUGE F1 results on the XSum test set. Results for comparison systems are taken from the authors’ respective papers or obtained on our data by running publicly released software.
どちらか一方のみでうまく機能することができる。しかし、我々は、この 2つの能力を同じモデルで共存させることができる追加的な分布特性(クラスに対する歪んだ Zipfian分布)を発見した。ま た、変換器では少数点学習が可能であった学習データが、リカレントモデルでは少数点学習が不可能であったことも注目すべき点である。つまり、数列学習は正しいデータ分布に正しいアーキ テクチャを適用することによってのみ出現し、どちらの要素も単独では不十分であることがわかった。 (原文: Large transformer-based language models are able to perform few-shot learning (also known as in-context learning), without having been explicitly trained for it. We hypothesized that specific distributional properties of natural language might drive this emergent phenomenon, as these characteristics might lead to a kind of interpolation between few-shot meta-training (designed to elicit rapid few-shot learning) and standard supervised training (designed to elicit gradual in-weights learning). We also hypothesized that these distributional properties could lead to emergent few-shot learning in domains outside of language. Inspired by this idea, we ran a series of experiments on a standard image-based few-shot dataset. We discovered that a number of data properties did indeed promote the emergence of few-shot learning in transformer models. All of these properties are present in natural language -- burstiness, long-tailedness, and many-to-one or one-to-many label mappings. The data influenced whether models were biased towards either few-shot learning vs. memorizing information in their weights; models could generally perform well at only one or the other. However, we discovered that an additional distributional property could allow the two capabilities to co-exist in the same model -- a skewed, Zipfian distribution over classes -- which occurs in language as well. Notably, training data that could elicit few-shot learning in transformers were unable to elicit few-shot learning in recurrent models. In sum, we find that few-shot learning emerges only from applying the right architecture to the right data distribution; neither component is sufficient on its own.) DeepMind 4. データ分布の特性がトランスフォーマーにおける創発的な few-shot learning を促進する (原文: Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers)
We compare architectures while holding fixed the number of layers, hidden layer size, and number of parameters. Only a transformer is able to attain few-shot learning; the Vanilla RNN and LSTM never perform above chance. One run was performed for each set of hyperparameters in a hyperparameter sweep. Transformer 固有の現象
for Deep Learning) https://arxiv.org/abs/2205.01491v1 ディープラーニングは、大量の画像を必要とするコンピュータビジョンにおいて適切な性能を達成しているが、画像の収集は多くのシナリオで高価で困難である。この問題を軽減するために、効 果的かつ効率的な戦略として、多くの画像補強アルゴリズムが提案されている。現在のアルゴリズムを理解することは、与えられたタスクに適した方法を見つけたり、新しい技術を開発したりす るために不可欠である。本論文では、深層学習のための画像補強について、新しい情報的分類法を用いて包括的なサーベイを実施する。なぜ画像補強が必要なのか、その基本的な考え方を 知るために、コンピュータビジョンのタスクと周辺分布における課題を紹介する。次に、アルゴリズムをモデルフリー、モデルベース、最適化ポリシーベースの 3つのカテゴリに分類する。モデルフ リーのカテゴリは画像処理手法を採用し、モデルベースは学習可能な画像生成モデルを活用する。一方、最適化ポリシーベースは、最適な操作やその組み合わせを見つけることを目的として いる。さらに、群やカーネル理論といった画像補強の異なる理解方法を活用し、教師なし学習のための画像補強を展開するという、より活発な 2つのトピックで一般的なアプリケーションの現在 の傾向について議論する。これらの分析に基づき、我々の調査は、実用的なアプリケーションのために適切な方法を選択したり、新しいアルゴリズムを設計するのに役立つ、より良い理解を与 えるものと信じています。 (原文: Deep learning has been achieving decent performance in computer vision requiring a large volume of images, however, collecting images is expensive and difficult in many scenarios. To alleviate this issue, many image augmentation algorithms have been proposed as effective and efficient strategies. Understanding current algorithms is essential to find suitable methods or develop novel techniques for given tasks. In this paper, we perform a comprehensive survey on image augmentation for deep learning with a novel informative taxonomy. To get the basic idea why we need image augmentation, we introduce the challenges in computer vision tasks and vicinity distribution. Then, the algorithms are split into three categories; model-free, model-based, and optimizing policy-based. The model-free category employs image processing methods while the model-based method leverages trainable image generation models. In contrast, the optimizing policy-based approach aims to find the optimal operations or their combinations. Furthermore, we discuss the current trend of common applications with two more active topics, leveraging different ways to understand image augmentation, such as group and kernel theory, and deploying image augmentation for unsupervised learning. Based on the analysis, we believe that our survey gives a better understanding helpful to choose suitable methods or design novel algorithms for practical applications.)
プションロスに加え、ユニモーダル画像とテキスト埋め込み間のコントラストロスを適用する。同じ計算グラフを共有することで、 2つの学習目的は最小限のオーバーヘッドで効率的に計算されます。 CoCaは、Web スケールの alt-textデータと注釈付き画像の両方に対して、全てのラベルを単にテキストとして扱い、表現学習のための自然言語監視をシームレスに統合することにより、エンドツーエンドかつゼロから事前学習さ れる。経験的に、 CoCaは視覚認識( ImageNet, Kinetics-400/600/700, Moments-in-Time)、クロスモーダル検索( MSCOCO, Flickr30K, MSR-VTT)、マルチモーダル理解( VQA, SNLI-VE, NLVR2)、画像キャプ ション(MSCOCO, NoCaps)に及ぶ幅広い下流タスクに対してゼロショット転送もしくはタスク固有の最小限の適合で最先端の性能を達成することができました。特に ImageNetの分類では、 CoCaはゼロショットで 86.3%のトップ1精度を達成し、フリーズしたエンコーダと学習した分類ヘッドで 90.6%、微調整したエンコーダで ImageNetにおける最新鋭のトップ 1精度を91.0%達成しました。 (原文: Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.) Google Research 6. CoCa:対照的なキャプションは画像とテキストの基礎モデル (原文: CoCa: Contrastive Captioners are Image-Text Foundation Models)
as image-text foundation models. The pretrained CoCa can be used for downstream tasks including visual recognition, vision-language alignment, image captioning and multimodal understanding with zero-shot transfer, frozen-feature evaluation or end-to-end finetuning. Figure 2: Detailed illustration of CoCa architecture and training objectives. λConとλCapは損失重み付けハイパーパラメータ
learning multi-scale features with cross- attention (CrossViT). Our architecture consists of a stack of K multi-scale transformer encoders. Each multi-scale transformer encoder uses two different branches to process image tokens of different sizes (Ps and Pl, Ps < Pl) and fuse the tokens at the end by an efficient module based on cross attention of the CLS tokens. Our design includes dif- ferent numbers of regular transformer encoders in the two branches (i.e. N and M) to balance computational costs. Cross-attention Figure 3: Multi-scale fusion. (a) All-attention fusion where all tokens are bundled together without considering any charac- teristic of tokens. (b) Class token fusion, where only CLS tokens are fused as it can be considered as global representation of one branch. (c) Pairwise fusion, where tokens at the corresponding spatial locations are fused together and CLS are fused separately. (d) Cross-attention, where CLS token from one branch and patch tokens from another branch are fused together. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification (https://arxiv.org/abs/2103.14899v2)
based on attentional principles that scales to high-dimensional inputs such as images, videos, audio, point-clouds, and multimodal combinations without making domain-specific assumptions. The Perceiver uses a cross-attention module to project an high-dimensional input byte array to a fixed-dimensional latent bottleneck (the number of input indices M is much larger than the number of latent indices N ) before processing it using a deep stack of Transformer-style self-attention blocks in the latent space. The Perceiver iteratively attends to the input byte array by alternating cross-attention and latent self-attention blocks. Figure 2. We train the Perceiver architecture on images from ImageNet (Deng et al., 2009) (left), video and audio from AudioSet (Gemmeke et al., 2017) (considered both multi- and uni-modally) (center), and 3D point clouds from ModelNet40 (Wu et al., 2015) (right). Essentially no architectural changes are required to use the model on a diverse range of input data.
Applications, Challenges, and Opportunities) https://arxiv.org/abs/2205.06743v1 Few-shot learning (FSL) は効果的な学習法として注目されており、大きな可能性を持っている。近年、 FSLタスクに対する創造的な研究が行われているが、わず か数個、あるいはゼロ個のサンプルから有効な情報を迅速に学習することは、依然として深刻な課題である。そこで我々は、過去 3年間に発表された200以上の FSLに関する最新論文を幅広く調査し、 FSLの最新の進歩の概要と、既存論文の長所と短所の公平な比較をタイムリーに包括的に提示することを目的とした。概 念の混乱を避けるため、我々はまず、少数点学習、転移学習、メタ学習を含む一連の類似した概念を精緻化し、比較する。さらに、 FSLの課題に応じて、知識の抽 象化レベルに応じて既存の研究を分類する新しい分類法を提案する。このサーベイを充実させるために、各サブセクションにおいて、これらのトピックに関する最 近の進歩についての詳細な分析と洞察に満ちた議論を提供する。さらに、コンピュータビジョンを例として、 FSLの重要なアプリケーションを強調し、様々な研究の ホットスポットをカバーする。最後に、技術進化のトレンドと将来の研究機会に関するユニークな洞察で調査を締めくくり、後続の研究に指針を与えることを期待す る。 (原文: Few-shot learning (FSL) has emerged as an effective learning method and shows great potential. Despite the recent creative works in tackling FSL tasks, learning valid information rapidly from just a few or even zero samples still remains a serious challenge. In this context, we extensively investigated 200+ latest papers on FSL published in the past three years, aiming to present a timely and comprehensive overview of the most recent advances in FSL along with impartial comparisons of the strengths and weaknesses of the existing works. For the sake of avoiding conceptual confusion, we first elaborate and compare a set of similar concepts including few-shot learning, transfer learning, and meta-learning. Furthermore, we propose a novel taxonomy to classify the existing work according to the level of abstraction of knowledge in accordance with the challenges of FSL. To enrich this survey, in each subsection we provide in-depth analysis and insightful discussion about recent advances on these topics. Moreover, taking computer vision as an example, we highlight the important application of FSL, covering various research hotspots. Finally, we conclude the survey with unique insights into the technology evolution trends together with potential future research opportunities in the hope of providing guidance to follow-up research.)
(原文: The development of the transformer-based text-to-image models are impeded by its slow generation and complexity for high-resolution images. In this work, we put forward a solution based on hierarchical transformers and local parallel auto-regressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general language model (CogLM), and finetune it for fast super-resolution. The new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL-E-2, and naturally supports interactive text-guided editing on images.) https://github.com/THUDM/CogView2 Tsinghua University, BAAI 8. CogView2:階層型トランスフォーマーによるテキストから画像への変換の高速化・高品質化 (原文: CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers)
high-resolution images via the direct super-resolution module. In each snapshot during the iterative super-resolution, the tokens in the same color are generated at the same time. All the local windows work in parallel. 階層的生成
いAIシステムを創り出す。 (原文: What explains the dramatic progress from 20th-century to 21st-century AI, and how can the remaining limitations of current AI be overcome? The widely accepted narrative attributes this progress to massive increases in the quantity of computational and data resources available to support statistical learning in deep artificial neural networks. We show that an additional crucial factor is the development of a new type of computation. Neurocompositional computing adopts two principles that must be simultaneously respected to enable human-level cognition: the principles of Compositionality and Continuity. These have seemed irreconcilable until the recent mathematical discovery that compositionality can be realized not only through discrete methods of symbolic computing, but also through novel forms of continuous neural computing. The revolutionary recent progress in AI has resulted from the use of limited forms of neurocompositional computing. New, deeper forms of neurocompositional computing create AI systems that are more robust, accurate, and comprehensible.) Johns Hopkins University Microsoft Research Northwestern University 9. ニューロコンポジション・コンピューティング。認知のセントラルパラドックスから新世代の AIシステムへ (原文: Neurocompositional computing: From the Central Paradox of Cognition to a new generation of AI systems)
Thousand Languages) https://arxiv.org/abs/2205.03983v2 この論文では、1000以上の言語を翻訳できる実用的な機械翻訳( MT)システムを構築するための取り組みから得られた結果を紹介する。本論文では、 3つの研究 領域における成果を紹介する。(i) 半教師付き事前学習を利用して言語を識別し、データ駆動型のフィルタリング技術を開発することにより、 1500以上の言語のク リーンなWebマイニングデータセットを構築する。 (ii) 100以上の高リソース言語の教師付き並列データで学習した大規模多言語モデルと、さらに 1000以上の言語 の単言語データを利用して、サービスが不十分な言語向けの実践的 MTモデルを開発する。我々の研究が、現在あまり研究されていない言語の MTシステム構築 に取り組む実務家に有益な洞察を提供し、データが乏しい環境における大規模多言語モデルの弱点を補完する研究の方向性を明らかにすることを期待している。 (原文: In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-driven filtering techniques; (ii) Developing practical MT models for under-served languages by leveraging massively multilingual models trained with supervised parallel data for over 100 high-resource languages and monolingual datasets for an additional 1000+ languages; and (iii) Studying the limitations of evaluation metrics for these languages and conducting qualitative analysis of the outputs from our MT models, highlighting several frequent error modes of these types of models. We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.) Google Research 「高リソース言語」を用いて「ロングテール言語」翻訳タスクに応用する