Upgrade to Pro — share decks privately, control downloads, hide ads and more …

言語だけじゃない!Qwen VLモデルの実力 The Power of Qwen VL:Bey...

Avatar for Sawa Sawa
May 15, 2025

言語だけじゃない!Qwen VLモデルの実力 The Power of Qwen VL:Beyond Language

画像認識や画像生成の能力を持つQwen VLモデルの実力について、日本語のベンチマークにより計測されたリーダーボードで確認した。
ありがたいことに、第三者機関により、中立にVLMのベンチマーク評価がされてます。 VLMのリーダーボードは、LLMのリーダーボードと比較すると、まだ少ないです。私は、個人的に日本語能力の高いモデルに興味があるので、日本語能力を測るベンチマークで計測したHeron VLM Leaderboardについて、具体的に紹介。Heron VLMリーダーボードは、VLモデルの評価を、2つのデータセットにより行なっている。Turing社のJapanese Heron BenchとLlava Benchの2つ。それぞれのベンチマークについて解説した上で、Qwen VLの評価結果について解説を行なった。結論として、Qwen VLモデルはOSSのVLモデルの中で、ダントツでトップの性能。
今回は、MeltingHackさんとAliEaterさんのコラボイベントのため、英語と日本語で発表。スライドは英語で作成。

Avatar for Sawa

Sawa

May 15, 2025
Tweet

Other Decks in Research

Transcript

  1. The Power of Qwen VL: Beyond Language 言語だけじゃない!Qwen VLモデルの実力 AliEaters

    (Alibaba Cloud Developers) Meetup #32 祝!MeltingHackさんとの初コラボ 2025/5/14 Chika 1
  2. Who am I? I previously worked on researching and using

    the API rerated AI which API from Alibaba Cloud’s AI Research Lab ‘Damo’. I was a frequent Alibaba Cloud user at the time. Although I no longer use it in my work, I’m so impressed by Qwen, like LLM and VLM. I also give occasional lightning talks on various topics for the Alibaba Cloud community called ”AliEater”. 3
  3. Do you know VLM? LLM stands for Large Language Model.

    VLM stands for Vision and Language Model. VLM means we can leverage both visual and language ability. Vision Language 4
  4. What is the nice thing of VLM? The key advantage

    of a VLM is that, in addition to the language capabilities of an LLM, it also brings in visual understanding. By using a VLM, AI can combine image and language recognition In this photo, there is a large building … VLM 5 AI can analyze objects in an image and describe them in text, also generate an image from a text prompt.
  5. Problem is ‘Which one is the best’? ⚫OpenVLM Leaderboard https://opencompass-open-vlm-leaderboard.hf.space/?ref=codesphere.ghost.io

    ⚫Heron VLM Leaderboard for Japanese benchmark. → Today’s Talk Japanease version https://wandb.ai/vision-language-leaderboard/heron-leaderboard/reports/Heron-VLM- powered-by-nejumi-WandB--Vmlldzo3ODM4ODYw English version *initial version https://wandb.ai/vision-language-leaderboard/heron-leaderboard/reports/Heron-VLM- Leaderboard-powered-by-Nejumi-WandB--Vmlldzo4MjY3OTc5 →issue!! The English version data is outdated, so I recommend checking the Japanese version for the latest information. 6
  6. Heron VLM Leaderboard The Heron VLM leaderboard evaluates vision-and-language using

    two datasets. ⚫Japanese Heron Bench (Turing Corp.) Link Link2 ⚫Llava Bench Japanese (in the wild) Link Link2 ※Appendix Link 9
  7. Japanese Heron Bench ⚫Images: 21 photographs related to Japan. ⚫Question

    types: Each image is paired with three categories of questions—Conversation, Detail, Complex. ⚫Dataset size: A total of 102 questions across all images. ⚫Subcategories: Anime, Art, Culture, Food, Scenery, Landmarks, Transportation. 10
  8. Japanese Heron Bench Example for Question types: Conversation, Detail, Complex

    11 Conversation:イラストの少女の名前はなんで すか?(What is the name of the girl in the illustration?) Detail:このイラストについて詳しく説明してください。 (Please explain this illustration in more detail.) Complex:この映像の中で明らかに人間ではない のはどれでしょうか?(Which of the objects in this photo are clearly not human?)
  9. Heron Bench output 12 some example questions and output from

    the Heron benchmark for each of the three categories. (Conversation, Detail, Complex)
  10. Llava Bench In-the-Wild Japanese 13 ⚫Images: 24 photographs. ⚫Question types:

    Each image is paired with three categories of questions—Conversation, Detail, Complex. ⚫Dataset size: A total of 60 questions across all images.
  11. Llava Bench In-the-Wild Japanese The LLaVA benchmark was NOT originally

    developed specifically for Japanese, but later a Japanese version of the dataset was released, allowing VLM Japanese abilities to be evaluated. Some images generated by AI that don’t exist in reality, as well as anime-style images, are also used. The images used are NOT particularly representative of Japan. 14
  12. Llava Bench In-the-Wild Japanese 15 some example questions and output

    from the Llava benchmark for each of the three categories. (Conversation, Detail, Complex)
  13. The Power of Qwen VL How powerful is the Qwen

    VL? * As of 6/May/2025 Alibaba’s Qwen2-VL- 72B-Instruct ranks 7th, and Qwen/QVQ-72B- Preview comes in at 11th.These rankings include both open-source and closed-source commercial models. Qwen is the only one open-source that made it into the top 10. 16
  14. LLaVA-Bench vs Helon-Bench a Qwen The graph plotting the LLaVA

    benchmark on the x-axis against the Heron benchmark on the y-axis. Qwen2-VL-72B- Instruct is highlighted by the red box, and it stands as the top open-source. LLaVA benchmark Heron benchmark 17
  15. Comparison of VLMs on the LLaVA benchmark (Sample) Comparing VLMs

    on the LLaVA benchmark. In this case, the task was to identify which fruit appears in an image. Qwen VL was rated more highly than another one from AllenAI. What fruit is this? 18
  16. Comparison of VLMs on the Heron benchmark (Sample) Comparing VMLs

    on the Heron benchmark. This one is more challenging than the previous fruit- identification task, yet Qwen VL still achieves a higher score than the other one. Please explain why this work is being evaluated, including the techniques. 19
  17. Comparison of the benchmarks and category performance a 20 Comparison

    of the benchmarks and category performance for Qwen2- VL-72B-Instruct versus QVQ-72B-Preview. The chart shows that Qwen2- VL outperforms the QVQ across every category, with the largest difference in Heron Conv and LLaVA Conv—both of which measure conversational capabilities.
  18. How can you use? The easiest way is to ‘Use

    Qwen Chat’ ! URL:https://chat.qwen.ai 21 Your Name
  19. Simple Demo I tried the exact same question as in

    the Llava benchmark: "What fruit is this?" Qwen answered it correctly. 23
  20. Most recent Qwen VL The most recent Qwen VL is

    the Qwen2.5-VL- 72B-Instruct. This one is available from Hugging Face. 24 https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct