言語だけじゃない！Qwen VLモデルの実力 The Power of Qwen VL:Beyond Language

The Power of Qwen VL: Beyond Language 言語だけじゃない！Qwen VLモデルの実力 AliEaters
(Alibaba Cloud Developers) Meetup #32 祝！MeltingHackさんとの初コラボ 2025/5/14 Chika 1

First collaboration event between AliEater＆MeltingHack !! Link:https://lu.ma/dhpur0zu Link:https://alibabacloud.connpass.com/ event/352519/ 2

Who am I? I previously worked on researching and using
the API rerated AI which API from Alibaba Cloud’s AI Research Lab ‘Damo’. I was a frequent Alibaba Cloud user at the time. Although I no longer use it in my work, I’m so impressed by Qwen, like LLM and VLM. I also give occasional lightning talks on various topics for the Alibaba Cloud community called ”AliEater”. 3

Do you know VLM? LLM stands for Large Language Model.
VLM stands for Vision and Language Model. VLM means we can leverage both visual and language ability. Vision Language 4

What is the nice thing of VLM? The key advantage
of a VLM is that, in addition to the language capabilities of an LLM, it also brings in visual understanding. By using a VLM, AI can combine image and language recognition In this photo, there is a large building … VLM 5 AI can analyze objects in an image and describe them in text, also generate an image from a text prompt.

Problem is ‘Which one is the best’? ⚫OpenVLM Leaderboard https://opencompass-open-vlm-leaderboard.hf.space/?ref=codesphere.ghost.io
⚫Heron VLM Leaderboard for Japanese benchmark. → Today’s Talk Japanease version https://wandb.ai/vision-language-leaderboard/heron-leaderboard/reports/Heron-VLM- powered-by-nejumi-WandB--Vmlldzo3ODM4ODYw English version *initial version https://wandb.ai/vision-language-leaderboard/heron-leaderboard/reports/Heron-VLM- Leaderboard-powered-by-Nejumi-WandB--Vmlldzo4MjY3OTc5 →issue!! The English version data is outdated, so I recommend checking the Japanese version for the latest information. 6

Heron VLM Leaderboard That’s an amazing VLM leaderboard! 7

Nejumi LLM Leaderboard There is also an amazing LLM leaderboard!
I love those pictures!!! 8

Heron VLM Leaderboard The Heron VLM leaderboard evaluates vision-and-language using
two datasets. ⚫Japanese Heron Bench (Turing Corp.) Link Link2 ⚫Llava Bench Japanese (in the wild) Link Link2 ※Appendix Link 9

Japanese Heron Bench ⚫Images: 21 photographs related to Japan. ⚫Question
types: Each image is paired with three categories of questions—Conversation, Detail, Complex. ⚫Dataset size: A total of 102 questions across all images. ⚫Subcategories: Anime, Art, Culture, Food, Scenery, Landmarks, Transportation. 10

Japanese Heron Bench Example for Question types: Conversation, Detail, Complex
11 Conversation:イラストの少女の名前はなんですか？(What is the name of the girl in the illustration?) Detail:このイラストについて詳しく説明してください。 (Please explain this illustration in more detail.) Complex:この映像の中で明らかに人間ではないのはどれでしょうか？(Which of the objects in this photo are clearly not human?)

Heron Bench output 12 some example questions and output from
the Heron benchmark for each of the three categories. (Conversation, Detail, Complex)

Llava Bench In-the-Wild Japanese 13 ⚫Images: 24 photographs. ⚫Question types:
Each image is paired with three categories of questions—Conversation, Detail, Complex. ⚫Dataset size: A total of 60 questions across all images.

Llava Bench In-the-Wild Japanese The LLaVA benchmark was NOT originally
developed specifically for Japanese, but later a Japanese version of the dataset was released, allowing VLM Japanese abilities to be evaluated. Some images generated by AI that don’t exist in reality, as well as anime-style images, are also used. The images used are NOT particularly representative of Japan. 14

Llava Bench In-the-Wild Japanese 15 some example questions and output
from the Llava benchmark for each of the three categories. (Conversation, Detail, Complex)

The Power of Qwen VL How powerful is the Qwen
VL? * As of 6/May/2025 Alibaba’s Qwen2-VL- 72B-Instruct ranks 7th, and Qwen/QVQ-72B- Preview comes in at 11th.These rankings include both open-source and closed-source commercial models. Qwen is the only one open-source that made it into the top 10. 16

LLaVA-Bench vs Helon-Bench a Qwen The graph plotting the LLaVA
benchmark on the x-axis against the Heron benchmark on the y-axis. Qwen2-VL-72B- Instruct is highlighted by the red box, and it stands as the top open-source. LLaVA benchmark Heron benchmark 17

Comparison of VLMs on the LLaVA benchmark （Sample） Comparing VLMs
on the LLaVA benchmark. In this case, the task was to identify which fruit appears in an image. Qwen VL was rated more highly than another one from AllenAI. What fruit is this? 18

Comparison of VLMs on the Heron benchmark （Sample） Comparing VMLs
on the Heron benchmark. This one is more challenging than the previous fruit- identification task, yet Qwen VL still achieves a higher score than the other one. Please explain why this work is being evaluated, including the techniques. 19

Comparison of the benchmarks and category performance a 20 Comparison
of the benchmarks and category performance for Qwen2- VL-72B-Instruct versus QVQ-72B-Preview. The chart shows that Qwen2- VL outperforms the QVQ across every category, with the largest difference in Heron Conv and LLaVA Conv—both of which measure conversational capabilities.

How can you use? The easiest way is to ‘Use
Qwen Chat’ ! URL：https://chat.qwen.ai 21 Your Name

How can we use? Select Qwen2.5-VL-32B from the model selection
menu. 22

Simple Demo I tried the exact same question as in
the Llava benchmark: "What fruit is this?" Qwen answered it correctly. 23

Most recent Qwen VL The most recent Qwen VL is
the Qwen2.5-VL- 72B-Instruct. This one is available from Hugging Face. 24 https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct

Thank you for listening!! ! 25

言語だけじゃない！Qwen VLモデルの実力 The Power of Qwen VL:Bey...

言語だけじゃない！Qwen VLモデルの実力 The Power of Qwen VL:Beyond Language

Sawa

Other Decks in Research

Featured

Transcript

The Power of Qwen VL: Beyond Language 言語だけじゃない！Qwen VLモデルの実力 AliEaters

First collaboration event between AliEater＆MeltingHack !! Link:https://lu.ma/dhpur0zu Link:https://alibabacloud.connpass.com/ event/352519/ 2

Who am I? I previously worked on researching and using

Do you know VLM? LLM stands for Large Language Model.

What is the nice thing of VLM? The key advantage

Problem is ‘Which one is the best’? ⚫OpenVLM Leaderboard https://opencompass-open-vlm-leaderboard.hf.space/?ref=codesphere.ghost.io

Heron VLM Leaderboard That’s an amazing VLM leaderboard! 7

Nejumi LLM Leaderboard There is also an amazing LLM leaderboard!

Heron VLM Leaderboard The Heron VLM leaderboard evaluates vision-and-language using

Japanese Heron Bench ⚫Images: 21 photographs related to Japan. ⚫Question

Japanese Heron Bench Example for Question types: Conversation, Detail, Complex

Heron Bench output 12 some example questions and output from

Llava Bench In-the-Wild Japanese 13 ⚫Images: 24 photographs. ⚫Question types:

Llava Bench In-the-Wild Japanese The LLaVA benchmark was NOT originally

Llava Bench In-the-Wild Japanese 15 some example questions and output

The Power of Qwen VL How powerful is the Qwen

LLaVA-Bench vs Helon-Bench a Qwen The graph plotting the LLaVA

Comparison of VLMs on the LLaVA benchmark （Sample） Comparing VLMs

Comparison of VLMs on the Heron benchmark （Sample） Comparing VMLs

Comparison of the benchmarks and category performance a 20 Comparison

How can you use? The easiest way is to ‘Use

How can we use? Select Qwen2.5-VL-32B from the model selection

Simple Demo I tried the exact same question as in

Most recent Qwen VL The most recent Qwen VL is

Thank you for listening!! ! 25