In this section, we present model outputs and qualitative results for VideoGameQA-Bench.
Back to HomeModel Name | Total Entries | TP | TN | FP | FN | Errors | Overall Acc (%) | Details |
---|---|---|---|---|---|---|---|---|
gpt-4o-2024-08-06 | 999 | 417 | 411 | 89 | 82 | 0 | 82.9 | View |
gpt-4.1-2025-04-14 | 1000 | 374 | 439 | 61 | 126 | 0 | 81.3 | View |
gpt-4.1-mini-2025-04-14 | 999 | 468 | 300 | 199 | 32 | 0 | 76.9 | View |
o4-mini-2025-04-16 | 1000 | 331 | 433 | 67 | 169 | 0 | 76.4 | View |
gemini-2.5-pro-preview-03-25 | 999 | 418 | 336 | 164 | 81 | 0 | 75.5 | View |
o3-2025-04-16 | 1000 | 253 | 484 | 16 | 247 | 0 | 73.7 | View |
claude-3-5-sonnet-20241022 | 1000 | 238 | 463 | 37 | 261 | 1 | 70.1 | View |
qwen/qwen2.5-vl-72b-instruct | 1000 | 254 | 446 | 52 | 246 | 2 | 70.0 | View |
gemini-2.0-flash | 1000 | 259 | 422 | 78 | 241 | 0 | 68.1 | View |
gemini-2.5-flash-preview-04-17 | 999 | 215 | 448 | 52 | 284 | 0 | 66.4 | View |
claude-3-7-sonnet-20250219 | 1000 | 177 | 474 | 26 | 323 | 0 | 65.1 | View |
mistralai/mistral-small-3.1-24b-instruct | 1000 | 230 | 367 | 133 | 270 | 0 | 59.7 | View |
gpt-4.1-nano-2025-04-14 | 1000 | 413 | 157 | 343 | 87 | 0 | 57.0 | View |
meta-llama/llama-4-scout | 1000 | 74 | 484 | 16 | 425 | 1 | 55.8 | View |
meta-llama/llama-4-maverick | 1000 | 44 | 488 | 11 | 456 | 1 | 53.2 | View |
google/gemma-3-27b-it | 1000 | 460 | 7 | 446 | 0 | 87 | 46.7 | View |
Model Name | Total Entries | TP | TN | FP | FN | Errors | Overall Acc (%) | Details |
---|---|---|---|---|---|---|---|---|
gemini-2.5-pro-preview-03-25 | 1000 | 334 | 447 | 53 | 166 | 0 | 78.1 | View |
o3-2025-04-16 | 995 | 298 | 470 | 27 | 200 | 0 | 77.2 | View |
gpt-4.1-2025-04-14 | 990 | 411 | 347 | 149 | 83 | 0 | 76.6 | View |
o4-mini-2025-04-16 | 960 | 330 | 370 | 115 | 143 | 2 | 72.9 | View |
gpt-4.1-mini-2025-04-14 | 995 | 346 | 372 | 124 | 153 | 0 | 72.2 | View |
claude-3-7-sonnet-20250219 | 997 | 250 | 419 | 79 | 245 | 4 | 67.1 | View |
gemini-2.5-flash-preview-04-17 | 1000 | 426 | 221 | 279 | 74 | 0 | 64.7 | View |
claude-3-5-sonnet-20241022 | 972 | 266 | 346 | 70 | 150 | 140 | 63.0 | View |
mistralai/mistral-small-3.1-24b-instruct | 999 | 238 | 376 | 112 | 238 | 35 | 61.5 | View |
meta-llama/llama-4-scout-17b-16e-instruct | 1000 | 117 | 469 | 25 | 349 | 40 | 58.6 | View |
gpt-4o-2024-08-06 | 995 | 356 | 214 | 53 | 90 | 282 | 57.3 | View |
meta-llama/llama-4-maverick-17b-128e-instruct | 1000 | 82 | 484 | 6 | 375 | 53 | 56.6 | View |
qwen2.5-vl-72b-instruct | 869 | 99 | 380 | 2 | 388 | 0 | 55.1 | View |
gemini-2.0-flash | 1000 | 477 | 68 | 432 | 23 | 0 | 54.5 | View |
google/gemma-3-27b-it | 999 | 498 | 15 | 484 | 1 | 1 | 51.4 | View |
gpt-4.1-nano-2025-04-14 | 983 | 466 | 25 | 468 | 24 | 0 | 49.9 | View |
Model Name | Accuracy | Link |
---|---|---|
gpt-4o-2024-08-06 | 54.0% (54/100) | View |
o3-2025-04-16 | 53.0% (53/100) | View |
gpt-4.1-2025-04-14 | 51.0% (51/100) | View |
gpt-4.1-mini-2025-04-14 | 46.0% (46/100) | View |
o4-mini-2025-04-16 | 38.0% (38/100) | View |
claude-3-7-sonnet-20250219 | 33.0% (33/100) | View |
gemini-2.5-pro-preview-03-25 | 33.0% (33/100) | View |
claude-3-5-sonnet-20241022 | 29.0% (29/100) | View |
gemini-2.5-flash-preview-04-17 | 24.0% (24/100) | View |
gemini-2.0-flash | 20.0% (20/100) | View |
qwen/qwen2.5-vl-72b-instruct | 19.0% (19/100) | View |
qwen/qwen-vl-max | 18.0% (18/100) | View |
gpt-4.1-nano-2025-04-14 | 16.0% (16/100) | View |
google/gemma-3-27b-it | 10.0% (10/100) | View |
mistralai/mistral-small-3.1-24b-instruct | 9.0% (9/100) | View |
meta-llama/llama-4-scout | 8.0% (8/100) | View |
meta-llama/llama-4-maverick | 7.0% (7/100) | View |
Model Name | Accuracy | Link |
---|---|---|
gpt-4o-2024-08-06 | 52.0% (52/100) | View |
gpt-4.1-2025-04-14 | 51.0% (51/100) | View |
o3-2025-04-16 | 45.0% (45/100) | View |
gemini-2.5-pro-preview-03-25 | 36.0% (36/100) | View |
o4-mini-2025-04-16 | 28.0% (28/100) | View |
gpt-4.1-mini-2025-04-14 | 26.0% (26/100) | View |
claude-3-5-sonnet-20241022 | 26.0% (26/100) | View |
gemini-2.0-flash | 26.0% (26/100) | View |
gemini-2.5-flash-preview-04-17 | 23.0% (23/100) | View |
claude-3-7-sonnet-20250219 | 22.0% (22/100) | View |
qwen2.5-vl-72b-instruct | 17.0% (17/100) | View |
meta-llama/llama-4-maverick-17b-128e-instruct | 15.0% (15/100) | View |
gpt-4.1-nano-2025-04-14 | 14.0% (14/100) | View |
mistralai/mistral-small-3.1-24b-instruct | 14.0% (14/100) | View |
qwen-vl-max | 12.0% (12/100) | View |
google/gemma-3-27b-it | 9.0% (9/100) | View |
meta-llama/llama-4-scout-17b-16e-instruct | 5.0% (5/100) | View |