Artifacts for VideoGameQA-Bench

In this section, we present model outputs and qualitative results for VideoGameQA-Bench.

Back to Home

Image-based Glitch Detection

Model Name Total Entries TP TN FP FN Errors Overall Acc (%) Details
gpt-4o-2024-08-06 999 417 411 89 82 0 82.9 View
gpt-4.1-2025-04-14 1000 374 439 61 126 0 81.3 View
gpt-4.1-mini-2025-04-14 999 468 300 199 32 0 76.9 View
o4-mini-2025-04-16 1000 331 433 67 169 0 76.4 View
gemini-2.5-pro-preview-03-25 999 418 336 164 81 0 75.5 View
o3-2025-04-16 1000 253 484 16 247 0 73.7 View
claude-3-5-sonnet-20241022 1000 238 463 37 261 1 70.1 View
qwen/qwen2.5-vl-72b-instruct 1000 254 446 52 246 2 70.0 View
gemini-2.0-flash 1000 259 422 78 241 0 68.1 View
gemini-2.5-flash-preview-04-17 999 215 448 52 284 0 66.4 View
claude-3-7-sonnet-20250219 1000 177 474 26 323 0 65.1 View
mistralai/mistral-small-3.1-24b-instruct 1000 230 367 133 270 0 59.7 View
gpt-4.1-nano-2025-04-14 1000 413 157 343 87 0 57.0 View
meta-llama/llama-4-scout 1000 74 484 16 425 1 55.8 View
meta-llama/llama-4-maverick 1000 44 488 11 456 1 53.2 View
google/gemma-3-27b-it 1000 460 7 446 0 87 46.7 View

Video-based Glitch Detectiom

Model Name Total Entries TP TN FP FN Errors Overall Acc (%) Details
gemini-2.5-pro-preview-03-25 1000 334 447 53 166 0 78.1 View
o3-2025-04-16 995 298 470 27 200 0 77.2 View
gpt-4.1-2025-04-14 990 411 347 149 83 0 76.6 View
o4-mini-2025-04-16 960 330 370 115 143 2 72.9 View
gpt-4.1-mini-2025-04-14 995 346 372 124 153 0 72.2 View
claude-3-7-sonnet-20250219 997 250 419 79 245 4 67.1 View
gemini-2.5-flash-preview-04-17 1000 426 221 279 74 0 64.7 View
claude-3-5-sonnet-20241022 972 266 346 70 150 140 63.0 View
mistralai/mistral-small-3.1-24b-instruct 999 238 376 112 238 35 61.5 View
meta-llama/llama-4-scout-17b-16e-instruct 1000 117 469 25 349 40 58.6 View
gpt-4o-2024-08-06 995 356 214 53 90 282 57.3 View
meta-llama/llama-4-maverick-17b-128e-instruct 1000 82 484 6 375 53 56.6 View
qwen2.5-vl-72b-instruct 869 99 380 2 388 0 55.1 View
gemini-2.0-flash 1000 477 68 432 23 0 54.5 View
google/gemma-3-27b-it 999 498 15 484 1 1 51.4 View
gpt-4.1-nano-2025-04-14 983 466 25 468 24 0 49.9 View

Image-based Bug Report Generation

Model Name Accuracy Link
gpt-4o-2024-08-06 54.0% (54/100) View
o3-2025-04-16 53.0% (53/100) View
gpt-4.1-2025-04-14 51.0% (51/100) View
gpt-4.1-mini-2025-04-14 46.0% (46/100) View
o4-mini-2025-04-16 38.0% (38/100) View
claude-3-7-sonnet-20250219 33.0% (33/100) View
gemini-2.5-pro-preview-03-25 33.0% (33/100) View
claude-3-5-sonnet-20241022 29.0% (29/100) View
gemini-2.5-flash-preview-04-17 24.0% (24/100) View
gemini-2.0-flash 20.0% (20/100) View
qwen/qwen2.5-vl-72b-instruct 19.0% (19/100) View
qwen/qwen-vl-max 18.0% (18/100) View
gpt-4.1-nano-2025-04-14 16.0% (16/100) View
google/gemma-3-27b-it 10.0% (10/100) View
mistralai/mistral-small-3.1-24b-instruct 9.0% (9/100) View
meta-llama/llama-4-scout 8.0% (8/100) View
meta-llama/llama-4-maverick 7.0% (7/100) View

Video-based Bug Report Generation

Model Name Accuracy Link
gpt-4o-2024-08-06 52.0% (52/100) View
gpt-4.1-2025-04-14 51.0% (51/100) View
o3-2025-04-16 45.0% (45/100) View
gemini-2.5-pro-preview-03-25 36.0% (36/100) View
o4-mini-2025-04-16 28.0% (28/100) View
gpt-4.1-mini-2025-04-14 26.0% (26/100) View
claude-3-5-sonnet-20241022 26.0% (26/100) View
gemini-2.0-flash 26.0% (26/100) View
gemini-2.5-flash-preview-04-17 23.0% (23/100) View
claude-3-7-sonnet-20250219 22.0% (22/100) View
qwen2.5-vl-72b-instruct 17.0% (17/100) View
meta-llama/llama-4-maverick-17b-128e-instruct 15.0% (15/100) View
gpt-4.1-nano-2025-04-14 14.0% (14/100) View
mistralai/mistral-small-3.1-24b-instruct 14.0% (14/100) View
qwen-vl-max 12.0% (12/100) View
google/gemma-3-27b-it 9.0% (9/100) View
meta-llama/llama-4-scout-17b-16e-instruct 5.0% (5/100) View