Artifacts for VideoGameQA-Bench

In this section, we present model outputs and qualitative results for VideoGameQA-Bench.

Image-based Glitch Detection

Model Name	Total Entries	TP	TN	FP	FN	Errors	Overall Acc (%)	Details
gpt-4o-2024-08-06	999	417	411	89	82	0	82.9	View
gpt-4.1-2025-04-14	1000	374	439	61	126	0	81.3	View
gpt-4.1-mini-2025-04-14	999	468	300	199	32	0	76.9	View
o4-mini-2025-04-16	1000	331	433	67	169	0	76.4	View
gemini-2.5-pro-preview-03-25	999	418	336	164	81	0	75.5	View
o3-2025-04-16	1000	253	484	16	247	0	73.7	View
claude-3-5-sonnet-20241022	1000	238	463	37	261	1	70.1	View
qwen/qwen2.5-vl-72b-instruct	1000	254	446	52	246	2	70.0	View
gemini-2.0-flash	1000	259	422	78	241	0	68.1	View
gemini-2.5-flash-preview-04-17	999	215	448	52	284	0	66.4	View
claude-3-7-sonnet-20250219	1000	177	474	26	323	0	65.1	View
mistralai/mistral-small-3.1-24b-instruct	1000	230	367	133	270	0	59.7	View
gpt-4.1-nano-2025-04-14	1000	413	157	343	87	0	57.0	View
meta-llama/llama-4-scout	1000	74	484	16	425	1	55.8	View
meta-llama/llama-4-maverick	1000	44	488	11	456	1	53.2	View
google/gemma-3-27b-it	1000	460	7	446	0	87	46.7	View

Video-based Glitch Detectiom

Model Name	Total Entries	TP	TN	FP	FN	Errors	Overall Acc (%)	Details
gemini-2.5-pro-preview-03-25	1000	334	447	53	166	0	78.1	View
o3-2025-04-16	995	298	470	27	200	0	77.2	View
gpt-4.1-2025-04-14	990	411	347	149	83	0	76.6	View
o4-mini-2025-04-16	960	330	370	115	143	2	72.9	View
gpt-4.1-mini-2025-04-14	995	346	372	124	153	0	72.2	View
claude-3-7-sonnet-20250219	997	250	419	79	245	4	67.1	View
gemini-2.5-flash-preview-04-17	1000	426	221	279	74	0	64.7	View
claude-3-5-sonnet-20241022	972	266	346	70	150	140	63.0	View
mistralai/mistral-small-3.1-24b-instruct	999	238	376	112	238	35	61.5	View
meta-llama/llama-4-scout-17b-16e-instruct	1000	117	469	25	349	40	58.6	View
gpt-4o-2024-08-06	995	356	214	53	90	282	57.3	View
meta-llama/llama-4-maverick-17b-128e-instruct	1000	82	484	6	375	53	56.6	View
qwen2.5-vl-72b-instruct	869	99	380	2	388	0	55.1	View
gemini-2.0-flash	1000	477	68	432	23	0	54.5	View
google/gemma-3-27b-it	999	498	15	484	1	1	51.4	View
gpt-4.1-nano-2025-04-14	983	466	25	468	24	0	49.9	View

Image-based Bug Report Generation

Model Name	Accuracy	Link
gpt-4o-2024-08-06	54.0% (54/100)	View
o3-2025-04-16	53.0% (53/100)	View
gpt-4.1-2025-04-14	51.0% (51/100)	View
gpt-4.1-mini-2025-04-14	46.0% (46/100)	View
o4-mini-2025-04-16	38.0% (38/100)	View
claude-3-7-sonnet-20250219	33.0% (33/100)	View
gemini-2.5-pro-preview-03-25	33.0% (33/100)	View
claude-3-5-sonnet-20241022	29.0% (29/100)	View
gemini-2.5-flash-preview-04-17	24.0% (24/100)	View
gemini-2.0-flash	20.0% (20/100)	View
qwen/qwen2.5-vl-72b-instruct	19.0% (19/100)	View
qwen/qwen-vl-max	18.0% (18/100)	View
gpt-4.1-nano-2025-04-14	16.0% (16/100)	View
google/gemma-3-27b-it	10.0% (10/100)	View
mistralai/mistral-small-3.1-24b-instruct	9.0% (9/100)	View
meta-llama/llama-4-scout	8.0% (8/100)	View
meta-llama/llama-4-maverick	7.0% (7/100)	View

Video-based Bug Report Generation

Model Name	Accuracy	Link
gpt-4o-2024-08-06	52.0% (52/100)	View
gpt-4.1-2025-04-14	51.0% (51/100)	View
o3-2025-04-16	45.0% (45/100)	View
gemini-2.5-pro-preview-03-25	36.0% (36/100)	View
o4-mini-2025-04-16	28.0% (28/100)	View
gpt-4.1-mini-2025-04-14	26.0% (26/100)	View
claude-3-5-sonnet-20241022	26.0% (26/100)	View
gemini-2.0-flash	26.0% (26/100)	View
gemini-2.5-flash-preview-04-17	23.0% (23/100)	View
claude-3-7-sonnet-20250219	22.0% (22/100)	View
qwen2.5-vl-72b-instruct	17.0% (17/100)	View
meta-llama/llama-4-maverick-17b-128e-instruct	15.0% (15/100)	View
gpt-4.1-nano-2025-04-14	14.0% (14/100)	View
mistralai/mistral-small-3.1-24b-instruct	14.0% (14/100)	View
qwen-vl-max	12.0% (12/100)	View
google/gemma-3-27b-it	9.0% (9/100)	View
meta-llama/llama-4-scout-17b-16e-instruct	5.0% (5/100)	View