VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, Cor-Paul Bezemer

With video games now generating the highest revenues in the entertainment industry, optimizing game development workflows has become essential for the sector's sustained growth. Recent advancements in Vision-Language Models (VLMs) offer considerable potential to automate and enhance various aspects of game development, particularly Quality Assurance (QA), which remains one of the industry's most labor-intensive processes with limited automation options. To accurately evaluate the performance of VLMs in video game QA tasks and determine their effectiveness in handling real-world scenarios, there is a clear need for standardized benchmarks, as existing benchmarks are insufficient to address the specific requirements of this domain. To bridge this gap, we introduce VideoGameQA-Bench, a comprehensive benchmark that covers a wide array of game QA activities, including visual unit testing, visual regression testing, needle-in-a-haystack tasks, glitch detection, and bug report generation for both images and videos of various games.

Paper Artifacts Dataset

Introduction

The global video game industry continues to expand rapidly, with its market value projected to reach $257 billion by 2028. Alongside this substantial growth, the process of developing high-quality video games remains inherently complex and demanding. A critical challenge within game development is to ensure visual quality and consistency through a rigorous visual testing and quality assurance (QA) process. Automation of visual QA tasks remains particularly challenging, and currently, most visual QA relies heavily on manual inspection, making the process time-consuming, costly, labor-intensive, and prone to human error.

The visual QA process for video games can generally be abstracted into three main types of tasks:

Verifying scene integrity by comparing the visual representation of scenes against intended configurations and known reference states, such as an oracle or previously rendered versions of the same scenes.

Detecting glitches through open-ended exploration—these glitches are unintended gameplay or visual artifacts without specific reference points, requiring testers to rely on common sense and general knowledge for detection.

Systematically reporting and documenting all identified glitches, ensuring developers receive clear and actionable information to address problems effectively during game development.

Recent advancements in vision-language models (VLMs) present promising opportunities to automate and significantly enhance the efficiency of video game QA. However, progress in applying VLMs to game QA has been limited by the lack of standardized benchmarks. Current multimodal benchmarks tend to focus heavily on complex mathematical or textual reasoning tasks, overlooking essential visual comprehension tasks fundamental to video game QA. Similarly, existing game-specific benchmarks often represent only narrow aspects of QA tasks, thus inadequately evaluating and tracking VLM performance across diverse QA scenarios.

Our Contributions

In this paper, we introduce GameQA-Benchmark, a benchmark designed to fill the gap in evaluating VLMs for video game QA. Our key findings and contributions are as follows:

We present VideoGameQA-Benchmark featuring 9 distinct tasks and a large set of questions designed considering real-world video game development scenarios, such as visual unit testing, regression testing, UI validation, video needle-in-a-haystack, and glitch detection.
While VLMs show promising performance on various multimodal benchmarks and can function as OCR systems, they perform poorly at detecting fine details required for accurate scene understanding and parsing complex UI elements.
Frontier VLMs show good performance on the glitch detection task using images (up to 82.8%) and videos (up to 78.1%); however, all struggle when it comes to glitches related to body configuration, intricate object clipping, and common-sense reasoning.
Visual regression testing remains one of the most challenging tasks for VLMs.
Locating specific glitch moments in videos remains a challenge, both in detecting and accurately pinpointing the glitch.
Frontier VLMs can generate useful bug reports for up to 50% of real-world glitches, providing accurate and descriptive summaries of the glitches.

VideoGameQA-Bench

We designed VideoGameQA-Benchmark's tasks by simulating realistic QA scenarios encountered during actual video game development. However, to make the benchmark more relevant for future QA automation tasks, we also included tasks that may challenge current software engineering practices while also remaining highly relevant. The table below provides an overview of the contents of each task in the benchmark. In summary, VideoGameQA-Benchmark contains 2,236 image-based samples and 1,200 video-based samples sourced from more than 800 games and 9 synthetic game scenes.

Image-based Tasks

Visual Unit Testing

Visual unit tests verify visual attributes including presence, placement, positioning, colors, conditions, and other relevant properties of various image elements.

How many of Spider-Man's and Black Cat's eye areas, including those covered by their masks, are visible in the image?

Provide your answer in the following JSON format:

{
    "spiderman_eyes_visible": 0,
    "black_cat_eyes_visible": 0
}

Sample from the Visual Unit Testing task.

UI Unit Testing

UI (visual) unit tests verify in-game UI elements such as menus, subtitles, heads-up displays (HUDs), and interface components like graphs and charts. We simulate the UI unit testing tasks by asking the vision-language model questions about game screenshots.

Read the dashboard and fill the JSON values below:

{
    "tire_pressure": {
        "front_left": 0,
        "front_right": 0,
        "rear_left": 0,
        "rear_right": 0
    },
    "brake_temps": {
        "front_left": 0,
        "front_right": 0,
        "rear_left": 0,
        "rear_right": 0
    },
    "break_bias": 0,
    "break_sl": 0,
    "settings": {
        "map": 0,
        "mix": 0,
        "tc1": 0,
        "tc2": 0
    },
    "gear": 0,
    "rpm": 0,
    "speed_mph": 0
}

Sample from the UI Unit Testing task.

Visual Regression Testing

Visual regression tests check for unintended visual changes after a change to the game. A simple pixel-by-pixel comparison of two screenshots is not sufficient, as some variations (e.g., because of character customization or weather conditions in the game) may be acceptable. Visual regressions may occur in cinematic parts of the game, such as cutscenes that have a defined sequence flow. We simulate this task by asking the VLM to compare whether two screenshots are similar, taking into account the specified acceptable or unacceptable variations.

Sample from the Visual Regression Testing task.

Glitch Detection

Glitch detection is the process of identifying unintended visual errors, such as rendering issues, clipping, or physics/logical bugs that express themselves visually. We simulate this task by asking the VLM whether glitch and glitch-free images contain a glitch.

You are a helpful assistant analyzing video game images and screenshots for glitches. You will be given a screenshot from a video game, and your job is to analyze the screenshot and determine whether it contains a glitch.

Provide your answer in the following format:

{
    "reasoning": "Brief explanation of why you think a glitch is present",
    "glitch_detected": true or false,
    "description_of_glitch": "Description of the glitch if detected else empty string"
}

Sample from the Glitch Detection task.

Parametric Clipping Detection

Given the common occurrence of clipping in games, our benchmark includes a dedicated task to evaluate a model's ability to detect such glitches. In this task, images feature an object (e.g., a cube, sphere, or character) positioned at varying distances from a human character — from far apart to fully overlapping/clipping. The VLM is asked whether it detects clipping across each of these distances.

Input Prompt:
A parametric clipping detection task where a approaches a human character from until it clips and intersects with the character.

Bug Report Generation

In addition to testing and detection tasks, a potential application of VLMs is to assist QA engineers with writing reports for detected bugs. We simulate this task by asking the VLM to write a description of a glitch image that can be used in a bug report.

You are a video game quality assurance assistant who analyzes screenshots to identify and document visual glitches. When presented with an image, carefully examine it for any graphical bugs or rendering issues.

Provide your analysis in the following JSON format:

{
    "reasoning": "Brief explanation of what you observe and why it appears to be a glitch",
    "bug_report_title": "A clear, concise title summarizing the issue",
    "bug_report_description": "Detailed description of the visual bug, including its appearance and potential impact on
    gameplay",
    "affected_item": "The specific game element (character, object, environment, UI) affected by the glitch"
}

Sample from the Bug Report Generation task.

Video-based Tasks

Glitch Detection (Video)

Glitch detection in videos can be used to verify autonomous gameplay sessions from bots. Detecting glitches in videos is significantly more complex due to challenges such as analyzing motion patterns, and may require identifying transient glitches that appear only briefly in a few frames. We simulate this task by asking the vision-language model whether it detects a glitch in a video.

You are a helpful assistant analyzing video game clips for glitches. You will be given a short video clip from a video game, and your task is to analyze the video and determine whether it contains a glitch.

Provide your answer in the following format:

{
    "reasoning": "Brief explanation of why you think a glitch is present",
    "glitch_detected": true or false,
    "description_of_glitch": "Description of the glitch if detected else empty string"
}

Sample from the Video-based Glitch Detection task.

Needle-in-a-Haystack (NIAH)

Needle-in-a-Haystack (NIAH) is a more challenging long-context retrieval version of the glitch detection task. We simulate this task by asking the vision-language model whether it detects a glitch in a video, and in which frame the glitch occurs for the first time.

You are a specialized video game quality assurance analyst trained to detect visual anomalies in gameplay footage. Your task is to analyze the provided video clip to identify any bugs, glitches, visual artifacts, or unexpected behaviors.

What to Look For

Visual artifacts (texture issues, flickering, clipping)
Animation problems (jerky movements, T-poses)
Rendering glitches (missing textures, lighting errors)
Gameplay anomalies (collision failures, object teleportation)

Response Format

After your thorough analysis, provide your findings in this exact JSON format:

{
    "reasoning": "Brief explanation of what you observed in the video and why it appears to be a glitch",
    "glitch_detected": true or false,
    "timestamp": 0
}

Sample from the Needle-in-a-Haystack task.

Bug Report Generation (Video)

In this task, the vision-language model is asked to provide a description of a glitch video that can be used in a bug report.

You are a video game quality assurance assistant who analyzes video clips to identify and document visual glitches or strange behaviors. When presented with a video clip, carefully examine it for any graphical bugs, rendering issues, physics anomalies, or unexpected events.

Provide your analysis in the following JSON format:

{
    "reasoning": "Brief explanation of what you observe and why it appears to be a glitch",
    "bug_report_title": "A clear, concise title summarizing the issue",
    "bug_report_description": "Detailed description of the visual bug, including its appearance and potential impact on
    gameplay",
    "affected_item": "The specific game element (character, object, environment, UI) affected by the glitch"
}

Sample from the Video-based Bug Report Generation task.

Leaderboard

We evaluated a total of 11 proprietary and 5 open-weight models on VideoGameQA-Bench. Our evaluation includes both standard models and those designed for extended reasoning.

Accuracy (%) scores of models on *VideoGameQA-Bench*.
Tasks: Visual unit testing (VU); UI unit testing (UI); Visual regression testing (VR); Image-based glitch detection (IGD); Parametric clipping detection (PCD); Image-based bug-report generation (IBR); Video-based glitch detection (VGD); Needle-in-a-haystack (NIAH); Video-based bug-report generation (VBR).
Scores marked with ^† were computed with the NIAH task set to 0. **Total** is the mean of the image-task and video-task averages.
	Image						Video			Average
	VU	UI	VR	IGD	PCD	IBR	VGD	NIAH	VBR	Img.	Vid.	Total
Model / # Samples	100	100	250	1,000	686	100	100	1,000	100	2,236	1,200	3,436
GPT-4o	43.0	28.0	28.8	81.3	87.8	51.0	75.8	19.0	51.0	53.3	48.6	51.0
GPT-4o Mini	42.0	30.0	20.4	76.8	66.9	46.0	71.8	10.0	26.0	47.0	35.9	41.5
GPT-4o Nano	9.0	14.0	19.2	57.0	66.9	16.0	49.1	4.0	14.0	30.4	22.4	26.4
GPT-4	39.0	23.0	31.6	82.8	82.5	54.0	57.0	1.0	52.0	52.2	36.7	44.4
o4-mini	50.0	35.0	45.2	76.4	65.0	38.0	70.0	18.0	28.0	51.6	38.7	45.1
o3	43.0	28.0	39.6	73.7	80.5	53.0	76.8	13.0	45.0	53.0	44.9	48.9
Gemini-2.5-Pro	53.0	40.0	30.8	75.4	72.2	33.0	78.1	34.0	36.0	50.7	49.4	50.0
Gemini-2.5-Flash	47.0	24.0	26.4	66.3	72.2	24.0	64.7	35.0	23.0	43.3	40.9	42.1
Gemini-2.0-Flash	44.0	28.0	12.0	68.1	78.0	20.0	54.5	36.0	26.0	41.7	38.8	40.3
Sonnet 3.7	23.0	22.0	24.0	65.1	76.4	29.0	66.9	31.0	22.0	39.9	40.0	39.9
Sonnet 3.5	23.0	29.0	14.0	70.1	72.9	33.0	61.2	27.0	26.0	40.3	38.1	39.2
Llama-4.0-Scout	32.0	23.0	13.6	55.8	71.6	8.0	58.6	—	5.0	34.0	21.2^†	27.6^†
Llama-4.0-Maverick	21.0	22.0	18.4	53.2	65.7	7.0	56.6	—	15.0	31.2	23.9^†	27.5^†
Gemma (27 B)	12.0	12.0	12.8	46.7	69.7	10.0	51.3	—	9.0	27.2	20.1^†	23.6^†
Mistral-Small-3.1 (24B)	15.0	17.0	25.6	59.7	62.5	9.0	61.4	—	14.0	31.5	25.1^†	28.3^†
Qwen-2.5-VL (72B)	38.0	27.0	21.2	70.0	76.0	19.0	47.9	—	17.0	41.9	21.6^†	31.7^†