VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, Cor-Paul Bezemer

With video games now generating the highest revenues in the entertainment industry, optimizing game development workflows has become essential for the sector's sustained growth. Recent advancements in Vision-Language Models (VLMs) offer considerable potential to automate and enhance various aspects of game development, particularly Quality Assurance (QA), which remains one of the industry's most labor-intensive processes with limited automation options. To accurately evaluate the performance of VLMs in video game QA tasks and determine their effectiveness in handling real-world scenarios, there is a clear need for standardized benchmarks, as existing benchmarks are insufficient to address the specific requirements of this domain. To bridge this gap, we introduce VideoGameQA-Bench, a comprehensive benchmark that covers a wide array of game QA activities, including visual unit testing, visual regression testing, needle-in-a-haystack tasks, glitch detection, and bug report generation for both images and videos of various games.

Introduction

The global video game industry continues to expand rapidly, with its market value projected to reach $257 billion by 2028. Alongside this substantial growth, the process of developing high-quality video games remains inherently complex and demanding. A critical challenge within game development is to ensure visual quality and consistency through a rigorous visual testing and quality assurance (QA) process. Automation of visual QA tasks remains particularly challenging, and currently, most visual QA relies heavily on manual inspection, making the process time-consuming, costly, labor-intensive, and prone to human error.


The visual QA process for video games can generally be abstracted into three main types of tasks:


  • Verifying scene integrity by comparing the visual representation of scenes against intended configurations and known reference states, such as an oracle or previously rendered versions of the same scenes.
  • Detecting glitches through open-ended exploration—these glitches are unintended gameplay or visual artifacts without specific reference points, requiring testers to rely on common sense and general knowledge for detection.
  • Systematically reporting and documenting all identified glitches, ensuring developers receive clear and actionable information to address problems effectively during game development.

  • Recent advancements in vision-language models (VLMs) present promising opportunities to automate and significantly enhance the efficiency of video game QA. However, progress in applying VLMs to game QA has been limited by the lack of standardized benchmarks. Current multimodal benchmarks tend to focus heavily on complex mathematical or textual reasoning tasks, overlooking essential visual comprehension tasks fundamental to video game QA. Similarly, existing game-specific benchmarks often represent only narrow aspects of QA tasks, thus inadequately evaluating and tracking VLM performance across diverse QA scenarios.

    Our Contributions

    In this paper, we introduce GameQA-Benchmark, a benchmark designed to fill the gap in evaluating VLMs for video game QA. Our key findings and contributions are as follows:


    1. We present VideoGameQA-Benchmark featuring 9 distinct tasks and a large set of questions designed considering real-world video game development scenarios, such as visual unit testing, regression testing, UI validation, video needle-in-a-haystack, and glitch detection.
    2. While VLMs show promising performance on various multimodal benchmarks and can function as OCR systems, they perform poorly at detecting fine details required for accurate scene understanding and parsing complex UI elements.
    3. Frontier VLMs show good performance on the glitch detection task using images (up to 82.8%) and videos (up to 78.1%); however, all struggle when it comes to glitches related to body configuration, intricate object clipping, and common-sense reasoning.
    4. Visual regression testing remains one of the most challenging tasks for VLMs.
    5. Locating specific glitch moments in videos remains a challenge, both in detecting and accurately pinpointing the glitch.
    6. Frontier VLMs can generate useful bug reports for up to 50% of real-world glitches, providing accurate and descriptive summaries of the glitches.

    VideoGameQA-Bench

    We designed VideoGameQA-Benchmark's tasks by simulating realistic QA scenarios encountered during actual video game development. However, to make the benchmark more relevant for future QA automation tasks, we also included tasks that may challenge current software engineering practices while also remaining highly relevant. The table below provides an overview of the contents of each task in the benchmark. In summary, VideoGameQA-Benchmark contains 2,236 image-based samples and 1,200 video-based samples sourced from more than 800 games and 9 synthetic game scenes.

    Image-based Tasks

    Visual Unit Testing

    Visual unit tests verify visual attributes including presence, placement, positioning, colors, conditions, and other relevant properties of various image elements.

    How many of Spider-Man's and Black Cat's eye areas, including those covered by their masks, are visible in the image?


    Provide your answer in the following JSON format:


    {
        "spiderman_eyes_visible": 0,
        "black_cat_eyes_visible": 0
    }

    Sample from the Visual Unit Testing task.

    UI Unit Testing

    UI (visual) unit tests verify in-game UI elements such as menus, subtitles, heads-up displays (HUDs), and interface components like graphs and charts. We simulate the UI unit testing tasks by asking the vision-language model questions about game screenshots.

    Read the dashboard and fill the JSON values below:


    {
        "tire_pressure": {
            "front_left": 0,
            "front_right": 0,
            "rear_left": 0,
            "rear_right": 0
        },
        "brake_temps": {
            "front_left": 0,
            "front_right": 0,
            "rear_left": 0,
            "rear_right": 0
        },
        "break_bias": 0,
        "break_sl": 0,
        "settings": {
            "map": 0,
            "mix": 0,
            "tc1": 0,
            "tc2": 0
        },
        "gear": 0,
        "rpm": 0,
        "speed_mph": 0
    }
                    

    Sample from the UI Unit Testing task.

    Visual Regression Testing

    Visual regression tests check for unintended visual changes after a change to the game. A simple pixel-by-pixel comparison of two screenshots is not sufficient, as some variations (e.g., because of character customization or weather conditions in the game) may be acceptable. Visual regressions may occur in cinematic parts of the game, such as cutscenes that have a defined sequence flow. We simulate this task by asking the VLM to compare whether two screenshots are similar, taking into account the specified acceptable or unacceptable variations.

    Image Version 1
    Image Version 2
    Image Version 1
    Image Version 2

    Image Version 1
    Image Version 2
    Image Version 1
    Image Version 2

    Sample from the Visual Regression Testing task.

    Glitch Detection

    Glitch detection is the process of identifying unintended visual errors, such as rendering issues, clipping, or physics/logical bugs that express themselves visually. We simulate this task by asking the VLM whether glitch and glitch-free images contain a glitch.

    You are a helpful assistant analyzing video game images and screenshots for glitches. You will be given a screenshot from a video game, and your job is to analyze the screenshot and determine whether it contains a glitch.


    Provide your answer in the following format:


    {
        "reasoning": "Brief explanation of why you think a glitch is present",
        "glitch_detected": true or false,
        "description_of_glitch": "Description of the glitch if detected else empty string"
    }

    Sample from the Glitch Detection task.

    Parametric Clipping Detection

    Given the common occurrence of clipping in games, our benchmark includes a dedicated task to evaluate a model's ability to detect such glitches. In this task, images feature an object (e.g., a cube, sphere, or character) positioned at varying distances from a human character — from far apart to fully overlapping/clipping. The VLM is asked whether it detects clipping across each of these distances.

    Input Prompt:
    A parametric clipping detection task where a approaches a human character from until it clips and intersects with the character.

    Input Prompt:
    A parametric clipping detection task where a approaches a human character from until it clips and intersects with the character.

    Bug Report Generation

    In addition to testing and detection tasks, a potential application of VLMs is to assist QA engineers with writing reports for detected bugs. We simulate this task by asking the VLM to write a description of a glitch image that can be used in a bug report.

    You are a video game quality assurance assistant who analyzes screenshots to identify and document visual glitches. When presented with an image, carefully examine it for any graphical bugs or rendering issues.


    Provide your analysis in the following JSON format:


    {
        "reasoning": "Brief explanation of what you observe and why it appears to be a glitch",
        "bug_report_title": "A clear, concise title summarizing the issue",
        "bug_report_description": "Detailed description of the visual bug, including its appearance and potential impact on
        gameplay",
        "affected_item": "The specific game element (character, object, environment, UI) affected by the glitch"
    }

    Sample from the Bug Report Generation task.

    Video-based Tasks

    Glitch Detection (Video)

    Glitch detection in videos can be used to verify autonomous gameplay sessions from bots. Detecting glitches in videos is significantly more complex due to challenges such as analyzing motion patterns, and may require identifying transient glitches that appear only briefly in a few frames. We simulate this task by asking the vision-language model whether it detects a glitch in a video.

    You are a helpful assistant analyzing video game clips for glitches. You will be given a short video clip from a video game, and your task is to analyze the video and determine whether it contains a glitch.


    Provide your answer in the following format:


    {
        "reasoning": "Brief explanation of why you think a glitch is present",
        "glitch_detected": true or false,
        "description_of_glitch": "Description of the glitch if detected else empty string"
    }

    Sample from the Video-based Glitch Detection task.

    Needle-in-a-Haystack (NIAH)

    Needle-in-a-Haystack (NIAH) is a more challenging long-context retrieval version of the glitch detection task. We simulate this task by asking the vision-language model whether it detects a glitch in a video, and in which frame the glitch occurs for the first time.

    You are a specialized video game quality assurance analyst trained to detect visual anomalies in gameplay footage. Your task is to analyze the provided video clip to identify any bugs, glitches, visual artifacts, or unexpected behaviors.


    What to Look For


    • Visual artifacts (texture issues, flickering, clipping)
    • Animation problems (jerky movements, T-poses)
    • Rendering glitches (missing textures, lighting errors)
    • Gameplay anomalies (collision failures, object teleportation)

    Response Format


    After your thorough analysis, provide your findings in this exact JSON format:


    {
        "reasoning": "Brief explanation of what you observed in the video and why it appears to be a glitch",
        "glitch_detected": true or false,
        "timestamp": 0
    }

    Sample from the Needle-in-a-Haystack task.

    Bug Report Generation (Video)

    In this task, the vision-language model is asked to provide a description of a glitch video that can be used in a bug report.

    You are a video game quality assurance assistant who analyzes video clips to identify and document visual glitches or strange behaviors. When presented with a video clip, carefully examine it for any graphical bugs, rendering issues, physics anomalies, or unexpected events.


    Provide your analysis in the following JSON format:


    {
        "reasoning": "Brief explanation of what you observe and why it appears to be a glitch",
        "bug_report_title": "A clear, concise title summarizing the issue",
        "bug_report_description": "Detailed description of the visual bug, including its appearance and potential impact on
        gameplay",
        "affected_item": "The specific game element (character, object, environment, UI) affected by the glitch"
    }

    Sample from the Video-based Bug Report Generation task.

    Leaderboard

    We evaluated a total of 11 proprietary and 5 open-weight models on VideoGameQA-Bench. Our evaluation includes both standard models and those designed for extended reasoning.

    Accuracy (%) scores of models on VideoGameQA-Bench.
    Tasks: Visual unit testing (VU); UI unit testing (UI); Visual regression testing (VR); Image-based glitch detection (IGD); Parametric clipping detection (PCD); Image-based bug-report generation (IBR); Video-based glitch detection (VGD); Needle-in-a-haystack (NIAH); Video-based bug-report generation (VBR).
    Scores marked with were computed with the NIAH task set to 0. Total is the mean of the image-task and video-task averages.
    Image Video Average
    VU UI VR IGD PCD IBR VGD NIAH VBR Img. Vid. Total
    Model / # Samples 100 100 250 1,000 686 100 100 1,000 100 2,236 1,200 3,436
    GPT-4o 43.0 28.0 28.8 81.3 87.8 51.0 75.8 19.0 51.0 53.3 48.6 51.0
    GPT-4o Mini 42.0 30.0 20.4 76.8 66.9 46.0 71.8 10.0 26.0 47.0 35.9 41.5
    GPT-4o Nano 9.0 14.0 19.2 57.0 66.9 16.0 49.1 4.0 14.0 30.4 22.4 26.4
    GPT-4 39.0 23.0 31.6 82.8 82.5 54.0 57.0 1.0 52.0 52.2 36.7 44.4
    o4-mini 50.0 35.0 45.2 76.4 65.0 38.0 70.0 18.0 28.0 51.6 38.7 45.1
    o3 43.0 28.0 39.6 73.7 80.5 53.0 76.8 13.0 45.0 53.0 44.9 48.9
    Gemini-2.5-Pro 53.0 40.0 30.8 75.4 72.2 33.0 78.1 34.0 36.0 50.7 49.4 50.0
    Gemini-2.5-Flash 47.0 24.0 26.4 66.3 72.2 24.0 64.7 35.0 23.0 43.3 40.9 42.1
    Gemini-2.0-Flash 44.0 28.0 12.0 68.1 78.0 20.0 54.5 36.0 26.0 41.7 38.8 40.3
    Sonnet 3.7 23.0 22.0 24.0 65.1 76.4 29.0 66.9 31.0 22.0 39.9 40.0 39.9
    Sonnet 3.5 23.0 29.0 14.0 70.1 72.9 33.0 61.2 27.0 26.0 40.3 38.1 39.2
    Llama-4.0-Scout 32.0 23.0 13.6 55.8 71.6 8.0 58.6 5.0 34.0 21.2 27.6
    Llama-4.0-Maverick 21.0 22.0 18.4 53.2 65.7 7.0 56.6 15.0 31.2 23.9 27.5
    Gemma (27 B) 12.0 12.0 12.8 46.7 69.7 10.0 51.3 9.0 27.2 20.1 23.6
    Mistral-Small-3.1 (24B) 15.0 17.0 25.6 59.7 62.5 9.0 61.4 14.0 31.5 25.1 28.3
    Qwen-2.5-VL (72B) 38.0 27.0 21.2 70.0 76.0 19.0 47.9 17.0 41.9 21.6 31.7