Vibe Jam 2026: A Fun Exploration of AI-Assisted Game Development

Vibe Jam 2026 is a game jam where developers have one month to develop and submit a game, with the prerequisite that it needs to be lightweight and have 90% of the code generated by AI.

I have never been a game developer and had limited prior knowledge of game development frameworks, but this sparked my interest. I wanted to test a simple question:

How far can I get with AI-generated code, with limited time and limited prior experience?

As background, I spent around 10–15 hours on development. I did not have too much time to spare in April, so this was very much a lightweight experiment. For those interested, feel free to check out the game here:

Play Bug Detective — a desktop detective puzzle with four Cursor-themed mini-games.

Gameplay recording

Screen recording from the build I submitted (same clip as on Substack — desk, mini-games, and final call-out).

Takeaways

It is not too difficult to create a simple game with an AI assistant. As you will see here, the experience is generally smooth: a simple 3D world, a mascot character walking around, and zooming into different mini-games.

But scaling it beyond that is far from easy. Below are some of my observations on limitations.

1. Getting the 3D character right was surprisingly hard

Editing the Cursor 3D character to match the exact details I wanted was very challenging. Since I am not familiar with developer tools for creating and refining 3D characters, I relied first on descriptions, then iterated through prompts to ask the model to adjust the style.

Even when I provided the exact image and a very detailed description, the models struggled to adjust things like the geometry, how wide the character’s eyes should be, or how dark the color should be.

I spent at least three hours just iterating on the mascot to get it looking the way I wanted.

I tried Opus 4.7, GPT-5.5, and Gemini 3.1 Pro. They all consistently struggled. I might give a slight edge to Gemini 3.1 Pro, but the difference was not huge.

There is a clear gap between truly interpreting an image well and translating that interpretation into code that accurately reflects it. I see this gap at work with slides as well: LLMs can often generate the right text, colors, and shapes, but the proportions and spacing are frequently off.

2. Creativity was not really there

I had a lot of brainstorming sessions with the LLM because I wanted to come up with creative game mechanics. It was helpful as a thought partner, but nothing it suggested felt glaringly creative.

In contrast, when I work with LLMs on agentic chatbot design or similar projects, I often feel like they give me good insights. Perhaps that is because my vision is clearer, or because those projects are usually building on top of an existing framework with a different goal.

That was not really the case with game design. The suggestions felt more generic.

3. Game mechanics were surprisingly difficult to get right

One of the mini-games is a simple “spot the difference” game between two pictures, where the player simply says yes or no after identifying a difference.

In the initial versions, the model designed shapes that were far too obvious and looked quite hideous.

In the running game, where the character needs to keep running and jumping without falling, the first few iterations had obvious issues. For example, the character could simply continue walking through holes where it was supposed to fall and fail the game.

It also failed to calculate the relationship between the character’s maximum running speed and how fast the ground was supposed to disappear. As a result, the player could never survive beyond 30 seconds, or you get the extreme side where the player can run forever without losing and simply pressing forward →

Small glitches were also fairly common: the character getting stuck, the character walking through objects, and other similar issues.

4. Long-running agents helped, but only to a point

I tested long-running agents and also tried setting up a Ralph loop overnight to see how effective they could be at making enhancements and fixes.

I wrote down the specs and requirements of what success looks like, and mentioned that the game needed to reach 90 out of 100 based on my written criteria. The results were limited.

My hypothesis is that the gap comes from the LLM’s limited ability to evaluate video. This kind of evaluation is mostly nonexistent with current coding agents, which typically evaluate screenshots from Playwright instead.

Coincidentally, I was in an early test group for the recently launched Cursor Agent SDK. I strung together a workflow where I programmatically downloaded recorded demo artifacts from the Cursor cloud agent, used the Gemini API to analyze the recorded video, and fed that assessment back into the coding agent for improvement.

That workflow was more effective at producing useful assessments. But because of the earlier issues around geometry, visual interpretation, and game mechanics, it still required many iterations to improve the game.

Final thoughts

Overall, it was a fun experiment. More importantly, it showed me a lot about the current capability gaps of AI coding tools.

Creating a lightweight game with AI assistance is very doable. But getting the game to feel polished, visually coherent, and mechanically smooth is still hard — especially if you are relying heavily on AI and do not already have strong game development experience.

I also encourage you to try creating a small game yourself, just to have a bit of fun. I would be keen to hear feedback on your experience.

Also published on Substack.