I asked an AI for a thousand pictures. Checking them was the work

June 12, 2026

For the color-by-number game I needed a lot of pictures. Not ten, not fifty, around a thousand small pixel-art subjects across twenty themes. So I wired up the PixelLab API, fed it a list of names, and let it run. Sixty-four by sixty-four pixels, transparent background, one subject each.

The generating was the easy part. It ran overnight. The actual work, the part that took real attention, was checking what came back.

The AI picks a meaning, and sometimes it picks the wrong one

A word is not always one thing, and the model does not ask which one you meant. It just commits, confidently, to a guess.

I asked for a crane. I wanted the bird. I got a construction crane, the yellow machine. I asked for a turkey, meaning the animal, and got the flag of Turkey. I asked for python and got the programming language logo, not the snake. One subject called udon came back as a ninja, which I still cannot fully explain.

None of these were broken images. Every one was a clean, well-drawn, perfectly nice picture. They were just pictures of the wrong thing. A grep cannot catch this. A file-exists check cannot catch this. The image is valid. Only a human who knows what the word was supposed to mean can see that it is wrong, and across a thousand of them, nine slipped through as the wrong subject entirely.

You cannot review a thousand files one at a time

The naive way to check is to open each PNG, look, close it, next. At a thousand pictures that is not a review, it is a second job, and you start rubber-stamping by picture two hundred.

So I built the review instead of doing it. A small script walks every generated picture, and instead of dumping them in a folder, it renders labelled montage sheets: a grid of thumbnails, each one captioned with the name it was generated from, forty-eight to a page. Now the check is a different task. You are not opening files, you are scanning a contact sheet with your eyes, and a turkey sitting under the caption "turkey" that is clearly a flag jumps out in half a second. The same pass also flags the boring stuff, titles that are too long, a leading "the", so I fix copy and content in one sitting.

Catching the nine bad ones went from impossible to an afternoon, because the tool turned a thousand individual decisions into one long visual scan.

Generation scales. Judgment does not

This is the thing I keep relearning with generative tools. The model will happily give you a thousand of anything. What it will not give you is the certainty that any particular one is right, and that certainty is the only thing that matters when the output goes in front of users.

So the cost moves. It is no longer in making the asset, that is now cheap and fast. It is in reviewing the asset, and review does not get cheaper just because generation did. If you are going to generate at volume, build the thing that lets you judge at volume too. Otherwise you have not saved the work, you have just moved it somewhere you were not looking.