Sora is garbage - Deterritoria

Generative AI is a travesty.

Got fooled by some beginners luck the other day when I made this clip (added some audio):

It took a bit of tweaking. I basically asked for overweight Americans walking into a Japanese shrine like it was a WWE event. First iterations were lame, but then I added to the prompt that it should be filmed from an annoyed person’s cell phone and got this. Fooled my mom. Engendered some trepidation in the friends and family circle.

I wondered what I could create with a dedicated session. I took some lines from a script I’m writing and made it prosier:

Viewed from someone looking out a window on the 5th floor: On a busy Tokyo street, a slightly disheveled middle-aged Japanese person with dwarfism, clearly exhibiting signs of schizophrenia or similar mental health issue, yells at no one in particular as he walks down the sidewalk. His posture is very upright. He sways his head right to left and chops his right hand in time with his rhythmic gait. Passersby turn or try to get out of his way, but he pays them no mind, just occasionally yelling in short bursts.

Immediately got denied:

Ok, let’s try something not so triggering.

In a cemetery in Tokyo, in the Fall. Many Yellow-leaved Gingko trees, and red leaved Japanese maple trees. Amidst the grave stones, there’s a small clearing – a nice patch of grass with some benches.Someone plays with their dog. Two people sit on a bench eating lunch. Past the bench, a WHITE CAT jumps from one gravestone to the next.

Looks like a late 90s James Bond Shooter game. First I added some additional notes to the original prompt: A very wide shot. No camera movement. The foreground should have out of focus greenery and gravesites. The cat should leap in the background.

Nevermind all the weird movement and the cat-dog hybrid. At this point it is clear that Sora really like to move the camera, even after you repeatedly tell it not to. I tried to edit the original prompt again, with different additional notes: A very wide shot that takes in the expanse. The perspective is static, as if shot from a tripod. It’s a cloudy day, but a shock of sunlight shoots through. The foreground is out of focus greenery and gravesites. The leaping cat is in the background.

Now there’s just no cat at all. And the damn camera is still moving. I kept the last prompt and added more additional notes, including some all caps so it would know I was getting mad.

More additional notes: DO NOT PUSH IN ON THE IMAGE. THERE SHOULD BE A CAT VISIBLE IN THE BACKGROUND. DON’T MAKE ALL THE GRAVES SO UNIFORM AND MAKE THEM LOOK MORE JAPANESE .

These are parallel moves at best. At least we are finally static. But where the fuck is the cat?

I realized at this point that there is a “Remix” function, where you can give feedback on the already generated video. This felt more akin to the notes/new version/notes cycle in post-production, where you build off something rough and keep refining it. Surely this would work better. I thought, OK, let’s build off this and simplify. I clicked Remix and typed:

remove the people. add a bench in the foreground. the burial site should look more japanese. in the background, A white cat jumps from one grave stone to the next.

All that effort to get to a good wide shot wasted with this fresh hell. Also, one thing that should be abundantly clear is the white supremacist-western orientation of Sora (and I imagine AI in general). The template is clearly a western cemetery, no matter how many times I clarify it should be Japanese. AND WHERE. THE FUCK. IS THE CAT?

REMIX: this should be a much wider shot. make sure a cat is visible. we should be looking at the grave sites head on, not at a diagonal angle. there should be out of focus greenery and gravestones in the foreground. there should be a large clearing in the middle ground, a circular plot of grass.

At least there is a cat. But I’m feeling very ignored. Let’s try again:

make this a much wider static shot and put some out-of-focus elements in the foreground. this land is too flat, add more bumpiness. there should a circular plot of grass in the middle of the foreground, but everything else should be stone or trees

Ok. The bird’s a nice touch. But this was a failure. Nothing like what I wanted. I figured maybe trying something more detail oriented.

A knife rests on a table. A hand comes in and grabs it.

We are clearly in early days here. Let’s try it without the action. Just a nice still life

A wide field of view: A table sits again a wall. On top of the table, we can see a knife to one side

What is with the double-sided knife? You call that a wide shot? And stop fucking moving? After several more tries, I finally got something decent with this prompt: filmed from a locked off tripod: a wide shot of a table in a room. the table is against the far wall. there is a knife on it.

Let’s fucking go!!! Finally a shot that looks like it could be from a movie I would watch. This I could work with. But I need to move the table to the side, so I can see the hand come in and grab the knife.

slightly higher angle. and move the table so it is on the left side of frame.

Stubborn as a mule. I kept trying to move the table to the side, but Sora wouldn’t budge. Ok, let’s just try some action:

someone comes in and grabs the knife

Shame on you, Sora. And shame on me. I’ll keep fucking around though. A couple takeaways from this experimentation is that it’s definitely better to make at least two versions since there’s generally one that’s better than the other. Also, it seems to do a better job of heeding prompts when you opt for higher resolution.

I’ll leave with you Ben Affleck talking about Generative AI because he is 100 percent right: