narrative coherence via 'video understanding' prior supplement an image generation model with a video understanding model to operate as a likelihood over the animation sequence