

That’s fair. I actually don’t think we disagree that much - I just think I have trouble conveying what I am trying to say. Whenever someone talks about ‘shallow statistical predictions’, I think about older techniques like Statistical Machine Translation which even had trouble with things like word order, LLMs handle text on a higher level of abstraction (which I described as a form of textual understanding) - and hence handle things like word order better - but are still inherently statistical predictors. The model stores info about how words interact and relate to one another, but it does not ‘understand’ what the words actually (physically?) represent beyond these interactions nor does it ‘understand’ what it is doing. Albeit, those interactions are modeled well enough to give a convincing replica of doing so.
It just so happens that many video codecs are based on image formats, so ffmpeg already has a lot of the complex machinery to do so available to also implement these image formats - internally it can just handle it as a single frame of video with specialized formats for that.
Imagemagick (and other tools) also work, but why use multiple pieces of software if what you already have is adequate? ImageMagick is also software, and can also have vurnabilities.