currently, storage space is significantly cheaper than all the cpu power needed to generate the images from a text description. also, what if you actually wanted to view the backgroud of the object? and where’s the advantage besides an at best 40 % increased storage space edficiency? after all, people are taking pictures to actually capture the moment. else they would do voice memos all the time.