The reason they fuck up hands is because hands are usually moving during pictures and have many different configurations compared to any other body part.
So when these image AIs refer back to all the pictures of hands they’ve been fed and use them to create an ‘average approximation’ of what a hand looks like they include the motion blur from some of their samples, a middle finger sticking up from another sample or extra fingers from the sample pictures of people holding hands etc and mismatch them together even when it doesn’t fit in the picture being created.
The AI doesn’t know what a hand is. It is just mixing together samples from its perfect recollection.
In other words they require exponentially more input because the AI doesn’t know what it is looking at.
It uses its perfect recollection of that input to create a ‘model’ of what a face should look like and stores that model like a collage of all the samples and then uses that to reproduce a face.
It’s perfect recollection with an extra step.