Generative text-to-image models, such as Stable Diffusion, have demonstrated a remarkable ability to generate diverse, high-quality images. However, they are surprisingly inept when it comes to rendering human hands, which are often anatomically incorrect or reside in the "uncanny valley". This paper proposes a method—HandCraft—for restoring such malformed hands. This is achieved by automatically constructing masks and depth images for hands as conditioning signals using a parametric model, allowing a diffusion-based image editor to fix the hand’s anatomy and adjust its pose while seamlessly integrating the changes into the original image, preserving pose, color, and style. Our plug-and-play hand restoration solution is compatible with existing diffusion models, and the restoration process facilitates adoption by eschewing any fine-tuning or training requirements. We also contribute MalHand datasets that contain generated images with a wide variety of malformed hands in several styles for training and benchmarking, and demonstrate through qualitative and quantitative evaluation that HandCraft not only restores anatomical correctness but also maintains the integrity of the overall image.
HandCraft flowchart. The framework has three stages for correcting malformed hands in images. (1) Hand detection. A hand detector is employed to detected the bounding box of the hand and a body pose estimator is used to predict the landmarks on hands with the prior of the whole body pose. (2) Control image generation. The extracted body pose and a parametric hand template are given to a control image generator to obtain a control image I_c and a template mask M_t. The final control mask M is obtained by doing a union operation between the bounding box mask M_d and the template mask M_t. (3) Hand restoration. The final output image with corrected hand is generated using ControlNet given the input image, a text prompt, control mask and control image as the conditioning.