StyleGAN-Human: A Data-Centric Odyssey of Human Generation

Abstract

Unconditional human image generation is an important task in vision and graphics, which enables various applications in the creative industry. Existing studies in this field mainly focus on "network engineering" such as designing new components and objective functions. This work takes a data-centric perspective and investigates multiple critical aspects in "data engineering", which we believe would complement the current practice. To facilitate a comprehensive study, we collect and annotate a large-scale human image dataset with over 230K samples capturing diverse poses and textures. Equipped with this large dataset, we rigorously investigate three essential factors in data engineering for StyleGAN-based human generation, namely data size, data distribution, and data alignment. Extensive experiments reveal several valuable observations w.r.t. these aspects: 1) Large-scale data, more than 40K images, are needed to train a high-fidelity unconditional human generation model with vanilla StyleGAN. 2) A balanced training set helps improve the generation quality with rare face poses compared to the long-tailed counterpart, whereas simply balancing the clothing texture distribution does not effectively bring an improvement. 3) Human GAN models with body centers for alignment outperform models trained using face centers or pelvis points as alignment anchors. In addition, a model zoo and human editing applications are demonstrated to facilitate future research in the community.

Model Zoo

Structure	1024x512	512x256
StyleGAN1	stylegan_human_v1_1024.pkl	to be released
StyleGAN2	stylegan_human_v2_1024.pkl	stylegan_human_v2_512.pkl
StyleGAN3	to be released	stylegan_human_v3_512.pkl

Downstream Applications

Interpolating

Start Frame

Loading...

End Frame

Style-Mixing

Use low-level styles from reference to control coarse features (e.g. poses) in source images.

Use mid-level styles from reference to control middle features (e.g. clothing types / ID appearances) in source images.

Use high-level styles from reference to control fine features (e.g. clothing colors) in source images.

❮ ❯

Attributes Editing with generated images

Change Upper length
(StyleSpace)

Change Upper length
(InterFaceGAN)

Change Bottom length
(InterFaceGAN)

Change Upper length
(StyleSpace)

Change Bottom length
(StyleSpace)

Attributes Editing with an inverted real image

from left to right: Real image | Inverted image | InterFaceGAN | StyleSpace | SeFa

InsetGAN

Starting with FFHQ faces and SHHQ bodies, we iteratively optimize both latent codes of faces and bodies, and finally obtain coherent full-body images.

Related Works

Text2Human proposes a text-driven controllable human image synthesis framework.

Talk-to-Edit proposes a StyleGAN-based method and a multi-modal dataset for dialog-based facial editing.

DeepFashion-MultiModal is a large-scale and high-quality human dataset with rich multi-modal annotations.

AvatarCLIP proposes a zero-shot text-driven framework for 3D avatar generation and animation

EVA3D proposes a compositional framework to generate 3D human from 2D image collections.

BibTeX

@article{fu2022styleganhuman,
      title={StyleGAN-Human: A Data-Centric Odyssey of Human Generation},
      author={Fu, Jianglin and Li, Shikai and Jiang, Yuming and Lin, Kwan-Yee and Qian, Chen and Loy, Chen-Change and Wu, Wayne and Liu, Ziwei },
      journal   = {arXiv preprint},
      volume    = {arXiv:2204.11823},
      year    = {2022}
    }