HyperHuman, Explained

6 min readJan 15, 2024

This post is the first in a series on our blog that gives a technical overview of the technologies and engineering behind HyperHuman, our production-ready text-to-3D avatar model tool. A more introductory series that is less technical will also be coming in the following weeks.

Getting a 3D avatar model from descriptive texts for putting into production use in gaming development, cinematography, and so on is an appealing idea for professionals and power users: instead of spending hundreds of hours on crafting a model from scratch, one can just tell the computer what they want, say a high school basketball player. And the computer can magically give them a 3D model (linked to one of our user’s work with HyperHuman) that is good for production use in under 10 minutes.

This appealing idea requires a lot of work behind the scenes, solving a magnitude of problems to provide an easy-to-use, real 3D avatar model that is customizable. In this post, we are deep-diving into the challenges and how we solve them.

The main work behind ChatAvatar is published in ACM Transactions on Graphics as DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance (preprint also available at https://arxiv.org/abs/2304.03117).

The Problems: Accessibility and Realism

Efforts to generate believable 3D human characters began more than a decade ago. This was when the approach was heavy and expensive: to generate realistic 3D human characters, you will typically need a large light stage, a time-consuming face scanning process, and a lot of experts to work in it. The approach was only used when one has a high budget, like in the production of a Hollywood movie. A well-known example of this is in the 2010 film The Social Network by David Fincher in which Justin Timberlake and Armie Hammer are playing twin brothers. In the film, the face of Justin Timberlake is copied into the face of Armie Hammer with the help of a light stage scan.

Excerpt from Making of a MASTERPIECE, behind the scene documentary of The Social Network (2010), showing how they use a light stage to scan the face of Justin Timberlake to be used to replace the face of Armie Hammer as they are playing the twin brothers in the movie. / © Sony Pictures

The long, expensive, and painful process has long been deemed a must for creating realistic 3D human characters: you will need a scan from a real person to recreate a realistic 3D human character. It is far from being accessible outside the film industry.

Younger technologies and mathematical frameworks like variational autoencoders (VAEs) and generative adversarial networks (GANs) have been used in recent years for facial synthesis. While not necessarily needing a scan of a real person, they all carry their problems. For example, some of the methods use rendering-based schemes that are difficult to integrate seamlessly into the existing CG production pipeline rendering it unsuitable for production use. Also, the approaches still suffer from the leak of diversity and facial traits largely because of the limited training set of 3D or paired data and the limited flexibility of the generation process. This also limited how they can be used in a production environment.

How HyperHuman Generates Models?

We are hoping to address the problems with the help of the large language models, hoping to take advantage of large language and vision models and the dynamic physically-based facial assets we have from the years here in Deemos to generate a 3D avatar model from descriptive texts (prompts).

The pipeline behind HyperHuman has three organically integrated parts, geometry generation, texture generation, and animation empowerment. We will briefly introduce each of the parts in this section.

Geometry Generation

This part of the pipeline is doing the ground-breaking work: it generates the geometry model from the text prompt. You can think of the geometry model as the bone of the resulting 3D human character. It determines things like how high is one’s forehead, whether the character has thin or thick lips, and so on. The subsequent texture model is just putting skin onto the geometry model so that the model has a real human look. We are generating the geometry in two steps: coarse geometry generation and detail carving.

An analogy is to imagine that you are a sculptor and you are making a bust of a person. You also have a bunch of pre-carved semi-finished busts in various shapes. The first thing for you to do is to select one semi-finished bust that looks the most like the person, roughly. Then you take out your carving knife and do some final touches on the bust to make it look more like the person you are making the bust for. This is exactly how these steps work.

The Coarse geometry generation step involves selecting an optimal coarse geometry from a diverse candidate pool pre-sampled from a parametric model, for example, the ICT-FaceKit. The selection is based on the matching scores calculated using the CLIP model, which compares the relative similarity between the text prompt and the rendered geometry images. Once the optimal coarse geometry is selected, fine-grained detail carving is performed on top of it to add vertex displacements and detailed normal maps in the tangent space, resulting in a highly detailed geometry that closely matches the input prompt. This process enables the generation of neutral facial geometry with fine-grained facial traits and topology structure, which can be further customized and animated to create personalized 3D facial assets.

Texture Generation

The next task is to generate the skin that will be put onto the geometry model. Just as an artist might receive a textual prompt describing a scene or a character and then use their skills to paint a vivid and accurate representation, the Texture Generation step takes textual prompts and uses advanced algorithms and models to generate high-quality, detailed textures and facial features for 3D models. This process involves understanding the nuances of the description and translating it into visually compelling and realistic textures, much like an artist bringing a written description to life through a carefully crafted painting.

To do that, HyperHuman utilizes a dual-path optimization scheme that combines a generic LDM (Latent Diffusion Model) and a texture LDM. The generic LDM is capable of generating plausible textures in image space according to general text prompts, while the texture LDM ensures that the generated textures follow the UV specifications and are consistent with the underlying 3D facial geometry. The system also includes a texture translation and augmentation module that generates detailed physically-based textures, including diffuse maps, specularity maps, and normal maps in high resolution, suitable for photo-realistic rendering.

Animation Empowerment

With the geometry generation and texture generation steps, we have a realistic static (neutral) 3D model. We will also want to enable 3D facial assets to be animated with personalized expressions. While our asset directly supports traditional blend shape-based animation due to its consistent geometric topology, one could utilize existing facial performance capture techniques to obtain the corresponding expression blend shape parameters from images and videos to animate our generated facial asset.

In this article, we dive deep into how we design the system underlying HyperHuman, from the problems to how we solve them. This demonstrated how we built the tool that can create a 3D avatar model from text prompts from the user. We hope this information is helpful to you.

If you want to try HyperHuman now, you can go to https://hyperhuman.deemos.com/ now and register a free account with 20 free tokens that you can use to explore what HyperHuman can do. We also launched our subscription plans for more features and flexible options.