CMU Creates Language2Pose Model that Generates Animations From Text (original) (raw)

In June Synced published an article about Microsoft ObjGAN’s impressive performance on transferring text to images. Now, just weeks later, Carnegie Mellon University researchers have made another leap in the field with their Joint Language-to-Pose (JL2P) model, which generate animations from text input via a joint multimodal space comprising language and poses.

The language-to-pose animation process creates a sequence of poses represented by positions of joints (shoulders, wrists, knees, etc.). The challenge lies in three aspects:

To tackle these challenges, the CMU researchers had the JL2P model learn a joint embedding of language and pose and optimized it with a progressive training curriculum that helps the model process shorter and easier sequences such as basic leg motions before processing longer and more difficult sequences such as running. The researchers evaluated their model using the following criteria:

Researchers designed a user study in which human annotators selected the two generated animations they believed most accurately described a specific sentence from a set of six generated animations. Researchers compared the JL2P model’s average positional error results when using different optimizations, and against previous model performance.

The JL2P model showed an improvements of 9 to 15 percent compared to previous models. In the user study JL2P showed scored a 75 percent preference rate, only 10 percent below ground truth. In both the objective metrics and evaluation by humans the JL2P model generated more accurate animations with more reasonable visual representations than other data-driven approaches.

The paper Language2Pose: Natural Language Grounded Pose Forecasting is on arXiv.