Body of Her: A Preliminary Study on End-to-End Humanoid Agent

Abstract

Interactive virtual humanoid agent is a crucial interface with the physical world. A relatively complete humanoid agent first needs to have face and body, then possess both verbal and non-verbal (such as eye contact, facial expression, lip motion, gesture, and manipulation) abilities, and finally, it is capable of real-time duplex communication, e.g., the ability to actively interrupt conversations. Most prior systems typically only consider a subset of these elements, leaving a gap from realistic humanoid agent. In this work, we propose a real-time, duplex, interactive end-to-end network capable of modeling realistic agent behaviors, including speech, full-body movements for talking, responding, idling, and manipulation. This system is a multimodal model integrating audio and visual inputs, extended from a pre-trained large language model (LLM). We collect approximately 200,000 hours of audio, around 130,000 hours of video data, and about 20,000 alignment samples to build the model. The final model demonstrates capabilities that are difficult to achieve in previous systems, such as generalized object manipulation. This work performs a preliminary exploration of the end-to-end approach in this field, aiming to inspire further research towards scaling up.

Demo

To protect personal privacy, the head part of characters in all demonstrated results are covered with animated Memoji, and the audio is modified to disguise their voices.

To specify the initial frame for the agent, we first select an image of an empty scene. Subsequently, we directly insert the image of the agent in a specific pose along with the images of the objects to be interacted with into the empty scene to obtain the initial frame. Referring to Magic Insert, the insertion process can be refined using image generation models.

For all demos, the topics of conversations with the agent, the scenarios, and the objects the agent interacts with are out-of-distribution (OOD). This indicates that the test cases have never appeared in the corresponding customized agent training dataset. The model may have been exposed to functionally similar scenarios and objects during the pre-training phase, thereby knowing how to interact with them.

Video frames occasionally exhibit pauses and abrupt changes. This issue arises because the model inference sometimes fails to achieve real-time output, leading us to forcibly align the human and agent timelines. This problem can be addressed in the future by optimizing the inference architecture.

Conversation Scenario

Demo1: the two parties engage in a discussion on the methods of brewing Vitamin C.

Demo2: the human instructs the agent to draw some simple patterns.

Demo3: the agent acts as a salesperson selling fruit.

Demo4: the agent functions as a tour guide introducing Beijing.

Demo5: the agent counsels a human interlocutor who is experiencing anxiety due to the upcoming exams.

Demo6: the agent provides the human interlocutor with an introduction to artificial intelligence and its learning pathway.

Demo7: the agent perceives and describes the layout of the environment, then proposes specific renovation ideas based on the human interlocutor's decor theme.

Manipulation Scenario

Demo8: the agent demonstrates the proper use of sunscreen lotion.

Demo9: the agent demonstrates the process of folding a towel.

Demo10: the agent cuts the paper strip according to the human interlocutor's instructions.

Idling Scenario

Demo12: the agent's idling behavior (video sound removed) sampled given an initial frame and empty inputs.

Demo13: the agent's idling behavior (video sound removed) sampled given an initial frame and empty inputs.

Demo14: the agent's idling behavior (video sound removed) sampled given an initial frame and empty inputs.

Enhancing Reasoning via Chain-of-Thought (CoT)

Demo15: we can enhance the model's reasoning capability through Chain-of-Thought (CoT) (see Section 5.2 of the paper for details). For instance, in Demo15, the agent can systematically manage the teaching progress of classical poetry, accurately recognize facial expressions (which Memoji may not be able to estimate accurately) and gestures.

Example of Question-Answer Safety

Demo16: through Reinforcement Learning with Human Feedback (RLHF), the model can to some extent "know what it doesn't know" and understands how to decline requests. Given the extensive variety of interactable objects, coupled with limited datasets and computational resources, we collect cases of non-interactable objects during the alignment phase to teach the model to refuse interactions, thereby avoiding significant artifacts. For example, in Demo16, the agent can refuse interaction requests for non-operable objects.

Limitation

Constrained by model size, there may be issues with logical inconsistencies in the question-and-answer process. The expansion of model size and the requirement for real-time inference are inherently contradictory. As computational power and inference techniques advance, it is expected to be alleviated.
To ensure long-term identity consistency, the appearance of the agent, such as clothing, must align with the corresponding agent's training data. It is challenging to re-edit the agent's attire as needed. In the future, a more flexible approach is to create a specific agent through a single image prompt or a short video prompt.
The generated results exhibit inconsistencies with the laws of physics. For instance, (a) In Demo1, vitamin C should be taken out of the box rather than appearing spontaneously from behind the box; (b) In Demo10, after the second cutting, the paper strip reattaches itself automatically; (c) Sometimes, the agent does not strictly have five fingers on one of its hands.
The agent exhibits spatial disorientation regarding left and right. For example, in Demo7, the curtain should be positioned to the right of the agent rather than to the left.
In the real world, there are numerous interactive objects. But due to limitations in the dataset and computational resources, the interaction process of the agent may exhibit artifacts.
The primary reason for the system's capability to produce highly consistent video results is the limited scope of the modeled scenarios, mainly focusing on "two-person dialogues in static scenes." While ensuring real-time performance, we believe that with the expansion of the model and data scale, this paradigm has the potential to address more complex and generalized scenarios, such as intricate dynamic scenes, high-degree-of-freedom agent navigation, multi-agent interactions, and dialogues involving non-humanoid agents.

BibTeX

@article{ao2024bodyofher,
  author    = {Ao, Tenglong},
  title     = {Body of Her: A Preliminary Study on End-to-End Humanoid Agent},
  journal   = {arXiv},
  year      = {2024},
}

Acknowledgement

It is precisely due to the increasing maturity and lower barriers of the deep learning community in areas such as multi-modal data collection and filtering, foundational models, GPU cluster cloud services, and distributed training frameworks, that the preliminary exploration of the system described in this paper has become feasible with relatively minimal human and financial resources.