Interactive virtual humanoid agent is a crucial interface with the physical world. A relatively complete humanoid agent first needs to have face and body, then possess both verbal and non-verbal (such as eye contact, facial expression, lip motion, gesture, and manipulation) abilities, and finally, it is capable of real-time duplex communication, e.g., the ability to actively interrupt conversations. Most prior systems typically only consider a subset of these elements, leaving a gap from realistic humanoid agent. In this work, we propose a real-time, duplex, interactive end-to-end network capable of modeling realistic agent behaviors, including speech, full-body movements for talking, responding, idling, and manipulation. This system is a multimodal model integrating audio and visual inputs, extended from a pre-trained large language model (LLM). We collect approximately 200,000 hours of audio, around 130,000 hours of video data, and about 20,000 alignment samples to build the model. The final model demonstrates capabilities that are difficult to achieve in previous systems, such as generalized object manipulation. This work performs a preliminary exploration of the end-to-end approach in this field, aiming to inspire further research towards scaling up.
To protect personal privacy, the head part of characters in all demonstrated results are covered with animated Memoji, and the audio is modified to disguise their voices.
To specify the initial frame for the agent, we first select an image of an empty scene. Subsequently, we directly insert the image of the agent in a specific pose along with the images of the objects to be interacted with into the empty scene to obtain the initial frame. Referring to Magic Insert, the insertion process can be refined using image generation models.
For all demos, the topics of conversations with the agent, the scenarios, and the objects the agent interacts with are out-of-distribution (OOD). This indicates that the test cases have never appeared in the corresponding customized agent training dataset. The model may have been exposed to functionally similar scenarios and objects during the pre-training phase, thereby knowing how to interact with them.
Video frames occasionally exhibit pauses and abrupt changes. This issue arises because the model inference sometimes fails to achieve real-time output, leading us to forcibly align the human and agent timelines. This problem can be addressed in the future by optimizing the inference architecture.
Demo1: the two parties engage in a discussion on the methods of brewing Vitamin C.
Demo2: the human instructs the agent to draw some simple patterns.
Demo3: the agent acts as a salesperson selling fruit.
Demo4: the agent functions as a tour guide introducing Beijing.
Demo5: the agent counsels a human interlocutor who is experiencing anxiety due to the upcoming exams.
Demo6: the agent provides the human interlocutor with an introduction to artificial intelligence and its learning pathway.
Demo7: the agent perceives and describes the layout of the environment, then proposes specific renovation ideas based on the human interlocutor's decor theme.
Demo8: the agent demonstrates the proper use of sunscreen lotion.
Demo9: the agent demonstrates the process of folding a towel.
Demo10: the agent cuts the paper strip according to the human interlocutor's instructions.
Demo12: the agent's idling behavior (video sound removed) sampled given an initial frame and empty inputs.
Demo13: the agent's idling behavior (video sound removed) sampled given an initial frame and empty inputs.
Demo14: the agent's idling behavior (video sound removed) sampled given an initial frame and empty inputs.
Demo15: we can enhance the model's reasoning capability through Chain-of-Thought (CoT) (see Section 5.2 of the paper for details). For instance, in Demo15, the agent can systematically manage the teaching progress of classical poetry, accurately recognize facial expressions (which Memoji may not be able to estimate accurately) and gestures.
Demo16: through Reinforcement Learning with Human Feedback (RLHF), the model can to some extent "know what it doesn't know" and understands how to decline requests. Given the extensive variety of interactable objects, coupled with limited datasets and computational resources, we collect cases of non-interactable objects during the alignment phase to teach the model to refuse interactions, thereby avoiding significant artifacts. For example, in Demo16, the agent can refuse interaction requests for non-operable objects.
@article{ao2024bodyofher,
author = {Ao, Tenglong},
title = {Body of Her: A Preliminary Study on End-to-End Humanoid Agent},
journal = {arXiv},
year = {2024},
}
It is precisely due to the increasing maturity and lower barriers of the deep learning community in areas such as multi-modal data collection and filtering, foundational models, GPU cluster cloud services, and distributed training frameworks, that the preliminary exploration of the system described in this paper has become feasible with relatively minimal human and financial resources.
Tenglong Ao: implementation of the system and writing of the technical report.
Zeyi Zhang: reviewing the report and providing detailed feedback.
Heyuan Yao: reviewing the report and providing detailed feedback.