AnyBipe: An End-to-End Framework for Training and Deploying Bipedal Robots Guided by Large Language Models


*Indicates Equal Contribution

1 All authors are with Machine Vision and Autonomous System Laboratory, Department of Automation, School of Electrical Information and Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China, with the Key Laboratory f System Control and Information Processing, Ministry of Education of China, and with Shanghai Engineering Research Center of Intelligent Control and Management, Shanghai 200240, China.

2Wentao He is is with University of Michigan - Shanghai Jiao Tong University Joint Institute , Shanghai Jiao Tong University, Shanghai, China

Abstract

Training and deploying reinforcement learning (RL) policies for robots, especially in accomplishing specific tasks, presents substantial challenges. Recent advancements have explored diverse reward function designs, training techniques, simulation-to-reality (sim-to-real) transfers, and performance analysis methodologies, yet these still require significant human intervention. This paper introduces an end-to-end framework for training and deploying RL policies, guided by Large Language Models (LLMs), and evaluates its effectiveness on bipedal robots. The framework consists of three interconnected modules: an LLM-guided reward function design module, an RL training module leveraging prior work, and a sim-to-real homomorphic evaluation module. This design significantly reduces the need for human input by utilizing only essential simulation and deployment platforms, with the option to incorporate human-engineered strategies and historical data. We detail the construction of these modules, their advantages over traditional approaches, and demonstrate the framework's capability to autonomously develop and refine controlling strategies for bipedal robot locomotion, showcasing its potential to operate independently of human intervention.

Introduction

With the integration of advanced algorithms, enhanced physical simulations, and improved computational power, robotics has made significant strides. These innovations enable robots to perform tasks ranging from industrial automation to personal assistance with unprecedented efficiency and autonomy. As industrial robotics matures, research increasingly focuses on humanoid robots, particularly in replicating human-like characteristics and enabling robots to perform traditionally human tasks. Bipedal robots, which emulate human lower-body movements, are central to achieving human-like mobility in robots.

Control strategies for bipedal robots typically leverage either traditional control methods or reinforcement learning (RL). Traditional approaches rely on problem abstraction, modeling, and detailed planning, while RL employs reward functions to iteratively guide robots toward task completion. Through repeated interactions with the environment, RL enables robots to refine control strategies and acquire essential skills, particularly excelling in trial-and-error learning in simulated environments, where robots adapt to complex terrains and disturbances.

Despite these advancements, training and deploying RL algorithms remains challenging. Effective reward design requires careful consideration of task-specific goals and the incorporation of safety constraints for real-world applications. This complexity demands significant engineering effort in training, testing, and iterative refinement. Although reward shaping and safe RL offer potential solutions, they often rely on prior experience, complicating the reward design process. Furthermore, bridging the gap between simulations and real-world conditions—the "Sim-to-Real" challenge—remains difficult. Techniques such as domain randomization, which randomizes physical parameters to enhance agent robustness, and observation design, which facilitates task transfers across varied terrains, remain essential but still require real-world testing and human feedback. Ultimately, precise evaluation metrics are crucial for guiding and refining RL algorithm performance.

The integration of large language models (LLMs) into robotics represents a transformative advancement. Known for their capabilities in code generation, problem-solving, and task planning, LLMs are increasingly being applied to complex robotics applications. For instance, they play a pivotal role in embodied intelligence by enabling the dynamic creation of action tasks. Recent developments have further enhanced the utility of LLMs in improving reward function design, advancing Sim-to-Real transfer, and refining performance verification—key areas that reduce the need for extensive real-world testing and human intervention. However, a comprehensive framework that automatically implements all trained models in real-world settings remains lacking. To address this issue and adapt to these innovations, we propose a novel framework that leverages LLMs to optimize the entire training-to-deployment process. This framework minimizes human engineering involvement, facilitating the autonomous training and deployment of RL algorithms, and enabling both the development of new models and the enhancement of existing ones.

Anybipe Components

MY ALT TEXT

Overview. Our frameworks are organized in three interconnected modules. After receiving all pre-requisites and requirements, the framework generates reward funvion via LLM, trains it in simulation and evaluates in both gazebo and reality, providing important feedback. The whole procedure requires minimum human labor.

We introduce AnyBipe, the first end-to-end framework for the training, simulation, and deployment of bipedal robots, guided by LLMs. This framework, now publicly available on GitHub, consists of three interconnected modules and embodies a closed-loop process for continual refinement and deployment. Our main contributions are listed as follows:

  1. Minimized Human Intervention: The proposed framework operates autonomously, requiring only minimal initial user input to oversee the entire process from training to deployment. No additional human intervention is needed during the workflow.
  2. LLM-Guided Reward Function Design: Leveraging large language models, the framework generates suitable reward functions from predefined prompts. Through iterative refinement, it allows users to design customized RL reward functions from the ground up.
  3. Incorporation of Prior Knowledge in Training: The framework enables the integration of pre-trained models and conventional control algorithms, which enhances RL training stability and facilitates the migration of traditional control implementations into the proposed system.
  4. Real-World Feedback Integration via Homomorphic Evaluation: This module converts real-world sensor and actuator feedback into formats compatible with simulation, enabling LLMs to bridge the gap between the training environment and real-world deployment. As a result, it allows for adaptive adjustments to the reward function based on actual feedback.

Experimental trials on bipedal robots traversing both flat and complex terrains have shown that AnyBipe can significantly enhance real-world deployment performance. Compared to manually designed reward functions, those generated by AnyBipe lead to faster and more stable training outcomes. Moreover, the integration of RL strategies with traditional control algorithms has proven to stabilize training and prevent convergence to high-reward but low-usability solutions. Valuable real-world feedback has further refined reward functions, thereby improving Sim-to-Real performance.

The paper is organized to initially detail the design principles and implementation of the framework's modules, elucidate the experimental setup and results, and explore the implications of these findings. We conclude with a summary of our contributions and propose future research directions in autonomous robotic systems.

Related Works

Reinforcement Learning for Bipedal Robots. Reinforcement learning (RL) has achieved significant success in enabling legged robots to learn locomotion and motion control through data-driven methods, allowing them to adapt to diverse environmental challenges. Although RL has traditionally been applied to quadruped robots, recent studies have extended these techniques to bipedal robots, such as Cassie. Research has introduced RL-based locomotion strategies, training in simulation environments like MuJoCo and Isaac Gym. Additional approaches explore imitation learning, motion planning, and robust RL strategies, enabling bipedal robots to perform tasks like running, stair climbing, and complex maneuvers. Building on these advancements, our work utilizes the Isaac Gym environment, proposing a supervised RL framework to mitigate risks associated with suboptimal training outcomes.

Large Language Model Guided Robotics. Large language models (LLMs) have demonstrated considerable capabilities in task understanding, semantic planning, and code generation, making them valuable tools for robotics applications. LLMs automate environmental analysis, design reward functions, and map tasks to actions. However, challenges such as data scarcity, real-time performance, and real-world integration remain. Additionally, LLM-driven reward shaping typically depends on human feedback or manual refinement. Our framework addresses these limitations by leveraging LLMs from code generation through deployment, using environmental features and safety constraints as priors. It uniquely incorporates homomorphic feedback from real-world applications, reducing the need for in-process human intervention.

Sim-to-real Training and Deploying Techniques. The gap between simulated environments and real-world conditions, known as the “reality gap,” presents significant challenges for deploying RL strategies in robotics. Techniques such as domain randomization and system identification are widely used to address this issue. Researchers have proposed sim-to-real solutions for bipedal robots to handle tasks such as turning and walking. Recent work has also integrated LLMs to enhance environmental modeling and reward function design, making simulations more reflective of real-world complexity. However, most approaches still rely on separate training in simulation and real-world evaluation, often using human feedback to assess sim-to-real effectiveness. Our work extends these techniques by introducing an evaluation loop that continuously monitors sim-to-real performance during deployment.

Methods

In this section, we detail the AnyBipe framework, composed of three modules designed to automate reward design, simulation training, and deployment feedback, thereby minimizing human intervention. The framework integrates the robot's URDF model, a basic task description, an RL training platform, and a ROS-based system for communication and control, along with an SDK for sensor and actuator data. Optional elements include a manual reward function template, a teacher model for strategy implementation, and custom environmental observations. AnyBipe generates reward functions and trains deployable policies guided by the teacher model. The best strategies, determined by success criteria and simulation tests, are deployed via ROS. After validating in the ROS Gazebo simulation, successful policies undergo real-world testing. Evaluation across various environments and the selected best policy guide iterative improvements in reward generation. The procedural steps are encapsulated in Algorithm 1.

Algorithm 1: AnyBipe Framework Process
Pre-requisites: URDF model O, training environment 𝒯, deployment environment ℛ, robot state tracker st
Require: Environment description 𝒟(𝒯), prompt set p, training environment estimator ℰtrain, homomorphic estimator mapping function ℱ, safety evaluation criterion SA. RL algorithm RL, LLM model LLM, feedback prompt compiler COMPILE
Optional: Human-engineered reward function 𝑅ref, reference policy πref, additional prompts padd, custom environment observation obsc
Hyperparameters: Iteration N, number of reward candidates K, best sample percentage cbs, teacher model coefficient β, environment estimator coefficient ce, observation coefficient cobs
Input: Task description 𝒟
𝑅refif pre-defined then reference reward else None
For i ← 1 to N do:
/* Module 1: Reward Function Generation */
pin ← p + padd + pfeedback
𝑅 ← LLM(𝒟, 𝒟(𝒯), pin, 𝑅ref)
/* Module 2: Teacher-Guided RL Training */
Π, Obs ← RL(𝒯, O, 𝑅, πref, β)
/* Module 3: Deployment, Evaluation, and Feedback */
nbs ← ⌈cbs · K⌉
pfeedback ← None
Criterion ← ce · ℰtrain(Π) + Σcobs · Obs
𝑅bs, πbs ← argmaxCriterion(𝑅, Π)
𝑅̂bs ← ℱ(𝑅bs)
For all π, 𝑅̂ in πbs, 𝑅̂bs:
πreal ← (𝒯 → ℛ)(π)
sim ← EVALgazebo(𝑅̂(st(O)), πreal)
pfeedback += COMPILE(p, ℰsim)
If SA(ℰsim) is true:
real ← EVALreal(𝑅̂(st(O)), πreal)
pfeedback += COMPILE(p, ℰreal)
𝑅ref, πref ← argmaxCriterion(𝑅bs, πbs)
pfeedback += COMPILE(p, πref)
Output: Best policy π, best deployment πreal, and best reward function 𝑅

Module 1: LLM Guided Reward Function Design

We enhance the reward function design using the Eureka framework, which enables LLMs to autonomously improve and iterate reward functions with predefined environmental and task descriptions \(\mathcal{D}(\mathcal{T})\). However, initial usability issues require multiple iterations for viable training code. Furthermore, Eureka often overlooks discrepancies between training \(\mathcal{T}\) and real environments \(\mathcal{R}\), resulting in computationally expensive but minimally effective reward functions that may induce undefined behaviors. The framework also lacks comprehensive safety considerations for tasks such as bipedal movement, despite attempts to integrate safety through Reward-Aware Physical Priors (RAPP) and LLM-led Domain Randomization.

To address key issues, we developed a robust context establishment mechanism that effectively tackles underdesigned reward functions and safety constraints. Our approach classifies prompts into two categories: General and Task-Specific. For general tasks, we provide coding tips, function templates, and predefined templates that facilitate code compilation, training feedback, and testing feedback. We also integrate reference tables from Isaac Gym for precise measurements like motor torque, torque limits, and foot height, which are crucial for maintaining realistic task parameters. These tables prevent the customization of non-existent observations and enhance the utilization of environmental variables in \(\mathcal{D}(\mathcal{T})\), ensuring that LLMs consider actionable constraints during reward function design. Our experimental results affirm that LLMs can seamlessly incorporate these safety restrictions and environmental variables, thus designing highly effective reward functions, evidencing their exceptional context tracking and directive following capabilities.

Examples of safety restriction prompts (left) against LLM generated reward functions (right)
Examples of safety restriction prompts (left) against LLM generated reward functions (right)

Furthermore, the Task-Specific module allows users to define custom prompts for specific tasks, facilitating the rapid generation of viable code and standardization of reward calculations. Users have the flexibility to use trainable artificial reward functions as templates or employ various computational paradigms to enhance the accuracy and applicability of reward assessments.

To refine the evaluation of designed reward functions for completing the improvement loop, we introduced a comprehensive reward function evaluation scheme. This scheme not only tracks changes in rewards and observations throughout the training but also integrates a homomorphic evaluation model to closely assess real-world robot performance. This model ensures a high correlation between real-world outcomes and theoretical reward functions, enabling LLMs to intuitively identify the most impactful components of the reward functions. Details on this model are expounded in Section 3.3, showcasing our commitment to precise and practical feedback mechanisms that enhance real-world applicability.

Module 2: RL Training Adopting Reference Policy as Teacher

In this section, we detail the adaptations applied to guide RL training towards desired actions using frameworks provided by Legged Gym1, employing the Proximal Policy Optimization (PPO) algorithm as our foundation. We assume the existence of a baseline policy \( \pi_{\text{ref}} \), which could be derived from traditional control methods or previous RL techniques.

To enhance the PPO algorithm, we modify the objective function given below:

\[ \begin{aligned} L^{\text{CLIP}}(\theta) = & \hat{\mathbb{E}}_t [\min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t) \\ &+\beta \operatorname{KL}[\pi_{\text{ref}}(\cdot | s_t), \pi_{\theta}(\cdot | s_t)]], \end{aligned} \]

where \( r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} \) denotes the probability ratio, \( \hat{A}_t \) is an estimator of the advantage at time \( t \), \( \epsilon \) is a small positive number, and \( \beta \) is a coefficient that measures the divergence between the reference policy and PPO policy.

This integration enables control over the similarity between the trained policy and the reference policy. However, abstracting the reference policy \( \pi_{\text{ref}} \) into the same probabilistic framework as the PPO policy presents challenges. Despite this, given the deterministic nature of actions \( a_t \) for a specific state \( s \) and previous action \( a_{t-1} \), and assuming sufficient environmental observations, we can approximate the distribution \( \pi_{\text{ref}} \) as a Dirac distribution, and the differences between \( \pi \) and \( \pi_{\text{ref}} \) can be described as follows:

\[ \begin{aligned} \hat{\mathbb{E}}_t\left[\operatorname{KL}( \pi_{\text{ref}}, \pi_{\theta})\right] \approx \frac{1}{N}\sum_{i=1}^N \left[\log(\sqrt{2\pi \sigma_{\theta,i}^2}) + \frac{(a_{ref} - \mu_{\theta,i})^2}{2\sigma_{\theta,i}^2}\right]. \end{aligned} \]

The approximation uses the integral properties of the Dirac function and expression of KL divergence, which is omitted due to space constraints. With adequate observations, this approximation becomes reliable. This improvement can effectively prevent reinforcement learning from falling into a degradation trap or a local optimal solution.

In the AnyBipe framework, we have established a template for deploying existing policies as teacher functions, demonstrating integration of models deployed using PyTorch, ONNX, and traditional control algorithms implemented in C++ into our framework. This will allow users to transfer previous work to current work, or introduce pre-trained policies for simple tasks to accelerate the convergence of policies on complex tasks.


Proof of this section (section 3.2 in the paper)

We have KL divergence between normal distribution \( N(\mu_{\theta}, \sigma_{\theta}^2) \) with PDF \( q(x) = \frac{1}{\sqrt{2\pi\sigma_{\theta}^2}}\exp\left(-\frac{(x-\mu_{\theta})^2}{\sigma_{\theta}^2}\right) \), and Dirac distribution with PDF \( p(x) = \delta(x-a_{ref}) \), the KL divergence can be written as

\[ \begin{aligned} \operatorname{KL}(\pi_{\text{ref}}\mid \pi_{\theta}) &= \int_{-\infty}^{\infty}p(x)\log\frac{p(x)}{q(x)}dx\\ & = \int_{-\infty}^{\infty}\delta(x)\log \delta(x)dx-\int_{-\infty}^{\infty}\delta(x-a_{ref})\log q(x)dx\\ & = 0 - \log q(a_{ref})\\ & = \log(\sqrt{2\pi \sigma_{\theta}^2}) + \frac{(a_{ref} - \mu_{\theta})^2}{2\sigma_{\theta}^2} \end{aligned} \]

We first prove that \( \int_{-\infty}^{\infty}\delta(x)\log \delta(x)dx=0 \). Let \( u = \log\delta(x) \), \( v = \int \delta(x)dx = \mathbb{1}(x) \), where \( \mathbb{1} \) is the step function, we have

\[ \begin{aligned} m(x):=\int\delta(x)\log \delta(x)dx&= \int u dv\\ &= uv - \int vdu \\ &= \left\{\begin{aligned} & 0\cdot u - \int 0 \cdot du =0, &x < 0,\\ & \log\delta(0), & x=0,\\ &=u - \int 1\cdot du=0, & x> 0. \end{aligned}\right.\\ \end{aligned} \]

Therefore we can approximately say that \( \int_{-\infty}^{\infty}\delta(x)\log \delta(x)dx = m(\infty) - m(-\infty)=0 \)

Then we show why we don't calculate the normal KL divergence \( \operatorname{KL}(\pi_{\theta}\mid \pi_{\text{ref}}) \). This is because the term

\[ \int_{-\infty}^{\infty}p(x)\log\delta(x)dx \]

is \( -\infty \), there is no way for normal calculation. Since the backward propagation functions normally, we choose the inverse form, which approximately measures the difference between two distributions.

Also, in practice, we tend to ignore the term which controls the size of \( \sigma_{\theta} \), using this term directly instead:

\[ L_{dist} =\sum_{i=1}^N\frac{(a_{ref} - \mu_{\theta,i})^2}{2\sigma_{\theta,i}^2} \]

Module 3: Deployment, Evaluation, and Feedback

In this section, we discuss optimal strategy selection, state estimator implementations, and methods to align evaluation with reward functions, leading to refined strategy generation by LLMs.

To shield our data from ineffective reward functions, we define essential training metrics—like command compliance and survival duration—as basic measurements \( \mathfrak{E}_{train} \), filtering samples that meet these benchmarks. For specific tasks such as bipedal stair climbing or standing, customization of observational metrics \( Obs_c \) and coefficients \( c_{obs} \) is enabled in configuration files. We then select the top 15% of strategies for real-world deployment, prioritizing those that maximize combined metrics:

\[ \mathbf{R}_{bs}, \Pi_{bs} = \text{argmax}_{n_{obs}} c_{e} \cdot \mathfrak{E}_{train}(\Pi) + \sum c_{obs} \cdot \operatorname{Obs}_c. \]

Strategies are deployed in a ROS-based robotic framework \( \mathcal{R} \) using ONNX. Given that the training \( \mathcal{T} \) and real environments \( \mathcal{R} \) are isomorphic, we define a homomorphism \( \mathcal{F}: \mathcal{T} \to \mathcal{R} \), ensuring the real-world evaluation metric \( \hat{\mathbf{R}} = \mathcal{F}(\mathbf{R}) \) mirrors the reward function. Our automated script aligns reward structures with observed real-world data, informing users of any mismatches.

We also develop a basic state estimator for comprehensive robotic assessments, including metrics like step width and leg height. Models are first validated in Gazebo to confirm operational stability, with safety checks aligned with safety regulations proposed earlier. Successful models undergo real-world testing, with optimal models selected based on:

\[ \begin{aligned} &R_{best},\pi_{best} = \\&\text{argmax}\left[ [\mathcal{F}(\mathbf{R}_{bs})_{gazebo} + \mathcal{F}(\mathbf{R}_{bs})_{real}](\mathcal{T} \to \mathcal{R})(\Pi_{bs})\right]. \end{aligned} \]

AnyBipe enables an autonomous cycle from reward function generation to training, deployment, and optimization, facilitating user-driven training of bipedal robot RL algorithms without pre-existing strategies.

Homomorphic reward function conversion procedure

homomorphic_conv

Example Feedback Visualized

Experiments

Our experiments were conducted on a six-degree-of-freedom bipedal robot from Limx Dynamics, shown in Fig. 1. GPT-4o was selected as the LLM base, with training performed on an NVIDIA GTX 3090Ti. We explored the robot's locomotion on both flat and complex terrains over multiple experimental rounds. This section highlights key experiments on individual modules and overall traininimg g effectiveness. Due to space limitations, only critical results are presented here, with detailed data and charts available on our GitHub page.

dof

Figure 1: Limx robot and DOF definitions.

Table 1 outlines key reward functions that maintain the same structure across both human-engineered and AnyBipe-generated rewards, differing only in scale, allowing direct performance comparison.

Table 1: Examples of important rewards
Reward Name Expression Form
Survival \( R_{surv}= \int_0^{t_{\text{term},i}} dt \)
Tracking Linear Velocity \( R_{vel} = \exp\left(- \frac{\|v-v_{ref}\|^2}{\sigma_{l}^2}\right) \)
Tracking Angular Velocity \( R_{angl}=\exp\left(- \frac{\|\omega-\omega_{ref}\|^2}{\sigma_{a}^2}\right) \)
Success \( R_{succ} = R_{surv} \cdot (c_{l} R_{vel} + R_{angl}) \)

Module Analysis

We evaluated each module by testing prompt design and various augmentations using gpt-3.5-turbo, gpt-4, and gpt-4o with \( N=16 \) samples per batch. Success rates—defined as the proportion of samples processed successfully in the first iteration—were compared across four configurations: basic prompts, prompts with code references, safety regulations, and both. Figure 2 shows that adding code references and safety regulations improves LLM performance in generating usable reward functions. As gpt-4 and gpt-4o show similar performance with complete prompts, gpt-4o is recommended as the primary model.

dof

Figure 2: Success rate for LLM generated reward functions.

We validated the effect of incorporating safety regulations into reward functions by comparing models with and without safety prompts in Isaac Gym and Gazebo environments. Figure 3 shows the operational postures and IMU spatial angles of safe models (c)(d) versus unsafe models (a)(b). Results indicate that, even without real-world feedback, safety prompts effectively constrain robot behavior, supporting practical deployment.

crop_fig3

Figure 3: Model behavior with and without safety regulation prompts.


We evaluated teacher-guided models under artificial reward functions, training each for 5,000 iterations on complex terrain. The original model used only an artificial reward function, while the teacher-guided model was instructed by an operational ONNX model with \( \beta = 5.0 \). Performance was measured using reward success and terrain level. Results showed that the teacher-guided model had more stable training, with faster and less volatile reward growth.

reward_success

Figure 4: Reward success, terrain level for teacher guided and original RL

training with human-engineered rewards.

Homomorphic Evaluation

For the Homomorphic evaluation part, Table 2 shows some evaluation indicators before and after the conversion, and the evaluation results under the best model. It can objectively reflect some differences from simulation to reality. Each column represents reward function name, reward in Isaac Gym, homomorphic measurement in Gazebo, in reality, and the mapped tracking result in reality (30 seconds of tracking).

Table 2: Examples of homomorphic evaluation
Name Gym Gazebo Reality Mapping (real/targ)
Track lin vel 56.92 21.93 20.41 0.86 (1.0) m/s
Track ang vel 36.73 15.60 14.43 0.06 (0.10) rad/s
Feet distance -0.31 -0.00 -0.00 \(>0.1\) m (\(>0.1\) m)
Standing still -50.35 -6.16 -11.20 5.8 (30) s
Survival time 0.86 0.30 0.30 30 (30) s

Framework Analysis

The experimental setup involved first establishing a baseline using a manually designed reward function. We then trained our model from scratch in flat and complex terrain environments, simulating the general situation of users first using the AnyBipe framework. The robot was trained to track speed commands, avoid falls, and walk on various terrains using a simplified reward function with basic components like velocity tracking, action rate, and balance judgment, without scale-related information. Over five rounds (\( N=16 \) samples each), the best model from each round was tested in Gazebo and real-world scenarios. The basic training framework occupies 7G of GPU memory, and the corresponding training time is 79 hours. Figure 5 shows the deployment results, where initial iterations had issues like excessive movement (Iteration 0) and unnatural joint postures (Iterations 1 and 2), but these were corrected by Iteration 5, achieving a natural gait.

exp1

Figure 5: Deployment results for complex terrain locomotion tasks.

To demonstrate the effectiveness of the LLM's reward function improvements during training, we evaluate its performance using reward success and terrain level in tasks like velocity tracking, safe walking, and navigating complex terrains. After comparing five iterations with manually designed reward functions in Figure 6, we found that the LLM's reward function outperformed the manual version after just two iterations. Each subsequent iteration further enhanced training speed and performance, all achieved without human intervention.

reward_final

Figure 6: Reward success, terrain level for different iterations in complex locomotion training, compared with human-engineered rewards.


To verify that the final trained policy not only performs well in the lab but also adapts to real-world terrains, we conducted walking tests across five different surfaces: the experimental site, carpet, hard ground, grass, and stairs. The experiments demonstrated that the model trained and deployed by AnyBipe possesses the ability to walk on various terrains. In contrast, the manually designed reward function used in the experiment failed to achieve Sim-to-real transfer, indicating that AnyBipe can independently resolve the Sim-to-real problem.

reward_final

Figure 7: Experiments conducted on different terrains, adopting AnyBipe best policy.

Conclusion

AnyBipe proposes an end-to-end framework for training and deploying bipedal robots, which utilizes a state-of-the-art LLM to design reward functions for specific tasks. The framework provides interfaces that allow users to supply reward references and integrate pre-existing models to assist in training. Additionally, it incorporates feedback from both simulated and real-world test results, enabling the execution of training-to-deployment tasks entirely without human supervision. We validated the effectiveness of each module, as well as the system's ability to guide the robot in learning locomotion in both simple and complex environments, continuously improving the model by either designing new reward functions from scratch or refining existing ones. Furthermore, this framework exhibits potential for transferability to other robotic task planning scenarios. Our future work will focus on improving the current framework in three key areas: first, extending it to a broader range of robotic applications to verify its generalizability; second, testing its effectiveness across more tasks beyond locomotion; and third, enhancing the model evaluation process by incorporating image capture and VLM to achieve more comprehensive state estimation.

BibTeX

BibTex Code Here