浏览代码

更新强化学习的代码,使其可以在纯CPU下正常进行训练,现在正常站立已经没有问题,在明显往前移动时,参数还在进行调整训练

corvin_zhang 1 周之前
父节点
当前提交
2fece5a316

+ 38 - 2
docs/guguji_biped_rl_guide.md

@@ -128,6 +128,21 @@ ros2 launch guguji_ros2 gazebo.launch.py gui:=false pause:=true
 
 不要一开始就追求很快,否则双足很容易学成“扑倒式前冲”。
 
+### 阶段 C.5:轻微前进过渡
+
+目标:
+
+- 保持当前平衡能力不明显退化
+- 开始建立“轻微向前”的速度偏好
+- 为后续 walking 训练做课程过渡
+
+建议先用:
+
+- `configs/forward_transition_ppo.yaml`
+- `task.target_forward_velocity = 0.05`
+
+这一阶段最好直接从已经训好的平衡模型继续训练,而不是从零开始。
+
 ## 5. 奖励函数应该怎么理解
 
 当前代码中已经实现了一套适合入门调试的奖励结构:
@@ -217,8 +232,9 @@ ros2 launch guguji_ros2 gazebo.launch.py gui:=false pause:=true
 1. 启动 Gazebo 仿真,确认机器人正常出现。
 2. 运行 `check_env.py`,确认 reset / step / reward 能跑通。
 3. 训练 `balance_ppo.yaml`,不要急着训练 walking。
-4. 观察 `/joint_states` 和 `/tf`,检查机器人是否会频繁倒地。
-5. 调整奖励和关节范围后,再切到 `walk_ppo.yaml`。
+4. 训练出基础平衡模型后,用 `forward_transition_ppo.yaml` 做轻微前进过渡。
+5. 观察 `/joint_states` 和 `/tf`,检查机器人是否会频繁倒地,是否开始出现轻微向前趋势。
+6. 调整奖励和关节范围后,再切到 `walk_ppo.yaml`。
 
 ## 10. 训练完成后怎么在 ROS 2 里持续控制
 
@@ -238,6 +254,26 @@ python3 scripts/run_policy.py \
   --deterministic
 ```
 
+## 10.1 如何从平衡模型继续做轻微前进训练
+
+当前训练脚本已经支持把已有 PPO 模型参数作为课程学习初始化。
+
+示例命令:
+
+```bash
+cd /home/corvin/Project/guguji_simulation/guguji_rl
+source .venv/bin/activate
+python3 scripts/train.py \
+  --config configs/forward_transition_ppo.yaml \
+  --init-model outputs/<你的平衡实验目录>/final_model.zip
+```
+
+这样做的核心意义是:
+
+- 继续复用已经学到的“站稳”能力
+- 避免轻微前进阶段又退回“一步就摔”的状态
+- 让 walking 训练从更合理的初始化策略开始
+
 ## 11. 后续你最可能会继续改的地方
 
 你后面大概率会集中修改这些点:

+ 15 - 2
guguji_rl/README.md

@@ -26,8 +26,9 @@
 1. 先启动 Gazebo 仿真
 2. 再运行 `check_env.py` 检查 ROS 2 / Gazebo 训练接口
 3. 先跑 `balance_ppo.yaml` 训练站立平衡
-4. 再跑 `walk_ppo.yaml` 训练前进
-5. 训练出模型后,先用 `evaluate.py` 做回放,再用 `run_policy.py` 在 ROS 2 系统中持续推理控制
+4. 再跑 `forward_transition_ppo.yaml` 做轻微前进过渡训练
+5. 最后再跑 `walk_ppo.yaml` 训练正式前进
+6. 训练出模型后,先用 `evaluate.py` 做回放,再用 `run_policy.py` 在 ROS 2 系统中持续推理控制
 
 ## 安装依赖
 
@@ -66,3 +67,15 @@ python3 scripts/run_policy.py \
   --model outputs/<你的实验目录>/final_model.zip \
   --deterministic
 ```
+
+## 课程学习训练示例
+
+如果你已经训练出一个平衡模型,建议轻微前进阶段直接在它的基础上继续训练:
+
+```bash
+cd /home/corvin/Project/guguji_simulation/guguji_rl
+source .venv/bin/activate
+python3 scripts/train.py \
+  --config configs/forward_transition_ppo.yaml \
+  --init-model outputs/<你的平衡实验目录>/final_model.zip
+```

+ 29 - 11
guguji_rl/configs/balance_ppo.yaml

@@ -14,6 +14,18 @@ robot:
     - right_ankle_pitch_joint
     - right_ankle_joint
   command_topic_prefix: /guguji/command
+  # 第一版先围绕一个安全站姿做小范围残差控制,避免随机策略一步把机器人打翻。
+  nominal_joint_positions:
+    left_hip_pitch_joint: 0.0
+    left_knee_pitch_joint: 0.0
+    left_ankle_pitch_joint: 0.0
+    left_ankle_joint: 0.0
+    right_hip_pitch_joint: 0.0
+    right_knee_pitch_joint: 0.0
+    right_ankle_pitch_joint: 0.0
+    right_ankle_joint: 0.0
+  action_scale: 0.12
+  action_smoothing: 0.85
 
 ros:
   joint_state_topic: /joint_states
@@ -26,30 +38,35 @@ sim:
   # 强化学习训练更适合用 service_step,能让每一步更可控、更容易复现。
   step_mode: service_step
   control_dt: 0.05
-  service_step_iterations: 50
+  # 每次动作只推进较少的物理步,降低初期随机动作的破坏性。
+  service_step_iterations: 15
   reset_settle_seconds: 1.0
+  # reset 后先保持几步名义站姿,让机器人回到更稳定的初始状态。
+  reset_hold_steps: 8
   action_publish_delay: 0.01
   post_step_wait_seconds: 0.01
-  # 训练时建议 Gazebo 以 pause 模式启动,再由训练程序按步推进仿真
-  launch_hint: ros2 launch guguji_ros2 gazebo.launch.py gui:=false pause:=true
+  # 这里保留 GUI 版启动提示,方便你直观看训练时机器人在做什么
+  launch_hint: ros2 launch guguji_ros2 gazebo.launch.py pause:=true
 
 task:
   target_forward_velocity: 0.0
   target_base_height: null
-  max_roll_rad: 0.60
-  max_pitch_rad: 0.60
-  min_base_height: 0.14
+  # 第一轮先放宽跌倒判定,避免轻微摆动就立刻结束 episode。
+  max_roll_rad: 1.00
+  max_pitch_rad: 1.00
+  min_base_height: 0.24
+  termination_grace_steps: 8
 
 rewards:
-  alive_bonus: 1.0
+  alive_bonus: 2.0
   velocity_tracking_scale: 1.5
   velocity_tracking_sigma: 0.30
   upright_scale: 2.0
   height_scale: 1.0
-  action_rate_penalty_scale: 0.03
+  action_rate_penalty_scale: 0.01
   joint_limit_penalty_scale: 0.05
   lateral_velocity_penalty_scale: 0.10
-  fall_penalty: -15.0
+  fall_penalty: -8.0
 
 training:
   algorithm: ppo
@@ -58,8 +75,9 @@ training:
   seed: 42
   device: auto
   learning_rate: 0.0003
-  n_steps: 1024
-  batch_size: 256
+  # 单 Gazebo 环境训练很慢,这里先把 rollout 变短,便于更快看到日志反馈。
+  n_steps: 128
+  batch_size: 64
   gamma: 0.99
   gae_lambda: 0.95
   clip_range: 0.2

+ 74 - 0
guguji_rl/configs/balance_ppo_smoke.yaml

@@ -0,0 +1,74 @@
+experiment:
+  name: balance_ppo_smoke
+
+robot:
+  model_name: guguji
+  urdf_path: guguji_ros2_ws/src/guguji_ros2/urdf/guguji.urdf
+  joint_names:
+    - left_hip_pitch_joint
+    - left_knee_pitch_joint
+    - left_ankle_pitch_joint
+    - left_ankle_joint
+    - right_hip_pitch_joint
+    - right_knee_pitch_joint
+    - right_ankle_pitch_joint
+    - right_ankle_joint
+  command_topic_prefix: /guguji/command
+
+ros:
+  joint_state_topic: /joint_states
+  tf_topic: /tf
+  clock_topic: /clock
+  world_control_service: /world/default/control
+
+sim:
+  world_name: default
+  # 冒烟检查配置仍然使用 service_step,保持训练过程可重复。
+  step_mode: service_step
+  control_dt: 0.05
+  service_step_iterations: 50
+  reset_settle_seconds: 1.0
+  action_publish_delay: 0.01
+  post_step_wait_seconds: 0.01
+  launch_hint: ros2 launch guguji_ros2 gazebo.launch.py gui:=false pause:=true
+
+task:
+  target_forward_velocity: 0.0
+  target_base_height: null
+  max_roll_rad: 0.60
+  max_pitch_rad: 0.60
+  min_base_height: 0.14
+
+rewards:
+  alive_bonus: 1.0
+  velocity_tracking_scale: 1.5
+  velocity_tracking_sigma: 0.30
+  upright_scale: 2.0
+  height_scale: 1.0
+  action_rate_penalty_scale: 0.03
+  joint_limit_penalty_scale: 0.05
+  lateral_velocity_penalty_scale: 0.10
+  fall_penalty: -15.0
+
+training:
+  algorithm: ppo
+  # 这个配置专门用于首轮训练诊断,让 PPO 更快打印统计信息。
+  total_timesteps: 2048
+  max_episode_steps: 400
+  seed: 42
+  device: cpu
+  learning_rate: 0.0003
+  n_steps: 128
+  batch_size: 64
+  gamma: 0.99
+  gae_lambda: 0.95
+  clip_range: 0.2
+  ent_coef: 0.0
+  vf_coef: 0.5
+  policy_net_arch: [256, 256]
+  checkpoint_freq: 1024
+  output_root: guguji_rl/outputs
+
+evaluation:
+  episodes: 2
+  deterministic: true

+ 93 - 0
guguji_rl/configs/forward_transition_ppo.yaml

@@ -0,0 +1,93 @@
+experiment:
+  name: forward_transition_ppo
+
+robot:
+  model_name: guguji
+  urdf_path: guguji_ros2_ws/src/guguji_ros2/urdf/guguji.urdf
+  joint_names:
+    - left_hip_pitch_joint
+    - left_knee_pitch_joint
+    - left_ankle_pitch_joint
+    - left_ankle_joint
+    - right_hip_pitch_joint
+    - right_knee_pitch_joint
+    - right_ankle_pitch_joint
+    - right_ankle_joint
+  command_topic_prefix: /guguji/command
+  # 过渡阶段仍然沿用稳定的平衡站姿,只是在其基础上允许更积极一点的残差动作。
+  nominal_joint_positions:
+    left_hip_pitch_joint: 0.0
+    left_knee_pitch_joint: 0.0
+    left_ankle_pitch_joint: 0.0
+    left_ankle_joint: 0.0
+    right_hip_pitch_joint: 0.0
+    right_knee_pitch_joint: 0.0
+    right_ankle_pitch_joint: 0.0
+    right_ankle_joint: 0.0
+  action_scale: 0.16
+  action_smoothing: 0.80
+
+ros:
+  joint_state_topic: /joint_states
+  tf_topic: /tf
+  clock_topic: /clock
+  world_control_service: /world/default/control
+
+sim:
+  world_name: default
+  # 继续保持 service_step,保证课程学习阶段每一步都可控。
+  step_mode: service_step
+  control_dt: 0.05
+  service_step_iterations: 18
+  reset_settle_seconds: 1.0
+  reset_hold_steps: 8
+  action_publish_delay: 0.01
+  post_step_wait_seconds: 0.01
+  launch_hint: ros2 launch guguji_ros2 gazebo.launch.py pause:=true
+
+task:
+  # 从非常小的前进目标开始,让机器人先学会“想往前”,而不是立刻追求快走。
+  target_forward_velocity: 0.05
+  target_base_height: null
+  max_roll_rad: 0.90
+  max_pitch_rad: 0.90
+  min_base_height: 0.23
+  termination_grace_steps: 8
+
+rewards:
+  # 过渡阶段仍然优先保命,但比纯平衡阶段更强调向前速度跟踪。
+  alive_bonus: 1.8
+  velocity_tracking_scale: 2.4
+  velocity_tracking_sigma: 0.18
+  upright_scale: 2.0
+  height_scale: 1.0
+  action_rate_penalty_scale: 0.012
+  joint_limit_penalty_scale: 0.05
+  lateral_velocity_penalty_scale: 0.12
+  fall_penalty: -10.0
+
+training:
+  algorithm: ppo
+  # 这一版过渡训练先拉长到 20000 步,用来巩固轻微前进能力。
+  total_timesteps: 20000
+  max_episode_steps: 400
+  seed: 42
+  device: auto
+  # 课程学习阶段建议从平衡模型继续训练,而不是重新从零开始。
+  init_model_path: null
+  learning_rate: 0.0002
+  n_steps: 128
+  batch_size: 64
+  gamma: 0.99
+  gae_lambda: 0.95
+  clip_range: 0.2
+  ent_coef: 0.0
+  vf_coef: 0.5
+  policy_net_arch: [256, 256]
+  # 保留更密一点的 checkpoint,方便你中途挑选效果更好的阶段模型。
+  checkpoint_freq: 5000
+  output_root: guguji_rl/outputs
+
+evaluation:
+  episodes: 3
+  deterministic: true

+ 67 - 23
guguji_rl/configs/walk_ppo.yaml

@@ -14,6 +14,21 @@ robot:
     - right_ankle_pitch_joint
     - right_ankle_joint
   command_topic_prefix: /guguji/command
+  # walking 首版仍然采用“名义站姿 + 残差动作”的方式,
+  # 这样可以直接沿用 transition 阶段已经学到的稳定姿态能力。
+  nominal_joint_positions:
+    left_hip_pitch_joint: 0.0
+    left_knee_pitch_joint: 0.0
+    left_ankle_pitch_joint: 0.0
+    left_ankle_joint: 0.0
+    right_hip_pitch_joint: 0.0
+    right_knee_pitch_joint: 0.0
+    right_ankle_pitch_joint: 0.0
+    right_ankle_joint: 0.0
+  # 在 reset 修好后,再把动作空间稍微放大一点,让策略更容易真正迈开步子。
+  action_scale: 0.26
+  # 比上一轮略微降低平滑,给左右腿交替摆动更多空间。
+  action_smoothing: 0.62
 
 ros:
   joint_state_topic: /joint_states
@@ -23,50 +38,79 @@ ros:
 
 sim:
   world_name: default
-  # 走路训练也建议采用 service_step,便于稳定做 step / reward 对齐
+  # walking 训练继续使用 service_step,便于一步动作对应一步奖励
   step_mode: service_step
   control_dt: 0.05
-  service_step_iterations: 50
-  reset_settle_seconds: 1.2
+  # 仍然略高于上一版 walking,用于放大迈步动作的实际位移。
+  service_step_iterations: 22
+  reset_settle_seconds: 1.0
+  reset_hold_steps: 6
+  # reset 时会直接删掉并重生模型,这里就是重生时使用的初始位姿。
+  spawn_x: 0.0
+  spawn_y: 0.0
+  spawn_z: 0.35
+  spawn_roll: 0.0
+  spawn_pitch: 0.0
+  spawn_yaw: 0.0
   action_publish_delay: 0.01
   post_step_wait_seconds: 0.01
-  # 训练时建议 Gazebo 以 pause 模式启动,再由训练程序按步推进仿真。
-  launch_hint: ros2 launch guguji_ros2 gazebo.launch.py gui:=false pause:=true
+  # 如果你想实时看画面,建议像现在这样开着 GUI 并使用 pause:=true
+  launch_hint: ros2 launch guguji_ros2 gazebo.launch.py pause:=true
 
 task:
-  target_forward_velocity: 0.25
+  # 再把目标速度往上推一点,进一步压缩“原地小幅挪动”的可行空间。
+  target_forward_velocity: 0.18
   target_base_height: null
-  max_roll_rad: 0.65
-  max_pitch_rad: 0.65
-  min_base_height: 0.12
+  # 允许稍大一点机体摆动,为迈步时的动态平衡留空间。
+  max_roll_rad: 0.90
+  max_pitch_rad: 0.90
+  min_base_height: 0.21
+  termination_grace_steps: 12
 
 rewards:
-  alive_bonus: 1.0
-  velocity_tracking_scale: 3.0
-  velocity_tracking_sigma: 0.25
-  upright_scale: 1.5
-  height_scale: 0.8
-  action_rate_penalty_scale: 0.04
+  # 这一版在前进奖励之外,额外加入轻量步态先验:
+  # 1. hip_alternation 鼓励左右髋关节交替摆动
+  # 2. knee_flexion 鼓励膝关节适度弯曲,帮助抬腿
+  alive_bonus: 0.8
+  velocity_tracking_scale: 3.8
+  velocity_tracking_sigma: 0.10
+  forward_progress_scale: 4.2
+  hip_alternation_scale: 0.8
+  hip_target_separation: 0.35
+  hip_antiphase_sigma: 0.18
+  knee_flexion_scale: 0.6
+  knee_target: 0.22
+  knee_flexion_sigma: 0.12
+  upright_scale: 1.6
+  height_scale: 0.9
+  action_rate_penalty_scale: 0.008
   joint_limit_penalty_scale: 0.05
-  lateral_velocity_penalty_scale: 0.15
-  fall_penalty: -20.0
+  lateral_velocity_penalty_scale: 0.10
+  backward_velocity_penalty_scale: 2.0
+  stall_penalty_scale: 3.0
+  stall_velocity_threshold: 0.06
+  fall_penalty: -15.0
 
 training:
   algorithm: ppo
-  total_timesteps: 500000
+  # 继续做 10000 步验证,看这版更激进的前进奖励能否带来明显位移。
+  total_timesteps: 10000
   max_episode_steps: 500
   seed: 42
   device: auto
-  learning_rate: 0.0003
-  n_steps: 1024
-  batch_size: 256
+  # 直接从刚刚这轮修复过 reset 之后的 walking 模型继续训练。
+  init_model_path: outputs/walk_ppo_20260411_180823/final_model.zip
+  # 显著降低学习率,并增大 rollout 长度,避免刚才那种更新过猛导致的退化。
+  learning_rate: 0.00008
+  n_steps: 256
+  batch_size: 128
   gamma: 0.99
   gae_lambda: 0.95
-  clip_range: 0.2
+  clip_range: 0.15
   ent_coef: 0.0
   vf_coef: 0.5
   policy_net_arch: [256, 256]
-  checkpoint_freq: 50000
+  checkpoint_freq: 5000
   output_root: guguji_rl/outputs
 
 evaluation:

+ 17 - 0
guguji_rl/guguji_rl/config.py

@@ -31,6 +31,12 @@ DEFAULT_CONFIG: dict[str, Any] = {
         'reset_settle_seconds': 1.0,
         'action_publish_delay': 0.01,
         'post_step_wait_seconds': 0.01,
+        'spawn_x': 0.0,
+        'spawn_y': 0.0,
+        'spawn_z': 0.35,
+        'spawn_roll': 0.0,
+        'spawn_pitch': 0.0,
+        'spawn_yaw': 0.0,
     },
     'task': {
         'target_forward_velocity': 0.0,
@@ -43,11 +49,21 @@ DEFAULT_CONFIG: dict[str, Any] = {
         'alive_bonus': 1.0,
         'velocity_tracking_scale': 1.0,
         'velocity_tracking_sigma': 0.3,
+        'forward_progress_scale': 0.0,
+        'hip_alternation_scale': 0.0,
+        'hip_target_separation': 0.3,
+        'hip_antiphase_sigma': 0.2,
+        'knee_flexion_scale': 0.0,
+        'knee_target': 0.2,
+        'knee_flexion_sigma': 0.12,
         'upright_scale': 1.0,
         'height_scale': 1.0,
         'action_rate_penalty_scale': 0.02,
         'joint_limit_penalty_scale': 0.02,
         'lateral_velocity_penalty_scale': 0.05,
+        'backward_velocity_penalty_scale': 0.0,
+        'stall_penalty_scale': 0.0,
+        'stall_velocity_threshold': 0.0,
         'fall_penalty': -10.0,
     },
     'training': {
@@ -56,6 +72,7 @@ DEFAULT_CONFIG: dict[str, Any] = {
         'max_episode_steps': 400,
         'seed': 42,
         'device': 'auto',
+        'init_model_path': None,
         'learning_rate': 3e-4,
         'n_steps': 1024,
         'batch_size': 256,

+ 69 - 13
guguji_rl/guguji_rl/envs/gazebo_biped_env.py

@@ -26,6 +26,7 @@ class GazeboBipedEnv(gym.Env):
         self.sim_config = config['sim']
         self.task_config = config['task']
         self.training_config = config['training']
+        self.configured_target_base_height = self.task_config['target_base_height']
 
         urdf_path = resolve_project_path(config, self.robot_config['urdf_path'])
         self.joint_limits: list[JointLimit] = parse_joint_limits(
@@ -38,6 +39,20 @@ class GazeboBipedEnv(gym.Env):
         self.joint_mid = np.array([joint.midpoint for joint in self.joint_limits], dtype=np.float32)
         self.joint_half_range = np.array([joint.half_range for joint in self.joint_limits], dtype=np.float32)
 
+        # 第一版平衡训练不适合直接把策略动作映射到整个关节极限范围,
+        # 否则随机初始化策略往往一步就能把机器人打翻。
+        # 这里使用“名义站姿 + 小范围残差动作”的方式,让训练从更温和的控制开始。
+        nominal_joint_config = self.robot_config.get('nominal_joint_positions', {})
+        self.nominal_joint_targets = np.array(
+            [float(nominal_joint_config.get(joint_name, 0.0)) for joint_name in self.joint_names],
+            dtype=np.float32,
+        )
+        self.action_scale = float(self.robot_config.get('action_scale', 1.0))
+        self.action_smoothing = float(self.robot_config.get('action_smoothing', 0.0))
+        self.action_smoothing = float(np.clip(self.action_smoothing, 0.0, 0.99))
+        self.reset_hold_steps = int(self.sim_config.get('reset_hold_steps', 1))
+        self.termination_grace_steps = int(self.task_config.get('termination_grace_steps', 0))
+
         self.interface = GugujiRos2Interface(
             joint_names=self.joint_names,
             command_topic_prefix=self.robot_config['command_topic_prefix'],
@@ -46,6 +61,14 @@ class GazeboBipedEnv(gym.Env):
             clock_topic=self.ros_config['clock_topic'],
             world_control_service=self.ros_config['world_control_service'],
             model_name=self.robot_config['model_name'],
+            world_name=self.sim_config['world_name'],
+            model_file_path=str(urdf_path),
+            spawn_x=float(self.sim_config.get('spawn_x', 0.0)),
+            spawn_y=float(self.sim_config.get('spawn_y', 0.0)),
+            spawn_z=float(self.sim_config.get('spawn_z', 0.35)),
+            spawn_roll=float(self.sim_config.get('spawn_roll', 0.0)),
+            spawn_pitch=float(self.sim_config.get('spawn_pitch', 0.0)),
+            spawn_yaw=float(self.sim_config.get('spawn_yaw', 0.0)),
         )
         self.interface.wait_for_world_control_service()
         self.reward_calculator = BipedRewardCalculator(config['rewards'])
@@ -68,16 +91,31 @@ class GazeboBipedEnv(gym.Env):
         self.previous_action = np.zeros(len(self.joint_names), dtype=np.float32)
         self.previous_snapshot: RobotStateSnapshot | None = None
         self.current_snapshot: RobotStateSnapshot | None = None
+        self.current_joint_target = self.nominal_joint_targets.copy()
         self.step_count = 0
-        self.target_base_height = self.task_config['target_base_height']
+        self.target_base_height = self.configured_target_base_height
 
     def _normalized_joint_position(self, joint_position: np.ndarray) -> np.ndarray:
         return (joint_position - self.joint_mid) / self.joint_half_range
 
     def _action_to_joint_targets(self, action: np.ndarray) -> np.ndarray:
         clipped_action = np.clip(action, -1.0, 1.0)
-        # 将 [-1, 1] 的策略输出缩放到 URDF 关节极限范围内。
-        joint_targets = self.joint_mid + clipped_action * self.joint_half_range
+        # 将 [-1, 1] 的策略输出缩放成“围绕名义站姿的小残差动作”,
+        # 这样 PPO 初期随机动作不会一下把关节打到很远的位置。
+        desired_targets = (
+            self.nominal_joint_targets
+            + clipped_action * self.joint_half_range * self.action_scale
+        )
+        desired_targets = np.clip(desired_targets, self.joint_lower, self.joint_upper)
+
+        if self.action_smoothing > 0.0:
+            joint_targets = (
+                self.action_smoothing * self.current_joint_target
+                + (1.0 - self.action_smoothing) * desired_targets
+            )
+        else:
+            joint_targets = desired_targets
+
         return np.clip(joint_targets, self.joint_lower, self.joint_upper).astype(np.float32)
 
     def _get_target_base_height(self, snapshot: RobotStateSnapshot) -> float:
@@ -114,9 +152,20 @@ class GazeboBipedEnv(gym.Env):
         )
         return observation
 
-    def _publish_zero_action(self) -> None:
-        zero_target = {joint_name: 0.0 for joint_name in self.joint_names}
-        self.interface.publish_joint_targets(zero_target)
+    def _publish_joint_targets(self, joint_targets: np.ndarray) -> None:
+        self.interface.publish_joint_targets(joint_targets)
+
+    def _hold_nominal_pose_after_reset(self) -> None:
+        """reset 后先保持几步名义站姿,减少第一拍就摔倒的概率。"""
+        if self.sim_config['step_mode'] == 'service_step':
+            for _ in range(max(self.reset_hold_steps, 1)):
+                self._publish_joint_targets(self.current_joint_target)
+                time.sleep(float(self.sim_config['action_publish_delay']))
+                self.interface.step_world(int(self.sim_config['service_step_iterations']))
+                time.sleep(float(self.sim_config['post_step_wait_seconds']))
+        else:
+            self._publish_joint_targets(self.current_joint_target)
+            time.sleep(float(self.sim_config['reset_settle_seconds']))
 
     def _advance_simulation(self) -> None:
         if self.sim_config['step_mode'] == 'service_step':
@@ -129,6 +178,10 @@ class GazeboBipedEnv(gym.Env):
             time.sleep(float(self.sim_config['control_dt']))
 
     def _terminated(self, snapshot: RobotStateSnapshot) -> bool:
+        # 刚 reset 后的前几步允许一点过渡,避免初始数值抖动把 episode 过早终止。
+        if self.step_count <= self.termination_grace_steps:
+            return False
+
         roll, pitch, _ = quaternion_xyzw_to_euler(snapshot.base_quaternion)
         if abs(roll) > float(self.task_config['max_roll_rad']):
             return True
@@ -141,18 +194,20 @@ class GazeboBipedEnv(gym.Env):
     def reset(self, *, seed: int | None = None, options: dict[str, Any] | None = None):
         super().reset(seed=seed)
         pause_after_reset = self.sim_config['step_mode'] == 'service_step'
-        self.interface.reset_world(pause_after_reset=pause_after_reset)
-        self._publish_zero_action()
-        time.sleep(float(self.sim_config['reset_settle_seconds']))
-
-        if pause_after_reset:
-            self.interface.step_world(int(self.sim_config['service_step_iterations']))
+        # 这里使用“删掉并重生模型”的硬复位,避免 Gazebo 对动态生成模型 reset 不彻底。
+        self.interface.respawn_model(pause_after_reset=pause_after_reset)
+        self.current_joint_target = self.nominal_joint_targets.copy()
+        self._hold_nominal_pose_after_reset()
 
         snapshot = self.interface.wait_for_snapshot()
+        # 如果 target_base_height 没有在配置里写死,就在每个 episode 开头重新取一次。
+        # 这样高度奖励始终参考当前 reset 后的稳定站姿,而不会被上一个 episode 的状态污染。
+        self.target_base_height = self.configured_target_base_height
         self._get_target_base_height(snapshot)
         self.current_snapshot = snapshot
         self.previous_snapshot = snapshot
         self.previous_action = np.zeros(len(self.joint_names), dtype=np.float32)
+        self.current_joint_target = self.nominal_joint_targets.copy()
         self.step_count = 0
 
         observation = self._build_observation(snapshot, snapshot, self.previous_action)
@@ -170,7 +225,8 @@ class GazeboBipedEnv(gym.Env):
         self.step_count += 1
         action = np.asarray(action, dtype=np.float32)
         joint_targets = self._action_to_joint_targets(action)
-        self.interface.publish_joint_targets(joint_targets)
+        self.current_joint_target = joint_targets.copy()
+        self._publish_joint_targets(joint_targets)
         self._advance_simulation()
 
         self.previous_snapshot = self.current_snapshot

+ 45 - 0
guguji_rl/guguji_rl/rewards.py

@@ -40,6 +40,13 @@ class BipedRewardCalculator:
         sigma = max(float(self.reward_config['velocity_tracking_sigma']), 1e-6)
         velocity_error = forward_velocity - context.target_forward_velocity
         velocity_tracking = np.exp(-(velocity_error * velocity_error) / (2.0 * sigma * sigma))
+        # 仅奖励“真正向前”的速度,避免策略通过左右晃动或后退来钻空子。
+        positive_forward_velocity = max(forward_velocity, 0.0)
+        backward_velocity = max(-forward_velocity, 0.0)
+
+        # 如果前向速度长期太低,就给一个停滞惩罚,逼着策略去迈开步子。
+        stall_velocity_threshold = float(self.reward_config.get('stall_velocity_threshold', 0.0))
+        stall_penalty = max(stall_velocity_threshold - positive_forward_velocity, 0.0)
 
         action_rate_penalty = float(np.mean(np.square(context.action - context.previous_action)))
 
@@ -49,14 +56,52 @@ class BipedRewardCalculator:
             joint_limit_penalty += max(normalized - 0.9, 0.0)
         joint_limit_penalty /= max(len(context.joint_limits), 1)
 
+        joint_name_to_index = {
+            joint_limit.name: index
+            for index, joint_limit in enumerate(context.joint_limits)
+        }
+        hip_alternation_reward = 0.0
+        knee_flexion_reward = 0.0
+
+        left_hip_index = joint_name_to_index.get('left_hip_pitch_joint')
+        right_hip_index = joint_name_to_index.get('right_hip_pitch_joint')
+        if left_hip_index is not None and right_hip_index is not None:
+            left_hip = float(context.current.joint_position[left_hip_index])
+            right_hip = float(context.current.joint_position[right_hip_index])
+            hip_separation = abs(left_hip - right_hip)
+            hip_target_separation = max(float(self.reward_config.get('hip_target_separation', 0.3)), 1e-6)
+            hip_antiphase_sigma = max(float(self.reward_config.get('hip_antiphase_sigma', 0.2)), 1e-6)
+            # 鼓励左右髋关节朝相反方向摆动,并且摆动幅度不要太小。
+            hip_separation_reward = min(hip_separation / hip_target_separation, 1.0)
+            hip_antiphase_reward = np.exp(-((left_hip + right_hip) ** 2) / (2.0 * hip_antiphase_sigma ** 2))
+            hip_alternation_reward = float(hip_separation_reward * hip_antiphase_reward)
+
+        left_knee_index = joint_name_to_index.get('left_knee_pitch_joint')
+        right_knee_index = joint_name_to_index.get('right_knee_pitch_joint')
+        if left_knee_index is not None and right_knee_index is not None:
+            left_knee = abs(float(context.current.joint_position[left_knee_index]))
+            right_knee = abs(float(context.current.joint_position[right_knee_index]))
+            knee_target = float(self.reward_config.get('knee_target', 0.2))
+            knee_sigma = max(float(self.reward_config.get('knee_flexion_sigma', 0.12)), 1e-6)
+            average_knee_flexion = 0.5 * (left_knee + right_knee)
+            # 鼓励膝关节保持适度弯曲,帮助策略学会抬腿而不是整条腿僵直拖行。
+            knee_flexion_reward = float(
+                np.exp(-((average_knee_flexion - knee_target) ** 2) / (2.0 * knee_sigma ** 2))
+            )
+
         reward_terms = {
             'alive_bonus': float(self.reward_config['alive_bonus']),
             'velocity_tracking': float(self.reward_config['velocity_tracking_scale']) * float(velocity_tracking),
+            'forward_progress': float(self.reward_config.get('forward_progress_scale', 0.0)) * positive_forward_velocity,
+            'hip_alternation': float(self.reward_config.get('hip_alternation_scale', 0.0)) * hip_alternation_reward,
+            'knee_flexion': float(self.reward_config.get('knee_flexion_scale', 0.0)) * knee_flexion_reward,
             'upright': float(self.reward_config['upright_scale']) * float(upright_reward),
             'height': float(self.reward_config['height_scale']) * float(height_reward),
             'action_rate_penalty': -float(self.reward_config['action_rate_penalty_scale']) * action_rate_penalty,
             'joint_limit_penalty': -float(self.reward_config['joint_limit_penalty_scale']) * joint_limit_penalty,
             'lateral_velocity_penalty': -float(self.reward_config['lateral_velocity_penalty_scale']) * abs(lateral_velocity),
+            'backward_velocity_penalty': -float(self.reward_config.get('backward_velocity_penalty_scale', 0.0)) * backward_velocity,
+            'stall_penalty': -float(self.reward_config.get('stall_penalty_scale', 0.0)) * stall_penalty,
         }
 
         total_reward = sum(reward_terms.values())

+ 156 - 27
guguji_rl/guguji_rl/ros2_interface.py

@@ -2,7 +2,10 @@ from __future__ import annotations
 
 import threading
 import time
+import math
+import subprocess
 from dataclasses import dataclass
+from pathlib import Path
 from typing import Iterable
 
 import numpy as np
@@ -73,13 +76,43 @@ class _GugujiInterfaceNode(Node):
         with self._lock:
             self._latest_clock = message
 
+    def _extract_base_pose(self, tf_message: TFMessage) -> tuple[np.ndarray, np.ndarray] | None:
+        """从 TF 消息中提取机器人整体模型在世界坐标系下的位姿。"""
+        # Gazebo PosePublisher 转成 /tf 后,整机模型通常对应 child_frame_id == model_name,
+        # 例如这里会出现 child_frame_id == "guguji"、parent == "default" 的变换。
+        for transform in tf_message.transforms:
+            if transform.child_frame_id == self.model_name:
+                base_position = np.array(
+                    [
+                        transform.transform.translation.x,
+                        transform.transform.translation.y,
+                        transform.transform.translation.z,
+                    ],
+                    dtype=np.float32,
+                )
+                base_quaternion = np.array(
+                    [
+                        transform.transform.rotation.x,
+                        transform.transform.rotation.y,
+                        transform.transform.rotation.z,
+                        transform.transform.rotation.w,
+                    ],
+                    dtype=np.float32,
+                )
+                return base_position, base_quaternion
+
+        return None
+
     def snapshot(self) -> RobotStateSnapshot | None:
         with self._lock:
             joint_state = self._latest_joint_state
             tf_message = self._latest_tf
             clock_message = self._latest_clock
 
-        if joint_state is None:
+        # 强化学习环境依赖 joint_state、真实位姿和仿真时钟三者同时存在。
+        # 如果这里只等到 joint_state 就提前返回,会导致 base pose 仍然是默认全零值,
+        # 进而让环境误以为机器人一开始就已经倒地。
+        if joint_state is None or tf_message is None or clock_message is None:
             return None
 
         joint_map = {name: index for index, name in enumerate(joint_state.name)}
@@ -95,34 +128,14 @@ class _GugujiInterfaceNode(Node):
             if source_index < len(joint_state.velocity):
                 joint_velocity[output_index] = joint_state.velocity[source_index]
 
-        base_position = np.zeros(3, dtype=np.float32)
-        base_quaternion = np.array([0.0, 0.0, 0.0, 1.0], dtype=np.float32)
-
-        if tf_message is not None:
-            for transform in tf_message.transforms:
-                if transform.child_frame_id == self.model_name:
-                    base_position = np.array(
-                        [
-                            transform.transform.translation.x,
-                            transform.transform.translation.y,
-                            transform.transform.translation.z,
-                        ],
-                        dtype=np.float32,
-                    )
-                    base_quaternion = np.array(
-                        [
-                            transform.transform.rotation.x,
-                            transform.transform.rotation.y,
-                            transform.transform.rotation.z,
-                            transform.transform.rotation.w,
-                        ],
-                        dtype=np.float32,
-                    )
-                    break
+        base_pose = self._extract_base_pose(tf_message)
+        if base_pose is None:
+            return None
+
+        base_position, base_quaternion = base_pose
 
         sim_time = 0.0
-        if clock_message is not None:
-            sim_time = clock_message.clock.sec + clock_message.clock.nanosec * 1e-9
+        sim_time = clock_message.clock.sec + clock_message.clock.nanosec * 1e-9
 
         return RobotStateSnapshot(
             sim_time=sim_time,
@@ -146,10 +159,27 @@ class GugujiRos2Interface:
         clock_topic: str,
         world_control_service: str,
         model_name: str,
+        world_name: str,
+        model_file_path: str,
+        spawn_x: float,
+        spawn_y: float,
+        spawn_z: float,
+        spawn_roll: float,
+        spawn_pitch: float,
+        spawn_yaw: float,
     ) -> None:
         self._context = Context()
         # 训练程序使用独立 context,避免和外部 ROS 2 进程互相干扰。
         rclpy.init(args=None, context=self._context)
+        self._world_name = world_name
+        self._model_name = model_name
+        self._model_file_path = str(Path(model_file_path).resolve())
+        self._spawn_x = float(spawn_x)
+        self._spawn_y = float(spawn_y)
+        self._spawn_z = float(spawn_z)
+        self._spawn_roll = float(spawn_roll)
+        self._spawn_pitch = float(spawn_pitch)
+        self._spawn_yaw = float(spawn_yaw)
 
         self._node = _GugujiInterfaceNode(
             context=self._context,
@@ -168,6 +198,105 @@ class GugujiRos2Interface:
         self._spin_thread = threading.Thread(target=self._executor.spin, daemon=True)
         self._spin_thread.start()
 
+    def _run_ign_service(
+        self,
+        *,
+        service_name: str,
+        request_type: str,
+        response_type: str,
+        request_payload: str,
+        timeout_ms: int = 5000,
+        allow_false_response: bool = False,
+    ) -> bool:
+        """调用 Gazebo 原生 transport service。"""
+        completed = subprocess.run(
+            [
+                'ign',
+                'service',
+                '-s',
+                service_name,
+                '--reqtype',
+                request_type,
+                '--reptype',
+                response_type,
+                '--timeout',
+                str(timeout_ms),
+                '--req',
+                request_payload,
+            ],
+            check=False,
+            capture_output=True,
+            text=True,
+        )
+        if completed.returncode != 0:
+            raise RuntimeError(
+                f'调用 Gazebo service 失败: {service_name}\n'
+                f'stdout: {completed.stdout}\n'
+                f'stderr: {completed.stderr}'
+            )
+
+        output = completed.stdout.strip().lower()
+        if 'data: true' in output:
+            return True
+        if allow_false_response and 'data: false' in output:
+            return False
+        raise RuntimeError(
+            f'Gazebo service 返回了意外结果: {service_name}\n'
+            f'stdout: {completed.stdout}\n'
+            f'stderr: {completed.stderr}'
+        )
+
+    def _spawn_quaternion_xyzw(self) -> tuple[float, float, float, float]:
+        """把配置中的 RPY 初始姿态转换成 Gazebo 需要的四元数。"""
+        half_roll = self._spawn_roll * 0.5
+        half_pitch = self._spawn_pitch * 0.5
+        half_yaw = self._spawn_yaw * 0.5
+
+        cr = math.cos(half_roll)
+        sr = math.sin(half_roll)
+        cp = math.cos(half_pitch)
+        sp = math.sin(half_pitch)
+        cy = math.cos(half_yaw)
+        sy = math.sin(half_yaw)
+
+        x = sr * cp * cy - cr * sp * sy
+        y = cr * sp * cy + sr * cp * sy
+        z = cr * cp * sy - sr * sp * cy
+        w = cr * cp * cy + sr * sp * sy
+        return x, y, z, w
+
+    def respawn_model(self, *, pause_after_reset: bool) -> None:
+        """通过 model_only reset + set_pose 做真正的模型复位。"""
+        # Fortress 对动态生成模型执行 world reset 时,往往不会可靠恢复起始位姿;
+        # 但如果直接 remove/create,又会触发当前 JointStatePublisher 插件崩溃。
+        # 这里改用更稳妥的方案:先 reset 模型状态,再把整机姿态强制放回初始位置。
+        self.control_world(
+            pause=True,
+            reset_model_only=True,
+            timeout=5.0,
+        )
+        time.sleep(0.1)
+
+        qx, qy, qz, qw = self._spawn_quaternion_xyzw()
+        set_pose_payload = (
+            f'name: "{self._model_name}" '
+            'position: { '
+            f'x: {self._spawn_x} y: {self._spawn_y} z: {self._spawn_z} '
+            '} '
+            'orientation: { '
+            f'x: {qx} y: {qy} z: {qz} w: {qw} '
+            '}'
+        )
+        self._run_ign_service(
+            service_name=f'/world/{self._world_name}/set_pose',
+            request_type='ignition.msgs.Pose',
+            response_type='ignition.msgs.Boolean',
+            request_payload=set_pose_payload,
+            timeout_ms=5000,
+        )
+        time.sleep(0.2)
+        self.control_world(pause=pause_after_reset, timeout=5.0)
+
     def wait_for_world_control_service(self, timeout: float = 10.0) -> None:
         deadline = time.time() + timeout
         while time.time() < deadline:

+ 1 - 0
guguji_rl/pyproject.toml

@@ -15,6 +15,7 @@ dependencies = [
   "rich>=13.7",
   "stable-baselines3>=2.3",
   "tensorboard>=2.14",
+  "tqdm>=4.66",
 ]
 
 [tool.setuptools.packages.find]

+ 1 - 0
guguji_rl/requirements.txt

@@ -7,3 +7,4 @@ PyYAML>=6.0
 rich>=13.7
 stable-baselines3>=2.3
 tensorboard>=2.14
+tqdm>=4.66

+ 15 - 0
guguji_rl/scripts/train.py

@@ -40,6 +40,11 @@ def parse_args() -> argparse.Namespace:
         default=None,
         help='可选覆盖配置文件中的 total_timesteps',
     )
+    parser.add_argument(
+        '--init-model',
+        default=None,
+        help='可选指定一个已有 PPO 模型,用于继续训练或做课程学习初始化',
+    )
     return parser.parse_args()
 
 
@@ -72,6 +77,8 @@ def main() -> int:
         config['training']['device'] = args.device
     if args.total_timesteps is not None:
         config['training']['total_timesteps'] = args.total_timesteps
+    if args.init_model is not None:
+        config['training']['init_model_path'] = str(resolve_input_path(args.init_model))
 
     # 这里统一解析训练设备,方便你只改 YAML 就切换 CPU / GPU。
     config['training']['device'] = resolve_device(config['training']['device'])
@@ -116,6 +123,14 @@ def main() -> int:
         policy_kwargs=policy_kwargs,
     )
 
+    init_model_path = config['training'].get('init_model_path')
+    if init_model_path:
+        resolved_init_model_path = resolve_input_path(str(init_model_path))
+        # 这里不是直接 load 整个 PPO 对象,而是把旧模型参数灌入新模型。
+        # 好处是:我们仍然使用当前配置文件里的超参数,只复用之前学到的策略权重。
+        model.set_parameters(str(resolved_init_model_path), exact_match=False, device=config['training']['device'])
+        print(f"已加载课程初始化模型: {resolved_init_model_path}")
+
     model.learn(
         total_timesteps=int(config['training']['total_timesteps']),
         callback=checkpoint_callback,