浏览代码

在 walking 配置里加一个更细的课程阶段,让目标速度从 0.18 -> 0.22 -> 0.26 逐步爬升,同时更新训练参数,已经可以使用GPU进行训练

corvin_zhang 1 周之前
父节点
当前提交
6b961ef0e2

+ 24 - 0
docs/guguji_biped_rl_guide.md

@@ -235,6 +235,7 @@ ros2 launch guguji_ros2 gazebo.launch.py gui:=false pause:=true
 4. 训练出基础平衡模型后,用 `forward_transition_ppo.yaml` 做轻微前进过渡。
 5. 观察 `/joint_states` 和 `/tf`,检查机器人是否会频繁倒地,是否开始出现轻微向前趋势。
 6. 调整奖励和关节范围后,再切到 `walk_ppo.yaml`。
+7. `walk_ppo.yaml` 现在已经支持自动课程学习,会按 `0.18 -> 0.22 -> 0.26 m/s` 逐段抬升目标速度。
 
 ## 11. 怎么快速比较“这轮训练到底有没有更往前走”
 
@@ -258,6 +259,29 @@ python3 scripts/evaluate_forward_progress.py \
 
 这样你后面对比不同 walking 配置时,不需要每次肉眼猜模型是不是更会往前走,终端里会直接给出量化结果。
 
+## 12. walking 阶段为什么建议用三段速度课程
+
+如果一上来就把 `task.target_forward_velocity` 顶到较高值,双足机器人很容易出现两种情况:
+
+- 原地乱摆腿,但身体没有稳定前移
+- 为了追速度直接前扑,episode 很短
+
+所以现在的 `configs/walk_ppo.yaml` 已经内置三段课程:
+
+- `0.18 m/s`
+- `0.22 m/s`
+- `0.26 m/s`
+
+训练时直接运行一次:
+
+```bash
+cd /home/corvin/Project/guguji_simulation/guguji_rl
+source .venv/bin/activate
+python3 scripts/train.py --config configs/walk_ppo.yaml
+```
+
+`train.py` 会自动把这三段串起来训练,并把每一段的模型分别保存到输出目录下的阶段子目录中。
+
 ## 10. 训练完成后怎么在 ROS 2 里持续控制
 
 训练完成后,建议分成两步:

+ 16 - 0
guguji_rl/README.md

@@ -28,6 +28,7 @@
 3. 先跑 `balance_ppo.yaml` 训练站立平衡
 4. 再跑 `forward_transition_ppo.yaml` 做轻微前进过渡训练
 5. 最后再跑 `walk_ppo.yaml` 训练正式前进
+6. `walk_ppo.yaml` 现在已经内置了 `0.18 -> 0.22 -> 0.26` 的 walking 课程阶段
 6. 训练出模型后,先用 `evaluate_forward_progress.py` 看前进效果,再用 `run_policy.py` 在 ROS 2 系统中持续推理控制
 7. 每轮训练结束后,`train.py` 会自动输出一次 `delta_x / mean_vx` 前进评估
 
@@ -100,3 +101,18 @@ python3 scripts/train.py \
   --config configs/forward_transition_ppo.yaml \
   --init-model outputs/<你的平衡实验目录>/final_model.zip
 ```
+
+对于 `walk_ppo.yaml`,现在不需要你手动分三次启动训练。
+只要直接运行一次:
+
+```bash
+cd /home/corvin/Project/guguji_simulation/guguji_rl
+source .venv/bin/activate
+python3 scripts/train.py --config configs/walk_ppo.yaml
+```
+
+`train.py` 会自动按下面三段课程顺序继续训练,并把每段的模型单独保存到对应子目录:
+
+- `0.18 m/s`
+- `0.22 m/s`
+- `0.26 m/s`

+ 67 - 45
guguji_rl/configs/walk_ppo.yaml

@@ -14,31 +14,35 @@ robot:
     - right_ankle_pitch_joint
     - right_ankle_joint
   command_topic_prefix: /guguji/command
-  # walking 首版仍然采用“名义站姿 + 残差动作”的方式
-  # 这样可以直接沿用 transition 阶段已经学到的稳定姿态能力
+  # 这一轮先把名义站姿改成“轻微屈膝 + 踝关节回正”的准备姿态
+  # 这样双腿更容易迈起来,而不是笔直站住后原地小幅摆动
   nominal_joint_positions:
-    left_hip_pitch_joint: 0.0
-    left_knee_pitch_joint: 0.0
-    left_ankle_pitch_joint: 0.0
+    left_hip_pitch_joint: 0.04
+    left_knee_pitch_joint: 0.18
+    left_ankle_pitch_joint: -0.10
     left_ankle_joint: 0.0
-    right_hip_pitch_joint: 0.0
-    right_knee_pitch_joint: 0.0
-    right_ankle_pitch_joint: 0.0
+    right_hip_pitch_joint: 0.04
+    right_knee_pitch_joint: 0.18
+    right_ankle_pitch_joint: -0.10
     right_ankle_joint: 0.0
-  # 这一轮改成“参考步态 + 残差动作”:
-  # 先让双腿有明确的交替摆动节奏,再由策略学习如何稳住并把步态转成前进
+  # 这一轮继续保留“参考步态 + 残差动作”,
+  # 但把残差动作再压小一点,让参考步态主导迈腿,PPO 主要负责微调
   reference_gait:
     enabled: true
-    period: 0.80
-    # 让摆动期更短、支撑期更长,更像真正向前走时的步态节奏。
-    stance_ratio: 0.62
-    hip_pitch_amplitude: 0.28
-    knee_pitch_amplitude: 0.36
-    knee_pitch_bias: 0.08
-    ankle_pitch_amplitude: 0.18
-  # 参考步态已经会主动摆腿,所以残差动作幅度可以适当收小一点。
-  action_scale: 0.18
-  action_smoothing: 0.72
+    period: 0.72
+    # 支撑期稍长,保证落脚稳定;摆动期稍快,用于保持清晰的迈腿节奏。
+    stance_ratio: 0.60
+    hip_pitch_amplitude: 0.34
+    hip_pitch_bias: 0.04
+    knee_pitch_amplitude: 0.46
+    knee_pitch_bias: 0.12
+    swing_knee_scale: 1.10
+    ankle_pitch_amplitude: 0.22
+    ankle_pitch_bias: -0.05
+    push_off_ankle_scale: 0.22
+  # 把残差动作进一步收小,这样就算旧策略权重带来偏置,也不容易把机器人直接打翻。
+  action_scale: 0.08
+  action_smoothing: 0.82
 
 ros:
   joint_state_topic: /joint_states
@@ -51,7 +55,7 @@ sim:
   # walking 训练继续使用 service_step,便于一步动作对应一步奖励。
   step_mode: service_step
   control_dt: 0.05
-  # 仍然略高于上一版 walking,用于放大迈步动作的实际位移
+  # 这一版把一步动作在 Gazebo 中的作用时间略微收回一点,优先保稳定
   service_step_iterations: 22
   reset_settle_seconds: 1.0
   reset_hold_steps: 6
@@ -68,49 +72,67 @@ sim:
   launch_hint: ros2 launch guguji_ros2 gazebo.launch.py pause:=true
 
 task:
-  # 参考步态接入后,把目标速度继续往上推,逼着策略把摆腿转成前进位移。
-  target_forward_velocity: 0.22
+  # 这里保留最终阶段的目标速度;真正训练时会由下面的 curriculum_stages
+  # 先走 0.18 -> 0.22,再抬到 0.26,避免一开始就把 walking 强度顶太高。
+  target_forward_velocity: 0.26
   target_base_height: null
   # 允许稍大一点机体摆动,为迈步时的动态平衡留空间。
   max_roll_rad: 0.90
   max_pitch_rad: 0.90
   min_base_height: 0.21
-  termination_grace_steps: 12
+  termination_grace_steps: 18
 
 rewards:
-  # 参考步态已经负责“先迈起来”,这里的奖励更聚焦于“迈起来以后要往前走”
+  # 奖励函数这轮更偏向“真正前进”,同时保留基础稳定性约束
   alive_bonus: 0.6
-  velocity_tracking_scale: 4.2
-  velocity_tracking_sigma: 0.09
-  forward_progress_scale: 5.0
-  hip_alternation_scale: 0.4
-  hip_target_separation: 0.35
+  velocity_tracking_scale: 4.8
+  velocity_tracking_sigma: 0.10
+  forward_progress_scale: 6.0
+  hip_alternation_scale: 0.5
+  hip_target_separation: 0.36
   hip_antiphase_sigma: 0.18
-  knee_flexion_scale: 0.3
-  knee_target: 0.22
-  knee_flexion_sigma: 0.12
+  knee_flexion_scale: 0.35
+  knee_target: 0.28
+  knee_flexion_sigma: 0.15
   upright_scale: 1.6
   height_scale: 0.9
-  action_rate_penalty_scale: 0.006
+  action_rate_penalty_scale: 0.004
   joint_limit_penalty_scale: 0.05
-  lateral_velocity_penalty_scale: 0.10
-  backward_velocity_penalty_scale: 2.5
-  stall_penalty_scale: 4.0
-  stall_velocity_threshold: 0.08
+  lateral_velocity_penalty_scale: 0.08
+  backward_velocity_penalty_scale: 2.8
+  stall_penalty_scale: 4.6
+  stall_velocity_threshold: 0.10
   fall_penalty: -15.0
 
 training:
   algorithm: ppo
-  # 继续做 10000 步验证,看这版更激进的前进奖励能否带来明显位移。
+  # 这里的 total_timesteps 是“无课程模式”的兜底值;
+  # 只要 curriculum_stages 非空,train.py 就会按阶段自己的 timesteps 顺序训练。
   total_timesteps: 10000
   max_episode_steps: 500
   seed: 42
-  device: auto
-  # 由于参考步态接入后控制结构变化较大,这里仍然沿用最新 walking 模型作为初始化,
-  # 但把学习率压得更低,让策略以“细修”的方式适应新节奏。
-  init_model_path: outputs/walk_ppo_20260412_181640/final_model.zip
-  # 显著降低学习率,并增大 rollout 长度,避免刚才那种更新过猛导致的退化。
-  learning_rate: 0.00008
+  device: cuda
+  # 从当前最新的 walking 模型继续训,避免把已经学到的基础摆腿能力丢掉。
+  init_model_path: outputs/walk_ppo_20260412_183147/final_model.zip
+  # 课程学习的这一步不希望探索噪声太大,否则旧策略刚学到的迈腿节奏会被打散。
+  initial_log_std: -2.2
+  # walking 改成三段式课程:
+  # 先让机器人适应 0.18 的低速稳定前进,再逐步爬到 0.22 和 0.26。
+  curriculum_stages:
+    - name: walk_v018
+      target_forward_velocity: 0.18
+      total_timesteps: 3000
+      initial_log_std: -2.3
+    - name: walk_v022
+      target_forward_velocity: 0.22
+      total_timesteps: 3000
+      initial_log_std: -2.25
+    - name: walk_v026
+      target_forward_velocity: 0.26
+      total_timesteps: 4000
+      initial_log_std: -2.2
+  # 学习率再收一点,避免在新步态逻辑下前几轮就把稳定性学坏。
+  learning_rate: 0.00006
   n_steps: 256
   batch_size: 128
   gamma: 0.99

+ 23 - 12
guguji_rl/guguji_rl/config.py

@@ -6,6 +6,9 @@ from typing import Any
 
 import yaml
 
+MODULE_ROOT = Path(__file__).resolve().parents[1]
+PROJECT_ROOT = MODULE_ROOT.parent
+
 
 DEFAULT_CONFIG: dict[str, Any] = {
     'experiment': {
@@ -21,9 +24,13 @@ DEFAULT_CONFIG: dict[str, Any] = {
             'period': 0.9,
             'stance_ratio': 0.62,
             'hip_pitch_amplitude': 0.0,
+            'hip_pitch_bias': 0.0,
             'knee_pitch_amplitude': 0.0,
             'knee_pitch_bias': 0.0,
+            'swing_knee_scale': 1.0,
             'ankle_pitch_amplitude': 0.0,
+            'ankle_pitch_bias': 0.0,
+            'push_off_ankle_scale': 0.0,
         },
     },
     'ros': {
@@ -75,16 +82,18 @@ DEFAULT_CONFIG: dict[str, Any] = {
         'stall_velocity_threshold': 0.0,
         'fall_penalty': -10.0,
     },
-    'training': {
-        'algorithm': 'ppo',
-        'total_timesteps': 200000,
-        'max_episode_steps': 400,
-        'seed': 42,
-        'device': 'auto',
-        'init_model_path': None,
-        'learning_rate': 3e-4,
-        'n_steps': 1024,
-        'batch_size': 256,
+        'training': {
+            'algorithm': 'ppo',
+            'total_timesteps': 200000,
+            'max_episode_steps': 400,
+            'seed': 42,
+            'device': 'auto',
+            'init_model_path': None,
+            'initial_log_std': None,
+            'curriculum_stages': [],
+            'learning_rate': 3e-4,
+            'n_steps': 1024,
+            'batch_size': 256,
         'gamma': 0.99,
         'gae_lambda': 0.95,
         'clip_range': 0.2,
@@ -122,8 +131,10 @@ def load_config(config_path: str | Path) -> dict[str, Any]:
     config = _deep_update(copy.deepcopy(DEFAULT_CONFIG), user_config)
     config['meta'] = {
         'config_path': str(config_path),
-        'project_root': str(config_path.parents[2]),
-        'rl_root': str(config_path.parents[1]),
+        # 不再假设配置文件一定放在仓库里的固定层级,
+        # 这样外部临时配置、导出的实验配置也可以直接拿来训练。
+        'project_root': str(PROJECT_ROOT),
+        'rl_root': str(MODULE_ROOT),
     }
     return config
 

+ 24 - 15
guguji_rl/guguji_rl/envs/gazebo_biped_env.py

@@ -135,43 +135,52 @@ class GazeboBipedEnv(gym.Env):
             return offsets
 
         hip_amplitude = float(self.reference_gait_config.get('hip_pitch_amplitude', 0.0))
+        hip_bias = float(self.reference_gait_config.get('hip_pitch_bias', 0.0))
         knee_amplitude = float(self.reference_gait_config.get('knee_pitch_amplitude', 0.0))
         knee_bias = float(self.reference_gait_config.get('knee_pitch_bias', 0.0))
+        swing_knee_scale = float(self.reference_gait_config.get('swing_knee_scale', 1.0))
         ankle_amplitude = float(self.reference_gait_config.get('ankle_pitch_amplitude', 0.0))
+        ankle_bias = float(self.reference_gait_config.get('ankle_pitch_bias', 0.0))
+        push_off_ankle_scale = float(self.reference_gait_config.get('push_off_ankle_scale', 0.0))
         stance_ratio = float(self.reference_gait_config.get('stance_ratio', 0.62))
         stance_ratio = float(np.clip(stance_ratio, 0.05, 0.95))
 
-        def gait_profile(phase: float) -> tuple[float, float]:
-            """返回该相位下的髋关节目标形状和膝关节摆动形状。"""
+        def gait_profile(phase: float) -> tuple[float, float, float]:
+            """返回该相位下的髋、膝、踝参考轨迹形状。"""
             phase = phase % 1.0
             if phase < stance_ratio:
                 stance_progress = phase / stance_ratio
-                # 支撑期:腿缓慢向后划,形成推地趋势。
-                hip_profile = -1.0 + 2.0 * stance_progress
-                knee_profile = 0.0
+                # 支撑期:落脚时腿应在身体前方,随后逐渐向后划过身体,
+                # 这样才更容易把机体“送”向前方,而不是原地踏步。
+                hip_profile = 1.0 - 2.0 * stance_progress
+                knee_profile = 0.18 * math.sin(math.pi * stance_progress)
+                push_off_progress = max((stance_progress - 0.60) / 0.40, 0.0)
+                ankle_profile = -0.70 * hip_profile + push_off_ankle_scale * push_off_progress
             else:
                 swing_progress = (phase - stance_ratio) / (1.0 - stance_ratio)
-                # 摆动期:腿更快地向前回摆,并伴随膝关节抬腿。
-                hip_profile = 1.0 - 2.0 * swing_progress
-                knee_profile = math.sin(math.pi * swing_progress)
-            return hip_profile, knee_profile
+                # 摆动期:腿从身体后方快速回摆到前方,同时膝关节明显抬起,
+                # 这样 Gazebo 里能更容易看到“迈腿”的效果。
+                hip_profile = -1.0 + 2.0 * swing_progress
+                knee_profile = swing_knee_scale * math.sin(math.pi * swing_progress)
+                ankle_profile = -0.45 * hip_profile - 0.25 * math.sin(math.pi * swing_progress)
+            return hip_profile, knee_profile, ankle_profile
 
-        left_hip_profile, left_knee_profile = gait_profile(self.gait_phase)
-        right_hip_profile, right_knee_profile = gait_profile(self.gait_phase + 0.5)
+        left_hip_profile, left_knee_profile, left_ankle_profile = gait_profile(self.gait_phase)
+        right_hip_profile, right_knee_profile, right_ankle_profile = gait_profile(self.gait_phase + 0.5)
 
         for index, joint_name in enumerate(self.joint_names):
             if joint_name == 'left_hip_pitch_joint':
-                offsets[index] = hip_amplitude * left_hip_profile
+                offsets[index] = hip_bias + hip_amplitude * left_hip_profile
             elif joint_name == 'right_hip_pitch_joint':
-                offsets[index] = hip_amplitude * right_hip_profile
+                offsets[index] = hip_bias + hip_amplitude * right_hip_profile
             elif joint_name == 'left_knee_pitch_joint':
                 offsets[index] = knee_bias + knee_amplitude * left_knee_profile
             elif joint_name == 'right_knee_pitch_joint':
                 offsets[index] = knee_bias + knee_amplitude * right_knee_profile
             elif joint_name == 'left_ankle_pitch_joint':
-                offsets[index] = -ankle_amplitude * left_hip_profile
+                offsets[index] = ankle_bias + ankle_amplitude * left_ankle_profile
             elif joint_name == 'right_ankle_pitch_joint':
-                offsets[index] = -ankle_amplitude * right_hip_profile
+                offsets[index] = ankle_bias + ankle_amplitude * right_ankle_profile
 
         return offsets
 

+ 22 - 7
guguji_rl/guguji_rl/ros2_interface.py

@@ -287,13 +287,28 @@ class GugujiRos2Interface:
             f'x: {qx} y: {qy} z: {qz} w: {qw} '
             '}'
         )
-        self._run_ign_service(
-            service_name=f'/world/{self._world_name}/set_pose',
-            request_type='ignition.msgs.Pose',
-            response_type='ignition.msgs.Boolean',
-            request_payload=set_pose_payload,
-            timeout_ms=5000,
-        )
+        last_error: RuntimeError | None = None
+        for attempt_index in range(3):
+            try:
+                # Gazebo Fortress 在长时间训练时偶尔会让 set_pose 服务响应变慢,
+                # 这里给更宽裕的超时,并在失败时做短暂重试,避免整轮训练被偶发抖动打断。
+                self._run_ign_service(
+                    service_name=f'/world/{self._world_name}/set_pose',
+                    request_type='ignition.msgs.Pose',
+                    response_type='ignition.msgs.Boolean',
+                    request_payload=set_pose_payload,
+                    timeout_ms=12000,
+                )
+                last_error = None
+                break
+            except RuntimeError as error:
+                last_error = error
+                self.control_world(pause=True, timeout=5.0)
+                time.sleep(0.2 * (attempt_index + 1))
+
+        if last_error is not None:
+            raise last_error
+
         time.sleep(0.2)
         self.control_world(pause=pause_after_reset, timeout=5.0)
 

+ 143 - 47
guguji_rl/scripts/train.py

@@ -1,9 +1,11 @@
 from __future__ import annotations
 
 import argparse
+import copy
 import sys
 from datetime import datetime
 from pathlib import Path
+from typing import Any
 
 SCRIPT_ROOT = Path(__file__).resolve().parents[1]
 if str(SCRIPT_ROOT) not in sys.path:
@@ -61,6 +63,62 @@ def resolve_input_path(path_str: str) -> Path:
     return SCRIPT_ROOT / path
 
 
+def maybe_override_policy_log_std(model: object, initial_log_std: float | None) -> None:
+    """可选地缩小 PPO 的初始探索方差,适合课程学习后的精修阶段。"""
+    if initial_log_std is None:
+        return
+    policy = getattr(model, 'policy', None)
+    if policy is None or not hasattr(policy, 'log_std'):
+        raise RuntimeError('当前策略对象不支持直接设置 log_std。')
+
+    # 这里直接把每个动作维度的对数标准差统一改成同一个值,
+    # 方便在“已有步态基础上继续训练”时降低探索噪声,减少无意义的乱踢。
+    policy.log_std.data.fill_(float(initial_log_std))
+    print(f'已将策略初始 log_std 设为: {float(initial_log_std):.3f}')
+
+
+def sanitize_stage_name(stage_name: str) -> str:
+    sanitized = ''.join(
+        character if character.isalnum() or character in {'-', '_'} else '_'
+        for character in stage_name.strip()
+    )
+    return sanitized.strip('_') or 'stage'
+
+
+def build_curriculum_stage_configs(config: dict[str, Any]) -> list[tuple[str | None, dict[str, Any]]]:
+    """把课程学习阶段展开成一组可直接训练的独立配置。"""
+    raw_stages = config['training'].get('curriculum_stages') or []
+    if not raw_stages:
+        single_stage_config = copy.deepcopy(config)
+        single_stage_config['training'].pop('curriculum_stages', None)
+        return [(None, single_stage_config)]
+
+    stage_configs: list[tuple[str | None, dict[str, Any]]] = []
+    for stage_index, raw_stage in enumerate(raw_stages, start=1):
+        if not isinstance(raw_stage, dict):
+            raise RuntimeError('training.curriculum_stages 里的每个阶段都必须是字典。')
+
+        stage_config = copy.deepcopy(config)
+        stage_config['training'].pop('curriculum_stages', None)
+
+        raw_name = str(raw_stage.get('name') or f'stage_{stage_index}')
+        stage_name = f'{stage_index:02d}_{sanitize_stage_name(raw_name)}'
+
+        # 课程阶段目前主要控制“目标前进速度 + 本阶段训练步数 + 探索方差”。
+        # 这样 walking 阶段就能从慢到快逐段抬升,而不用一次把目标速度顶太高。
+        if 'target_forward_velocity' in raw_stage:
+            stage_config['task']['target_forward_velocity'] = float(raw_stage['target_forward_velocity'])
+        if 'total_timesteps' in raw_stage:
+            stage_config['training']['total_timesteps'] = int(raw_stage['total_timesteps'])
+        if 'initial_log_std' in raw_stage:
+            stage_config['training']['initial_log_std'] = float(raw_stage['initial_log_std'])
+
+        stage_config['experiment']['name'] = f"{config['experiment']['name']}_{stage_name}"
+        stage_configs.append((stage_name, stage_config))
+
+    return stage_configs
+
+
 def main() -> int:
     args = parse_args()
 
@@ -95,65 +153,103 @@ def main() -> int:
     run_dir.mkdir(parents=True, exist_ok=True)
     # 保存一份展开后的配置,便于后面复现实验。
     save_yaml(config, run_dir / 'resolved_config.yaml')
-
-    env = Monitor(GazeboBipedEnv(config))
-    checkpoint_callback = CheckpointCallback(
-        save_freq=max(int(config['training']['checkpoint_freq']), 1),
-        save_path=str(run_dir / 'checkpoints'),
-        name_prefix='guguji_ppo',
-    )
-
-    policy_kwargs = {
-        'net_arch': list(config['training']['policy_net_arch']),
-    }
+    stage_configs = build_curriculum_stage_configs(config)
 
     print(f"训练设备: {config['training']['device']}")
     print(f"输出目录: {run_dir}")
+    model = None
+    final_model_path = run_dir / 'final_model'
+    final_stage_config = config
 
-    # 先用 MLP + PPO 跑通训练闭环,后面你可以再逐步增大网络规模。
-    model = PPO(
-        policy='MlpPolicy',
-        env=env,
-        verbose=1,
-        seed=int(config['training']['seed']),
-        learning_rate=float(config['training']['learning_rate']),
-        n_steps=int(config['training']['n_steps']),
-        batch_size=int(config['training']['batch_size']),
-        gamma=float(config['training']['gamma']),
-        gae_lambda=float(config['training']['gae_lambda']),
-        clip_range=float(config['training']['clip_range']),
-        ent_coef=float(config['training']['ent_coef']),
-        vf_coef=float(config['training']['vf_coef']),
-        device=config['training']['device'],
-        tensorboard_log=str(run_dir / 'tensorboard'),
-        policy_kwargs=policy_kwargs,
-    )
+    for stage_index, (stage_name, stage_config) in enumerate(stage_configs, start=1):
+        stage_dir = run_dir if stage_name is None else run_dir / stage_name
+        stage_dir.mkdir(parents=True, exist_ok=True)
+        # 每个阶段都单独保存一份实际生效的配置,后面你回看实验会很方便。
+        save_yaml(stage_config, stage_dir / 'resolved_config.yaml')
 
-    init_model_path = config['training'].get('init_model_path')
-    if init_model_path:
-        resolved_init_model_path = resolve_input_path(str(init_model_path))
-        # 这里不是直接 load 整个 PPO 对象,而是把旧模型参数灌入新模型。
-        # 好处是:我们仍然使用当前配置文件里的超参数,只复用之前学到的策略权重。
-        model.set_parameters(str(resolved_init_model_path), exact_match=False, device=config['training']['device'])
-        print(f"已加载课程初始化模型: {resolved_init_model_path}")
-
-    model.learn(
-        total_timesteps=int(config['training']['total_timesteps']),
-        callback=checkpoint_callback,
-        progress_bar=True,
-    )
-    final_model_path = run_dir / 'final_model'
-    model.save(final_model_path)
-    env.close()
+        if stage_name is not None:
+            print(
+                f'开始课程阶段 {stage_index}/{len(stage_configs)}: {stage_name} '
+                f'(target_forward_velocity={stage_config["task"]["target_forward_velocity"]:.2f}, '
+                f'timesteps={int(stage_config["training"]["total_timesteps"])})'
+            )
+
+        env = Monitor(GazeboBipedEnv(stage_config))
+        checkpoint_callback = CheckpointCallback(
+            save_freq=max(int(stage_config['training']['checkpoint_freq']), 1),
+            save_path=str(stage_dir / 'checkpoints'),
+            name_prefix='guguji_ppo',
+        )
+
+        try:
+            if model is None:
+                policy_kwargs = {
+                    'net_arch': list(stage_config['training']['policy_net_arch']),
+                }
+                # 先用 MLP + PPO 跑通训练闭环,后面你可以再逐步增大网络规模。
+                model = PPO(
+                    policy='MlpPolicy',
+                    env=env,
+                    verbose=1,
+                    seed=int(stage_config['training']['seed']),
+                    learning_rate=float(stage_config['training']['learning_rate']),
+                    n_steps=int(stage_config['training']['n_steps']),
+                    batch_size=int(stage_config['training']['batch_size']),
+                    gamma=float(stage_config['training']['gamma']),
+                    gae_lambda=float(stage_config['training']['gae_lambda']),
+                    clip_range=float(stage_config['training']['clip_range']),
+                    ent_coef=float(stage_config['training']['ent_coef']),
+                    vf_coef=float(stage_config['training']['vf_coef']),
+                    device=stage_config['training']['device'],
+                    tensorboard_log=str(run_dir / 'tensorboard'),
+                    policy_kwargs=policy_kwargs,
+                )
+
+                init_model_path = stage_config['training'].get('init_model_path')
+                if init_model_path:
+                    resolved_init_model_path = resolve_input_path(str(init_model_path))
+                    # 这里不是直接 load 整个 PPO 对象,而是把旧模型参数灌入新模型。
+                    # 好处是:我们仍然使用当前配置文件里的超参数,只复用之前学到的策略权重。
+                    model.set_parameters(
+                        str(resolved_init_model_path),
+                        exact_match=False,
+                        device=stage_config['training']['device'],
+                    )
+                    print(f"已加载课程初始化模型: {resolved_init_model_path}")
+            else:
+                model.set_env(env)
+
+            maybe_override_policy_log_std(model, stage_config['training'].get('initial_log_std'))
+            model.learn(
+                total_timesteps=int(stage_config['training']['total_timesteps']),
+                callback=checkpoint_callback,
+                progress_bar=True,
+                reset_num_timesteps=(stage_index == 1),
+            )
+
+            stage_model_path = stage_dir / 'final_model'
+            model.save(stage_model_path)
+            final_model_path = stage_model_path
+            final_stage_config = stage_config
+
+            if stage_name is not None:
+                print(f'课程阶段完成,模型已保存到: {stage_model_path.with_suffix(".zip")}')
+        finally:
+            env.close()
+
+    if final_model_path != run_dir / 'final_model' and model is not None:
+        # 在课程学习模式下,额外在 run 根目录保存一份最终模型,方便统一引用。
+        model.save(run_dir / 'final_model')
+        final_model_path = run_dir / 'final_model'
 
     print(f'训练完成,模型已保存到: {run_dir / "final_model.zip"}')
 
-    evaluation_config = config['evaluation']
+    evaluation_config = final_stage_config['evaluation']
     if bool(evaluation_config.get('auto_forward_progress', True)) and not args.skip_auto_eval:
         try:
             # 每轮训练结束后自动做一次前进评估,方便你快速看 delta_x / mean_vx。
             summary = evaluate_forward_progress(
-                config=config,
+                config=final_stage_config,
                 model_path=final_model_path,
                 episodes=int(evaluation_config['forward_progress_episodes']),
                 max_steps=int(evaluation_config['forward_progress_max_steps']),