Continuous-Discrete Reinforcement Learning for Hybrid Control in Robotics (paper) [NOTES]
Continuous–Discrete Reinforcement Learning for Hybrid Control in Robotics
Meta
Why these notes exist
Why this paper exists
Why hybrid action spaces are annoying
Why robots refuse to be simple
####### Why I am doing this to Markdown
0. Abstract (Rewritten in My Own Words)
0.1 The problem, roughly
Hybrid control problems are those where the action space contains both discrete decisions (e.g. mode switches, on/off, grasp/not grasp) and continuous controls (e.g. torques, velocities).
0.2 Why standard RL breaks
Most RL algorithms assume:
- either fully discrete actions (DQN-style)
- or fully continuous actions (policy gradients, actor–critic)
Hybrid ≠ either.
0.3 What this paper proposes
A framework to:
- explicitly model discrete + continuous actions
- avoid exponential blowups
- keep training stable
- work on real robotic tasks
1. Background
1.1 Reinforcement Learning recap
1.1.1 Markov Decision Processes (MDPs)
An MDP is defined as:
[ (\mathcal{S}, \mathcal{A}, P, R, \gamma) ]
Where:
- ( \mathcal{S} ): state space
- ( \mathcal{A} ): action space
- ( P ): transition dynamics
- ( R ): reward function
- ( \gamma ): discount factor
1.1.1.1 Important note
This already quietly assumes a single action space.
1.2 Continuous control
1.2.1 Policy gradients
A policy ( \pi_\theta(a \mid s) ) outputs parameters of a continuous distribution.
1.2.2 Common algorithms
- DDPG
- TD3
- SAC
All assume: [ a \in \mathbb{R}^n ]
1.3 Discrete control
1.3.1 Q-learning
Action selection via: [ a = \arg\max_a Q(s, a) ]
1.3.2 DQN and friends
- Works well for discrete action sets
- Breaks when actions are continuous
2. Hybrid Action Spaces
2.1 Definition
A hybrid action is:
[ a = (a_d, a_c) ]
Where:
- ( a_d \in \mathcal{A}_d ) (discrete)
- ( a_c \in \mathcal{A}_c \subset \mathbb{R}^n )
2.2 Naive approaches (and why they suck)
2.2.1 Discretize everything
- Curse of dimensionality
- Loses precision
- Not scalable
2.2.2 Enumerate discrete, learn continuous per mode
- Separate policy per discrete action
- Sample inefficiency
- Switching instability
3. Core Idea of the Paper
3.1 Factorized policy
The policy is decomposed as:
[ \pi(a_d, a_c \mid s) = \pi_d(a_d \mid s)\,\pi_c(a_c \mid s, a_d) ]
This is the key move.
3.2 Interpretation
- First decide what to do (discrete)
- Then decide how to do it (continuous)
Very human. Very robotic. Very sane.
4. Architecture
4.1 Actor structure
4.1.1 Discrete actor
Outputs categorical distribution over modes.
4.1.2 Continuous actor
Conditioned on:
- state
- chosen discrete action
4.2 Critic structure
4.2.1 Q-function
[ Q(s, a_d, a_c) ]
4.2.2 Why this matters
The critic evaluates joint actions, preserving coupling.
5. Training
5.1 Objective
Maximize expected return:
[ \mathbb{E}_{\pi_d, \pi_c}\left[\sum_t \gamma^t r_t\right] ]
5.2 Gradient estimation
5.2.1 Discrete gradients
- REINFORCE-style
- Or Gumbel-softmax tricks
5.2.2 Continuous gradients
- Standard reparameterization
- Backprop through critic
5.3 Stability tricks
- Target networks
- Entropy regularization
- Careful learning rates
6. Experiments
6.1 Simulated tasks
6.1.1 Pushing
6.1.2 Grasping
6.1.3 Mode switching
6.2 Real robot experiments
6.2.1 Why this is impressive
Real robots:
- are noisy
- break easily
- hate unstable policies
7. Results
7.1 Performance
- Faster convergence
- Higher success rates
- Better sample efficiency
7.2 Comparison
Outperforms:
- naive discretization
- separate-policy baselines
8. Limitations
8.1 Scaling discrete actions
Large discrete sets still hard.
8.2 Credit assignment
Discrete decisions early → delayed effects.
9. My Thoughts
9.1 Why this matters
Hybrid action spaces are everywhere:
- robotics
- games
- systems control
- missile interception (yes, you)
9.2 Connection to my work
This maps directly to:
- mode switching
- guidance laws
- continuous accelerations
10. Questions I Still Have
10.1 Exploration
How do we explore discrete modes efficiently?
10.2 Hierarchy
Is this secretly hierarchical RL?
11. Random Scratchpad
11.1 TODO
- re-derive gradients
- implement toy version
- test on missile sim
11.2 Garbage thoughts
Is a hybrid action space just a polite way of saying the world is messy?
Probably.
Appendix A: Equations Dump
[ \nabla_\theta J(\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_d(a_d \mid s)
- \nabla_\theta \log \pi_c(a_c \mid s, a_d)\right] ]
Appendix B: Code Skeleton
```python class HybridActor(nn.Module): def forward(self, state): discrete_logits = self.discrete_head(state) discrete_action = sample(discrete_logits) continuous_action = self.continuous_head(state, discrete_action) return discrete_action, continuous_action