Policy Gradients: REINFORCE from Scratch with NumPy
In the DQN post, we trained a neural network to estimate Q-values and then picked the best action with argmax. That works when the action space is discrete — push left or push right. But what if you n
sesenai.hashnode.dev20 min read