Skip to content

roosephu/needle

Repository files navigation

Introduction

This is my personal implementation for several algorithms, some of which are cutting edge, including

  1. Deep Q-Network (DQN)
  2. Deep Deterministic Policy Gradient (DDPG)
  3. Asynchronous Advantage Actor-Critic (A3C)
  4. REINFORCE
  5. Truncated Natural Policy Gradient (TNPG) (maybe I cited wrong paper since it doesn’t use Conjugate Gradient to solve equations?)
  6. Trust Region Policy Gradient with Generalized Advantage Estimation

Some optimizations are used. Double DQN is implemented instead of traditional DQN. Furthermore, prioritized sampling is currently being developing.

The library is inspired by a paper Benchmarking Deep Reinforcement Learning for Continuous Control, whose home page is here. If you find duplicated code, it’s my bad. I, however, promise to write every line of code myself.

Lots of codes are ad-hoc and needs refactored. Issues and discussions are always appreciated.

Tests

I developed the library in an ancient MacBook Air (Mid 2013, i5 with 4G RAM) without using GPU, so you should have no problems running all of these toy experiments.

Few examples are available now, due to lots of bugs. However, DDPG may succeed now. All codes depend on OpenAI/gym and TensorFlow, so if you want to run any experiments, install them please.

Examples commands:

python main.py --mode train --agent DDPG      --env MountainCarContinuous-v0
python main.py --mode train --agent REINFORCE --env CartPole-v0               --batch_size 10 --iterations 8000 --learning_rate 0.1
python main.py --mode train --agent A2C       --env CartPole-v0               --replay_buffer_size 200 --batch_size 200
python main.py --mode train --agent A2C       --env Copy-v0                   --replay_buffer_size 200 --batch_size 200 --iterations 6000
python main.py --mode train --agent TNPG      --env Copy-v0                   --batch_size 10 --iterations 8000
python main.py --mode train --agent TRPO      --env Copy-v0                   --batch_size 10 --iterations 8000

Notes

Experimental A2C (synchronous advantage actor-critic) running on CartPole-v0. Note that A2C uses LSTM by default.

A2C on `Copy-v0` succeeds with probability about 0.7 after 4k-6k steps, or gets stuck at a local minimum where for some specific characters the agent would always go left. I find using small learning rate for actor helps find global minimum.

TNPG sometiomes solves `Copy-v0` in ~1k steps. More experiments?

Question: can we combine TNPG and A3C with LSTM? The actor network and critic network shares many weights and how to apply suitable gradient on them?

5 independent runs of TNPG (batch size 10, delta_KL = 0.001):

./doc/TNPG.png

About

some RL algorithms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages