I'm trying to make deep q-learning agent from https://keon.io/deep-q-learning
My environment looks like this: https://imgur.com/a/OnbiCtV
As you can see my agent is a circle and there is one gray track with orange lines (reward gates). The bolder line is an active gate. The orange line from the circle pointing to his direction.
The agent has constant velocity and it can turn left/right 10 degrees or do nothing
On the next image are agents sensors https://imgur.com/a/Qj7Kesi
They are rotating with the agent.
The states are a distance from the agent to active gate and lengths of sensors. In total there are 1+7 states and it is q-learning neural net input dimension.
Actions are turn left, turn right and do nothing.
Reward function returns 25 when the agent intersects reward gate; 125 when agent intersects the last gate; -5 if agent intersects track border If none of this, reward function compare the distance from the agent to the active gate for current state and next state:
If current state distance > next state distance: return 0.1 else return -0.1
Also, DQNAgent has negative, positive and neutral memory. If reward is -5, (state, action, reward, next_state, done) go to the negative memory, if reward is >= 25, to positive else to neutral
That is because when I'm forming minibatch for training, I'm taking 20 random samples from neutral memory, 6 from positive and 6 from negative.
Every time when agent intersects track border or when he is stuck for more than 30 seconds, I'm doing training (replay) and agent starts from the beginning. This is my model
model = Sequential() model.add(Dense(64, input_dim=self.state_size,activation='relu', kernel_initializer=VarianceScaling(scale=2.0))) model.add(Dense(32, activation='relu',kernel_initializer=VarianceScaling(scale=2.0))) model.add(Dense(self.action_size, activation='linear')) model.compile(loss=self._huber_loss, optimizer=Adam(lr=self.learning_rate)) return model
I tried different kinds of model, a different number of neurons per layer, other activation and loss functions, dropout, batch normalization, and this model works the best for now
I tried different reward values
Also, I tried to use static sensors (they are not rotating with the agent) https://imgur.com/a/8eDtQIF (green lines on the photo)
Sometimes my agent manages to intersect a few gates before hits the border. Rarely he manages to traverse half of the track and once, with this settings, he traversed two laps before he stuck.
More often, he is only rotating in one place.
I think that the problem lays in state representation or reward function.
Any suggestions would be appreciated