4

2

Ok, due to previous question I was pointed to use reinfrocement learning.

So far what I understood from random websites is the following:

- there is a Q(s,a) function involved
- I can assume my neural network ~ Q(s,a)
- my simulation has a state (N input variables)
- my actor can perform M possible actions (M output variables)
- at each step of the simulation my actor perform just the action corresponding to the
`max(outputs)`

- (in my case the actions are 1/2/3 % increase or decrease to propellers thrust force.)

**From this website I found that at some point I have to**:

- Estimate outputs Q[t] (or so called q-values)
- Estimate outputs over next state Q[t+1]
- Let the backpropagation algorithm perform error correction only on the action performed on next state.

**The last 3 points are not clear at all to me, infact I don't have yet the next state what I do instead is**:

- Estimate previous outputs Q[t-1]
- Estimate current outputs Q[t]
- Let backpropagation fix the error for max q value only

Actually for code I use just this library which is simple enough to allow me understand what happens inside:

Initializing the neural network (with N input neurons, N+M hidden neurons and M output neurons) is as simple as

```
Network network = new NeuralNetwork( N, N+M, M);
```

Then I think to understand there is the need for an arbitrary reward function

```
public double R()
{
double distance = (currentPosition - targetPosition).VectorMagnitude();
if(distance<100)
return 100-distance; // the nearest the greatest the reward
return -1; // too far
}
```

then what I do is:

```
// init step
var previousInputs = ReadInputs();
UpdateInputs();
var currentInputs = ReadInputs();
//Estimate previous outputs Q[t-1]
previousOutputs = network.Query( previousInputs );
//Estimate current outputs Q[t]
currentOutputs = network.Query( currentInputs);
// compute modified max value
int maxIndex = 0;
double maxValue = double.MinValue;
SelectMax( currentOutputs, out maxValue, out maxIndex);
// apply the modified max value to PREVIOUS outputs
previousOutputs[maxIndex] = R() + discountValue* currentOutputs[maxIndex];
//Let backpropagation fix the error for max q value only
network.Train( previousInputs, previousOutputs);
// advance simulation by 1 step and see what happens
RunPhysicsSimulationStep(1/200.0);
DrawEverything();
```

But it doesn't seem to work very nice. I let simulation running for over one hour without success. Probably I'm reading the algorithm in a wrong way.