4

1

I recently finished Course on RL by David Silver (on YT) and thought about trying it out on simple application in Unity Game Engine, where I've built simple labyrint with ball and want to teach the ball to get from point A to point B in there while avoiding obstacles and fire (the place where you'll get burnt so big negative reward)

The problem I encountered while designing the whole thing (programming-wise) is: What is the correct (or at least good) way of representing the position in 2D space? It is continuous so I thought about representing it as feature vector consisting of [up, down, left, right, posX, posY] where direction is whether I am pressing button of moving in that direction in binary (or actions if you want) and pos are floats (0-1) representing normalized position from one corner on the plane where the whole map is. That would be accompanied by vector W that would represent the weights adjusted using Gradient Descent.

Question is: will this work?? I am asking for 2 reasons. One is that I am not so sure about that posX and posY since it can be 0 and if I multiply it by the weights vector then how could be resulting reward anything but 0? Second reason is that I am not sure if the actions should be part of the features. I mean, it makes sense to me but I could easily be very wrong since I am a beginner.

Thanks a lot guys in advance. If you have any more questions or think the problem is not described deeply enough just ask in the comments and I'll edit the question. :)

PS: I could just code it the way I think is right, but I also want to get gasp of designing applications on paper before coding them (project management).

Shouldn't the action be the output not the input? If you also learn a bias vector you mitigate the pos==(0,0) problem. – BlindKungFuMaster – 2016-09-09T12:42:56.280

@BlindKungFuMaster That's the second reason there. To explain my thought process - The vector there is intended to be used as input to Q(s, a) which will be a function with parameters W (vector) that will be learned. The exact action then will be extracted from there finding max a in that Q(s, a) where S will be extracted from actual position of the ball and A will be found in some sort of loop.

What do you mean by the bias vector? Is it that W vector I mentioned or something else? Thanks. – Dominik – 2016-09-09T13:06:52.963

Ah, ok, so Q predicts the expected reward. A bias vector would be added to the product of input and W. So it would B would contain additional parameters for Q. – BlindKungFuMaster – 2016-09-09T15:17:43.453

So in the end it would be something like: Q(Sa, B) = W*Sa + B where B would be just some value so the Q of state (0, 0) won't be always 0 and therefore everything will be shifted by B which will not matter because in the end as W gets learnt the Q will converge to the actual values anyway, just W will be different then as if I did it without B. Do I understand correctly? – Dominik – 2016-09-09T16:56:48.520

Yes, though I would suspect that you'll need more layers anyway, which would look like Q(Sa, B) = W2

f(W1Sa + B1)+B2, except if your labyrinth is extremely simple. – BlindKungFuMaster – 2016-09-09T17:09:38.590