I have a function that generates possible actions, so that I would have a Q-table in form of a nested dictionary, with states (added as they occur) as keys whose values are also dictionaries of possible actions as keys and q-values as values. Is this possible? How would it affect learning? What other method can I use?

(Disclaimer: I provided this suggestion to OP here as an answer to a question on Data Science Stack Exchange)

Yes this is possible. Assuming you have settled on all your other decisions for Q learning (such as rewards, discount factor or time horizon etc), then it will have no *logical* impact on learning compared to any other approach to table building. The structure of your table has no relevance to how Q learning converges, it is an implementation detail.

This choice of structure may have a performance impact in terms of how fast the code runs - the design works best when it significantly reduces memory overhead compared to using a tensor that over-specifies all possible states and actions. If all parts of the state vector could take all values in any combination, and all states had the same number of allowed actions, then a tensor model for the Q table would likely be more efficient than a hash.

I want to update Q-value, but the next state is one that was never there before and has to be added to my nested dictionary with all possible actions having initial Q-values of zero; how do I update the Q-value, now that all of the actions in this next state have Q-values of zero.

I assume you are referring to the update rule from single step Q learning:

$$Q(s,a) \leftarrow Q(s,a) + \alpha(r + \gamma \text{max}_{a'} Q(s',a') - Q(s,a))$$

What do you do when you first visit $(s,a)$, and want to calculate $\text{max}_{a'} Q(s',a')$ for the above update, yet all of $Q(s',a') = 0$, because you literally just created them?

What you do is use the zero value in your update. There is no difference whether you create entries on demand or start with a large table of zeroes. The value of zero is your best estimate of the action value, because you have no data. Over time, as the state and action pair are visited multiple times, perhaps across multiple episodes, the values from experience will back up over time steps due to the way that the update formula makes a link between states $s$ and $s'$.

Actually you can use any arbitrary value other than zero. If you have some method or information from outside of your reinforcement learning routine, then you could use that. Also, sometimes it helps exploration if you use *optimistic* starting values - i.e. some value which is likely higher than the true optimal value. There are limits to that approach, but it's a quick and easy trick to try and sometimes it helps explore and discover the best policy more reliably.

In using the optimistic stating values, what if instead of starting with zeros, I decide to start with the reward for each action vector: The reward function for an action vector like [x, y, z] is given as a linear function ax+by+cz. Would this move it to convergence faster? – EArwa – 2019-07-19T08:00:34.257

1@EArwa: It would depend on the problem and you'd have to test it and see. There are two ways that it could become faster (1) The guesses from your linear model are close to the real optimal values (2) The results are optimistic in a useful way (higher than the true values but not too high) to encourage exploration. The same two things can cause it to be slower if the guesses are too far away from the real values as it could take a while for the agent to un-learn the assumptions that you have input. The only way to know is to test it and compare to all zeroes start to see if better – Neil Slater – 2019-07-19T08:14:24.427

Okay, thank you. Let me see how it would affect the solution. There are cases where the immediate reward (cost) is small but it becomes large in the long run. – EArwa – 2019-07-19T08:22:45.840

1@EArwa: Q-learning - and all RL algorithms - are designed to solve that. It's what the value function measures. So you don't

needto help it along. If you can run very fast simulations it may be best to just let Q learning do its job rather than worry about that detail. However, itmighthelp speed convergence. – Neil Slater – 2019-07-19T08:37:37.550