An article released by Open AI gives an overview of how Open AI Five works. There is a paragraph in the article stating:
Our agent is trained to maximize the exponentially decayed sum of future rewards, weighted by an exponential decay factor called γ. During the latest training run of OpenAI Five, we annealed γ from 0.998 (valuing future rewards with a half-life of 46 seconds) to 0.9997 (valuing future rewards with a half-life of five minutes).
Does annealing in this context mean the network found through training that γ was better as 0.9997? How would this be determined?
My limited understanding of the topic led me to the following assumption on how γ was annealed: Different versions of the network were trained for a given amount of time using different versions of γ. Then those different versions of the network played against each other or their true skill scored were compared to determine the ideal value of γ.