4

I'm trying to implement the Reinforce algorithm (Monte Carlo policy gradient) in order to optimize a portfolio of 94 stocks on a daily basis (I have suitable historical data to achieve this). The idea is the following: on each day, the input to a neural network comprises of the following:

- historical daily returns (daily momenta) for previous 20 days for each of the 94 stocks
- the current vector of portfolio weights (94 weights)

Therefore states are represented by 1974-dimensional vectors. The neural network is supposed to return a 94-dimensional action vector which is again a vector of (ideal) portfolio weights to invest in. Negative weights (short positions) are allowed and portfolio weights should sum to one. Since the action space is continuous I'm trying to tackle it via the Reinforce algorithm. Rewards are given by portfolio daily returns minus trading costs. Here's a code snippet:

```
class Policy(nn.Module):
def __init__(self, s_size=1974, h_size=400, a_size=94):
super().__init__()
self.fc1 = nn.Linear(s_size, h_size)
self.fc2 = nn.Linear(h_size, a_size)
self.state_size = 1974
self.action_size = 94
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
def act(self, state):
state = torch.from_numpy(state).float().unsqueeze(0).to(device)
means = self.forward(state).cpu()
m = MultivariateNormal(means,torch.diag(torch.Tensor(np.repeat(1e-8,94))))
action = m.sample()
action[0] = action[0]/sum(action[0])
return action[0], m.log_prob(action)
```

Notice that in order to ensure that portfolio weights (entries of the action tensor) sum to 1 I'm dividing by their sum. Also notice that I'm sampling from a multivariate normal distribution with extremely small diagonal terms since I'd like the net to behave as deterministically as possible. (I should probably use something similar to DDPG but I wanted to try out simpler solutions to start with).

The training part looks like this:

```
optimizer = optim.Adam(policy.parameters(), lr=1e-3)
def reinforce(n_episodes=10000, max_t=10000, gamma=1.0, print_every=1):
scores_deque = deque(maxlen=100)
scores = []
for i_episode in range(1, n_episodes+1):
saved_log_probs = []
rewards = []
state = env.reset()
for t in range(max_t):
action, log_prob = policy.act(state)
saved_log_probs.append(log_prob)
state, reward, done, _ = env.step(action.detach().flatten().numpy())
rewards.append(reward)
if done:
break
scores_deque.append(sum(rewards))
scores.append(sum(rewards))
discounts = [gamma**i for i in range(len(rewards)+1)]
R = sum([a*b for a,b in zip(discounts, rewards)])
policy_loss = []
for log_prob in saved_log_probs:
policy_loss.append(-log_prob * R)
policy_loss = torch.cat(policy_loss).sum()
optimizer.zero_grad()
policy_loss.backward()
optimizer.step()
if i_episode % print_every == 10:
print('Episode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
print(scores[-1])
return scores, scores_deque
scores, scores_deque = reinforce()
```

Unfortunately, there is no convergence during training even after fiddling with the learning rate so my question is the following: is there anything blatantly wrong with my approach here and if so, how should I tackle this?

Unfortunately it did not help with the learning error - the average reward still doesn't seem to be converging. – BGa – 2019-10-22T20:54:59.690

hmm, i would honestly try without the current portfolio contexts as input if its being passed into the input nodes in a the same way as your returns features are since they are very different features it will be hard to train a net that can handle that naively, i know it seems like it should have this info but really you want just want the best position sizing so if it works it should have the same position sizing regardless (save the fees involved in repositioning which can be handled in the reward mechanism) ill give this another look when i get home and see if i can help somehow – nickw – 2019-10-22T21:12:46.117