1

Say we've previously used a neural network or some other classifier C with $N$ training samples $I:=\{I_1,...I_N\}$ (that has a sequence or context, but is ignored by C) the, belonging to $K$ classes. Assume, for some reason (probably some training problem or declaring classes), C is confused and doesn't perform well. The way we assign a class using C to each test data $I$ is: $class(I):= arg max _{ {1 \leq j \leq K} } p_j(I)$, where $p_j(I)$ is the probability estimate of $I$ corresponding to the $j$-th class, given by C.

Now, on top of this previous classifier C, I'd like to use a Hidden Markov Model (HMM) to "correct" the mistakes made by the previous context-free classifier C, by taking into account the contextual/sequential information not used by C.

Hence let in my HMM, the hidden state $Z_i$ denote the true class of the $i$-th sample $I_i$, and $X_i$ be the predicted class by C. My question is: how could we use the probabilistic information $cl(I):= arg max _{ {1 \leq j \leq K} } p_j(I)$ to train this HMM? I understand that the *confusion matrix of C* can be used to define the emission prob. of the HMM, but how do we define the transition and start/prior prob.? I'm tempted to define the start/prior prob. vector as $\pi:=(p_1(x_1), ..., p_K(x_1))$. But I may be wrong. **This is my main question**.

**A follow up question:** One can define an HMM in the above way (using confusion matrix and the prob. information from C); call the resulting parameter set $\Theta_0$. But after doing so, is it advisable to estimate the parameters to best fit the data $I$ used for C, while initializing a parameter set with the values mentioned in the previous paragraph?

Why can you not make the original classifier aware of the context? E.g. use a CNN with a window of time, or an RNN? – kbrose – 2018-04-20T14:16:37.293

@kbrose: I'm kind of new to the subject, but I've been instructed to use neighborhood information of the samples; samples are retail products in supermarkets, and the contexts are not time, but the for a product P(t,s), the neighborhood consisting of all P(t+/- 1, s+/-1). – Sus_Q – 2018-04-20T14:22:30.790

Ok, so a CNN with filter widths of 3? – kbrose – 2018-04-20T23:10:30.530