## Dealing with long sequence labeling

0

1

I am dealing with a problem in which I have to label the inputs (in a sequence format) to 5 distinct classes. The input would be like:

X = {x_1,_x_2,...,X_500}

and the output should be something like:

Y = {Y_1,Y_2,...,Y_500}

But the problem is that too many of the labels are from the first class so a sample output would have many first class labels and only a few (5 or 6 samples) related to other classes.

The classifier tends to learn to classify everything to first class and yet get higher accuracy which is not correct.

Edit: Loss function being used is cross entropy loss, the architecture itself is a bilstm which is applied after an embedding layer, to be more precise:

InputLayer -> Embedding -> BidirectionalLSTM -> NN -> Softmax

The input is in character format (indexes of characters) and the output for each character is of 5 distinct classes while the problem is:

Most of characters belong to first class.

PS: I hope this would leave enough information.

1If correctly identifying the minority classes is more important, tune your loss function accordingly. The other way is to insert synthetic datum by oversampling sequences where the majority class is underrepresented. Welcome to the site! – Emre – 2018-06-07T17:36:27.040

@Emre Very nice advice thank you, i though about that, but how would i actually do that? Do you have any resource for reading to advice? – MesiA – 2018-06-07T17:51:19.880

You give us very little information about your current approach. What classifier/model are you using? What is your loss function? Exactly how many samples do you have from each class? The more information you can provide, the more likely that we can provide a useful answer. Also, I suggest you do some research on "class imbalance"; there's lots written on the subject. – D.W. – 2018-06-07T23:05:02.910

Many thanks @D.W. for your response. I looked up the "class imbalance" and the "skewed classes" and also found some useful information. I'm gonna try them out and if any solution comes will inform here. For the problem itself part: It's an bilstm designed to predict labels for each sample in sequence, each sample belongs to one of 5 distinct classes but the first class is very common (490/496 in a single sequence) while other 4 classes share the rest (6/496). Thus finding a decent weighting for weighted version of loss seems like an answer. Correct me if i'm wrong. – MesiA – 2018-06-08T01:57:39.127

Please don't leave extra information in the comments. Instead, edit the question so it includes all relevant material, and reads well for someone who encounters this page for the first time. People shouldn't need to read the comments to understand what you are asking. Also, I notice that you still haven't answered some of the questions (in particular, the loss function you are currently using). Thank you! – D.W. – 2018-06-08T05:23:53.767

1

First of all, don't use classification accuracy as a metric. Use precision, recall or F-score. They are best suited for multiclass unbalanced datasets.

Secondly, if you want to enrich your dataset with synthetic points of the minority class, a common way is to use the SMOTE algorithm.

Another way of creating more synthetic samples of your classes would be to use Generative Adversarial Networks (GANs). You will train a Generator and a Discriminator network based on the samples that you have. See this for further info.

Thanks for your response, i did some more research and first part of your answer is very correct (accuracy vs Precision/Recall/F1) while second and third parts (SMOTE/GAN) are not applicable. – MesiA – 2018-06-08T16:52:14.613

Because it is textual data that have relation between samples, meaning it is words that proceed each other, each word belongs to one of 5 classes and also the problem is seen as a series of characters (character embedding is employed). Thus doing what you have proposed in first place is not applicable, it needs learning words from context and yet results a nonsense sentence if it is employed. please correct me if i'm wrong. – MesiA – 2018-06-08T19:41:18.163

1

Based on your comments you only have 6 examples from the non-majority class. There's no way you are going to train a classifier with only 6 examples in your training set. No amount of augmentation, synthetic samples, etc. is going to deal with that problem. You need more data.

You are using the cross-entropy loss, which can normally handle class imbalance fine. There are standard methods for dealing with class imbalance, but if you had enough data, you should be fine as-is, since you use the cross-entropy loss. But with so little data, there is probably no hope. You need more training data from the non-majority class.

What is meant by sample is inside a learning sample (which is considered to be X,Y pair consisting 500 sub samples in each), so what we have is:

(X_j,Y_j)_i=({X_1,...X_500},{Y_1,...Y_500})_i where i = [1,5000] – MesiA – 2018-06-12T09:14:04.643