## applying word2vec on small text files

11

3

I'm totally new to word2vec so pls bear it with me. I have a set of text files each containing a set of tweets, between 1000-3000. I have chosen a common keyword ("kw1") and wants to find semantically relevant terms for "kw1" using word2vec. For example if the keyword is "apple" I would expect to see related terms such as "ipad" "os" "mac"... based on the input file. So this set of related terms for "kw1" would be different for each input file as word2vec would be trained on individual files (eg., 5 input files, run word2vec 5 times on each file).

My goal is to find sets of related terms for each input file given the common keyword ("kw1"), which would be used for some other purposes.

My questions/doubts are:

• Does it make sense to use word2vec for a task like this? is it technically right to use considering the small size of an input file?

 time ./word2vec -train \$file -output vectors.bin -cbow 1 -size 200 -window 10 -negative 25 -hs 1 -sample 1e-3 -threads 12 -binary 1 -iter 50

./distance vectors.bin

• From my results I saw I'm getting many noisy terms (stopwords) when I'm using the 'distance' tool to get related terms to "kw1". So I did remove stopwords and other noisy terms such as user mentions. But I haven't seen anywhere that word2vec requires cleaned input data...?

• How do you choose right parameters? I see the results (from running the distance tool) varies greatly when i change parameters such as '-window', '-iter'. Which technique should I use to find the correct values for the parameters. (manual trial and error is not possible for me as I'll be scaling up the dataset).

9

Word2Vec isn't a good choice for a dataset of such size. From researches I have seen, it will unleash its power if you feed at least couple of million of words, 3k tweets wouldn't be enough for a concise word similarity.

but when i was using the distance tool to find most similar words to a given word, the version with stopwords removed gave me sensible words than the version without. can you guess what's this mean? – samsamara – 2016-04-15T01:34:22.610

Probably, you are using too narrow context: if your model looks into, say, two words back and forward, you will be having up to 2 stopwords in context and that could give worse results. If you will broaden context (which will make model bigger and training time longer), with-stopwords model will give you better results, I assume. – chewpakabra – 2016-04-15T08:33:00.683

thanks for the input, makes more sense now. Also since word2vec processes the input sentence by sentence, what would happen if i mix up the sentences in the input document? that should totally change output vectors right? Also again, given it's processing sent by sent, how does word2vec differ from doc2vec? thanks again. – samsamara – 2016-04-15T09:31:13.723

If you mix sentences, you basically will corrupt context information for last/first words in the sentence (since they will catch previous words as nearest, which is way not always true). Change wouldn't be total, but important, I assume. Considering w2v and d2v difference, I like the tutorial here: http://rare-technologies.com/doc2vec-tutorial/ . And basically, you are learning either an abstract vector representation of a word or a label assigned to sentence/document, so the core difference is the modelled object.

– chewpakabra – 2016-04-15T09:40:44.430

thanks i'll have a look there. And reg the first point, even though word2vec processes the input sentence by sentence, the first word in sent 2 will look at 3 words from the end of sent 1 if the window size is 3, isn't it? I thought it considers each sentence as individual entities but im wrong i guess. – samsamara – 2016-04-15T13:49:53.890

do we need to remove stopwords as a data pre processing step? – samsamara – 2016-01-11T12:38:52.210

2No, in the word2vec approach you don't need to do that, since the algorithm itself relies on a broad context to find similarities in words, so stop words (most of which are prepositions, pronouns and such) are an important asses for algorithm. – chewpakabra – 2016-01-11T17:09:41.637