I'm totally new to word2vec so pls bear it with me. I have a set of text files each containing a set of tweets, between 1000-3000. I have chosen a common keyword ("kw1") and wants to find semantically relevant terms for "kw1" using word2vec. For example if the keyword is "apple" I would expect to see related terms such as "ipad" "os" "mac"... based on the input file. So this set of related terms for "kw1" would be different for each input file as word2vec would be trained on individual files (eg., 5 input files, run word2vec 5 times on each file).
My goal is to find sets of related terms for each input file given the common keyword ("kw1"), which would be used for some other purposes.
My questions/doubts are:
- Does it make sense to use word2vec for a task like this? is it technically right to use considering the small size of an input file?
I have downloaded the code from code.google.com: https://code.google.com/p/word2vec/ and have just given it a dry run as follows:
time ./word2vec -train $file -output vectors.bin -cbow 1 -size 200 -window 10 -negative 25 -hs 1 -sample 1e-3 -threads 12 -binary 1 -iter 50 ./distance vectors.bin
From my results I saw I'm getting many noisy terms (stopwords) when I'm using the 'distance' tool to get related terms to "kw1". So I did remove stopwords and other noisy terms such as user mentions. But I haven't seen anywhere that word2vec requires cleaned input data...?
How do you choose right parameters? I see the results (from running the distance tool) varies greatly when i change parameters such as '-window', '-iter'. Which technique should I use to find the correct values for the parameters. (manual trial and error is not possible for me as I'll be scaling up the dataset).