I am looking to compile a sentiment corpus for news articles in multiple languages (~100k per lang. for a machine learning experiment) where each article is labeled positive, neutral, or negative. I have searched high and low but could not find anything like this available. I already have the news articles in each language.
My question to the community is how would you achieve this as accurately as possible?
I was first looking at Mechanical Turk, where you can hire people to label each article manually for you. And this may be the best way forward but expensive.
- Sentiment Idea
My current idea is that I run each news article across a few of these libraries (for example AFINN, then TextBlob, then VADER) and for those articles who show positive, negative, neutral unanimously though all three libs are accepted into the corpus. Does that seem like a fairly strong and reasonable verification process?
- Language Idea
The next issue pertains to language itself. The 3 lib pipeline above can be executed in English with no issue. However these libraries do not uniformity support many other languages (Spanish, German, Chinese, Arabic, French, Portuguese, etc.) I was thinking about doing what VADER suggests and taking the news stories in non-english languages and sending them though Google Translation API to get them into English and then send them through the existing 3 lib pipeline above. I do realize there will be a loss in semantics for many articles. However, my hope is that enough articles will translate well enough that some pass through the 3 lib pipeline.
I am aware that translating and sending news articles through this triple blind sentiment pipe may take a 100k corpus and yield 10k results. I am fine with that. The accuracy and then price is my concern. I can easily acquire more data.
What would you do that may be a more accurate way of achieving a sentiment corpus of news articles? Is there an existing best used practice for assembling a corpus like this?