For some time, I did assessments to design metrics on how to recognize well-written and meaningful software requirements. Then I decided to work with Stack Overflow question posts because they are a large corpus and have also votes which indicate human perceived quality1.
Now I am not sure which proper data science/NLP terms would describe the methodology to sort out what is a meaningful text or not.
I've been looking over entropy and semiotics, also research specifically on the Stack Overflow corpus but I fail to find seminal papers dealing actually with defining and measuring meaningfulness of text.
What is meaningfulness?
- According to Wiktionary and Wikipedia, meaningfulness is "the state or measure of being meaningful", while meaningful is "having meaning, significant", while meaning is "the information or concepts that a sender intends to convey, or does convey, in communication with a receiver".
- I've also asked a question on Linguistics SE to find out whether there are recent results from academic reseach on that and more up-to-date definitions.
Stack Overflow folks would have already implemented some practical entropy detection algorithms to filter out low quality posts (as we know it by the system filter, until you have typed in more content) but for example the following quite meaningless question produces no warning from the - that is, entropy i.e. randomness appears low enough, but indeed the text is indeed very random and meaningless.
I do understand there is no absolute "meaning" because it depends on the context of the message receiver (semiotics!), but then it should be possible to put an message (SO question) into context of all previously received messages (posted SO questions).
1 The fact that SO content features code snippets additionally to natural language, it's possible a subject to research on its own what does it mean, what contributes etc.; for simplicity, I'll possibly just exclude code for text analysis but for sure it's also worth looking on the correlation between code snippets and votes. As said, it could be a research topic on its own.