How to determine the complexity of an English sentence?

10

4

I am working on an app to help people learn English as a second language. I have validated that sentences help in learning a language by providing extra context. I did that by conducting a small research in a classroom of 60 students.

I have mined over hundred thousand sentences from Wikipedia for various English words (Including Barrons'800 words and 1000 most common English words)

Entire data is available at https://buildmyvocab.in

In order to maintain the quality of content, I filtered out sentences which were longer than 160 characters since they might be difficult to understand.

As a next step, I want to be able to automate the process of sorting this content in the order of ease of understanding. I myself am a non-native English speaker. I want to know what features I can use to separate easy sentences from difficult ones.

Also, do you think this is possible?

BuildMyVocab

Posted 2017-06-03T20:12:19.593

Reputation: 103

Answers

8

Yes. There are various metrics, such as the fogg index. Textacy in python has a nice list and implementations.

>>> ts.flesch_kincaid_grade_level
10.853709110179697
>>> ts.readability_stats
{'automated_readability_index': 12.801546064781363,
 'coleman_liau_index': 9.905629258346586,
 'flesch_kincaid_grade_level': 10.853709110179697,
 'flesch_readability_ease': 62.51222198133965,
 'gulpease_index': 55.10492845786963,
 'gunning_fog_index': 13.69506833036245,
 'lix': 45.76390294037353,
 'smog_index': 11.683781121521076,
 'wiener_sachtextformel': 5.401029023140788}

GrimSqueaker

Posted 2017-06-03T20:12:19.593

Reputation: 366

You can also look at entropy or percent of unique words, but the above metrics are more relevant. – GrimSqueaker – 2020-04-28T12:54:51.590