How to determine the complexity of an English sentence?



I am working on an app to help people learn English as a second language. I have validated that sentences help in learning a language by providing extra context. I did that by conducting a small research in a classroom of 60 students.

I have mined over hundred thousand sentences from Wikipedia for various English words (Including Barrons'800 words and 1000 most common English words)

Entire data is available at

In order to maintain the quality of content, I filtered out sentences which were longer than 160 characters since they might be difficult to understand.

As a next step, I want to be able to automate the process of sorting this content in the order of ease of understanding. I myself am a non-native English speaker. I want to know what features I can use to separate easy sentences from difficult ones.

Also, do you think this is possible?


Posted 2017-06-03T20:12:19.593

Reputation: 103



Yes. There are various metrics, such as the fogg index. Textacy in python has a nice list and implementations.

>>> ts.flesch_kincaid_grade_level
>>> ts.readability_stats
{'automated_readability_index': 12.801546064781363,
 'coleman_liau_index': 9.905629258346586,
 'flesch_kincaid_grade_level': 10.853709110179697,
 'flesch_readability_ease': 62.51222198133965,
 'gulpease_index': 55.10492845786963,
 'gunning_fog_index': 13.69506833036245,
 'lix': 45.76390294037353,
 'smog_index': 11.683781121521076,
 'wiener_sachtextformel': 5.401029023140788}


Posted 2017-06-03T20:12:19.593

Reputation: 366

You can also look at entropy or percent of unique words, but the above metrics are more relevant. – GrimSqueaker – 2020-04-28T12:54:51.590