How to get the number of syllables in a word?

9

2

I have already gone through this post which uses nltk's cmudict for counting the number of syllables in a word:

from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]] 

However, for words outside the cmu's dictionary like names for example: Rohit, it doesn't give a result.

So, is there any other/better way to count syllables for a word?

Dawny33

Posted 2017-09-28T06:04:25.437

Reputation: 7 606

1

Well, http://www.wordcalc.com/ can handle "Rohit", so seems like it is possible. I don't know how it is doing it though . . . and it is not perfect.

– Neil Slater – 2017-09-28T07:20:40.783

wordcalc.com gave "syllable" a count of 1 (I'd call it 3). I think it may be using the hyphenation rules from your linked question. It seems that these coincide with pronounced syllables a lot of the time, but not 100%. – Neil Slater – 2017-09-28T07:27:35.253

Answers

11

You can try another Python library called Pyphen. It's easy to use and supports a lot of languages.

import pyphen
dic = pyphen.Pyphen(lang='en')
print dic.inserted('Rohit')
>>'Ro-hit'

Tasos

Posted 2017-09-28T06:04:25.437

Reputation: 3 340

This is pretty useful but it gives a lot of wrong results. For instance it counts 'readier' as 2 syllables instead of 3, 'karate' as 1 instead of 3, 'insouciance' as 3 instead of 4, 'Siberia' as 1 instead of 4, and many more. – Hayze – 2019-03-06T20:38:36.887

I recommend against using Pyphen. It's only 54% accurate, based on cmudict, and it's easy to do much better. See my answer for details. – hauntsaninja – 2021-02-13T04:04:56.973

4

I was facing the exact same issue, this is what I did:
Catch the key error you get when the word is not found in cmu's dictionary as below:

from nltk.corpus import cmudict
d = cmudict.dict()

def nsyl(word):
    try:
        return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]
    except KeyError:
        #if word not found in cmudict
        return syllables(word)

Call the below syllables function

def syllables(word):
    #referred from stackoverflow.com/questions/14541303/count-the-number-of-syllables-in-a-word
    count = 0
    vowels = 'aeiouy'
    word = word.lower()
    if word[0] in vowels:
        count +=1
    for index in range(1,len(word)):
        if word[index] in vowels and word[index-1] not in vowels:
            count +=1
    if word.endswith('e'):
        count -= 1
    if word.endswith('le'):
        count += 1
    if count == 0:
        count += 1
    return count

shantanuSpark

Posted 2017-09-28T06:04:25.437

Reputation: 41

0

Below is how I did

def countsyllables(pron):
 return len([char for phone in pron for char in phone if char[-1].isdigit() ])
from nltk.corpus import cmudict
from nltk.corpus import brown
cmudict_dict=cmudict.dict()
sw = stopwords.words('english')
bwns=[w.lower() for w in brown.words() if w.lower() not in sw ]
missingw=[]
syllablecnt=[]
for  w in bwns:
  try:
    syllablecnt.append(countsyllables(cmudict_dict[w]))
  except:
    missingw.append(w)
    continue  
# below is approximate count of syllable in the text brown, there are many missing words too    
sum(syllablecnt)

sakeesh

Posted 2017-09-28T06:04:25.437

Reputation: 101

0

Like you, I wasn't thrilled with the quality of syllable counting functions I could find online, so here's my take:

import re

VOWEL_RUNS = re.compile("[aeiouy]+", flags=re.I)
EXCEPTIONS = re.compile(
    # fixes trailing e issues:
    # smite, scared
    "[^aeiou]e[sd]?$|"
    # fixes adverbs:
    # nicely
    + "[^e]ely$",
    flags=re.I
)
ADDITIONAL = re.compile(
    # fixes incorrect subtractions from exceptions:
    # smile, scarred, raises, fated
    "[^aeioulr][lr]e[sd]?$|[csgz]es$|[td]ed$|"
    # fixes miscellaneous issues:
    # flying, piano, video, prism, fire, evaluate
    + ".y[aeiou]|ia(?!n$)|eo|ism$|[^aeiou]ire$|[^gq]ua",
    flags=re.I
)

def count_syllables(word):
    vowel_runs = len(VOWEL_RUNS.findall(word))
    exceptions = len(EXCEPTIONS.findall(word))
    additional = len(ADDITIONAL.findall(word))
    return max(1, vowel_runs - exceptions + additional)

We avoid looping in pure Python; at the same time, these regexes should be easy to understand.

This performs better than the various snippets floating around online that I've found (including Pyphen and Syllapy's fallback). It gets over 90% of cmudict correct (and I find its mistakes quite understandable).

cd = nltk.corpus.cmudict.dict()
sum(
    1 for word, pron in cd.items()
    if count_syllables(word) in (sum(1 for p in x if p[-1].isdigit()) for x in pron)
) / len(cd)
# 0.9073751569397757

For comparison, Pyphen is at 53.8% and the syllables function in the other answer is at 83.7%.

Here are some common words it gets wrong:

from collections import Counter
for word, _ in Counter(nltk.corpus.brown.words()).most_common(1000):
    word = word.lower()
    if word in cd and count_syllables(word) not in (sum(1 for p in x if p[-1].isdigit()) for x in cd[word]):
        print(word)

hauntsaninja

Posted 2017-09-28T06:04:25.437

Reputation: 101