There is a nice write up about Language modeling a billion words. Below are some excerpts:
Word-level models have an important advantage over character-level models.
Take the following sequence as an example (a quote from Robert A. Heinlein):
Progress isn't made by early risers. It's made by lazy men trying to
find easier ways to do something.
After tokenization, the word-level model might view this sequence as containing 22 tokens. On the other hand, the character-level will view this sequence as containing 102 tokens. This longer sequence makes the task of the character model harder than the word model, as it must take into account dependencies between more tokens over more time-steps. Another issue with character language models is that they need to learn spelling in addition to syntax, semantics, etc. In any case, word language models will typically have lower error than character models
The main advantage of character over word language models is that they have a really small vocabulary. For example, the GBW dataset will contain approximately 800 characters compared to 800,000 words (after pruning low-frequency tokens). In practice this means that character models will require less memory and have faster inference than their word counterparts. Another advantage is that they do not require tokenization as a preprocessing step.