## How to create word clouds?

146

91

Word clouds are rather useless fancy and visually appealing plots, where words are plotted with different sizes according to their frequency in a corpus. Many applications exist out there (Wordle, Tagxedo, etc.) that can give an example. I am interested in the algorithm that achieves the closest possible packing of words or other irregular shapes.

There is a method for defining the convex hull of an object (in the Computational geometry package), but I think one needs here the boundary that closes the least area. If this is calculated, perhaps the packing method of graph layout can be exploited by assuming that points on the hull of a word correspond to graph vertices... but this is just speculation. So far I could only list and style the words (that was the easy part):

tally = Tally@
Cases[StringSplit[ExampleData[{"Text", "AliceInWonderland"}],
Except@LetterCharacter], _?(StringLength@# > 4 &)];
tally = Cases[tally, _?(Last@# > 10 &)];
range = {Min@(Last /@ tally), Max@(Last /@ tally)};

words = Style[First@#, FontFamily -> "Cardinal", FontWeight -> Bold,
FontColor ->
Hue[RandomReal[], RandomReal[{.5, 1}], RandomReal[{.5, .8}]],
FontSize -> Last@Rescale[#, range, {12, 70}]] & /@ tally;

Framed[Grid@Partition[words, 10, 10, {1, 1}, {}],
FrameStyle -> {Gray, Thick}, RoundingRadius -> 10, ImageMargins -> 5]


Some possible specifications of the algorithm:

• According to this link (shared by cormullion) identifying the closest boundary of each word is not enough as words can appear inside other glyphs with holes, like P, A, etc. Thus indeed intersection of words must be tested.
• According to Szabolcs, the code might be able to resize words to fit them better
• Many applications are able to arrange the cloud to fill up a user-specified shape (e.g. ellipse, apple, Che Guevara, etc.) instead of being casually positioned along the ever-increasing spiral.
• It would be nice to allow individual words to have different rotations.
• As usually, a fully vectorized version is preferred over image-processing methods (if the former is faster).
• Also it would be nice to have post-rendering effects, like clickable words, mouseover effects, etc.

One way to convert strings to vector graphics is:

First@ImportString[
ExportString[
Style["SomeText", Italic, FontFamily -> "Times", FontSize -> 36],
"PDF"], "PDF", "TextMode" -> "Outlines"]


Some related questions for those who want to do further research:

@Phonon A bit more info would be useful about how/where it fails. The process should work for any words. – István Zachar – 2014-10-05T13:17:09.657

@IstvánZachar Sure, I created a list like this: namphy = {"Richard Feynman", "Albert Einstein", "Niels", "Bohr", "Erwin Schrodinger", "Alice", "Bob"}; and made a call to it in tally = Tally@ Cases[StringSplit[namphy, Except@LetterCharacter], _?(StringLength@# > 4 \[And] # =!= "Alice" &)]; the rest unchanged. It gives tons of errors... Thanks for your help in advance. – Ellie – 2014-10-05T13:21:09.890

@Phonon You cannot just replace part of code without understanding what it does. You need a weighted list of words, and Tally here would simply assing 1 for each name in namphy, as you have listed each only once. Try tally = Thread[{namphy, Range[70, 10, -10]}] and run with rest of Heike's code. – István Zachar – 2014-10-06T07:23:44.140

5

I like the question (+1) and I do not think "word clouds are rather useless". As data visualization bordering with arts it is pretty interesting and functional. It'd be great if community can come up with something similar to http://www.wordle.net/ implemented in Mathematica.

– Vitaliy Kaurov – 2012-02-26T02:29:39.807

1Yes I think this is a great question as well! I've wanted to this before in Mathematica--but unfortunately I was unable to make headway then, and thus I cannot help now. I hope someone can though. – JOwen – 2012-02-26T03:48:41.457

Is the size of the words pre-determined? I suspect some word cloud implementations size words to fit instead of fixing the sizes first and then packing the words together. – Szabolcs – 2012-02-26T07:02:17.037

It should be included into Wolfram|Alpha standard capacities: "word cloud of little->1, Rabbit->0.9, etc...", or, on Wolfram|Alpha Pro, "import document" followed by "word cloud of the most used 50 words, using a triangle template". – P. Fonseca – 2012-02-26T09:01:49.167

2

A related discussion here: http://stackoverflow.com/a/1478314/933113

– cormullion – 2012-02-26T09:28:31.630

@Szabolcs, I think in all cases I've checked on the internet it was pre-determined by their frequency, but of course as it would make sense to fine-tune sizes during layout, one can extend the capabilities of the code. – István Zachar – 2012-02-26T12:05:26.373

119

Here's what I came up with

How I did it

First we need a list of words. Here, I've taken the original list ordered by size.

tally = Tally@
Cases[StringSplit[ExampleData[{"Text", "AliceInWonderland"}],
Except@LetterCharacter], _?(StringLength@# > 4 \[And] # =!=
"Alice" &)];
tally = Cases[tally, _?(Last@# > 10 &)];
tally = Reverse@SortBy[tally, Last];
range = {Min@(Last /@ tally), Max@(Last /@ tally)};

words = Style[First@#, FontFamily -> "Cracked", FontWeight -> Bold,
FontColor ->
Hue[RandomReal[], RandomReal[{.5, 1}], RandomReal[{.5, 1}]],
FontSize -> Last@Rescale[#, range, {12, 70}]] & /@ tally;


The words are rasterised and cropped to make sure the bounding box is as tight as possible.

wordsimg = ImageCrop[Image[Graphics[Text[#]]]] & /@ words;


To produce the image the words are added one by one using a Fold loop where the next word is placed as close to the centre of the existing image as possible. This is done by applying a max filter to the binarized version of the original image thus turning forbidden pixels white and looking for the black point that is closest to the centre of the image.

iteration[img1_, w_, fun_: (Norm[#1 - #2] &)] :=
Module[{imdil, centre, diff, dimw, padding, padded1, minpos},
dimw = ImageDimensions[w];
padded1 = ImagePad[img1, {dimw[[1]] {1, 1}, dimw[[2]] {1, 1}}, 1];

imdil = MaxFilter[Binarize[ColorNegate[padded1], 0.01],
Reverse@Floor[dimw/2 + 2]];

minpos = Reverse@Nearest[Position[Reverse[ImageData[imdil]], 0],
Reverse[centre], DistanceFunction -> fun][[1]];
diff = ImageDimensions[imdil] - dimw;
padding[pos_] := Transpose[{#, diff - #} &@Round[pos - dimw/2]];

ImagePad[#, (-Min[#] {1, 1 }) & /@ BorderDimensions[#]] &@

Fold[iteration, wordsimg[[1]], Rest[wordsimg]]


You can play around with the distance function. For example for a distance function

fun = Norm[{1, 1/2} (#2 - #1)] &


you get an ellipsoidal shape:

Fold[iteration[##, fun]&, wordsimg[[1]], Rest[wordsimg]]


## Updated version

The previous code places new words in the image by approximating them with rectangles. This works fine for horizontally or vertically oriented words, but not so well for rotated words or more general shapes. Luckily, the code can be easily modified to deal with this by replacing the MaxFilter with a ImageCorrelate:

iteration2[img1_, w_, fun_: ( Norm[#1 - #2] &)] :=
Module[{imdil, centre, diff, dimw, padding, padded1, minpos},
dimw = ImageDimensions[w];
padded1 = ImagePad[img1, {dimw[[1]] {1, 1}, dimw[[2]] {1, 1}}, 1];
imdil = Binarize[ImageCorrelate[Binarize[ColorNegate[padded1], 0.05],
Dilation[Binarize[ColorNegate[w], .05], 1]]];
minpos =
Reverse@Nearest[Position[Reverse[ImageData[imdil]], 0],
Reverse[centre], DistanceFunction -> fun][[1]];
Sow[minpos - centre]; (* for creating vector plot *)
diff = ImageDimensions[imdil] - dimw;
padding[pos_] := Transpose[{#, diff - #} &@Round[pos - dimw/2]];
ImagePad[#, (-Min[#] {1, 1}) & /@ BorderDimensions[#]] &@


To test this code we use a list of rotated words. Note that I'm using ImagePad instead of ImageCrop to crop the images. This is because ImageCrop seems to clip the words sometimes.

words = Style[First@#, FontFamily -> "Times",
FontColor ->
Hue[RandomReal[], RandomReal[{.5, 1}], RandomReal[{.5, 1}]],
FontSize -> (Last@Rescale[#, range, {12, 150}])] & /@ tally;

wordsimg = ImagePad[#, -3 -
BorderDimensions[#]] & /@ (Image[
Graphics[Text[Framed[#, FrameMargins -> 2]]]] & /@ words);

wordsimgRot = ImageRotate[#, RandomReal[2 Pi],
Background -> White] & /@ wordsimg;


The iteration loop is as before:

Fold[iteration2, wordsimgRot[[1]], Rest[wordsimgRot]]


which produces

## Second update

To create a vector graphics of the previous result, we need to save the positions of the words in the image, for example by adding Sow[minpos - centre] to the definition of iteration2 somewhere towards the end of the code and using Reap to reap the results. We also need to keep the rotation angles of the words, so we'll replace wordsimgRot with

angles = RandomReal[2 Pi, Length[wordsimg]];

wordsimgRot = ImageRotate[##, Background -> White] & @@@
Transpose[{wordsimg, angles}];


As mentioned before, we use Reap to create the position list

poslist = Reap[img = Fold[iteration2, wordsimgRot[[1]],
Rest[wordsimgRot]];][[2, 1]]


The vector graphics can then be created with

Graphics[MapThread[Text[#1, Offset[#2, {0, 0}], {0, 0}, {Cos[#3], Sin[#3]}] &,
{words, Prepend[poslist, {0, 0}], angles}]]


It is a great job.... But how is it possible to control the size of the image without losing the quality?? – Morry – 2014-01-05T15:40:13.753

1@Morry You can use this variation of the last line of code: scalingFactor = 2; Graphics[MapThread[ Style[Text[#1, Offset[#2*scalingFactor, {0, 0}], {0, 0}, {Cos[#3], Sin[#3]}], Magnification -> scalingFactor] &, {words, Prepend[poslist, {0, 0}], angles}], ImageSize -> 1000] – shrx – 2014-01-06T12:41:25.110

It is better to replace wordsimg with wordsimg = ImageCrop /@ Rasterize /@ words; It produces less errors. – Gleno – 2012-07-03T04:38:01.957

@Gleno What errors are you experiencing? I used something like what you are suggesting in a previous version but I noticed that ImageCrop would sometimes crop part of the text as well which is why I chose to frame the text first. – Heike – 2012-07-03T08:00:09.960

@Heike same as you describe. This chopps incorrectly on windows/8.0.1 => longword = Style["somelongwordforsomereason", FontFamily -> "Times", FontColor -> Hue[RandomReal[], RandomReal[{.5, 1}], RandomReal[{.5, 1}]], FontSize -> 100]; ImagePad[#, -3 - BorderDimensions[#]] & /@ (Image[ Graphics[Text[Framed[#, FrameMargins -> 2]]]] & /@ {longword}) but the small change I propose chops correctly. Note that I put up 100px fontsize, because I wanted a larger graph. :) – Gleno – 2012-07-05T02:53:52.770

@Gleno Hmm, I'll look into it. Thanks for the input anyway. – Heike – 2012-07-05T09:06:54.647

I really like this one. I think it may be slightly less "arty" but more directly informative than other methods. – Mr.Wizard – 2012-02-26T21:20:10.757

Excellent work! (not sure anout your choice of font, but that doesn't make the solution any less impressive :) – cormullion – 2012-02-26T22:08:52.567

This is an excellent solution and even though it does more computation, it is faster and scales better than mine. This is partly because it uses the image processing functions better, allowing working with a smaller image and taking better advantage of optimized built-in function, and partly because it uses ImageCorrelate, which can be reduced to an $n \log n$ complexity FFT. I have some ideas on how to speed up mine though (I should exploit the fact that mine is computing much less, and that it is easier to parallelize). I would accept this solution! – Szabolcs – 2012-02-27T09:02:56.503

@Szabolcs: Let's just wait a bit for others to join the party. The solutions of both of you are amazing though. – István Zachar – 2012-02-27T19:59:27.800

90

## A preview

Before I show any code, here's a preview of what is possible with some tweaking:

## First try

Here's a go at implementing Wordle's layout algorithm, described at cormullion's link.

First, let's generate the word data (this is pretty arbitrary):

punctuation = ",/.<>?;':\"()-_!&"

(* boring words: *)
common = {"the", "of", "and", "to", "in", "I", "that", "was", "his",
"he", "it", "with", "is", "for", "as", "had", "you", "not", "be",
"her", "on", "at", "by", "which", "have", "or", "from", "this",
"him", "but", "all", "she", "they", "were", "my", "are", "me",
"one", "their", "so", "an", "said", "them", "we", "who", "would",
"been", "will", "no", "when", "there", "if", "more", "out", "up",
"into", "do", "any", "your", "what", "has", "man", "could",
"other", "than", "our", "some", "very", "time", "upon", "about",
"may", "its", "only", "now", "like", "little", "then", "can",
"should", "made", "did", "us", "such", "a", "great", "before",
"must", "two", "these", "see", "know", "over", "much", "down",
"after", "first", "mr", "good", "men"};

text = Select[
StringSplit@
StringReplace[ExampleData[{"Text", "AliceInWonderland"}],
Alternatives @@ Characters[punctuation] -> " "],
StringLength[#] > 2 &
];
text = DeleteCases[text, w_ /; MemberQ[common, ToLowerCase[w]]];


Now that we have the data, let's take a word tally and generate words at sized proportional to their frequency:

words = TakeWhile[Reverse@SortBy[Tally[text], Last], #[[2]] >= 10 &];

styledwords =
Style[#1, FontSize -> #2, FontFamily -> "Times"] & @@@ words


Let's rasterize and binarize these (the binarization is to ease overlap detection):

images = Binarize@Rasterize[#, "Image"] & /@ styledwords;


This counts black pixels in an image:

count[img_] := ImageLevels[Binarize[img]][[1, 2]]


Now run this:

canvas = Image[Graphics[], ImageSize -> {1000, 1000}];
Monitor[
Do[
x = 0;
w = images[[i]];
cc = count[canvas];
centre = RandomReal[0.1 {-1, 1}, {2}] + {0.5, 0.5};
compose :=
res = ImageCompose[canvas, SetAlphaChannel[w, ColorNegate[w]],
Scaled[centre + x/100 {Cos[x], Sin[x]}]];
compose;
While[count[res] - cc - count[w] < 0 && x < 80,
x += 1;
compose;
];
canvas = res,

{i, 1, Length[images]}
],
canvas
]

canvas // ImageCrop


Some explanations:

This code takes the word images one by one and tries to place them at a random position. If it does not fit at that position, it tries other positions, moving on an outward winding Archimedean spiral. The step size of moving on this spiral should be chosen so that the points are distributed with an approximately constant density in the plane, and are not on gathered on a few thin lines. I used this code to verify the point distribution for a given step size (1 was good enough for a first try):

Manipulate[Graphics[Point@Table[x/100 {Cos[x], Sin[x]}, {x, 0, 100, s}]],
{s, 0.1, 1.5}]


Testing for image overlap (i.e. whether the word fits) is done by composing the image onto a canvas which has all the previously placed words, and verifying that no black pixels will collide (i.e. the total pixel count will not change after placing the image). There are probably faster ways to do this.

## Second try: converting all this to vector graphics

I used images but his should only be an aid for calculating positions. The images should be dilated prior to arranging them (to have more "air" between them), their positions should be recorded, and the recorded positions should be used to arrange the vector versions of words.

Here's a basic implementation:

images = ColorNegate@Dilation[ColorNegate[#], 2] & /@ images

canvas = Image[Graphics[], ImageSize -> {1000, 1000}];
positions = {};
Monitor[
Do[
x = 0;
w = images[[i]];
cc = count[canvas];
centre = RandomReal[0.1 {-1, 1}, {2}] + {0.5, 0.5};
compose :=
res = ImageCompose[canvas, SetAlphaChannel[w, ColorNegate[w]],
pos = Scaled[centre + x/100 {Cos[x], Sin[x]}]];
compose;
While[count[res] - cc - count[w] < 0 && x < 80,
x += 1;
compose;
];
canvas = res;
AppendTo[positions, pos],

{i, 1, Length[images]}
],
canvas
]

Rasterize[
ImageSize -> 1000], "Image"] // ImageCrop


There's a lot of refinement needed though (I didn't have time to tune the parameters).

Can you make this work without SetAlphaChannel? – Mr.Wizard – 2012-02-26T15:00:43.243

@Mr.Wizard Yes, but no time now ... the task it to compose more word images onto the canvas with the white parts being transparent and the black parts being black. Since I'm working with binary images only, we could use ImageMultiply for this. But ImageMultiply does not allow arbitrary positioning of the second image, so this'll need to be done by hand, e.g. by padding the image to the size of the canvas (some extra work) – Szabolcs – 2012-02-26T15:15:13.567

Why is little a boring word? – s0rce – 2012-02-27T01:44:31.137

2@s0rce I just used the list of 100 most common words from Project Gutenberg, I didn't edit the list manually. – Szabolcs – 2012-02-27T08:03:15.487

Is it possible to implement this words cloud into a blog such like GitHub page, the inclined words can be clicked and searching by google cse. – HyperGroups – 2013-08-30T04:56:09.900

40

Now that two of our resident Mathematica geniuses (genii?) have produced such awesome examples, there's not much room left for anyone else... :) But that didn't stop me - and I'm here to make you guys look good. I had an idea...

I decided not to make a cloud, but a tale - or rather, a tail. I've pinched Szabolcs's code to get the words and frequencies:

punctuation = ",/.<>?;':\"()-_!&";
common = {"the", "of", "and", "to", "in", "I", "that", "was", "his",
"he", "it", "with", "is", "for", "as", "had", "you", "not", "be",
"her", "on", "at", "by", "which", "have", "or", "from", "this",
"him", "but", "all", "she", "they", "were", "my", "are", "me",
"one", "their", "so", "an", "said", "them", "we", "who", "would",
"been", "will", "no", "when", "there", "if", "more", "out", "up",
"into", "do", "any", "your", "what", "has", "man", "could", "other",
"than", "our", "some", "very", "time", "upon", "about", "may",
"its", "only", "now", "like", "little", "then", "can", "should",
"made", "did", "us", "such", "a", "great", "before", "must", "two",
"these", "see", "know", "over", "much", "down", "after", "first",
"mr", "good", "men"}; text =
Select[StringSplit@
StringReplace[ExampleData[{"Text", "AliceInWonderland"}],
Alternatives @@ Characters[punctuation] -> " "],
StringLength[#] > 2 &];
text = DeleteCases[text, w_ /; MemberQ[common, ToLowerCase[w]]];
words = TakeWhile[Reverse@SortBy[Tally[text], Last], #[[2]] >= 10 &];


Now, I can work out how big the words would be:

textToOutline[string_, size_] :=
First[ImportString[
ExportString[
Style[string, FontFamily -> "Times", FontWeight -> "Bold",
FontSize -> size], "PDF"], "TextMode" -> "Outlines"]];

width[t_] := t[[2]][[2]][[1]]; (* the best way? - surely not! *)
height[t_] := t[[2]][[2]][[2]];


and make the graphics:

wordGraphics =  textToOutline[First[#], Last[#]] & /@ words;
baselines =   Reverse[Accumulate [Reverse[height /@ wordGraphics]]]; (* !? *)
wiggle = 0; fudgeFactor = 180;
Graphics[
Tooltip[Text[#1, {Sin[++wiggle/4], #2/fudgeFactor}], #3] &,
{wordGraphics,  baselines, words}],
Background -> LightGreen]


With the help of a fudge factor or two, it looks like this:

There's a number of problems with this code - I don't see how to do some things in Mathematica yet - such as find out how far I've got in a Map operation, or how the coordinates work (what is that fudge factor doing?), or, more importantly, why the first word doesn't look right. But I like the idea of reading the tale.

31

There is now a built-in version of an algorithm in v10.1: WordCloud. I wonder whether any of your nice algorithms introduced here had any influence on the built-in function...

Individual words can be styled, annotated, rotated, etc., so I must assume that there is a polygon-intersection checking algorithm running under the hood. Would be useful to know more about that. You can finetune the density of packing and the function can also fit the cloud into different shapes.

1

is there any way to outline the text, as in this question? FontVariations is not supported in the options

– user50473 – 2019-01-23T13:25:44.440