I saw this blog post with text analyses of various articles by different authors – very fascinating stuff with lots of cool linguistic statistics. However, I felt that the pieces compared were…a bit similar. It makes sense to compare oneself with one’s peers; it’s what we all do instinctively and intuitively. Still, I wanted to get a broader spectrum of comparisons, so I got statistics on a blog entry of my own, a DDJ article, two CNN texts and one linguistic article. Oh yeah. Let’s bring on the stats. For completeness’ sake I’ll include the four examples from the original blog as well.
First: this is the text statistics tool; go analyze some other articles if you want to - it’s great fun!
Anyway, here are the results:
Wi-Fi Protected Setup: this is my latest blog post, and deals with a new Wi-Fi configuration standard. Technical mumbo jumbo with WLAN jargon and so on.
Total Word Count: 913
Total Unique Words: 391
Number of Sentences: 48
Average Words per Sentence: 19.04
Hard Words: 71 (7.78%) (what’s this?)
Lexical Density: 42.83% (what’s this?)
Fog Index: 10.72 (what’s this?)
Agile Testing Strategies: more technical mumbo jumbo, but this time from DDJ.
Total Word Count: 1513
Total Unique Words: 539
Number of Sentences: 83
Average Words per Sentence: 18.24
Hard Words: 154 (10.18%) (what’s this?)
Lexical Density: 35.62% (what’s this?)
Fog Index: 11.36 (what’s this?)
Doctor denies saying that Castro in serious condition: the first CNN article on their page.
Total Word Count: 304
Total Unique Words: 148
Number of Sentences: 19
Average Words per Sentence: 16.04
Hard Words: 27 (8.88%) (what’s this?)
Lexical Density: 48.68% (what’s this?)
Fog Index: 9.95 (what’s this?)
Another CNN article; I took another one because of the low Fog Index on the first one. I wanted to have more data to be sure.
Total Word Count: 707
Total Unique Words: 312
Number of Sentences: 37
Average Words per Sentence: 19.14
Hard Words: 54 (7.64%) (what’s this?)
Lexical Density: 44.13% (what’s this?)
Fog Index: 10.70 (what’s this?)
Finally, a linguistic arcticle as well, for comparison.
Total Word Count: 649
Total Unique Words: 307
Number of Sentences: 22
Average Words per Sentence: 29.54
Hard Words: 62 (9.55%) (what’s this?)
Lexical Density: 47.30% (what’s this?)
Fog Index: 15.62 (what’s this?)
And here are the results of the four different articles analyzed in the original post:
Total Word Count: 1575
Total Unique Words: 637
Number of Sentences: 71
Average Words per Sentence: 22.24
Hard Words: 122 (7.75%) (what€™s this?)
Lexical Density: 40.44% (what€™s this?)
Fog Index: 11.97 (what€™s this?)
Total Word Count: 657
Total Unique Words: 360
Number of Sentences: 34
Average Words per Sentence: 19.34
Hard Words: 51 (7.76%) (what€™s this?)
Lexical Density: 54.79% (what€™s this?)
Fog Index: 10.83 (what€™s this?)
Total Word Count: 983
Total Unique Words: 409
Number of Sentences: 117
Average Words per Sentence: 8.43
Hard Words: 82 (8.34%) (what€™s this?)
Lexical Density: 41.61% (what€™s this?)
Fog Index: 6.70 (what€™s this?)
Total Word Count: 737
Total Unique Words: 399
Number of Sentences: 26
Average Words per Sentence: 28.34
Hard Words: 66 (8.96%) (what€™s this?)
Lexical Density: 54.14% (what€™s this?)
Fog Index: 14.92 (what€™s this?)
Now, I’m sure you’re speculating what all these terms mean. I sure did at least. And likewise did m3mnoch in the original blog entry. So here’s a link to a definition of lexical density, and here’s a link that explains the term Fog Index. (Oh, I’m such an anal bastard… Directly after I wrote that I felt like correcting myself. “No, the link does not explain the term at all – it just points to a webpage that explains the term.” In my own particular idiom I’ll leave my mistake here for all the world to see.)
Essentially, the lexical index shows how varied your text is, and the Fog Index is the hypothetical reading level (measured in years of required education) that the reader has to be at in order to understand the text.
For reference, the New York Times has an average Fog Index of 11-12, Time magazine about 11. Typically, technical documentation has a Fog Index between 10 and 15, and professional prose almost never exceeds 18.
Looking at the results above, I note a few things:
- The percentage of unique words in an article varies quite a bit, but it rarely deviates extremely. Depending on what your definition of extreme is, of course. I won’t try to make any more comments on this since the sample set isn’t big enough (and the articles aren’t long enough) to say anything conclusive.
- The average words per sentence also differs by quite a bit; for example, the linguistic article and peckham’s piece both have long sentences at close to 30 words per sentence, while one author had the average length of 8 words per sentence. The most common length seems to be around 19 though. This is probably just a choice of writing style, but I think it’s an indication that ~20 seems to be about average for common people while ~30 and above points at either a deeper understanding of the language (which leads to more complex sentences), a more complicated text (which demands more complexity)…or pretentiousness.
- The percentage of hard words is high in the DDJ article and in the linguistic article, but remains within the span 7% - 9% in the other ones. A few – such as the last results above – have closer to 9%. I will hazard a guess that the amount of hard words is closely linked with the author’s vocabulary and the jargon of the genre; however, it’s impossible to say for sure since an article deals with a specific topic. If your article discusses an anthology, you will repeat the word a few times. (Of course, this also applies to the amount of unique words in a text.)
- Then we have the lexical density and the Fog Index. First of all, they are not linked at all. I would really have suspected that a more varied text would also feature a higher Fog Index, but that – and the opposite – is apparently not the case. Again, there’s too little data to say anything for sure, but I’ll point out the high Fog Index on the linguistic article and the last blog post.
What I wanted to see was a clear difference between technical writing, news articles and linguistic articles. However, it’s really not that obvious: nothing seems to be totally out of place, and judging from the definitions, the Fog Index is fairly normal in all texts. Normal, but there is still a bias toward complexity in a few cases – the linguistic text and peckham’s blog entry. I have to admit that I haven’t looked at that one yet; maybe it will turn out to be a brilliant review of Ulysses or something. Either way, if I must draw any conclusion I think that the conformity in the articles arrives from necessity; this level of writing is essentially what’s required to not appear too retarded but still appeal to the general public.
I’m pretty pleased with my results: they’re not that bad for a non-native English speaker. Thank Bob that the tool doesn’t measure how well-written a text is, as well.