Blog Readability Test

December 6th, 2007

Have you finished high school? If not, chances are you won’t be able to read this blog.

High School

Yes, it’s yet another ridiculous online test. I’m a sucker for these, and I can’t help myself whenever there’s a nice button saying “test yourself;” this time I didn’t have to fill in anything other than my blog’s URL, though. Here you can see a speculation that the algorithm is a variant of the Flesch-Kincaid Grade Level Formula, and given the little I know of Flesch-Kincaid (NOTHING!) I’m inclined to agree. It’s your usual run-of-the-mill combination of sentence length and syllables and whatnot to generate an estimated lowest necessary level of education to understand the text; apparently I’m lucid enough for all post high school people.

“High school? Aren’t you ashamed of yourself? I thought this was a blog about technical issues for nerds and geeks and weirdos, but even wee high schoolers can read it!”

Well… I was a bit miffed at first until I tested a number of other sites:

  • www.penny-arcade.com – Elementary School
  • www.abc.com – Elementary School
  • www.crankygeeks.com – Junior High
  • www.tomshardware.com – Junior High
  • www.gamespot.com – Junior High
  • www.imdb.com – Junior High
  • www.xkcd.com – Junior High
  • www.gameproducer.net – Junior High
  • anders-ivarsson.blogspot.com – Junior High

(Did you know that my Indian name is Can’t be Arsed to Make Proper Links?)

Of course, I also found a bunch of sites that also got the High School grade. I can honestly say that some of these surprised me. Quite a bit.

  • www.gamefaqs.com
  • www.gamasutra.com
  • www.gamedev.net
  • www.tigsource.com
  • www.slashdot.org
  • www.engadget.com

GameFAQs? What the hell? Anyway, in my search I did find a couple of sites that received a rating above High School. They’re hard to find, but here are two:

  • www.microsoft.com
  • www.gamepolitics.com

Can you find any others?



To Allot a Lot

March 16th, 2007

English grammar contains many tricky pitfalls. Should we say “the media are” or “the media is?” Is “majority” plural or singular? (The answer is of course: it depends. Media is per definition the plural form of medium, but there are countless examples of “the media is.” If enough people choose to interpret media as singular, that’s what it’ll be. And majority can be either plural or singular depending on the situation.)

However, some things shouldn’t be hard.

Lately I’ve seen many examples of people writing “alot.” I think this has turned into my new pet peeve.

“This matters alot to me.”

“He did alot of damage.”

“Her rectum has seen alot of penises.”

I won’t give you any links to the offending sites since I don’t want to be an anal bastard. (Yes, I’m terribly proud of this pun, coming right after the last example above. I’m giggling right now, in fact.) But I really don’t see how hard it is to make that non-existing word into two separate – correct – words. And I also don’t see how people reason when they write alot. Are they confusing it with allot? That’s a verb, for Bog’s sake!

Now, there might be some mitigating circumstances. For one thing, as a non-native English speaker I might have missed something; maybe this error comes from how one learns the language as a kid, or maybe it’s caused by schools offering strange grammatical rules to apply in strange places. Since I learned English after my native tongue I might be immune to those particular traps. Who knows! But it still irritates me. As this page notes: “just remind yourself that just as you wouldn’t write ‘alittle’ you shouldn’t write ‘alot.’”

Speaking of that site (Common Errors in English) there are other cool mistakes listed there. “Awe, shucks,” “full of pith and vinegar” and similarly mutilated idiomatic expressions seem to be very common – and I find that quite interesting since I haven’t seen many of these errors. Again, this is probably due to my non-native-ness: I mostly read English literature, watch English shows, read semi-litterate English articles and so on. I’m probably protected from everyday English, in other words. Even if one can complain alot (ah-hah) about the linguistic quality in books, TV, movies or articles, they probably have much better grammatical correctness compared to speaking to Bob the janitor while you’re waiting in line for the bathroom.

To all people who get the urge to enlighten me about how hypocritical I am since I complain about other people’s grammar but make mistakes myself: please send your complaints to shut_the_fuck_up@karjasoft.com



Technical Articles vs News Articles vs Linguistic Articles

January 17th, 2007

I saw this blog post with text analyses of various articles by different authors – very fascinating stuff with lots of cool linguistic statistics. However, I felt that the pieces compared were…a bit similar. It makes sense to compare oneself with one’s peers; it’s what we all do instinctively and intuitively. Still, I wanted to get a broader spectrum of comparisons, so I got statistics on a blog entry of my own, a DDJ article, two CNN texts and one linguistic article. Oh yeah. Let’s bring on the stats. For completeness’ sake I’ll include the four examples from the original blog as well.

First: this is the text statistics tool; go analyze some other articles if you want to - it’s great fun!

Anyway, here are the results:

Wi-Fi Protected Setup: this is my latest blog post, and deals with a new Wi-Fi configuration standard. Technical mumbo jumbo with WLAN jargon and so on.

Total Word Count:  913
Total Unique Words:  391
Number of Sentences:  48
Average Words per Sentence:  19.04
Hard Words:  71 (7.78%) (what’s this?)
Lexical Density:  42.83% (what’s this?)
Fog Index:  10.72 (what’s this?)

Agile Testing Strategies: more technical mumbo jumbo, but this time from DDJ.

Total Word Count:  1513
Total Unique Words:  539
Number of Sentences:  83
Average Words per Sentence:  18.24
Hard Words:  154 (10.18%) (what’s this?)
Lexical Density:  35.62% (what’s this?)
Fog Index:  11.36 (what’s this?)

Doctor denies saying that Castro in serious condition: the first CNN article on their page.

Total Word Count:  304
Total Unique Words:  148
Number of Sentences:  19
Average Words per Sentence:  16.04
Hard Words:  27 (8.88%) (what’s this?)
Lexical Density:  48.68% (what’s this?)
Fog Index:  9.95 (what’s this?)

Another CNN article; I took another one because of the low Fog Index on the first one. I wanted to have more data to be sure.

Total Word Count:  707
Total Unique Words:  312
Number of Sentences:  37
Average Words per Sentence:  19.14
Hard Words:  54 (7.64%) (what’s this?)
Lexical Density:  44.13% (what’s this?)
Fog Index:  10.70 (what’s this?)

Finally, a linguistic arcticle as well, for comparison.

Total Word Count:  649
Total Unique Words:  307
Number of Sentences:  22
Average Words per Sentence:  29.54
Hard Words:  62 (9.55%) (what’s this?)
Lexical Density:  47.30% (what’s this?)
Fog Index:  15.62 (what’s this?)

And here are the results of the four different articles analyzed in the original post:

raph:

Total Word Count: 1575
Total Unique Words: 637
Number of Sentences: 71
Average Words per Sentence: 22.24
Hard Words: 122 (7.75%) (what€™s this?)
Lexical Density: 40.44% (what€™s this?)
Fog Index: 11.97 (what€™s this?)

tycho:

Total Word Count: 657
Total Unique Words: 360
Number of Sentences: 34
Average Words per Sentence: 19.34
Hard Words: 51 (7.76%) (what€™s this?)
Lexical Density: 54.79% (what€™s this?)
Fog Index: 10.83 (what€™s this?)

m3mnoch:

Total Word Count: 983
Total Unique Words: 409
Number of Sentences: 117
Average Words per Sentence: 8.43
Hard Words: 82 (8.34%) (what€™s this?)
Lexical Density: 41.61% (what€™s this?)
Fog Index: 6.70 (what€™s this?)

peckham:

Total Word Count: 737
Total Unique Words: 399
Number of Sentences: 26
Average Words per Sentence: 28.34
Hard Words: 66 (8.96%) (what€™s this?)
Lexical Density: 54.14% (what€™s this?)
Fog Index: 14.92 (what€™s this?)

Now, I’m sure you’re speculating what all these terms mean. I sure did at least. And likewise did m3mnoch in the original blog entry. So here’s a link to a definition of lexical density, and here’s a link that explains the term Fog Index. (Oh, I’m such an anal bastard… Directly after I wrote that I felt like correcting myself. “No, the link does not explain the term at all – it just points to a webpage that explains the term.” In my own particular idiom I’ll leave my mistake here for all the world to see.)

Essentially, the lexical index shows how varied your text is, and the Fog Index is the hypothetical reading level (measured in years of required education) that the reader has to be at in order to understand the text.

For reference, the New York Times has an average Fog Index of 11-12, Time magazine about 11. Typically, technical documentation has a Fog Index between 10 and 15, and professional prose almost never exceeds 18.

Looking at the results above, I note a few things:

  • The percentage of unique words in an article varies quite a bit, but it rarely deviates extremely. Depending on what your definition of extreme is, of course. I won’t try to make any more comments on this since the sample set isn’t big enough (and the articles aren’t long enough) to say anything conclusive.
  • The average words per sentence also differs by quite a bit; for example, the linguistic article and peckham’s piece both have long sentences at close to 30 words per sentence, while one author had the average length of 8 words per sentence. The most common length seems to be around 19 though. This is probably just a choice of writing style, but I think it’s an indication that ~20 seems to be about average for common people while ~30 and above points at either a deeper understanding of the language (which leads to more complex sentences), a more complicated text (which demands more complexity)…or pretentiousness.
  • The percentage of hard words is high in the DDJ article and in the linguistic article, but remains within the span 7% - 9% in the other ones. A few – such as the last results above – have closer to 9%. I will hazard a guess that the amount of hard words is closely linked with the author’s vocabulary and the jargon of the genre; however, it’s impossible to say for sure since an article deals with a specific topic. If your article discusses an anthology, you will repeat the word a few times. (Of course, this also applies to the amount of unique words in a text.)
  • Then we have the lexical density and the Fog Index. First of all, they are not linked at all. I would really have suspected that a more varied text would also feature a higher Fog Index, but that – and the opposite – is apparently not the case. Again, there’s too little data to say anything for sure, but I’ll point out the high Fog Index on the linguistic article and the last blog post.

What I wanted to see was a clear difference between technical writing, news articles and linguistic articles. However, it’s really not that obvious: nothing seems to be totally out of place, and judging from the definitions, the Fog Index is fairly normal in all texts. Normal, but there is still a bias toward complexity in a few cases – the linguistic text and peckham’s blog entry. I have to admit that I haven’t looked at that one yet; maybe it will turn out to be a brilliant review of Ulysses or something. Either way, if I must draw any conclusion I think that the conformity in the articles arrives from necessity; this level of writing is essentially what’s required to not appear too retarded but still appeal to the general public.

I’m pretty pleased with my results: they’re not that bad for a non-native English speaker. Thank Bob that the tool doesn’t measure how well-written a text is, as well.



Literacy, Verbosity and Katakres

December 5th, 2006

Today I browsed through a language column of a Swedish newspaper, and to my surprise I found not only one, but two articles that might be worth mentioning.

The first one dealt with words. The amount of words in languages in fact, and in Swedish first and foremost. The title of the article was (very loosely translated) “Sometimes English is Less Rich in Words than Swedish.” The word literacy was brought up as an example: in English it’s a very useful and multi-facetted word that can be used in many different ways. For instance, it can be used to describe the ability to read or write, or it can concern literature. Or many other aspects.

In Swedish we don’t have that luxury. We don’t have a word to describe the ability to read and write. We call it – again loosely translated - “the ability to read and write.” Quite pragmatic; quite Swedish. Likewise, we have various ways of expressing the different meanings of literacy. The article’s main point was that we should utilize and be happy for our language’s many terms and expressions, but I can’t help but feel that this is a special case; an isolated event that the author attempts to use to imbue the idea that Swedish isn’t a language with few words. Sometimes in the title of the article is a useless word: it’s like saying “sometimes it’s better to be poor than rich;” it’s a special case that will not hold true for the vast – the extremely vast – majority of cases. It’s not worth making a big deal of.

But then I started writing this post, and was struck by how I couldn’t find a good translation for the article’s title. “Sometimes English is Poorer in Words than Swedish” might have sounded better than the one I chose, but the Swedish title used a word that positively reflected that a language has many usable words. I wanted to show the negation of that, so I went for “less rich” instead. What I really wanted was a translation for the Swedish word, but I couldn’t think of one. Verbose was the best I could come up with, but to my understanding that would have implied that English was a language that required more words to describe things – the complete opposite of what I meant. And “less verbose” really wouldnt’ have fit the bill at all. This might be my poor English vocabulary rearing its ugly head, but it might also be the case of a deficiency in expressing this particular thing in English.

Which brings me to the second article; one which describes the difference between the Swedish words kontamination and katakres. The first one is pretty straight-forward: contamination. In linguistics this refers to a mix-up of common, neutral words, and even if contamination isn’t the correct English word it fits rather well despite that. The second word is more problematic. I can’t seem to find an English counterpart to it at all.

In fact, when I tried to look up a translation online it automatically gave me the suggestion “were you looking for catarrh?” Nooo… That’s not really what I wanted.

The word katakres comes from the Greek word for misuse, and in Swedish it means a specific mix-up: when proverbs or idioms get mixed up. An example would be:

You can’t see the forest for all the birds in the hand.

Yeah, I know. I suck. If I knew more English proverbs I might have come up with something decent.

Again, katakres might have have a perfectly good English translation, but that’s beside the point. The point is that (to my knowledge) Swedish possesses a word for a concept that the English doesn’t. If I were prone to state the obvious, I would mention that languages aren’t one-to-one mapped, and that they are more like Venn diagrams. And that would inevitably lead to a conclusion where I stated that speaking of a word-rich or word-poor language is pretty irrelevant since they can be used in different ways, with focus on different concepts and ideas.

It’s a good thing that I don’t state obvious things, though. And it’s also a good thing that I found the katakres article, ’cause I’ve learned a new word today. I hope someone else found it as fascinating as I did.



Comma Splice

November 30th, 2006

In vocal communication I tend to use rough and coarse language; that makes it rather fascinating that I’m anal about my written language. Except for prepositions. I don’t give a crap about ‘em. And spelling. I never use spell checkers; if I don’t know how to spell a word, I’m prepared to live with the shame. In general, I suppose that I’m not really anal about my written language per se - I’m anal about my grammar.

One thing that constantly fascinates me is the grammar splice errors that pop up everywhere. Literally everywhere: newspapers, literature, web pages, reports… I’ve seen it in a web comic just now. I saw it in a Pratchett book yesterday. It doesn’t matter whether it’s in Swedish or English or German – people seem to have a fetish for abusing the poor old comma in strange and horrible ways. It’s like people have collectively decided that comma is the weird fat kid who should get beaten up every recess. You know, that kid who spends all his time at the library; people are stupid and mean and he doesn’t understand them, but books are kind and helpful and comforting. Oh, the horrors poor comma-boy has seen! He has seen the hearts of children, and they are black as the night.

Anyway, I mentioned the term comma splice, but I didn’t know about it until five minutes ago. I had planned to write a post about this horrible comma misuse, and decided to check the comma Wikipedia page before I embarked on this glorious task; that’s where I found that some kind soul had placed a link to the aforementioned comma splice definition page. Okay, I think I’ve mentioned comma splice about as many times as I can without explaining what it is: comma splice is when two independent clauses are joined without a binding word in between them.

Here’s an example:

John was tired, he wanted to go home.

I assume that you immediately see what the problem is with the sentence above, but just in case I’d better elaborate a bit. “John was tired” is an independent clause. It’s a perfectly correct sentence on its own; it has a subject, a predicate and an adverb as well. “He wanted to go home” is also an independent clause; it can stand on its own. These two can never be joined by just a comma! All that’s needed to correct this is to add a single conjunction:

John was tired and he wanted to go home.

Even a semicolon would make things better:

John was tired; he wanted to go home.

I get so frustrated when people make this mistake. I really don’t see why. Spelling mistakes I can understand. Not knowing the difference between an adverb and an adjective is fine. Not caring the least about grammar is also fine. But this jumps out at you (well, at me at least): it’s so fundamentally incorrect! It’s not even colloquial – it’s just plain wrong. In every language known to man. And by man I mean me. Which means just a measly few languages. But still.

For the observant: yes, I just wrote an incomplete sentence. See, I’m not a grammar nazi – it’s just this comma splice that annoys me to no end.