I thought Arcanicity was an album by The Police, but it turns out that it’s a measure of the complexity of the words that high-tech vendors use in their marketing. As I mentioned in my last post, I’ve produced an Acanicity Index. This is a readability index that looks to estimate the amount of domain-specific or technical knowledge a reader needs to posses in order to understand a piece of text. Well here’s a little bit more detail…
I’ve been posting recently about my desire to rate the readability of high-tech vendors’ marketing copy. There are many existing readability indexes. Each of which looks to calculate the ease at which the structure of the sentences/words can be comprehended. But none that I’ve found tries to gauge how technically difficult it is to understand the concepts contained within the text.
I’m interested in this because as part of project REPAMATron, I want to be able to compare various vendors’ marketing copy. It also gives me an insight in to what audience strata (IT Technical, IT Business or Business) a vendor is targetting. Anyway, I’ve settled on a calculation that seems to hit all of the design criteria I set for a readability index that takes into account the level of domain knowledge required to interpret text.
And here it is…
Let me explain how I got to this first draft and exactly what that calculation means
But first let me recap on the issue I am looking to address. If, during the course of my analysis I were to find the following text on a vendor’s web site, product PDF or PowerPoint presentation…
ACME’s IJK-compliant WooshBang product provides developers with an integrated IDE for developing RTY, JKL and ERT applications. ASDF support is provided through a plugin which also serves as a POL generator. Developers that previously used NMB or XCV will find the environment totally familiar to them. Security is enhanced through the addition of RTY scripts which when used in conjunction with the POL generator ensures unauthorised access is not allowed. Finally, upgrading from previous versions of WhooshBang is simple. Just install the new version and old UIO repositories will be upgraded automatically.
…I might come to the conclusion that the text is pretty much impenetrable. Because whilst it’s relatively simple to interpret the physical structure of the text, unless the reader understands those acronyms, the meaning of the text remains hidden. And before you rush out to buy a WhooshBang, I’m sorry to disappoint you but it’s fictional.
But just how difficult is it to understand that text? Is it aimed solely at an audience with the domain knowledge to be able to understand it? Does the vendor, ACME in this case, realise that they are excluding readers from understanding the content of their web site? And is the text any more or less difficult to understand than text used by ACME’s major competitors? Well these are the questions I am looking to my Arcanicity Index to answer.
How might we calculate the degree of arcane content in a text like this? Well, let’s look at some general text statistics from the text above and see if we can identify some metrics we could use.
General Text Statistics
These statistics give a good indication as to the complexity of the structure of the text itself. These are the metrics used by traditional reading indexes such as Flesch-Kincaid, Gunning-Fog, Coleman-Liau, etc.
|Average characters per word||5.49|
|Average syllables per word||1.94|
|Average words per sentence||15.67|
|Words with 3 or more syllables||28|
|Percentage of words with 3 or more syllables||29.79|
General Readability Index Statistics
And here are some of the general text readability indexes that look to estimate the amount of education someone would need in order to be able to comprehend that text. These indexes concentrate only on the physical structure of the paragraph/sentences/syllables and words.
|Automated Reading Index||11.90|
Acronym Density Statistics
So what if we were to look at the acronyms and abbreviations contained within the text. What if we tried to compute the density of these arcane words? Would that help us to determine whether the content was aimed at someone with specific technical knowledge? Well here are the key acronym statistics for the text above.
|Acronyms (unique)||ERT, IJK, UIO, JKL, POL, NMB, XCV, IDE, RTY|
|Acronyms (non-unique)||ERT, IJK, UIO, JKL, POL, POL, NMB, XCV, IDE, RTY, RTY|
|Acronym count (unique)||9|
|Acronym count (non-unique)||11|
|Acronyms per sentence (unique) – UAS||1.5|
|Acronyms per sentence (non-unique) – NAS||1.833|
|Peak acronyms per sentence – PAS||5|
Which Measurements do we need to calculate the Index?
In designing an algorithm that could gauge how complex a text is to understand from a domain knowledge perspective, I’ve concluded that I need three key metrics in my calculation.
- NAS – NON-UNIQUE ACRONYMS PER SENTENCE
- The density of the total number of acronyms per sentence (or the total acronyms in the text divided by the number of sentences in the text)
- UAS – UNIQUE ACRONYMS PER SENTENCE
- The density of the unique acronyms per sentence (or the count of unique acronyms in the text divided by the number of sentences in the text)
- PAS – PEAK UNIQUE ACRONYMS IN A SINGLE SENTENCE
- The peak number of unique acronyms found in a single sentence
The combination of these three metrics gives me a good indication as to the number, average density and peak density of the acronyms within a specific text. The question then is how to accommodate these in some form of index. To allow the index to be easily interpreted I want it to max out at 100. So a score of 100 would mean that total domain specific knowledge would be required to comprehend it. However if I reserve the 100 score for a text that is completely full of acronyms (i.e. all 100 words of a 100 word text are different acronyms) then scores for less acronym-dense text would be far too small to measure. The scores would be disproportionately skewed towards zero.
I needed a way to allow me to reserve a score of 100 for text that is aimed at experts only, but also allow for less acronym-dense text to receive a representative score too. I therefore needed to find a density point at which the text is already in effect “complex enough” and this point would become 100. I’m looking for a point where as more acronyms are added, the sentence doesn’t really get more complex to understand as it already requires a qualified reader. This would mean that it would be possible to score higher than 100 – much higher in fact, but 100 would be the notional maximum. Any score higher than 100 means that the density of acronyms is greater but the meaning is equally obscure.
So what would a score of 100 mean?
I felt that there must be a limit for acronyms per sentence density (#1 above) across a text extract where increasing that density doesn’t really exclude any non-qualified readers as the text is already at expert level. Likewise there must be a density of unique acronyms per sentence (#2 above) where readers are not excluded. Lastly, the measure of the peak number of unique acronyms in any one sentence (#3 above) would give a good indication of a text that was meant for expert eyes.
So I then did a bit of web crawling and looked at the typical count and density of acronyms on a number of infrastructure software vendors’ web sites (my area of focus and expertise). I rated the ‘arcanicity’ (the degree of arcane content) of those sites arbitrarily myself. I read the marketing copy. I counted the sentences and acronyms, worked out the densities and counts for 1,2 and 3 above and looking at my results across a number of sites, I ended up with the following measurements.
- 2.8 x NON-UNIQUE ACRONYMS PER SENTENCE
- 1.7 x UNIQUE ACRONYMS PER SENTENCE
- 4.0 x PEAK UNIQUE ACRONYMS IN SINGLE SENTENCE
Why 2.8, 1.7 and 4.0?
Well those numbers were the point at which I felt the text was already obscure enough. Or put another way, I felt that a sentence is for ‘expert eyes only’ if it contains an average of 2.8 acronyms per sentence (when looking at the total count of acronyms – not just unique acronyms) AND has 1.7 unique acronyms per sentence AND features one sentence that has 4 unique acronyms within it. This therefore is maximum arcanicity. I’ve capped the peak unique acronyms at 4 because again I felt that if a single sentence contained 4 unique acronyms, it was already expecting a degree of expertise from the reader in order for them to be able to understand it. Would a 5th acronym make it any less impenetrable to someone without domain knowledge? Probably not.
Creating the Acanicity Index Equation
Now I needed to put this into an equation that would get me a score of 100 if the text contained 2.8 non-unique acronyms per sentence, 1.7 unique acronyms per sentence and a peak of 4 unique acronyms in one sentence. So given that we want a maximum score to be 100, working back from there gives us something like.
100 = (1) x (2) x (3) x MAGIC NUMBER
100 = 3.8 x 2.7 x 4 x 2.4366
Note that I’ve added 1.0 to each of the density figures to ensure that they are always 1.0 or above. Otherwise density figures of less than 1 acronym per sentence would serve to lower the index and would create an artificial pivot point in the scoring around a density of 1.0. The 2.4366 is the magic number that when used with the the maximum vales of 1,2 and 3, creates a maximum score of 100. Or mathematically…
MAGIC NUMBER = 100 / (3.8 x 2.7 x 4)
And so this becomes…
Arcanicity = 2.4366 x NAS x UAS x PAS
As I mentioned above it’s possible to score greater than 100 but I’d contend that the sentence isn’t excluding any further readers at that point, as it is already arcane enough.
So how does ACME’s marketing copy above score?
Well the Goodall Arcanicity Index is 69.04, or…
(1 + 1.833) x (1 + 1.50) x 2.4366 x 4 = 69.04
Whether that is good or bad depends on who the text was aimed at. Without even seeing the text, a score of 69.04 tells me that the text relies on significant domain knowledge in order for the reader to understand it and I would hope that the vendor that wrote the text was aiming at the the IT Technical stratum.
Is that it?
No. The magic number and other ratios I’ve landed upon here in the first draft will need to be tested against more marketing copy to ensure that they reflect my own perceived level of obscurity. It will need to be fed very long passages and very short passages to see what min and max bounding I would need to put on the number of sentences in the input text. Currently I can see no value in processing less than 3 sentences – and 5 would probably be a more optimum minimum.
I expect that very long passages that are highly arcane is some sections but less so in others will produce an artificially low result. So I may need to adjust the equation to counter that in future drafts. I would also probably need a more rigorous, scientific approach to testing rather than my “finger in the air – yep it looks good to me” approach. Especially if this Arcanicity Index is to be used in the wild. But for now it’s a start and I’m quite pleased with the way it looks.
Over the next couple of weeks I’ll post some Arcanicity Index calculations for the ESB vendors that I’m using to tune project REPAMATron. And if you think my example text above for the fictional ACME corp was artificially complex, well I’ll show you some real-world examples that make that look like a Janet and John book.
Lastly, Why Arcanicity?
Well a number of reasons really.
- I like the word – it’s got a nice rhythm to it
- Google only returned 111 hits for it which means it isn’t a real word and it definitely won’t have been used before. I suspect the real term for the degree to which something is arcane or not is acranality, ancanity or probably arcaneness – but I’m not sure.
- I like the humour used in the names of other readability indexes such as Gunning-FOG and SMOG to illustrate that something was clouding the view of the meaning of the words. So I thought I’d inject a little, hopefully, obvious ironic humour myself. I mean, using a made up word with an obscure meaning and 5 syllables to represent an index that is designed to determine how arcane a piece of text is – well it just feels ‘right’.