The Gold-Bug Decoded

In Edgar Allan Poe's short story The Gold-Bug this encoded message occurs:

"53++!305))6;4826)4+.)4+);806;48! 860))85;]8:+8!83(88)5!; 46(;8896?;8)+(;485);5!2:+(; 49562(5-4)88; 4069285);)6 !8)4++;1(+9;48081;8:8+1;48!85;4)485! 52880681(+9;48;(88;4(+?3 4;48)4+;161;:188;+?;"

The code is explained as such: "the solution is by no means so difficult as you might be led to imagine from the first hasty inspection of the characters. These characters, as anyone might readily guess, form a cipher --that is to say, they convey a meaning".

Even though The Art of Secret Messages is old, the name of cryptography is modern. The word cryptography actually originates from Poe's short story The Gold-Bug from 1843.

Poe is perhaps best known for being the inventor of the detective genre, but he is also known because of his uncanny stories. I believe that Poe's oeuvre, contain a view on analytic 'procedure', which actually, in the age of biased machine learning algorithms, to me, appears more important than ever.

In this fictitious example Poe uses asymmetric cryptography, in which every included characters are paired up - one act as secret, one as value. Each pair contains characters that is substitutable for another. In The Guld-bug, the 'values' are unknown and one of the main characters try to get on top of the problem by counting the strange signs, thus making a table.

Poe used his understanding of communication (language) when he dealt with cryptography. To me, Poe's procedure seem to have strong similarities to heuristics in algorithms. Sometimes you're not able to rely on pure abstractions and logic, you must make assumptions. The better assumptions, the higher the possibility of an accurate or good outcome.

Now, in English, the letter which most frequently occurs is e. Afterwards, the succession runs thus: a o i d h n r s t u y c f g l m w b k p q x z. E however predominates so remarkably that an individual sentence of any length is rarely seen, in which it is not the prevailing character."Let us assume 8, then, as e. Now, of all the words in the language, 'the' is the most usual; let us see, therefore, whether they are not repetitions of any three characters in the same order of collocation, the last of them being 8. If we discover repetitions of such letters, so arranged, they will most probably represent the word 'the'. On inspection, we find no less than seven such arrangements, the characters being ;48. We may, therefore, assume that the semicolon represents t, that 4 represents h, and that 8 represents e --the last being now well confirmed. Thus a great step has been taken.

A JavaScript implementation

Now let's reproduce this, creating an encoder and a decoder with JavaScript. We will use this to make a conversion table.

A: "5", B: "2", C: "-", D: "†", E: "8", F: "1", G: "3", H: "4", I: "6", J: ",", K: "7", L: "0", M: "9", N: \"*", O:"‡", P:".", Q:"\$", R:"(", S:")", T:";", U:"?", V:"¶", W:"]", X:"¢", Y:":", Z:"[\"

Given a conversion table, this kind of algorithms - Ceasar's cipher - are quite easy to understand and implement. We simply have to substitute each character in the encoded message, with its correlating value in the table. The other way around, of course, would be the procedure for decoding messages.

Basically, we loop through the message and switch each and every character with another (the 'opposite' value). But… we have a problem. The result is not very satisfying, well it's not formatted anyway.

This is by the way a good example of an area where the human mind outshines the computer. The computer will encode and decode endlessly faster. But it can't understand text, so it's quite tricky to separate the words - something we humans can do quite fast.

This is our expected output:

AGOODGLASSINTHEBISHOPSHOSTELINTHEDE VILSSEATTWENTYONEDEGREESANDTHIRTEENMI NUTESNORTHEASTANDBYNORTHMAINBRANHSEVENTH LIMBEASTSIDESHOOTFROMTHELEFTEYE OFTHEDEATHSHEADABEELINEFROMTHETREE THROUGHTHESHOTFIFTYFEETOUT

Is there some way to manage this? …making it even easier?

I will present a theoretically correct procedure - but with flawed results, if you don't have billions of years to sit around and wait for the result (and a whole planet acting RAM etc.).

My 'hypothesis' was to match the string - a string that lacks blank-spaces - for words by comparing combinations of individual characters with actual words in the English dictionary. This I managed. The problem though, is that that algorithm generate more words than those included in the string.

I thought that if I combined every element in the array (in this case approximately 200), joined them and looked for combined strings with the same length as the string evaluated, we would find only a few matching combinations - perhaps even only one. But… I did not consider the 5.092×10\^46 average Gregorian years.

I will present what I managed step by step and why it failed.

First,

const goldBugKey = {
  A: "5", B: "2", C: "-", D: "†",
  E: "8", F: "1", G: "3", H: "4",
  I: "6", J: ",", K: "7", L: "0",
  M: "9", N: "*", O: "‡", P: ".",
  Q: "$", R: "(", S: ")", T: ";",
  U: "?", V: "¶", W: "]", X: "¢",
  Y: ":", Z: "["
};

But this only covers one part. We also need the pairs in the opposite order. To obtain it, we'll make the value-secret order opposite.

const reversedArr = Object.entries(goldBugKey).reverse();
let keyReversed = {};
reversedArr.forEach((keyPair) = {
  const [a, b] = keyPair;
  let newPair = {
    [b]: a
  };
  keyReversed = {
    ...keyReversed,
    ...newPair
  };
});

This is the encoded message the reader is confronted with in The Gold-Bug:

const codeFromTheGoldBug = `53‡‡†305))6
*;4826)4‡.)4‡);806*;48†8
¶60))85;;]8*;:‡*8†83(88)5*†;46(;88*96
*?;8)*‡(;485);5*†2:*‡(;4956*2(5*—4)8
¶8*;4069285);)6†8)4‡‡;1(‡9;48081;8:8‡
1;48†85;4)485†528806*81(‡9;48;(88;4
(‡?34;48)4‡;161;:188;‡?;`;

Something that translates to (and we need both as proofs of concept):

const sentenceFromTheGoldBug = "A good glass
in the bishop's hostel in the devil's seat
twenty-one degrees and thirteen minutes
northeast and by north main branch seventh
limb east side shoot from the left eye of
the death's-head a bee line from the tree
through the shot fifty feet out.";

To avoid problems with mismatching lower- and uppercase letters we arbitrarily choose to uppercase all letters. We also needed to trim the string, so we can avoid confusion with whitespace at both ends. In the end we also want to handle each character as a separate element, and therefore needed to split the string into an array.

Next follows the most important relevant step - the conversion. We substitute each character with the corresponding character in the conversion table.

const senToEncode = Array
  .from(sentenceFromTheGoldBug.toUpperCase()
  .trim());

const encodedMsg = senToEncode
  .map((char) => char = goldBugKey[char] || " ")
  .join()
  .replace(/,/g, "");

And then do the same, but in the other direction (if that would be our task).

const codeToDecodeArr = Array
  .from(codeFromTheGoldBug);

const decodedMsg = codeToDecodeArr
  .map((char) => char = keyReversed[char] || "\n")
  .join()
  .replace(/,/g, "");

I have used the NPM-package 'an-array-of-english-words', which contains 275 000 words. My assumption is to no encoded individual word is longer than 27 characters, the longest word in the works of Shakespeare. [According to Wikipedia the word is "Honorificabilitudinitatibus", meaning "the state of being able to achieve honours".

const charsInMsg = decodedMsg.split("");
let relevantGuesses = [];

for (let i = 0; i < charsInMsg.length; i++) {

let tempString = "";
let char = 0;

while (char <= 27) {
 if ((i + char) < charsInMsg.length) {
   let tempIndex = i + char;
   tempString += charsInMsg[tempIndex];

    if (wordsArrAsObj[tempString]) {
      const temp = [...relevantGuesses, tempString]
      relevantGuesses = temp;
    }
  }
  char++;
  }
}
console.log(relevantGuesses)

Given the sentence provided in The Gold-Bug this would be the output:

 **a** - ag - ago - agood - go - goo - **good** - o - oo - o - od -
 **glass** - la - las - lass - lassi - a - as - ass - si - sin - i -
 **in** - nth - the - he - bi - bis - bish - bishop - **bishops** - i -
 is - ish - sh - shop - shops - ho - hop - hops - o - op - ops - sh -
 ho - hos - host - **hostel** - o - os - st - te - tel - el - elint -
 li - lin - lint - i - in - nth - **the** - he - ed - de - dev - devil
 - **devils** - evil - evils - i - sea - **seat** - ea - eat - a - at -
 att - **twenty** - we - wen - went - en - y - yo - yon - o - on -
 **one** - ne - ned - ed - de - deg - degree - **degrees** - gree -
 grees - re - ree - rees - ee - es - san - sand - a - an - **and** -
 thir - **thirteen** - hi - i - te - tee - teen - ee - een - en - mi -
 minute - **minutes** - i - in - nu - nut - u - ut - ute - utes - te -
 tes - es - snort - no - nor - north - **northeast** - o - or - ort -
 the - he - heast - ea - eas - east - a - as - st - stand - standby -
 ta - tan - a - an - **and** - **by** - y - no - nor - **north** - o -
 or - ort - hm - ma - **main** - a - ai - ain - i - in - bra - **bran**
 - ran - a - an - seven - **seventh** - eve - even - event - vent - en
 - nth - li - **limb** - i - be - beast - beasts - ea - eas - **east**
 - easts - a - as - st - si - **side** - sides - sideshoot - i - id -
 ide - ides - de - es - sh - shoo - **shoot** - ho - hoo - hoot - o -
 oo - oot - o - fro - **from** - rom - o - om - **the** - he - hele -
 el - left - lefte - ef - eft - te - eye - y - ye - o - of - oft - the
 - he - ed - de - death - deaths - ea - eat - eath - a - at - sh - she
 - shea - he - head - ea - a - ad - da - dab - a - ab - be - bee -
 beeline - ee - eel - el - li - lin - line - i - in - ne - nef - ef -
 fro - from - rom - o - om - the - he - het - et - **tree** - re - ree
 - ee - et - eth - thro - **through** - rough - rought - o - ou - ought
 - u - ug - ugh - **the** - he - hes - es - sh - **shot** - ho - hot -
 o - **fifty** - i - if - y - fe - fee - **feet** - ee - et - to - tout
 - o - ou - **out** - u - ut -

As you can see, all the words are included, but the .length hypothesis turned out to be quite problematic. We can't know what words are to be included. It would be possible to guess, and with the length of the string check if what combinations of words together would equal this number. But we can't know if this combination would be the one, if we (the algoritm) would not understand the sentence.