Projects:2016s1-141 Cracking the Voynich manuscript code

From Projects
Revision as of 12:15, 26 October 2016 by A1674940 (talk | contribs)
Jump to: navigation, search

Topic

Cracking the Voynich manuscript code

Supervisors

Prof. Derek Abbott

Dr. Brian Ng

Team members

Ruihang Feng

Yaxin Hu

Project Introduction

Background

The Voynich the manuscript was created in the first half of the fifteenth century (probably between 1404 and 1438) [1]. No one today knows what it says or who wrote it. The book is in a strange alphabet. At 1912, a book collector named Wilfried Voynich found it in an Italian Jesuit college. Since this book cannot be read, it is divided into six different sections by illustrations with different styles and images:

a) Herbal: There are one or more plants on each page, which is a format of European herbals [2].

b) Astronomical There are circular diagrams such as suns, moons, and stars which suggest this part as something about astronomy or astrology [2].

c) Biological Mostly naked women show that this part should be biological section [2].

d) Cosmological Circular diagrams of obscure nature make this section as cosmological section [2].

e) Pharmaceutical Drawings of isolated plants parts and objects resembling apothecary jars show that this section should be something about pharmaceutical [2].

f) Recipes This part are full pages of text in short paragraphs [2].

Motivation

With statistical methods, trying to carry out a project that is used to investigate the language and linguistics of an unknown book is an attempt that may beyond excellent. Trying to find any features of relationships and patterns of the Voynich manuscript could be used to decode the unknown text with unknown languages. It may contribute significant progress in attempting decode a part of the book. The outcomes can be used to further linguistic or language decryption, such as information decoding, search engines and data mining. They can also be used in specific applications such as Google, Turn-it-in, Google translate, Yahoo, and Grammarly.

Project Aim

The aim of this project is to search the text and determine whether there are any possible features that can be used to decode the Voynich manuscript using statistical methods. The investigation of languages and linguistics is required to be processed with the unknown text. Furthermore, crack initial digits of the Voynich manuscript and determine the possible letters which may stand for digits. But, it is not necessary to fully decode the Voynich manuscript since it is not possible to be done in a one-year project.

Proposed method

As shown in the Appendix section A.2, the proposed methods of this project are divided into three phases.

Phase 1: Text investigation

There are two parts in this phase: words and digits.

During the process of words research, Matlab will be used as an essential tool. Team members will attempt to search laws from three aspects:

  • The total number of words in the Voynich manuscript.
  • The characters and words which may stand for digits from some paragraphs of the manuscript.
  • The frequency of special characters and words.

On the other hand, in the course of digits investigation, team members will search for different kinds of known expressions of digits and make a comparison with the words in the Voynich manuscript. For example, the expression of digits in Roman is as shown in the Appendix section A.3. The word which is as shown in the Appendix section A.4 is extracted from the Voynich manuscript, it is obvious that the form of the word in the Appendix section A.4 is like “*##’. According to the method of comparison mentioned above, this word may mean ‘seven’ in Roman.

Phase 2: Illustrations investigation

An illustration which is extracted from the Voynich manuscript is as shown in the Appendix section A.5.

In this phase, illustrations will be analysed by using Matlab. Generally, there are three aspects which are needed to be completed:

  • The number of different elements in the illustrations.
  • The characters which may stand for digits.
  • Match the characters and digits.


Phase 3: Marginal symbols research

A page which contains marginal symbols is as shown in the Appendix section A.6.

This phase also requires proficiency in programming by using Matlab. During the process of this phase, there are four major aspects:

  • Ordering and quantitative features of the marginal symbols of each page.
  • Search the characters which may stand for digits.
  • The differences between marginal symbols in each page.
  • Match the characters and digits and make inference about the relationship between characters and digits.

Phase 1

Characterisation of the Voynich manuscript

Figure 1 shows the letter frequency in Voynich manuscript. There are 24 letters in Voynich manuscript. As the figure shows, that o, e, h, and y are the four most frequency letters, and S, z, v, x are the four least frequency letters. The blue line is the tendency of all the letters.

There are six kinds of languages are used in comparing the letter frequency, those are English, Latin, French, German, Greek and Spanish.

Figure 2 shows the letter frequency of English. There are 26 words in total. The most frequency letters are e, t, a and o, and the least frequency letters are z, q, j and x. Figure 3 shows the letter frequency of Latin. There are 23 words in total. The most frequency letters are i, e, a and u, and the least frequency letters are z, y, x and h.

Figure 4 shows the letter frequency of French. There are 38 words in total. The most frequency letters are e, s, a and i, and the least frequency letters are ï, ë, œ and ô. Figure 5 shows the letter frequency of German. There are 30 words in total. The most frequency letters are e, n, s and r, and the least frequency letters are q, x, y and j.

Figure 6 shows the letter frequency of Greek. There are 24 words in total. The most frequency letters are A, E, O and I, and the least frequency letters are Ψ, Z, Ξ and B. Figure 7 shows the letter frequency of Spanish. There are 33 words in total. The most frequency letters are e, a, o and s, and the least frequency letters are k, ü, w and ú.

With the Matlab, correlations between the tendency of letter frequency of the Voynich manuscript and English, Latin, French, German, Greek and Spanish. The correlation between the Voynich manuscript and English is 98.04%. The correlation between the Voynich manuscript and Latin is 98.66%. The correlation between the Voynich manuscript and French is 94.55%. The correlation between the Voynich manuscript and German is 94.81%. The correlation between the Voynich manuscript and Greek is 98.34%. The correlation between the Voynich manuscript and Spanish is 96.09%.

Comparing the Voynich manuscript with English, Latin, French, German, Greek and Spanish, the letter number of these languages shows that the most possible language is Greek, because they both have 24 letters. Furthermore, the letter frequency is also similar for the Voynich manuscript and Greek. In addition, the correlation between the Voynich manuscript and Greek is high. Therefore, Greek can be considered as a possible language that the Voynich manuscript used. However, this is not a strong evidence that can prove the Voynich manuscript is written in Greek. In conclusion, there is no specific evidence can prove that Voynich manuscript is one of these six kind of language, Greek is one of the possible language that the Voynich manuscript used.

Figure 8 shows the word frequency in the Voynich manuscript. There are 37104 words in the whole manuscript, and the total unique words are 8486. Furthermore, there are 2472 words that appears more than once, and 6014 words appears only once. 515 words appears more than 10 times and these words counts 65.66% of the total words in the Voynich manuscript.

In figure 9, 50 most frequency words are token to make a comparison with English.

Comparing word frequency in the Voynich manuscript and in English, the correlation between the tendencies of both curve is 93.65%, which shows that there may exist relationship between the Voynich manuscript and English. In conclusion, there is no strong evidence shows that there is any relationship between the Voynich manuscript and English.

Statistical Comparison of Letters and Words

This section gives a brief statistical comparison between the Voynich manuscript and three book in English, French and German. Among these languages, the percentage of unique words/total words, word length and the percentage of words appear more than once /total unique words were compared.

Figure 11 shows the percentage of unique words/total words. There is significant difference between the Voynich manuscript and English books (47.9%) or French books (27.7%). However, there is no significant difference between the Voynich manuscript and German (13.6%).

Figure 12 shows the word length the Voynich, English, French and German. There is small difference for the word length between the Voynich manuscript and English (6.7%) or French (6.0%). Furthermore, there is no significant difference for the word length between the Voynich manuscript and German (0.1%).

Figure 13 shows the percentage of words appear more than once /total unique words were compared. There is large difference between the Voynich manuscript and English (41.0%) or French (38.9%) or German (22.8%). However, the difference between the Voynich manuscript and German books is the smallest difference among these differences.

Among these statistical comparisons, German can be considered as a possible language that the Voynich manuscript used.

Specific Pattern Words

This section shows gives a brief analysis of current results of the specific pattern words. The first numeral language is Roman numerals. In Roman numeral, VII stands for 7 and VIII stands for 8. These two numerals have obvious patterns that are easy to search in the Voynich manuscript. Words follow VII pattern and VIII pattern have been found, and next step will continue finding all possible numerical words in Roman numerals from I to XX, and several obvious pattern numerals such as XX, XXX, C, CC and CCC.

Figure 14 shows part of Vii pattern words. All the numbers below the words are locations of these words, such as 534 means that 534th word in the Voynich manuscript contains aii. Since a large number of locations of several words were found, this figure could not show all the locations.

From the locations, there are 562 aii, 201 kee, 77 oee, 72 tee, 51 oii, 30 qoo, 27 dee, 18 qee, 10 see, 4 lee, 3 yee and 2 ree were found. All along with the result, it is obviously that *ii, *ee and *oo are three patterns that may be numerical words for VII.

Figure 15 shows part of Viii pattern words. From the locations, there are 44 aiii, 25 oeee, 22 keee, 72 tee, 11 oiii, 7 deee, 6 qeee, 5 teee, 3 seee, 2 leee , 2 reee and 1 yeee were found. All along with the result, it is obviously that *iii and *eee are two patterns that may be numerical words for VIII.

All along with the result, it is obviously that *ii (*iii) and *ee (*eee) are two patterns that may be numerical words for VIII. Comparing all possible VII words and VIII words, e, I and o can be considered as possible numerical characters.

Compare with triple letters in other languages, such as English, there is a list of triple letter words in English:

There are a lot of triple ‘l’ and triple ‘s’ appears inside of the words in English. Furthermore, comparing with the results got form the VIII pattern words before, as ‘i’ and ‘e’ appears most as triple letters in the Voynich manuscript and ‘l’ and ‘s’ appears most as triple letters in English, there may exist some relationship among ‘i’, ‘e’ in the Voynich manuscript and ‘l’ ‘s’ in English.

Furthermore, there is also some triple letter words in other language, such as German and Russian.

German:

Schneeeule

Teeei

There is triple ‘e’ appears inside of the words in German.

Russian:

Длинношеее

Короткошеее

змееед

доооновский

зоообъединение

There is triple ‘o’ appears inside of the words in Russian, and triple ‘e’ appears at the end of the words in Russian.

As we talked before, there is the highest possible relationship between the Voynich manuscript and German, and ‘e’ as a letter that appears three times in German, there is possible relationship between ‘i’, ‘e’ in the Voynich manuscript and ‘e’ in German, which need further searching if there can be found any breakpoint in the text investigation.

Phase 2

Phase 3

Conclusion

Results and Analysis

Future work

Reference

[1] Schmeh, Klaus (January–February 2011). "The Voynich Manuscript: The Book Nobody Can Read". Skeptical Inquirer. Retrieved 2013-09-05.

[2] Shailor, Barbara A.,Beinecke MS 408, Yale University, Beinecke Rare Book and Manuscript Library, General Collection of Rare Books and Manuscripts, Medieval and Renaissance Manuscripts, accessed 24 June 2013.