Projects:2016s1-141 Cracking the Voynich manuscript code

From Projects
Revision as of 15:45, 31 October 2016 by A1672395 (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Topic

Cracking the Voynich manuscript code

Supervisors

Prof. Derek Abbott

Dr. Brian Ng

Team members

Ruihang Feng a1674940

Yaxin Hu     a1672395

Project Introduction

Background

The Voynich the manuscript was created in the first half of the fifteenth century (probably between 1404 and 1438) [1]. No one today knows what it says or who wrote it. The book is in a strange alphabet. At 1912, a book collector named Wilfried Voynich found it in an Italian Jesuit college. Since this book cannot be read, it is divided into six different sections by illustrations with different styles and images:

a) Herbal:

There are one or more plants on each page, which is a format of European herbals [2].

b) Astronomical

There are circular diagrams such as suns, moons, and stars which suggest this part as something about astronomy or astrology [2].

c) Biological

Mostly naked women show that this part should be biological section [2].

d) Cosmological

Circular diagrams of obscure nature make this section as cosmological section [2].

e) Pharmaceutical

Drawings of isolated plants parts and objects resembling apothecary jars show that this section should be something about pharmaceutical [2].

f) Recipes

This part are full pages of text in short paragraphs [2].

Motivation

With statistical methods, trying to carry out a project that is used to investigate the language and linguistics of an unknown book is an attempt that may beyond excellent. Trying to find any features of relationships and patterns of the Voynich manuscript could be used to decode the unknown text with unknown languages. It may contribute significant progress in attempting decode a part of the book. The outcomes can be used to further linguistic or language decryption, such as information decoding, search engines and data mining. They can also be used in specific applications such as Google, Turn-it-in, Google translate, Yahoo, and Grammarly.

Project Aim

The aim of this project is to search the text and determine whether there are any possible features that can be used to decode the Voynich manuscript using statistical methods. The investigation of languages and linguistics is required to be processed with the unknown text. Furthermore, crack initial digits of the Voynich manuscript and determine the possible letters which may stand for digits. But, it is not necessary to fully decode the Voynich manuscript since it is not possible to be done in a one-year project.

Significance

There are many guesses about the Voynich manuscript. Because of the manuscript’s long history, many historians believe that the mysterious alphabets of the Voynich manuscript are related to ancient civilizations [3]. If manuscript can be cracked, the Voynich manuscript will be helpful for historians to explore the culture of ancient society.

In addition, the statistical method which will be used in this project is also useful in other fields, such as engineering, finance and architecture. Moreover, comparison is widely used, such as Turn-It-In, Google translate, Grammarly and Bing.

Technical background

The major technique which will be applied in this project is data mining. Data mining is an effective method to search laws among the massive number of data and has a fantastic performance. The two major methods of data mining are statistics and comparison. Statistics is used to count the frequency of the occurrence of some special words. Comparison is served to find out relations between two languages.

In the field of linguistics, European Voynich Alphabet (EVA) is a representative digital transcription of the Voynich manuscript [4]. Then a Japanese linguist Takahashi organised the whole Voynich manuscript by using EVA [5].

Therefore, major data will be extracted from the transcription of Takahashi in the process of this project.

Moreover, other resources will be considered, such as expressions of some representative ancient languages.


Knowledge gaps

Due to the massive amount of data in the Voynich manuscript, the project requires skilled data processing technique and software programming capabilities; however, no one in this project team has ever dealt with so much data. Hence members should develop data processing ability and software programming skills.

On the other hand, the project requires particular knowledge about statistics, so members must be adept at sorting data.

Technical challenges

Technical challenges of this project involve two aspects.

First of all, it is very difficult to infer which language the author used. The language of the manuscript does not belong to any known languages [6] and even this language may have been extinct. What is more, due to the long history of the Voynich manuscript, some important information is nowhere to be searched, such as exact information about author. In that case, it is difficult to infer which language the author used from the author’s nationality. In order to solve the above problem, members must search many different languages as references and compare those languages with the language of the manuscript.

Secondly, references of cracking the Voynich manuscript are limited. Because of unknown language and mysterious illustrations in manuscript, it is difficult to crack the whole manuscript. Although there are very few words have been cracked by researchers, on one can guarantee that the results are right. In the field of linguistics, there are not recognized correct results about cracking the Voynich manuscript. In that case, it is hard to find reliable references. So members must search references from different ways and find out enough accurate references.

Proposed method

Figure 1. Proposed method

As shown in the Figure 1, the proposed methods of this project are divided into three phases.

Phase 1: Text investigation

There are two parts in this phase: words and digits.

During the process of words research, Matlab will be used as an essential tool. Team members will attempt to search laws from three aspects:

  • The total number of words in the Voynich manuscript.
  • The characters and words which may stand for digits from some paragraphs of the manuscript.
  • The frequency of special characters and words.

On the other hand, in the course of digits investigation, team members will search for different kinds of known expressions of digits and make a comparison with the words in the Voynich manuscript. For example, the expression of digits in Roman is as shown in the Figure 2.

Figure 2. Roman numberal
Figure 3. *##

The word which is as shown in the Figure 3 is extracted from the Voynich manuscript, it is obvious that the form of the word in the Figure 3 is like “*##’. According to the method of comparison mentioned above, this word may mean ‘seven’ in Roman.

Phase 2: Illustrations investigation

An illustration which is extracted from the Voynich manuscript is as shown in the Figure 4.

Figure 4. Illustration

In this phase, illustrations will be analysed by using Matlab. Generally, there are three aspects which are needed to be completed:

  • The number of different elements in the illustrations.
  • The characters which may stand for digits.
  • Match the characters and digits.

Phase 3: Marginal symbols research

A page which contains marginal symbols is as shown in the Figure 5.

Figure 5. Marginal symbols

This phase also requires proficiency in programming by using Matlab. During the process of this phase, there are four major aspects:

  • Ordering and quantitative features of the marginal symbols of each page.
  • Search the characters which may stand for digits.
  • The differences between marginal symbols in each page.
  • Match the characters and digits and make inference about the relationship between characters and digits.

Phase 1: Characterisation and text investigation

Characterisation of the Voynich manuscript

Figure 6 shows the letter frequency in Voynich manuscript. There are 24 letters in Voynich manuscript. As the figure shows, that o, e, h, and y are the four most frequency letters, and S, z, v, x are the four least frequency letters. The blue line is the tendency of all the letters.

Figure 6. letter frequency in Voynich manuscript

There are six kinds of languages are used in comparing the letter frequency, those are English, Latin, French, German, Greek and Spanish.

Figure 7 shows_the_letter_frequency_of_English. There are 26 words in total. The most frequency letters are e, t, a and o, and the least frequency letters are z, q, j and x.

Figure 7. letter frequency of English

Figure 8 shows the letter frequency of Latin. There are 23 words in total. The most frequency letters are i, e, a and u, and the least frequency letters are z, y, x and h.

Figure 8. letter frequency of Latin

Figure 9 shows the letter frequency of French. There are 38 words in total. The most frequency letters are e, s, a and i, and the least frequency letters are ï, ë, œ and ô.

Figure 9. letter frequency of French

Figure 10 shows the letter frequency of German. There are 30 words in total. The most frequency letters are e, n, s and r, and the least frequency letters are q, x, y and j.

Figure 10. letter frequency of German

Figure 11 shows the letter frequency of Greek. There are 24 words in total. The most frequency letters are A, E, O and I, and the least frequency letters are Ψ, Z, Ξ and B.

Figure 11. letter frequency of Greek

Figure 12 shows the letter frequency of Spanish. There are 33 words in total. The most frequency letters are e, a, o and s, and the least frequency letters are k, ü, w and ú.


Figure 12. letter frequency of Spanish

With the Matlab, correlations between the tendency of letter frequency of the Voynich manuscript and English, Latin, French, German, Greek and Spanish. The correlation between the Voynich manuscript and English is 98.04%. The correlation between the Voynich manuscript and Latin is 98.66%. The correlation between the Voynich manuscript and French is 94.55%. The correlation between the Voynich manuscript and German is 94.81%. The correlation between the Voynich manuscript and Greek is 98.34%. The correlation between the Voynich manuscript and Spanish is 96.09%.

Comparing the Voynich manuscript with English, Latin, French, German, Greek and Spanish, the letter number of these languages shows that the most possible language is Greek, because they both have 24 letters. Furthermore, the letter frequency is also similar for the Voynich manuscript and Greek. In addition, the correlation between the Voynich manuscript and Greek is high. Therefore, Greek can be considered as a possible language that the Voynich manuscript used. However, this is not a strong evidence that can prove the Voynich manuscript is written in Greek. In conclusion, there is no specific evidence can prove that Voynich manuscript is one of these six kind of language, Greek is one of the possible language that the Voynich manuscript used.

Figure 13 shows the word frequency in the Voynich manuscript. There are 37104 words in the whole manuscript, and the total unique words are 8486. Furthermore, there are 2472 words that appears more than once, and 6014 words appears only once. 515 words appears more than 10 times and these words counts 65.66% of the total words in the Voynich manuscript.

Figure 13. word frequency in the Voynich manuscript

In figure 14, 50 most frequency words are token to make a comparison with English.

Figure 14. 50 most frequency words in the Voynich manuscript
Figure 15. 50 most frequency words in English

Comparing word frequency in the Voynich manuscript and in English, the correlation between the tendencies of both curve is 93.65%, which shows that there may exist relationship between the Voynich manuscript and English. In conclusion, there is no strong evidence shows that there is any relationship between the Voynich manuscript and English.

Statistical Comparison of Letters and Words

This section gives a brief statistical comparison between the Voynich manuscript and three book in English, French and German. Among these languages, the percentage of unique words/total words, word length and the percentage of words appear more than once /total unique words were compared.

Figure 16 shows the percentage of unique words/total words. There is significant difference between the Voynich manuscript and English books (47.9%) or French books (27.7%). However, there is no significant difference between the Voynich manuscript and German (13.6%).


Figure 16. percentage of unique words/total words

Figure 17 shows the word length the Voynich, English, French and German. There is small difference for the word length between the Voynich manuscript and English (6.7%) or French (6.0%). Furthermore, there is no significant difference for the word length between the Voynich manuscript and German (0.1%).


Figure 17. word length the Voynich

Figure 18 shows the percentage of words appear more than once /total unique words were compared. There is large difference between the Voynich manuscript and English (41.0%) or French (38.9%) or German (22.8%). However, the difference between the Voynich manuscript and German books is the smallest difference among these differences.

Figure 18. percentage of words appear more than once

Among these statistical comparisons, German can be considered as a possible language that the Voynich manuscript used.

Specific Pattern Words

This section shows gives a brief analysis of current results of the specific pattern words. The first numeral language is Roman numerals. In Roman numeral, VII stands for 7 and VIII stands for 8. These two numerals have obvious patterns that are easy to search in the Voynich manuscript. Words follow VII pattern and VIII pattern have been found, and next step will continue finding all possible numerical words in Roman numerals from I to XX, and several obvious pattern numerals such as XX, XXX, C, CC and CCC.

Figure 19 shows part of Vii pattern words. All the numbers below the words are locations of these words, such as 534 means that 534th word in the Voynich manuscript contains aii. Since a large number of locations of several words were found, this figure could not show all the locations.

Figure 19. part of Vii pattern words

From the locations, there are 562 aii, 201 kee, 77 oee, 72 tee, 51 oii, 30 qoo, 27 dee, 18 qee, 10 see, 4 lee, 3 yee and 2 ree were found. All along with the result, it is obviously that *ii, *ee and *oo are three patterns that may be numerical words for VII.

Figure 20 shows part of Viii pattern words. From the locations, there are 44 aiii, 25 oeee, 22 keee, 72 tee, 11 oiii, 7 deee, 6 qeee, 5 teee, 3 seee, 2 leee , 2 reee and 1 yeee were found. All along with the result, it is obviously that *iii and *eee are two patterns that may be numerical words for VIII.


Figure 20. part of the initial numbers


All along with the result, it is obviously that *ii (*iii) and *ee (*eee) are two patterns that may be numerical words for VIII. Comparing all possible VII words and VIII words, e, I and o can be considered as possible numerical characters.

Compare with triple letters in other languages, such as English, there is a list of triple letter words in English:

Figure 21. triple letter words in English

There are a lot of triple ‘l’ and triple ‘s’ appears inside of the words in English. Furthermore, comparing with the results got form the VIII pattern words before, as ‘i’ and ‘e’ appears most as triple letters in the Voynich manuscript and ‘l’ and ‘s’ appears most as triple letters in English, there may exist some relationship among ‘i’, ‘e’ in the Voynich manuscript and ‘l’ ‘s’ in English.

Furthermore, there is also some triple letter words in other language, such as German and Russian.

German:

  • Schneeeule
  • Teeei

There is triple ‘e’ appears inside of the words in German.

Russian:

  • Длинношеее
  • Короткошеее
  • змееед
  • доооновский
  • зоообъединение

There is triple ‘o’ appears inside of the words in Russian, and triple ‘e’ appears at the end of the words in Russian.

As we talked before, there is the highest possible relationship between the Voynich manuscript and German, and ‘e’ as a letter that appears three times in German, there is possible relationship between ‘i’, ‘e’ in the Voynich manuscript and ‘e’ in German, which need further searching if there can be found any breakpoint in the text investigation.

Phase 2: illustraction investigation

Searching initial numbers and possible numerical words inside images

The first part of this section is to find all initial numbers inside the images of the whole Voynich manuscript. There is a list of some part of the initial numbers below:

Figure 22. part of the initial numbers


In order to make a comparison and mapping between initial numbers and the Voynich manuscript, all possible words that may stand for numbers. There is a list of some part of the possible words below:

Figure 23. part of the possible words

Mapping all initial numbers and numerical words

When we compare the initial numbers and possible words, there can be seen some potential relationship between them, such as there are a lot of ‘s’ and ‘2’ appear in the same page (54 pairs), ‘o’ and ‘1’ for 24 pairs, ‘ol’ and ‘10’ for 14 pairs. Therefore, in order to make it simple to compare, mapping between initial numbers and possible words are made to show whether there is any relationship between them. There is a list of mapping pairs for letter 'o' and 'r' below:

Figure 24. mapping pairs for letter 'o'
Figure 25. mapping pairs for letter 'r'

In order to make it simple to find a more possible relationship among them, we choose the most frequency pairs for each pair, and made a new list, which is shown below:

Figure 26. most frequency pairs for each pair

There can be easily seen that ‘o’ and ‘1’ appears together for 24 times. Furthermore, there are a lot of ‘ol’ and ‘10’ (14 times), ‘ol’ and ‘13’ (12 times), ‘ol’ and ‘12’ (11 times), ‘or’ and ‘10’ (19 times), ‘or’ and ‘12’ (13 times), ‘or’ and ‘13’ (12 times), ‘os’ and ‘19’ (11 times) appear together. Therefore, there is a potential relationship between ‘o’ in the Voynich manuscript and number ‘1’.

Furthermore, there are ‘r’ and ‘1’ for 48 times, ‘r’ and ‘2’ for 26 times, ‘r’ and ‘3’ for 21 times, ‘s’ and ‘2’ appear together for 54 times, ‘s’ and ‘1’ for 46 times, ‘s’ and ‘3’ for 41 times, ‘s’ and ‘5’ for 32 times, ‘y’ and ‘2’ for 36 times, ‘y’ and ‘1’ for 30 times, ‘y’ and ‘3’ for 29 times, ‘y’ and ‘5’ for 20 times. There may exist potential relationship among them, which need further investigation.

In order to make it simple to see and compare, there is a list that all the possible pairs shown below:

Figure 27. possible pairs

Phase 3: Marginal symbols research

According to the chapter 5, this phase is divided into three parts: statistics for marginal stars of each page, digits mining and conclusion. In addition, this phase is completed by Ruihang Feng.

Statistics for Marginal stars of each page

There are 15 pages which involve marginal stars in the Voynich manuscript. As the analysis in the chapter 5, an example is shown in the Figure 5. The results of this part are shown in the Figure 28.

Star1.png
Star2.png
Star3.png
Star4.png
Figure 28. Statistics for marginal stars of each page

From the Figure 28, we can find that there are two kinds of marginal stars in the Voynich manuscript: white stars and coloured stars. In addition, Figure 28 also involves detailed information about the number of stars, arrangement and location in the text.

Digits mining

In this phase, first, the number of marginal stars for each page is counted. Then, letters which may stand for digits are extracted. An example (page number: f58r) is shown in the Figure 29.

Figure 29. Digits mining

For this page, there are 3 white stars (according the Figure 28) and the single letters which may stand for digits are m, o, r and s. Then all the 25 pages are counted in this way.

As the result, these 25 pages involve 16 kinds of digits: 1 3 4 5 6 7 8 9 10 12 13 14 15 16 17 and 19. Some of them stand for the total number stars of each page; some of them stand for the number of white stars or the number of the coloured stars of each page. The detailed information is shown in the Figure 28.

Figure 30. Digits analysis

The results of this phase are shown in the Figure 30. The first column stand for those 16 kinds of digits, the information in brackets mean the number of the pages which involve that digit (for example, for the digit ‘5’, the information in brackets is 3 pages, that means there are 3 pages which involve ‘5’); the red mark represent the first three letters which has high occurrence frequency; the second column stand for the pages which involve the digits and the last column means the letters which may stand for digits.

Conclusion

According to the section 8.1 and 8.2, the conclusion of the phase ‘marginal symbol research’ is shown in the Figure 31.

Result111.png
Result222.png
Result333.png
Figure 31. Conclusion

The letters of the first column are extracted according to the red mark in the Figure 29. The forth column stand for the occurrence frequency of letters. For example, the occurrence frequency of y=5 is equal to 3/18=16.67%, ‘3’ means there are 3 pages which involve ‘y=5’ (according to the Figure 29.), ‘18’ means there are 18 pages which involve ‘y’.

As the result, according to the figures above, we can find that there are potential relationships between:

  • ‘y’ and ‘6’
  • ‘y’ and ‘7’
  • ‘l’ and ‘7’
  • ‘l’ and ‘5’
  • ‘l’ and ‘8’
  • ‘r’ and ‘5’
  • ‘s’ and ‘6’
  • ‘o’ and ‘6’
  • ‘o’ and ‘1’
  • ‘ar’ and ‘13’
  • ‘ar’ and ‘15’
  • ‘al’ and ‘13’
  • ‘al’ and ‘10’
  • ‘al’ and ‘15’
  • ‘or’ and ‘13’
  • ‘ol’ and ‘13’
  • ‘am’ and 12
  • ‘am’ and ‘19’
  • ‘dy’ and ‘14’
  • ‘dy and ‘19’
  • ‘om’ and ‘16’

Conclusion

Results and Analysis

This project is divided into three phases: text investigation, illustration research and marginal symbol investigation. On the other hand, the major works of this project can be achieved by using computer.

In addition, the goals of this project involve three parts:

  • Use statistical method Matlab to search the linguistic laws in the Voynich manuscript.
  • Search laws from illustrations from the perspective of digits.
  • Investigate laws from marginal symbols form the perspective of digits.

Over the past two semesters, the whole phases have been finished. As the analysis in the chapter 6, we can infer that the language which is used in the Voynich manuscript may be a branch of German.

The most Possible digits:

  • ‘o’ and ‘1’
  • ‘ol’ and ‘13’
  • ‘or’ and ‘13’

Some other possible digits:

  • ‘s’ and ‘2’, ‘6’
  • ‘y’ and ‘2’, ‘6’
  • ‘a’ and ‘1’
  • ‘r’ and ‘1’

Reference

[1] Schmeh, Klaus (January–February 2011). "The Voynich Manuscript: The Book Nobody Can Read". Skeptical Inquirer. Retrieved 2013-09-05.

[2] Shailor, Barbara A.,Beinecke MS 408, Yale University, Beinecke Rare Book and Manuscript Library, General Collection of Rare Books and Manuscripts, Medieval and Renaissance Manuscripts, accessed 24 June 2013.

[3] Stojko, John, Letters to God’s Eye: The Voynich Manuscript for the first time deciphered and translated into English. New York: Vantage Press, 1978.

[4] Joachim Dathe, The EVA-Transcription [Online]. Available: https://voynich2arabic.wordpress.com/eva-transcription/

[5] Vladimir Sazonov, Voynich Manuscript [Online]. Available: http://voynich.naobum.de/

[6] Reed Johnson (2013, July 9), The Unread: The Mystery Of The Voynich Manuscript [Online]. Available: http://www.newyorker.com/books/page-turner/the-unread-the-mystery-of-the-voynich-manuscript