Difference between revisions of "Projects:2014S1-44 Cracking the Voynich Manuscript Code"

From Projects
Jump to: navigation, search
Line 3: Line 3:
 
[[Category:2014S1|44]]
 
[[Category:2014S1|44]]
  
The Voynich Manscript is a mysterious 15th century book that no one today know what it says or who wrote it. The book is in a strange alphabet. [https://en.wikipedia.org/wiki/Voynich_manuscript See details here].  
+
The Voynich Manuscript is a mysterious 15th century manuscript written in an unknown alphabet. The mysterious nature of the manuscript has attracted linguists and cryptographers over the course of the last hundred years, each attempting to "Crack the Voynich code". However, despite a great deal of research into the text and illustrations of the manuscript, no one has ever conclusively deciphered a single word.  
  
Fortunately the whole book has been converted into an electronic format with each character changed to a convenient ascii character. We want you to write software that will search the text and perform statistical tests to get clues as to the nature of the writing. Does the document bear the statistics of a natural language or is it a fake?
+
Fortunately the whole book has been converted into an electronic format with each character changed to a convenient ascii character.
  
We already have Support Vector Machine (SVM) and Multiple Discriminant Analysis (MDA) software that you can adapt for your purposes. This software is set up to test if two texts are written by the same author or not. The great thing about our software is that it is independent of language. So you could compare it against the existing writings of Roger Bacon, who is a suspected author
+
This project expands upon past research into the linguistic features of the manuscript.  Computational analysis techniques such as Word Recurrence Intervals and N-gram relationships, along with supervised learning algorithms such as Support Vector Machines (SVM) and Multiple Discriminant Analysis (MDA) are applied to an electronic transcription of the text. The team evaluates the use of these classification methods, and also develops new ways to identify grammar and syntax in the Voynich language.
  
  
Line 13: Line 13:
  
 
=== Specific tasks ===
 
=== Specific tasks ===
 +
===Background===
 +
The Voynich Manuscript is a mysterious book written in an unknown alphabet. So little is known about the nature and origins of the manuscript that it has been named after Wifred Michael Voynich, the collector who purchased it in 1912 from a castle in Italy. The manuscript has been verified by radiocarbon dating (at the University of Arizona) as belonging to the early 15th century<ref name=ARIZ />, and appears to be a herbal (or medicinal) manual from this time period in Europe. However, despite a great deal of research into the text and illustrations of the manuscript, no one has ever conclusively deciphered a single word. The mysterious nature of the manuscript has attracted linguists and cryptographers over the course of the last hundred years, each attempting to "crack the Voynich code".
 +
 +
The manuscript itself is made up of several folios, numbered from f1 to f116. Each folio consists of two pages (labelled r and v), with the exception of ten foldouts of up to six pages. Although the page numbering was clearly added after the manuscript was written, there are gaps in the numbering which indicate missing folios, and indications that some folios have been reordered long after their completion. These oddities have led scholars to believe that the manuscript may have had several owners. Certain pages also contain a "key-like" sequence of characters, which is why many leading cryptographers of the time (and indeed, today) believe it to be a key cipher. <ref name=Zandbergen1/>
 +
 +
<gallery mode=packed caption="Example Pages" heights=200px widths=200px>
 +
File:Vms_p77.jpg|'''f44r'''
 +
File:Vms_p101.jpg|'''f51r'''
 +
File:Vms_p129.jpg|'''f71r'''
 +
File:Vms_p141.jpg|'''f78r'''
 +
</gallery>
 +
 +
The first major work into an electronic transcription of the manuscript was performed by famous cryptologist William Friedman, and a team called the ‘First Study Group’ (FSG). The FSG developed an early method with which they were able to convert between the characters in the manuscript and computer-readable text. Since this first work, several other research groups have developed alternate transcriptions of the text, each of them attempting to correct the errors of previous work and create a definitive data set. <ref name=Zandbergen1/>
 +
 +
One of these transcriptions was developed by Captain Prescott Currier, who was the first to identify, based on handwriting and word usage, the possibility that the manuscript had two or more authors (He noted up to six distinct handwriting styles) <ref name=DImperio1978 /> <ref name=Stolfi1 />. Currier identified two main ‘languages’ (which have since been referred to as Currier A and Currier B) and divided most of the pages in the manuscript into one of these two categories. This division of languages has since been supported by several experiments using computational cluster analysis. <ref name=Zandbergen2 /> <ref name=Knight2011 />
 +
 +
Between 1996 and 1998, several academics began work on what they hoped to be a complete database of Voynich transcriptions. This database was known as the “Interlinear Archive of Electronic Transcriptions in EVA” (referred to as the Interlinear Archive), and came to include several partial transcriptions such as those developed by Currier and Friedman, along with a complete transcription by Takeshi Takahashi. <ref name=Stolfi1 /> All the transcriptions in this file were converted to a newly developed alphabet, called EVA, that attempted to correct the errors of previous alphabets (which often oversimplified the characters in the manuscript and ignored rare characters) by using ligature analysis on the handwriting. <ref name=Zandbergen3 />
 +
 +
The author of the manuscript has been heavily disputed. Some popular candidates are Roger Bacon, John Dee, and Leonardo da Vinci.<ref name=AUTH/>
 +
 +
See the [[Voynich Data Table| Voynich Data Table (Appendix A)]] for more information about the sectioning, language, and page order of the Voynich Manuscript as it currently exists within the Yale Beinecke Library.
 +
 +
===Technical Background===
 +
This project is heavily dependent on data mining techniques, as it involves the analysis and extraction of large quantities of data from different sources. Data mining can be used to analyse and predict behaviours, find hidden patterns and meanings or compare different data sets for commonalities or correlation. Authorship detection (stylometry) involves a subset of data mining techniques that help determine the authenticity of works and the possible authors of undocumented texts. The techniques involved in this project are also used in applications such as search engine development, code analysis, language processing, and plagiarism detection.
 +
 +
Part of the complexity involved in the VMS decoding process is the difficulty of transcription. The VMS has been transcribed many times by many different scholars, each attempting to create a universally acceptable transcription, but each of these has had to make limiting decisions about the structure of the manuscript. Due to the nature of the writing, the two characters observed by one scholar may also be interpreted as one combined character, and the spacing between words in the manuscript is notoriously vague. This ambiguity reduces the effectiveness of standard cryptographic and linguistic techniques to observe the relationship between characters and between separate word tokens.
 +
 +
===Previous Studies===
 +
 +
The fame of the Voynich Manuscript has led to a large amount of research into its origin, although this has been of varying quality. In the past, notable code-breakers (including the NSA) have attempted to crack the manuscript, but recent work has been done primarily by a small group of academics and amateurs, who have focused on data mining various electronic transcriptions. <ref name=Zandbergen1/> In this section, we will detail a few experiments which relate to our own analysis.
 +
 +
[[FILE:Elegant_enigma_cover.png|thumb|200px|Mary D'Imperio's 1978 Paper]]
 +
 +
Prescott Currier’s work and that of many other code-breaking groups were collected and curated by Mary D’Imperio, who worked on the manuscript for several years during her time as an NSA researcher. <ref name=DImperio1978 /> D’Imperio’s paper, titled “The Voynich Manuscript: An Elegant Enigma”, which collected and analyzed the research so far, has become the most cited reference work on the Voynich Manuscript since it was first published in 1978 <ref name=Zandbergen1 />
 +
In this document, D'Imperio states that most research attempting to match the herbal drawings to real plants has produced disappointingly vague results, but also claims that some plants have been indisputably identified as European. D'Imperio further highlighted a need for a strict adherence to the experimental method when dealing with computational analysis of the manuscript, claiming that this was necessary to avoid vague and meaningless conclusions.<ref name=DImperio1978 />
 +
 +
Reddy and Knight's 2001 paper titled "What We Know About the Voynich Manuscript" provides a useful summary of linguistic analysis into the manuscript. In particular, Reddy and Knight used an unsupervised classification algorithm (based on Hidden Markov Models) to separate vowels and consonants. They found that, in this area, the text is similar to "abjad" languages (such as modern Hebrew) which do not have vowels in the conventional sense of the term.<ref name=Knight2011 /> Jorge Stolfi, on the other hand, did an experiment looking at the word length distribution of the Voynich and of various other languages and concluded that, in this area, the text is similar to East Asian languages such as Chinese and Vietnamese. <ref name=Stolfi2 /> We used the experiments of these researchers as part of the basis for our comparison corpus, but it is clear from the two examples above that different classification methods produce vastly different conclusions about the nature of the language in the Voynich.
 +
 +
Another researcher, Rene Zandbergen, recently developed an experiment to determine whether the language used on a page is related to the basic illustration type.<ref name=Zandbergen2 /> He concluded that, although the illustration types can be separated by supervised learning algorithms which look only at the words on the page, this doesn't necessarily mean that the text relates to the illustrations themselves. He drew attention to the fact that pages in the Currier A language appeared (to his algorithm) as separate from those in the Currier B language, even when the pages had similar illustrations.
 +
 +
This is the first year that a project attempting to decode the manuscript has been run at Adelaide University, but similar projects have looked at text classification in previous years.  Notable examples include ‘Cipher Cracking’ and ‘Authorship Detection: Who wrote the letters to the Hebrews?’. Both of these projects developed techniques which may be applicable to the VMS, including textual comparison features such as Common N-grams and Word Recurrence Intervals (WRI) along with the use of supervised machine learning algorithms such as Support Vector Machines (SVM).
 +
 +
===Project Objectives===
 +
As this is the first year in which an Honours Project has investigated the Voynich Manuscript, the long term objectives have been flexible and subject to change. Furthermore, given the large body of work already in existence on the manuscript, it was unlikely that even a partial decoding could be produced by our team within a year. Instead the primary focus of this project was to research and understand some features of the manuscript, and effectively analyse our data. We aimed to develop ideas and code which could contribute to the overall project outcome and aid future students with their research. The broad goals of our project included:
 +
 +
* Developing possible methods and statistics that could be used to compare an unknown language with a known language or data set
 +
* Comparing the linguistic characteristics and features of the Voynich Manuscript with relevant languages and authors
 +
* Theorising as to whether the language contained within the manuscript is real, a code or a hoax
 +
* Developing a code base, documentation, and clear analysis to aid future projects
 +
 +
That being said, we had certain concrete milestones that we hoped to achieve this year, and certain research areas which we considered to be most worthwhile. These included:
 +
 +
* Word Recurrence Interval as a language independent statistic
 +
* Identification of words which may relate to the illustrations in the manuscript
 +
* SVM and MDA text classification
 +
 +
===Approach and Stages===
 +
 +
In the early planning stages, we looked at our goals (concrete and flexible) and decided to divide our work into five stages. Each stage was dependent on areas from the previous stages and allowed continuous developments and improvements as our understanding of the manuscript progressed. Early stages involved basic characterisation of the manuscript, and later stages involved text classifications algorithms and research-based strategies. The stages themselves were split evenly between team members (detailed in the section of Project Management) and was designed to take advantage of the unique skills of each team member. The five stages are listed in more detail within the body of this report.
 +
 +
 
* '''Phase 1:''' Characterize the text. Write scripts that count its features. How many words? How long is the alphabet? Word frequencies? Probability of one letter following another. Probability of two letter pairs (2-grams) and n-letter group (n-grams).  Compare these in a table with known languages obtained by running your same code on the Declaration of Human Rights.  Don't forget to get a short paragraph of English and manually count everything and then run it on your code to cross check it is counting correctly.  You must always validate your code or you will lose marks.
 
* '''Phase 1:''' Characterize the text. Write scripts that count its features. How many words? How long is the alphabet? Word frequencies? Probability of one letter following another. Probability of two letter pairs (2-grams) and n-letter group (n-grams).  Compare these in a table with known languages obtained by running your same code on the Declaration of Human Rights.  Don't forget to get a short paragraph of English and manually count everything and then run it on your code to cross check it is counting correctly.  You must always validate your code or you will lose marks.
  

Revision as of 08:32, 30 October 2014


The Voynich Manuscript is a mysterious 15th century manuscript written in an unknown alphabet. The mysterious nature of the manuscript has attracted linguists and cryptographers over the course of the last hundred years, each attempting to "Crack the Voynich code". However, despite a great deal of research into the text and illustrations of the manuscript, no one has ever conclusively deciphered a single word.

Fortunately the whole book has been converted into an electronic format with each character changed to a convenient ascii character.

This project expands upon past research into the linguistic features of the manuscript. Computational analysis techniques such as Word Recurrence Intervals and N-gram relationships, along with supervised learning algorithms such as Support Vector Machines (SVM) and Multiple Discriminant Analysis (MDA) are applied to an electronic transcription of the text. The team evaluates the use of these classification methods, and also develops new ways to identify grammar and syntax in the Voynich language.


Project information

Specific tasks

Background

The Voynich Manuscript is a mysterious book written in an unknown alphabet. So little is known about the nature and origins of the manuscript that it has been named after Wifred Michael Voynich, the collector who purchased it in 1912 from a castle in Italy. The manuscript has been verified by radiocarbon dating (at the University of Arizona) as belonging to the early 15th century[1], and appears to be a herbal (or medicinal) manual from this time period in Europe. However, despite a great deal of research into the text and illustrations of the manuscript, no one has ever conclusively deciphered a single word. The mysterious nature of the manuscript has attracted linguists and cryptographers over the course of the last hundred years, each attempting to "crack the Voynich code".

The manuscript itself is made up of several folios, numbered from f1 to f116. Each folio consists of two pages (labelled r and v), with the exception of ten foldouts of up to six pages. Although the page numbering was clearly added after the manuscript was written, there are gaps in the numbering which indicate missing folios, and indications that some folios have been reordered long after their completion. These oddities have led scholars to believe that the manuscript may have had several owners. Certain pages also contain a "key-like" sequence of characters, which is why many leading cryptographers of the time (and indeed, today) believe it to be a key cipher. [2]

The first major work into an electronic transcription of the manuscript was performed by famous cryptologist William Friedman, and a team called the ‘First Study Group’ (FSG). The FSG developed an early method with which they were able to convert between the characters in the manuscript and computer-readable text. Since this first work, several other research groups have developed alternate transcriptions of the text, each of them attempting to correct the errors of previous work and create a definitive data set. [2]

One of these transcriptions was developed by Captain Prescott Currier, who was the first to identify, based on handwriting and word usage, the possibility that the manuscript had two or more authors (He noted up to six distinct handwriting styles) [3] [4]. Currier identified two main ‘languages’ (which have since been referred to as Currier A and Currier B) and divided most of the pages in the manuscript into one of these two categories. This division of languages has since been supported by several experiments using computational cluster analysis. [5] [6]

Between 1996 and 1998, several academics began work on what they hoped to be a complete database of Voynich transcriptions. This database was known as the “Interlinear Archive of Electronic Transcriptions in EVA” (referred to as the Interlinear Archive), and came to include several partial transcriptions such as those developed by Currier and Friedman, along with a complete transcription by Takeshi Takahashi. [4] All the transcriptions in this file were converted to a newly developed alphabet, called EVA, that attempted to correct the errors of previous alphabets (which often oversimplified the characters in the manuscript and ignored rare characters) by using ligature analysis on the handwriting. [7]

The author of the manuscript has been heavily disputed. Some popular candidates are Roger Bacon, John Dee, and Leonardo da Vinci.[8]

See the Voynich Data Table (Appendix A) for more information about the sectioning, language, and page order of the Voynich Manuscript as it currently exists within the Yale Beinecke Library.

Technical Background

This project is heavily dependent on data mining techniques, as it involves the analysis and extraction of large quantities of data from different sources. Data mining can be used to analyse and predict behaviours, find hidden patterns and meanings or compare different data sets for commonalities or correlation. Authorship detection (stylometry) involves a subset of data mining techniques that help determine the authenticity of works and the possible authors of undocumented texts. The techniques involved in this project are also used in applications such as search engine development, code analysis, language processing, and plagiarism detection.

Part of the complexity involved in the VMS decoding process is the difficulty of transcription. The VMS has been transcribed many times by many different scholars, each attempting to create a universally acceptable transcription, but each of these has had to make limiting decisions about the structure of the manuscript. Due to the nature of the writing, the two characters observed by one scholar may also be interpreted as one combined character, and the spacing between words in the manuscript is notoriously vague. This ambiguity reduces the effectiveness of standard cryptographic and linguistic techniques to observe the relationship between characters and between separate word tokens.

Previous Studies

The fame of the Voynich Manuscript has led to a large amount of research into its origin, although this has been of varying quality. In the past, notable code-breakers (including the NSA) have attempted to crack the manuscript, but recent work has been done primarily by a small group of academics and amateurs, who have focused on data mining various electronic transcriptions. [2] In this section, we will detail a few experiments which relate to our own analysis.

File:Elegant enigma cover.png
Mary D'Imperio's 1978 Paper

Prescott Currier’s work and that of many other code-breaking groups were collected and curated by Mary D’Imperio, who worked on the manuscript for several years during her time as an NSA researcher. [3] D’Imperio’s paper, titled “The Voynich Manuscript: An Elegant Enigma”, which collected and analyzed the research so far, has become the most cited reference work on the Voynich Manuscript since it was first published in 1978 [2] In this document, D'Imperio states that most research attempting to match the herbal drawings to real plants has produced disappointingly vague results, but also claims that some plants have been indisputably identified as European. D'Imperio further highlighted a need for a strict adherence to the experimental method when dealing with computational analysis of the manuscript, claiming that this was necessary to avoid vague and meaningless conclusions.[3]

Reddy and Knight's 2001 paper titled "What We Know About the Voynich Manuscript" provides a useful summary of linguistic analysis into the manuscript. In particular, Reddy and Knight used an unsupervised classification algorithm (based on Hidden Markov Models) to separate vowels and consonants. They found that, in this area, the text is similar to "abjad" languages (such as modern Hebrew) which do not have vowels in the conventional sense of the term.[6] Jorge Stolfi, on the other hand, did an experiment looking at the word length distribution of the Voynich and of various other languages and concluded that, in this area, the text is similar to East Asian languages such as Chinese and Vietnamese. [9] We used the experiments of these researchers as part of the basis for our comparison corpus, but it is clear from the two examples above that different classification methods produce vastly different conclusions about the nature of the language in the Voynich.

Another researcher, Rene Zandbergen, recently developed an experiment to determine whether the language used on a page is related to the basic illustration type.[5] He concluded that, although the illustration types can be separated by supervised learning algorithms which look only at the words on the page, this doesn't necessarily mean that the text relates to the illustrations themselves. He drew attention to the fact that pages in the Currier A language appeared (to his algorithm) as separate from those in the Currier B language, even when the pages had similar illustrations.

This is the first year that a project attempting to decode the manuscript has been run at Adelaide University, but similar projects have looked at text classification in previous years. Notable examples include ‘Cipher Cracking’ and ‘Authorship Detection: Who wrote the letters to the Hebrews?’. Both of these projects developed techniques which may be applicable to the VMS, including textual comparison features such as Common N-grams and Word Recurrence Intervals (WRI) along with the use of supervised machine learning algorithms such as Support Vector Machines (SVM).

Project Objectives

As this is the first year in which an Honours Project has investigated the Voynich Manuscript, the long term objectives have been flexible and subject to change. Furthermore, given the large body of work already in existence on the manuscript, it was unlikely that even a partial decoding could be produced by our team within a year. Instead the primary focus of this project was to research and understand some features of the manuscript, and effectively analyse our data. We aimed to develop ideas and code which could contribute to the overall project outcome and aid future students with their research. The broad goals of our project included:

  • Developing possible methods and statistics that could be used to compare an unknown language with a known language or data set
  • Comparing the linguistic characteristics and features of the Voynich Manuscript with relevant languages and authors
  • Theorising as to whether the language contained within the manuscript is real, a code or a hoax
  • Developing a code base, documentation, and clear analysis to aid future projects

That being said, we had certain concrete milestones that we hoped to achieve this year, and certain research areas which we considered to be most worthwhile. These included:

  • Word Recurrence Interval as a language independent statistic
  • Identification of words which may relate to the illustrations in the manuscript
  • SVM and MDA text classification

Approach and Stages

In the early planning stages, we looked at our goals (concrete and flexible) and decided to divide our work into five stages. Each stage was dependent on areas from the previous stages and allowed continuous developments and improvements as our understanding of the manuscript progressed. Early stages involved basic characterisation of the manuscript, and later stages involved text classifications algorithms and research-based strategies. The stages themselves were split evenly between team members (detailed in the section of Project Management) and was designed to take advantage of the unique skills of each team member. The five stages are listed in more detail within the body of this report.


  • Phase 1: Characterize the text. Write scripts that count its features. How many words? How long is the alphabet? Word frequencies? Probability of one letter following another. Probability of two letter pairs (2-grams) and n-letter group (n-grams). Compare these in a table with known languages obtained by running your same code on the Declaration of Human Rights. Don't forget to get a short paragraph of English and manually count everything and then run it on your code to cross check it is counting correctly. You must always validate your code or you will lose marks.
  • Phase 2: Write a general descriptor for each picture in the book, eg. water, woman, tree, flower, vegetable, leaf, dancing etc. Associate each descriptor with the appropriate paper. Write some code to find which words on a page are unique to those pages with those descriptors. Which words also suddenly increase in frequency on those pages with shared descriptors? Tabulate the results.
  • Phase 3: Investigate the use of Word Recurrence Interval (WRI) versus rank plots. Plot WRI curves of the Voynich versus other languages from the Declaration of Human Rights.
  • Phase 4: Think up some other ideas to try out.
  • Phase 5: As WRI is a language-independent metric, you can select classification features based on WRI. Then you can run an SVM and an MDA classifier to compare the Voynich against other languages in the Declaration of Human Rights. Then you can run it against the works of specific authors of interest such as Roger Bacon, John Dee, and Edward Kelley.

Deliverables and Progress

  • Proposal seminar
  • Progress report
  • Final seminar
  • Final report
  • Poster
  • Project exhibition 'expo'
  • YouTube video
  • Weekly progress

Follow this link for current progress

End results

It will familiarize you with techniques in information theory, probability, statistics, encryption, decryption, signal classification, and datamining. It will also improve your software skills. The new software tools you develop may lead to new IP in the areas of datamining, automatic text language identification, and also make you rich/famous. The types of jobs out there where these skills are useful are in computer security, comms, digital forensics, internet search companies, and language processing software companies.

Team

Group members

Supervisors

Resources

  • Standard PC
    • MATLAB
    • Python
  • Reference books
  • Takahashi EVT file
  • English and foreign texts
  • Cite error: Invalid <ref> tag; no text was provided for refs named ARIZ
  • 2.0 2.1 2.2 2.3 Cite error: Invalid <ref> tag; no text was provided for refs named Zandbergen1
  • 3.0 3.1 3.2 Cite error: Invalid <ref> tag; no text was provided for refs named DImperio1978
  • 4.0 4.1 Cite error: Invalid <ref> tag; no text was provided for refs named Stolfi1
  • 5.0 5.1 Cite error: Invalid <ref> tag; no text was provided for refs named Zandbergen2
  • 6.0 6.1 Cite error: Invalid <ref> tag; no text was provided for refs named Knight2011
  • Cite error: Invalid <ref> tag; no text was provided for refs named Zandbergen3
  • Cite error: Invalid <ref> tag; no text was provided for refs named AUTH
  • Cite error: Invalid <ref> tag; no text was provided for refs named Stolfi2