Difference between revisions of "Projects:2014S1-44 Cracking the Voynich Manuscript Code"
(→Approach and Stages) |
|||
Line 8: | Line 8: | ||
This project expands upon past research into the linguistic features of the manuscript. Computational analysis techniques such as Word Recurrence Intervals and N-gram relationships, along with supervised learning algorithms such as Support Vector Machines (SVM) and Multiple Discriminant Analysis (MDA) are applied to an electronic transcription of the text. The team evaluates the use of these classification methods, and also develops new ways to identify grammar and syntax in the Voynich language. | This project expands upon past research into the linguistic features of the manuscript. Computational analysis techniques such as Word Recurrence Intervals and N-gram relationships, along with supervised learning algorithms such as Support Vector Machines (SVM) and Multiple Discriminant Analysis (MDA) are applied to an electronic transcription of the text. The team evaluates the use of these classification methods, and also develops new ways to identify grammar and syntax in the Voynich language. | ||
+ | |||
+ | For a report on our work decoding the manuscript, follow this *[https://www.eleceng.adelaide.edu.au/personal/dabbott/wiki/index.php/Semester_B_Final_Report_2014_-_Cracking_the_Voynich_code this link]. Otherwise, this work is summarised below. | ||
Line 47: | Line 49: | ||
− | + | Compile basic textual information about the manuscript, including: | |
:The number of word types in the manuscript | :The number of word types in the manuscript | ||
Line 63: | Line 65: | ||
===Phase 2=== | ===Phase 2=== | ||
− | + | Look through the manuscript and find pages with similar illustrations, and then find words that seem relevant to those illustrations. This will include the words unique to pages with a given illustration type as well as words which suddenly increase in frequency (‘burst’) on those pages. | |
===Phase 3=== | ===Phase 3=== | ||
− | + | Experiment with the Word Recurrence Interval (WRI) analytical method and apply it to the Voynich Manuscript. | |
===Phase 4=== | ===Phase 4=== | ||
− | + | Investigate current theories further or look into areas of interest which develop during the course of the project. | |
===Phase 5=== | ===Phase 5=== |
Revision as of 09:43, 30 October 2014
The Voynich Manuscript is a mysterious 15th century manuscript written in an unknown alphabet. The mysterious nature of the manuscript has attracted linguists and cryptographers over the course of the last hundred years, each attempting to "Crack the Voynich code". However, despite a great deal of research into the text and illustrations of the manuscript, no one has ever conclusively deciphered a single word.
Fortunately the whole book has been converted into an electronic format with each character changed to a convenient ascii character.
This project expands upon past research into the linguistic features of the manuscript. Computational analysis techniques such as Word Recurrence Intervals and N-gram relationships, along with supervised learning algorithms such as Support Vector Machines (SVM) and Multiple Discriminant Analysis (MDA) are applied to an electronic transcription of the text. The team evaluates the use of these classification methods, and also develops new ways to identify grammar and syntax in the Voynich language.
For a report on our work decoding the manuscript, follow this *this link. Otherwise, this work is summarised below.
Contents
Project information
Background
The Voynich Manuscript is a mysterious book written in an unknown alphabet. So little is known about the nature and origins of the manuscript that it has been named after Wifred Michael Voynich, the collector who purchased it in 1912 from a castle in Italy. The manuscript has been verified by radiocarbon dating (at the University of Arizona) as belonging to the early 15th century[1], and appears to be a herbal (or medicinal) manual from this time period in Europe. However, despite a great deal of research into the text and illustrations of the manuscript, no one has ever conclusively deciphered a single word. The mysterious nature of the manuscript has attracted linguists and cryptographers over the course of the last hundred years, each attempting to "crack the Voynich code".
The manuscript itself is made up of several folios, numbered from f1 to f116. Each folio consists of two pages (labelled r and v), with the exception of ten foldouts of up to six pages. Although the page numbering was clearly added after the manuscript was written, there are gaps in the numbering which indicate missing folios, and indications that some folios have been reordered long after their completion. These oddities have led scholars to believe that the manuscript may have had several owners. Certain pages also contain a "key-like" sequence of characters, which is why many leading cryptographers of the time (and indeed, today) believe it to be a key cipher. [2]
Between 1996 and 1998, several academics began work on what they hoped to be a complete database of Voynich transcriptions. This database was known as the “Interlinear Archive of Electronic Transcriptions in EVA” (referred to as the Interlinear Archive), and came to include several partial transcriptions such as those developed by Currier and Friedman, along with a complete transcription by Takeshi Takahashi. [3] All the transcriptions in this file were converted to a newly developed alphabet, called EVA, that attempted to correct the errors of previous alphabets (which often oversimplified the characters in the manuscript and ignored rare characters) by using ligature analysis on the handwriting. [4]
The author of the manuscript has been heavily disputed. Some popular candidates are Roger Bacon, John Dee, and Leonardo da Vinci.[5]
Technical Background
This project is heavily dependent on data mining techniques, as it involves the analysis and extraction of large quantities of data from different sources. Data mining can be used to analyse and predict behaviours, find hidden patterns and meanings or compare different data sets for commonalities or correlation. Authorship detection (stylometry) involves a subset of data mining techniques that help determine the authenticity of works and the possible authors of undocumented texts. The techniques involved in this project are also used in applications such as search engine development, code analysis, language processing, and plagiarism detection.
Part of the complexity involved in the VMS decoding process is the difficulty of transcription. The VMS has been transcribed many times by many different scholars, each attempting to create a universally acceptable transcription, but each of these has had to make limiting decisions about the structure of the manuscript. Due to the nature of the writing, the two characters observed by one scholar may also be interpreted as one combined character, and the spacing between words in the manuscript is notoriously vague. This ambiguity reduces the effectiveness of standard cryptographic and linguistic techniques to observe the relationship between characters and between separate word tokens.
Project Objectives
As this is the first year in which an Honours Project has investigated the Voynich Manuscript, the long term objectives have been flexible and subject to change. Furthermore, given the large body of work already in existence on the manuscript, it was unlikely that even a partial decoding could be produced by our team within a year. Instead the primary focus of this project was to research and understand some features of the manuscript, and effectively analyse our data. We aimed to develop ideas and code which could contribute to the overall project outcome and aid future students with their research. The broad goals of our project included:
- Developing possible methods and statistics that could be used to compare an unknown language with a known language or data set
- Comparing the linguistic characteristics and features of the Voynich Manuscript with relevant languages and authors
- Theorising as to whether the language contained within the manuscript is real, a code or a hoax
- Developing a code base, documentation, and clear analysis to aid future projects
That being said, we had certain concrete milestones that we hoped to achieve this year, and certain research areas which we considered to be most worthwhile. These included:
- Word Recurrence Interval as a language independent statistic
- Identification of words which may relate to the illustrations in the manuscript
- SVM and MDA text classification
Approach and Stages
In the early planning stages, we looked at our goals (concrete and flexible) and decided to divide our work into five stages. Each stage was dependent on areas from the previous stages and allowed continuous developments and improvements as our understanding of the manuscript progressed. Early stages involved basic characterisation of the manuscript, and later stages involved text classifications algorithms and research-based strategies. The stages themselves were split evenly between team members (detailed in the section of Project Management) and was designed to take advantage of the unique skills of each team member. The five stages are listed below:
Phase 1
Compile basic textual information about the manuscript, including:
- The number of word types in the manuscript
- The number of word tokens per page
- An ordered list of the characters used and their frequencies
- An ordered list of the words used and their frequencies
- The probability of bigrams, trigrams, and other collocations
To draw relevant conclusions from this data, tests will also be performed on known language texts, including the UN Universal Declaration of Human Rights
Phase 2
Look through the manuscript and find pages with similar illustrations, and then find words that seem relevant to those illustrations. This will include the words unique to pages with a given illustration type as well as words which suddenly increase in frequency (‘burst’) on those pages.
Phase 3
Experiment with the Word Recurrence Interval (WRI) analytical method and apply it to the Voynich Manuscript.
Phase 4
Investigate current theories further or look into areas of interest which develop during the course of the project.
Phase 5
Run supervised learning algorithms such as Support Vector Machines (SVM) and Machine Discriminant Analysis (MDA) to compare the Voynich against other languages in the Declaration of Human Rights.
Deliverables and Progress
- Proposal seminar
- Progress report
- Final seminar
- Final Report (Online)
- Poster
- Project exhibition 'expo'
- YouTube video
- Weekly Progress (Follow this link for current progress)
End results
Team
Group members
Supervisors
Resources
- Standard PC
- MATLAB
- Python
- Reference books
- Takahashi EVT file
- English and foreign texts
<ref>
tag;
no text was provided for refs named ARIZ
<ref>
tag;
no text was provided for refs named Zandbergen1
<ref>
tag;
no text was provided for refs named Stolfi1
<ref>
tag;
no text was provided for refs named Zandbergen3
<ref>
tag;
no text was provided for refs named AUTH