With updated Job’s Voynichese and OCR project, I thought a good place to start is an informal review of the digital transcription file of the Voynich Manuscript. I am referring to Stolfi’s “Interlinear” txt file from 5 December 1998 based on work by Landini, Grove, Friedman, Takahashi, D’Imperio, Currier, Reeds, Guillogly and Guy. It’s what everyone uses, including academic researchers. The quality of any statistical investigation rests on the quality of the transcription, which I think can be improved.
This post has also been informed by some papers and blog posts on similar subjects.
What is good?
- It is free to access and download (once you find it)
- It is in a universally accepted file format (.txt).
- It has plenty of metadata about the authors, sources and dates.
- It uses EVA so it can be converted to any other transcription.
What is bad?
- The data is based on outdated sources like low-quality photocopies. We have much better sources now. For example, Zandbergen has a UV shot of f17r where the entire top marginalia is readable (not that I have access). Update: The 2014 scans show some previously hidden parts too.
- It’s inconsistent. There are multiple readings of many passages accumulated over the years. Some of these ambiguities are non-issues when using modern sources.
- No recognition of original page order. This can be significant in explaining some statistical patterns and the shift from Currier language A to B.
- Far too many extraneous comments:
- It has an entire changelog in it that nobody really cares about, and names that are never explained. Very “closed community” like.
- These explanations are poorly ordered, assume too much prior knowledge about the manuscript, and are drowned out in the changelogs.
- Plenty of these comments could go on another website, such as descriptions of illustrations. Why include them? People can just look at an image of the page now, and the computer programs aren’t going to care.
- Semantics are poorly implemented:
- Metacodes and sections are unconventional and need explaining. Seems intuitive to the people who made it, but the sheer number of codes to remember is mind-boggling.
- Section coding is too strict. For example, there is no distinction between pages with pure text and those with pure text and stars, which may have originally been separate.
- I was going to say that different types of text (e.g. titles, starred paragraphs, marginalia and so on) aren’t distinguished, but I can’t tell if this is the case because there are so many letter codes and brackets to figure out. I gave up trying to review the semantics because it was so damn confusing.
EVA is pretty good and appropriate for many reasons. The only things I would change are the outdated metadata, but that’s not part of the transcription alphabet per se. There are possible steganographic elements that could be included like Nick Pelling’s suggestions about flourishes on the n characters, but that would get messy. Lastly, I’m not sure how it deals with those huge weirdos that cross many glyphs.
Based on these observations, a modern incarnation of a digital transcription would fulfill the following requirements:
- Easy to find.
- Free to access, download and share.
- Universally accepted file format with conventional section delineation and semantics.
- These semantics are intuitive and include everything they do now, plus original page order, looser section tagging, and possibly types of text.
- Brief metadata about dates and authors with links to elsewhere.
- Explanations only as needed to understand the particular file. All other descriptions and changelogs go elsewhere.
- A single text based on updated sources.
- A transcription that can describe everything and can be converted into anything.
The free and easy access and download would require an official, centralized site with Voynich Manuscript resources and contributor profiles. But more on that in another review…
Original page order
To log the original page order, the contributors in this field would have to come up with a notation they could use with incomplete data. Obviously the f##r/v notation won’t work until we have every page sorted. Just for example, I am using something like [book number]-q[quire number]-f[folio number in quire]r/v using the data from Nick Pelling’s presentation. So if we knew the first folio of quire 7 we would call it A-q7-f1r regardless of how many folios were in quires 1-6. But I digress.
The requirements of the file naturally suggest Dublin Core and XML. They’re standard, many are familiar with them, and many programs and databases already deal with them. XML is flexible, human-readable and naturally captures the structure of data without the ad hoc mess in the current file. Theoretically we can specify an XML schema and automatically convert the txt file into that system. But if we also need to update the text itself with new sources, we might as well take care of the underlying foundation too, and start completely fresh.
“No one sews a patch of unshrunk cloth on an old garment, for the patch will pull away from the garment, making the tear worse. Neither do people pour new wine into old wineskins. If they do, the skins will burst; the wine will run out and the wineskins will be ruined. No, they pour new wine into new wineskins, and both are preserved.” – Jesus, in Matthew 9:16-17 (NIV).
Okay that’s a strange quote to use, but you get the idea. No band-aids, I propose a fundamental restructuring. Below is a sample of the type of XML file I would make. Note that Stolfi’s txt file takes 1873 lines to get through all the damn preliminaries. My file includes all the same semantics but uses only 72 lines by moving extraneous content elsewhere and using known standards.
<?xml version="1.1" encoding="UTF-8"?> <manuscript> <head> <dc:title>Voynich Manuscript</dc:title> <dc:identifier>MS408</dc:identifier> <dc:contributor>Aaron Aaronson</dc:contributor> <dc:contributor>Someone Else</dc:contributor> <dc:creator>Master Troll</dc:creator> <dc:date>2014-07-05</dc:date> <dc:description>The entire Voynich Manuscript transcribed as a digital file.</dc:description> <dc:type>XML</dc:type> <dc:language>Hell if I know</dc:language> <dc:rights>Public domain</dc:rights> <dc:source>Beinecke MS408 SIDs, UV scans, pre-chemical photocopies, and more.</dc:source> <dc:subject>Plants. Possibly. Maybe.</dc:subject> <dc:publisher>Self-published</dc:publisher> <dc:relation>www.centralvoynichwebsite.com (just an example name)</dc:relation> <dc:coverage>int</dc:coverage> <!-- ABOUT THIS DOCUMENT - This document is in XML. For a full explanation, see here: www.example.com - To contact or read about the contributors, go to the profiles at www.example.com - For a full changelog, see here: www.example.com ABOUT THE TRANSCRIPTION - For an introduction to the Voynich Manuscript, see here: www.example.com - For a guide to the terminology and structure of the Voynich Manuscript, see here: www.example.com - Contents use the EVA transcription. For a full explanation, see here: www.example.com ABOUT THE TAGS AND ATTRIBUTES - <manuscript> is the root element that encloses the whole document. - <head> encloses metadata. - <body> encloses the main text of the Voynich Manuscript. - <quire> - Encloses a quire. Attributes: - id - Value is given by "q" followed by the number of the quire. - <folio> - Encloses a folio. Attributes: - id - Value is the traditional name for that folio. - oid - The id of that folio in the original page order. - content - The visual contents of the folio. Possible values include "text", "whole plant", "plant part", "jar", "star", "nymph", "zodiac", "circle" and "water". Multiple values can be included as a comma separated list. This replaces the section identifications which were subjective and overlapped. - curlang - The Currier language for the folio. Possible values include "none", "A" or "B". Default value is "none". Can only include one value. - curhand - The Currier hand for the folio, given as a numeral. - bifolio - The id of the other folio in the bifolio. Default value is "none". - <paragraph> - Encloses a clearly delineated set of lines. Attributes: - id - Value is the folio id followed by "-p" followed by the number of the paragraph. - type - What type of bullet points are used for the paragraph. Possible values include "none" (regular), "starred" (star bullet points) or "lettered" (letter bullet points). Default value is "none". - <line> - Encloses a line of text. Attributes: - id - Value is the paragraph id followed by "-l" followed by the number of the line. - type - The shape of the line. Possible values include "none" (horizontal), "circular" (going around the circumference of a circle), "radial" (going from the centre to the edge of a circle) and "spiral" (going in a spiral shape). Default value is "none". - <title> - Encloses a piece of text that is right aligned. - <label> - Encloses a piece of text that is not in a line and attached to something. Attributes. - id - Value is the folio id followed by "-la" followed by the number of the label. - referent - Value is the id of the object that the label is attached to. Alternative values include "none" and "illustration". Default value is "illustration". - <marginalia> - Encloses text outside the main paragraph structure. - <latin> - Encloses text in the Latin alphabet. - <key> - Encloses a key sequence as if it was a line. --> </head> <body> <quire id="q1"> <folio id="f1r" oid="A-q1-f1r" content="text" curlang="A" curhand="1" bifolio="f8"> <paragraph id="f1r-p1"> <line id="f1r-p1-l1">py chyr nkal ar ntaiir shol shor cthrer y kor sholty</line> <line id="f1r-p1-l2">sory ckhar ory kair chtaiir shar air cthar cthar dan</line> <line id="f1r-p1-l3">syaiir shoky or ykaiin shod cthoary cthes taraiin sy</line> <line id="f1r-p1-l4">daiin oteey oteorroloty ctaor daiin okaiin or okan</line> <line id="f1r-p1-l5">sairy chear cthaiin cphar cffaiin</line> <title>ydaraishy</title> <!-- And so on --> </paragraph> </folio> </quire> </body> </manuscript>
Below is an example of how it would handle marginalia:
<folio id="f17r" content="text"> <marginalia><line><latin>mallior allor lcz uinlanima</latin> oteeeol aim</line></marginalia> </folio>
Below is an example of how it would handle the circular sections:
<folio id="f57r" content="text, circle"> <label id="f57r-la1" referent="f57-l1">dairol</label> <line id="f57r-l1" type="circular">v ra l y soeos u s ar okees o d rocpcheer and so on.</line> </folio>
Notes: Some folio details might be incomplete or inaccurate in these samples because they are just for demonstration purposes. WordPress.com has screwed up the code indentation and lines, trust me it’s not as long and messy as it seems.
The advantages of the id tags are numerous. It becomes extremely easy for humans and computers to locate and correlate things on multiple structural levels, it is similar to current notations, and they can be automatically generated. Most notably, any further information (e.g. x and y location data, paint colors, or future discoveries) can use XML with the same object ids so that programs can automatically match them up or merge them with the basic contents here. Volunteers contributing other data can make their own XML sheets without having to navigate through this entire file to add them, and in the long term this won’t lead to unnecessary bloat.
So how would we achieve this? My idea is below.
- Get discussion and consensus from the Voynichology community about the appropriate data and format for a new transcription file.
- Gather knowledgeable volunteers and the latest and greatest image sources.
- Gather data about folios and original page order.
- Delegate tasks to transcribe the manuscript in EVA from scratch, and/or help Job develop his OCR software to automate it. Reach consensus for a single text. Many hands make light work.
It’s not the 90s anymore. Gentlemen, we can rebuild the transcription file. We have the technology. We can make it harder, better, faster, stronger 😀
Comments on my review and proposal are invited below.