The Erasmus Project

Resources for developers of Bible software

How text is processed

There are three steps to get from printed material to an electronic form that we make available here.

  1. Scanning: Creating an electronic image of the work.
  2. Optical Character Recognition: Converting an electronic image into text.
  3. Instrumentation: Added markup tags to make the text easy for software to process.
We will now discuss this in more detail.

Scanning
Scanning printed works involves using some form of scanner to create an electronic image of the work - one page at a time. This can involve a table scanner, a hand-held scanner, or a digital camera. Also possible, but rarely done, is someone typing in the text manually. Note that some older works have yellowed pages, dirt or stains, and sometimes notes written on the pages of the printed material. In addition, text and images can be faded or smudged.

We sometimes scan material ourselves, but most often rely on other people who have done the scanning and put the images online.

Optical Character Recognition (OCR)
This step involves using software that can process an electronic image, detected characters, and creating a textual representation of the image. There are many OCR solutions available from free to commercial. Some can process multiple languages, though most do not. Also, some can compensate to some degree for issues with faded or smudged original pages. Most can also deal with columnar layouts. However, images converted to text tend to result in a lot of "garbage" characters, as you might imagine. In such cases, the garbage has to be manually removed and the images have to be processed in a separate step.

No OCR is 100% accurate. As a result, the text usually has errors in it. For instance, it can be difficult to distinguish between the uppercase letter O and the numerical digit 0, especially when you consider the wide range of fonts that are used in printed material. Faded letters can be misinterpreted. For instance, a lower case "h" with a faded stem might be interpreted as a lower case "n". When it comes to diacriticals, this can add confusion between things such as accents, grave accents, and apostrophes (and breathing marks in Greek). Even under the best circumstances, the resulting text has to be proof-read and hand-corrected.

It should also be noted that italic and bold text are treated as normal text, since most OCR software creates plain-text files. This requires manually inserting italic and bold tags. Often, this is saved for subsequent updates of a work. Worse is the handling of non-Latin glyphs such as ancient Greek, Hebrew, Arabic, Persian, etc. Most OCR processing tries to interpret such characters as Latin glyphs, resulting in nonsense words in the place of the original. These must be manually corrected, but it can take a lot of time, so we may release a file with these nonsense words and then release corrections later.

Instrumentation
The final step is to add markup tags to delimit foreign words (especially Greek, Hebrew, and transliterations of such), mark Bible references, handle paragraphing and footnotes, etc. This is intense manual work, although software is sometimes written to do limited transformations on text (such as marking Bible references) where possible.

Image processing
Images must be processed separately from the text. Depending on how the scanning was done, the images may be slightly warped. Photo images from old works may have deteriorated over time. There isn't much we can do about that. However, we can sometimes make minor corrections to scan issues with drawn illustrations. Another issue has to do with images on yellowed pages. We use software to de-yellow drawings, but this can be problematic with colored images, so sometimes we cannot do anything about the coloration. Each image has to be stored in a separate file and then the text has to be instrumented to reference the image in the appropriate places. Pictures and drawings tend to be inserted between paragraphs while words in ancient glyphs tend to be in-line as they are in the original. We try to convert ancient glyphs to Unicode text, but that is not always practical (or even possible). Where we can do so, it is usually for Greek, Hebrew, and Anglo-saxon words.

Sources of errors
Errors can be inadvertently made in any of the above steps. Scanning can miss pages. OCR often mis-scans the electronic copy (we call these "scan errors"). Sometimes we don't catch every error when we proof-read. When others have done the scanning and OCR, they also often don't perfectly (or at all) correct scan errors. Sometimes, they won't scan or OCR fore-matter such as prefaces, title pages, and so forth, or after-matter such as appendices. We must then find a printed copy or online scans and then process the missing data. Even when we find a good text version, sometimes the text has been post-processed with software that may have introduced errors. However, most common is that some sort of textual transformation has been performed that makes the text version differ from the printed version. For instance, the articles may be rearranged in order in an encyclopedia. So we check the text against a print copy to make sure nothing is missing or out of order, relative to the printed copy.

Some instrumentation is done via software. This can sometimes introduce errors. For that which is done manually, errors can be introduced because humans are prone to error. This can range from inadvertent deletions to incorrect tagging. We try to proof all such changes to reduce the likelihood that such errors will find their way into the texts that we release, but one should expect that somewhere along the line, something was missed.

The final source of errors is when someone intentionally modifies the text that they put online. Fortunately, this is rare, but we have seen it happen. Therefore, we try to get text or scans only from reputable sources, and compare against printed originals whenever we can. And we also include the provenance of our materials for transparency sake. However, it should be noted that sometimes we have to use multiple sources (text from one place, prefaces from another, for instance), augment with our own scanning and OCR, and/or manually adjust things. As a result, sometimes what we provide is a reconstruction of the original from several sources. In extreme cases, providing provenance is of no practical use. We can only promise that we have done all that we can to provide files that faithfully represent the original work as much as it is possible. And we continue to review material that we've already released to correct any differences that come to light.