In a first step the books were cut open and alltogether 43.383 pages were scanned. The resulting pictures were loaded into the program FEP (Functional Extension Parser), which was developed by our colleagues from DEA – Digitisation and Electronic Archiving. With this programme automated structure recognition was done. The programme recognizes and marks e.g. paragraphs, headers, captions, footnotes etc. Automated text recognition then was done with AbbyFineReader 11.
In the next step our staff looked over all the pages to correct structure recognition, this means they looek if titles, headings, paragraphs and so on were marked correctly.
Now we exported the raw text together with structure information (tags) as a simple XML.
Now the main work on creating the corpus begins. Different corrections have to be done, e.g. wrongly recognized text in the many Gothic Script volumes (e.g. Waffer instead of Wasser).
The texts can now be processed by the so-called pipeline. A programm recognizes sentence boundaries (SBD – Sentence Boundary Disambiguation) automatically and segments the sentences into individual words (tokenization) – at this point the text is also verticalized (= one word per line).
Eventually, the words are enriched with so-called lemmata (= basic forms). This is important for searches within the text. For example, the word forms ‘is’, ‘was’, ‘been’, ‘are’ and so on are enriched by their lemma ‘be’, so when you search for the verb ‘be’ all the forms pop up. The word form “Gipfelwände” (PL) gets the basic form “Gipfelwand” (SG), so that all forms, singular and plural, appear in a search.
At the same time the word types are assigned to the words (POS-tags, Part of Speech) in order to be able to search for adjectives, nouns, verbs etc. We used the free TreeTagger von Helmut Schmidt with the STTS-Tagset (Stuttgart-Tübingen-Tagset), which is very common for German language texts.