Author: Elif Derin Can, Phd Student, FSMV University History Department
Transkribus is a comprehensive artificial intelligence (AI)-assisted platform for text recognition (HTR), automatic transcription and thematic tagging of historical documents. Started in 2013 with European Union funding under the name tranScriptorium, the project turned into a cooperative community in 2019. The European co-operative READ COOP SCE is responsible for the sustainability and updating of the Transkribus platform. Today, the community has 135 members, both individuals and institutes, from 35 different countries and has more than one hundred thousand users. Thus, the artificial intelligence infrastructure, which is quite costly for individual initiatives of academicians, is developed in a sustainable environment with the support of an international community and institutions.
The platform, besides automatic transcription, also offers opportunities for digitizing documents, training artificial intelligence, collecting and processing data, and publishing studies. Therefore, you can transcribe automatically manuscript and printed texts, train transcription models for different documents, scan these documents, tag texts in terms of their structure and content, and output the generated data in different formats (such as TEI, TXT, PDF, Word) in the Transkribus platform. Notably, the platform is web-based, meaning it is always accessible online, facilitating collaborative work.
In June 2023, Süphan Kırmızıaltın (NYU Abu Dhabi), Fatma Aladağ (Universität Leipzig), and Elif Derin Can (FSMV University) made the first printed Ottoman Turkish automatic transcription model available as open access on the Transkribus platform. Detailed information about this HTR model and other Ottoman Turkish digitalization efforts can be found on the Digital Ottoman Corpora website. In this article, we will try to show how you can use the mentioned HTR model developed for printed texts at a basic level.
A Practice for Ottoman Turkish in Transkribus:
You need to start by registering on the platform. Membership to the platform is free of charge, and you are initially given 500 credits for the beginning. The credit usage differs for printed and manuscript documents. For example, while 1 page of manuscript documents can be transcribed with 1 credit, this rate varies as 6 pages with 1 credit for printed models. Before automatic transcription, page analysis, or other work you do on the platform to create your own model does not require any credit.
Immediately after creating a membership by entering your information in the Register section, you can create your own collection by clicking on Collections on the top right.
Transkribus automatically assigns an ID number to the collection you create. However, during collection creation, giving distinctive names, considering that your collections will grow as your work progresses, will make your work more manageable.
Once you've named the collection, create it by clicking the Create button. Then you can upload the file of any size and any format (image or document) you want. Inside the collection, you can add your documents by using the Upload Document or Upload Files buttons. When uploading documents, you can choose between Image (image files with a .jpg extension) or PDF, based on your file type. Add the selected document or file from your computer to the collection by clicking the Submit button.
After uploading your documents to the collection, you have the option to perform automatic transcription in a single step. However, especially for documents with complex page layouts, such as multi-column newspapers, it will be more efficient to begin with a Layout Analysis to ensure that reading zones and lines are accurately determined before proceeding to automatic transcription. To perform both analyses, simply click on the "T" icon in the lower-left corner of the collection you want to work on.
The "T" icon takes you to the page where you can select the type of analysis. If you want to transcribe directly, click on Text Recognition. If you want it to first determine the page and line layout, click on Layout and select the appropriate model. Several different models are available in this section, depending on your document's text layout. For straightforward page layouts, choose the Universal Lines model. For more complex, multi-column, or mixed layout documents, opt for the Mixed Line Orientation model. After selecting the model, click the Start Recognition button in the upper-right corner of the page.
If you've performed a Layout Analysis first, after verifying the order of rows and columns, the integrity of lines, and whether they cover the entire line, you can proceed to transcription by clicking the "T" icon and selecting Text Recognition once more. When you click the Text Recognition button, the Transkribus platform provides a list of publicly available HTR models that you can use. You can select the OttomanTurkish_Print_1 model from this list and complete the process by clicking Start Recognition again.
You can monitor the progress of all your work on the platform from the Jobs tab in the upper-right corner of the homepage. Layout Analysis is completed quickly, while Text Analysis may take a bit longer depending on the host computer's performance and your internet connection. When the work status reads Finished, you can return to your document and review your automatic transcription.
After the automatic transcription process is finished, you can review the transcription and make necessary corrections, bearing in mind that the model has an accuracy of 92.8%. As shown in the image below, you can manually correct or add parts that may be faded in the original document.
As mentioned in the introduction, currently, there is only one available Ottoman printed HTR model for general use. If your documents contain different content, the accuracy rate may be lower. In such cases, you can use the model as a starting point for your own documents. After making necessary corrections to the transcription, you can retrain the model based on your documents and transcription features. For documents significantly different from the printed model, such as manuscripts, you can create your own model from scratch. This is a somewhat more time-consuming process that involves manually entering the transcription after conducting the Layout Analysis of your documents, but it will streamline and expedite your work in the long run.
After completing your transcription, you can tag thematic data such as persons, events, places, and dates for content analysis within the text. You can print out your transcriptions in various formats and analyze this data according to your research topic and methodology. Additionally, you can use these outputs as a dataset for NLP and other text-mining methods. Another notable feature of the platform is the Read & Search section, where you can digitally publish your documents. You can explore other digital edition projects prepared with Transkribus and publish your own digital editions here.
Comments