Overview:
The IESK-arDB is an off-line handwritten database. It contains 285 pages of a 14th century historical manuscripts, more than 6000 handwritten word images, and 8000 segmented character images. The word database vocabulary covers most of Arabic part of speech nouns, verbs, country/city names, security terms, and words used for writing bank amounts.
Data Acquisition:
Manuscript page images are collected from multiple islamic works that are tough to be written in the 14th century. The main sources are the book of Al-FRO written by IBN MUFLIH and the book of FAWAID FIGHIYAH (the writer is unknown). The handwritten word samples are collected from 22 writers from different Arabic countries and also from countries where the Arabic script is the writing medium. Writers have been asked to write according to Naskh style as much as they can. This has two reasons. First, Naskh is the most common used writing style. Second, compared to other writing styles, Naskh emphasizes most of the letters' structural peculiarities.
Ground Truthing:
Manuscript page images are ground-truthed by creating a UTF-8 text file for each page image. Each line in the text file exactly corresponds to a line in the page image. For a better view, we advice to set font to Segoe UI font. Each word is fully described by a ground truth XML file, that contains segmentation information besides other important entries.