 The database contians 285 manuscript pages, 6000 word images and 8000 segmented character images.

 Manuscript page images are saved in their origenal colors in PNG format, whereas word images saved in three different versions (Gray-scale, Binary, and Thinned).

 The theme of most manuscript collections is the islamic jurisprudence; where handwritten words  overs most Arabic parts of speech in addition to some cities names and security terms.


Database statistics:


Frequency analysis proves that letter distribution in IESK-arDB almost has the same frequency pattern compared to letter distribution of the huge digital corpora used in the Intellyze, which contains about 1,297,259 words or 5,122,132 letters. A normalized Chi-square test shows that letter frequency in both sources are nearly following the same distribution with a goodness fit value of X=0.98.

The Letters frequency in IESK-arDB compared to the letters frequency in huge digital corpora.

The frequency distribution of Arabic letters in IESK-arDB, sorted according to the alphabet sequence.

IESK-arDB: A database for off-line Arabic handwriting


