Charactristic:
■ The database contians 285 manuscript pages, 6000 word images and 8000 segmented character images.
■ Manuscript page images are saved in their origenal colors in PNG format, whereas word images saved in three different versions (Gray-scale, Binary, and Thinned).
■ The theme of most manuscript collections is the islamic jurisprudence; where handwritten words overs most Arabic parts of speech in addition to some cities names and security terms.
Database statistics:
Frequency analysis proves that letter distribution in IESK-arDB almost has the same frequency pattern compared to letter distribution of the huge digital corpora used in the Intellyze, which contains about 1,297,259 words or 5,122,132 letters. A normalized Chi-square test shows that letter frequency in both sources are nearly following the same distribution with a goodness fit value of X=0.98.