KANNADA HANDWRITTEN WORD DATASET FOR OCR VIASYLLABLE COMPOSITION ANDCORPUS AUGMENTATION
DOI:
https://doi.org/10.52152/801318Keywords:
Kannada handwritten Dataset, Char74K; IAM dataset, Optical Character Recognition (OCR); Handwritten Text Recognition (HTR), syllables,Abstract
Handwritten Text Recognition (HTR) systems require extensive labeled datasets to achieve high accuracy, especially for word-level recognition. While English enjoys rich resources such as the IAM dataset, low-resource Indian languages like Kannada lack publicly available handwritten word-level datasets. In this paper, we present a novel approach for synthesizing a Kannada handwritten word image dataset by combining character-level handwritten images from the Char74K dataset with a real-world Kannada corpus sourced from Kaggle. Our pipeline intelligently segments Kannada words into syllables, maps them to corresponding character images, and stitches them to generate realistic word-level samples. This method bridges the gap in training data availability for Kannada and enables consistent training of word-level OCR systems similar to those available for English. The resulting dataset contains 500+ synthetically generated word images with accurate Unicode labels, and the approach is scalable to thousands of words. This work contributes a reproducible methodology and a valuable resource for the OCR research community.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Lex localis - Journal of Local Self-Government

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.