KANNADA HANDWRITTEN WORD DATASET FOR OCR VIASYLLABLE COMPOSITION ANDCORPUS AUGMENTATION

Authors

  • Dadapeer .
  • Yeresime Suresh

DOI:

https://doi.org/10.52152/801318

Keywords:

Kannada handwritten Dataset, Char74K; IAM dataset, Optical Character Recognition (OCR); Handwritten Text Recognition (HTR), syllables,

Abstract

Handwritten Text Recognition (HTR) systems require extensive labeled datasets to achieve high accuracy, especially for word-level recognition. While English enjoys rich resources such as the IAM dataset, low-resource Indian languages like Kannada lack publicly available handwritten word-level datasets. In this paper, we present a novel approach for synthesizing a Kannada handwritten word image dataset by combining character-level handwritten images from the Char74K dataset with a real-world Kannada corpus sourced from Kaggle. Our pipeline intelligently segments Kannada words into syllables, maps them to corresponding character images, and stitches them to generate realistic word-level samples. This method bridges the gap in training data availability for Kannada and enables consistent training of word-level OCR systems similar to those available for English. The resulting dataset contains 500+ synthetically generated word images with accurate Unicode labels, and the approach is scalable to thousands of words. This work contributes a reproducible methodology and a valuable resource for the OCR research community.

Downloads

Published

2025-08-12

Issue

Section

Article

How to Cite

KANNADA HANDWRITTEN WORD DATASET FOR OCR VIASYLLABLE COMPOSITION ANDCORPUS AUGMENTATION. (2025). Lex Localis - Journal of Local Self-Government, 23(S5), 847-856. https://doi.org/10.52152/801318