KANNADA HANDWRITTEN WORD DATASET FOR OCR VIA SYLLABLE COMPOSITION AND CORPUS AUGMENTATION

Dadapeer .; Yeresime Suresh

doi:10.52152/801318

Authors

Dadapeer .
Yeresime Suresh

DOI:

https://doi.org/10.52152/801318

Keywords:

Kannada handwritten Dataset, Char74K; IAM dataset, Optical Character Recognition (OCR); Handwritten Text Recognition (HTR), syllables,

Abstract

Handwritten Text Recognition (HTR) systems require extensive labeled datasets to achieve high accuracy, especially for word-level recognition. While English enjoys rich resources such as the IAM dataset, low-resource Indian languages like Kannada lack publicly available handwritten word-level datasets. In this paper, we present a novel approach for synthesizing a Kannada handwritten word image dataset by combining character-level handwritten images from the Char74K dataset with a real-world Kannada corpus sourced from Kaggle. Our pipeline intelligently segments Kannada words into syllables, maps them to corresponding character images, and stitches them to generate realistic word-level samples. This method bridges the gap in training data availability for Kannada and enables consistent training of word-level OCR systems similar to those available for English. The resulting dataset contains 500+ synthetically generated word images with accurate Unicode labels, and the approach is scalable to thousands of words. This work contributes a reproducible methodology and a valuable resource for the OCR research community.

KANNADA HANDWRITTEN WORD DATASET FOR OCR VIA SYLLABLE COMPOSITION AND CORPUS AUGMENTATION

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

INDEXED BY

Latest publications

Information

Language

Subscription