Summary & Purpose

The BSU Bangla Dataset is an offline handwriting dataset of Bangla, one of the major scripts in the world. The fundamental objective of this dataset is to foster the offline Bangla handwriting text recognition related researches. The easy availability and simple structure of this dataset are believed to help the research community in developing and testing such recognizers. This dataset is an anonymous and voluntary contribution of many people and the acquisition is still going on. The development of a strong handwritten text recognizer will help to digitally store handwritten archived literature, documents and contribute in digital life automation in many ways such as digital character conversion, meaning translation, content-based image retrieval, keyword spotting, signboard translation, text-to-speech conversion, scene image analysis, postal sorting, etc.

Date of Publication or Submission

6-18-2018

Resource Type

Article

DOI

https://doi.org/10.18122/saipl/1/boisestate

Use Restrictions

Users are free to share, copy, distribute and use the dataset; to create or produce works from the dataset; to adapt, modify, transform and build upon the dataset as long as the user attributes any public use of the dataset, or works produced from the dataset, referencing the author(s) and DOI link. For any use or redistribution of the dataset, or works produced from it, the user must make clear to others the license of the dataset and keep intact any notices on the original dataset. If users publicly use any adapted version of this dataset, or works produced from an adapted dataset, you must also offer that adapted database under this license. If users redistribute the dataset, or an adapted version of it, then users may use technological measures that restrict the work (such as DRM) as long as users also redistribute a version without such measures. This dataset may not be used for commercial purposes. If interested in commercial licensing, contact (208) 426-5765.

Funding Citation

This research is supported by a Graduate Assistantship, awarded to Nishatul Majid, funded by the Graduate College at Boise State University through the Department of Electrical and Computer Engineering.

Single Dataset or Series?

Series

Data Format

txt, jpeg

Data Attributes

This is an open access dataset. Participants of different ages and professions provided two types of handwritten documents, namely type ‘a’ and type ‘b’. Type ‘a’ is an essay consisting 104 words/364 characters. This essay was carefully prepared using mostly common and frequently used Bangla words. This contains all the Bangla basic characters (except for “nio” which rarely appears in its basic form), all vowel diacritics and 32 high frequency consonant conjuncts. The type ‘b’ is a page containing isolated characters consisting all 50 basic characters, 10 numbers, all 11 vowel diacritics (with the consonant “ma”) and 10 high frequency conjuncts. The target content was provided in machine printed form. Contributors copied both the essay and the isolated units on blank paper. These were digitized using cellphone cameras. The images were cropped and basic skew correction was applied. No color alteration, resizing or filtering was done to the images. Although the participation was anonymous, the gender, age, profession and left/right handedness information was preserved and tagged along with their writing samples. The type of pen, pencil, paper or the photographing device was not specified. The essay images were tagged with ground truth information. All characters, words and lines are tagged with a bounding box represented with xmin , ymin , height and width values. The same was done with the images of the isolated characters and numbers. The tagged metadata were saved in plain text (*.txt) format.

Privacy and Confidentiality Statement

Boise State is explicitly compliant with federal and state laws surrounding data privacy including the protection of personal financial information through the Gramm-Leach-Bliley Act, personal medical information through HIPAA, HITECH and other regulations. All human subject data (e.g., surveys) has been collected and managed only by personnel with adequate human subject protection certification.

Disclaimer of Warranty

BOISE STATE UNIVERSITY MAKES NO REPRESENTATIONS ABOUT THE SUITABILITY OF THE INFORMATION CONTAINED IN OR PROVIDED AS PART OF THE SYSTEM FOR ANY PURPOSE. ALL SUCH INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. BOISE STATE UNIVERSITY HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS WITH REGARD TO THIS INFORMATION, INCLUDING ALL WARRANTIES AND CONDITIONS OF MERCHANTABILITY, WHETHER EXPRESS, IMPLIED OR STATUTORY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT.

IN NO EVENT SHALL BOISE STATE UNIVERSITY BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF INFORMATION AVAILABLE FROM THE SYSTEM.

THE INFORMATION PROVIDED BY THE SYSTEM COULD INCLUDE TECHNICAL INACCURACIES OR TYPOGRAPHICAL ERRORS. CHANGES ARE PERIODICALLY ADDED TO THE INFORMATION HEREIN. COMPANY AND/OR ITS RESPECTIVE SUPPLIERS MAY MAKE IMPROVEMENTS AND/OR CHANGES IN THE PRODUCT(S) AND/OR THE PROGRAM(S) DESCRIBED HEREIN AT ANY TIME, WITH OR WITHOUT NOTICE TO YOU.

BOISE STATE UNIVERSITY DOES NOT MAKE ANY ASSURANCES WITH REGARD TO THE ACCURACY OF THE RESULTS OR OUTPUT THAT DERIVES FROM USE OF THE SYSTEM.

Comments

New Release Note 09/05/2019: In this release we have 150 pages of handwritten essays and 250 pages of handwritten isolated character documents. Each of these have been acquired both by a handheld cell-phone camera and by a desktop scanner. The same document appears in both the camera and scanned dataset with the same names.

Share

 
COinS