Boise State Bangla Handwriting Dataset

Summary & Purpose

The BSU Bangla Dataset is an offline handwriting dataset of Bangla, one of the major scripts in the world. The fundamental objective of this dataset is to foster the offline Bangla handwriting text recognition related researches. The easy availability and simple structure of this dataset are believed to help the research community in developing and testing such recognizers. This dataset is an anonymous and voluntary contribution of many people and the acquisition is still going on. The development of a strong handwritten text recognizer will help to digitally store handwritten archived literature, documents and contribute in digital life automation in many ways such as digital character conversion, meaning translation, content-based image retrieval, keyword spotting, signboard translation, text-to-speech conversion, scene image analysis, postal sorting, etc.

Author Identifier

Nishatul Majid: https://orcid.org/0000-0001-5445-5252 Elisa Barney-Smith: https://orcid.org/0000-0003-2039-3844

Date of Publication or Submission

6-18-2018

Resource Type

Article

DOI

https://doi.org/10.18122/saipl/1/boisestate

Use Restrictions

Users are free to share, copy, distribute and use the dataset; to create or produce works from the dataset; to adapt, modify, transform and build upon the dataset as long as the user attributes any public use of the dataset, or works produced from the dataset, referencing the author(s) and DOI link. For any use or redistribution of the dataset, or works produced from it, the user must make clear to others the license of the dataset and keep intact any notices on the original dataset. If users publicly use any adapted version of this dataset, or works produced from an adapted dataset, you must also offer that adapted database under this license. If users redistribute the dataset, or an adapted version of it, then users may use technological measures that restrict the work (such as DRM) as long as users also redistribute a version without such measures. This dataset may not be used for commercial purposes. If interested in commercial licensing, contact (208) 426-5765.

Funding Citation

This research is supported by a Graduate Assistantship, awarded to Nishatul Majid, funded by the Graduate College at Boise State University through the Department of Electrical and Computer Engineering.

Single Dataset or Series?

Series

Data Format

txt, jpeg, tif

File Size

12.2 Gb

Data Attributes

This is an open access dataset. For Camera and Scan folders participants of different ages and professions provided two types of handwritten documents, namely type ‘a’ and type ‘b’. The only difference between the content of the Camera and Scan folders is in how the content was captured. Type ‘a’ is an essay consisting 104 words/364 characters. This essay was carefully prepared using mostly common and frequently used Bangla words. This contains all the Bangla basic characters (except for “nio” which rarely appears in its basic form), all vowel diacritics and 32 high frequency consonant conjuncts. The type ‘b’ is a page containing isolated characters consisting all 50 basic characters, 10 numbers, all 11 vowel diacritics (with the consonant “ma”) and 10 high frequency conjuncts. For the Conjunct folder we have a collection of words which contain the most used conjuncts in Bangla language. The target content was provided in machine printed form. Contributors copied both the essay and the isolated units on blank paper. These were digitized using cellphone cameras. The images were cropped and basic skew correction was applied. No color alteration, resizing or filtering was done to the images. Although the participation was anonymous, the gender, age, profession and left/right handedness information was preserved and tagged along with their writing samples. The type of pen, pencil, paper or the photographing device was not specified. The essay images were tagged with ground truth information. All characters, words and lines are tagged with a bounding box represented with xmin , ymin , height and width values. The same was done with the images of the isolated characters and numbers. The tagged metadata were saved in plain text (*.txt) format.

Time Period

2018

Update Frequency

Other

Privacy and Confidentiality Statement

Boise State is explicitly compliant with federal and state laws surrounding data privacy including the protection of personal financial information through the Gramm-Leach-Bliley Act, personal medical information through HIPAA, HITECH and other regulations. All human subject data (e.g., surveys) has been collected and managed only by personnel with adequate human subject protection certification.

Disclaimer of Warranty

BOISE STATE UNIVERSITY MAKES NO REPRESENTATIONS ABOUT THE SUITABILITY OF THE INFORMATION CONTAINED IN OR PROVIDED AS PART OF THE SYSTEM FOR ANY PURPOSE. ALL SUCH INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. BOISE STATE UNIVERSITY HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS WITH REGARD TO THIS INFORMATION, INCLUDING ALL WARRANTIES AND CONDITIONS OF MERCHANTABILITY, WHETHER EXPRESS, IMPLIED OR STATUTORY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT.

IN NO EVENT SHALL BOISE STATE UNIVERSITY BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF INFORMATION AVAILABLE FROM THE SYSTEM.

THE INFORMATION PROVIDED BY THE SYSTEM COULD INCLUDE TECHNICAL INACCURACIES OR TYPOGRAPHICAL ERRORS. CHANGES ARE PERIODICALLY ADDED TO THE INFORMATION HEREIN. COMPANY AND/OR ITS RESPECTIVE SUPPLIERS MAY MAKE IMPROVEMENTS AND/OR CHANGES IN THE PRODUCT(S) AND/OR THE PROGRAM(S) DESCRIBED HEREIN AT ANY TIME, WITH OR WITHOUT NOTICE TO YOU.

BOISE STATE UNIVERSITY DOES NOT MAKE ANY ASSURANCES WITH REGARD TO THE ACCURACY OF THE RESULTS OR OUTPUT THAT DERIVES FROM USE OF THE SYSTEM.

Comments

Update Note 09/05/2019: In this release we have 150 pages of handwritten essays and 250 pages of handwritten isolated character documents. Each of these have been acquired both by a handheld cell-phone camera and by a desktop scanner. The same document appears in both the camera and scanned dataset with the same names.

Update Note 02/25/2020: In this release a folder named Conjunct has been added. This folder has 70 pages containing words with the most used conjuncts in Bangla. These are all scanned at 300 dpi. We also added 100 more essay images in the previous dataset, both in cell-phone camera and scanner acquired versions with the ground truth metadata.

Due to the size of this dataset's component files, if you would prefer to access this data via Globus, please email ScholarWorks@boisestate.edu.

Share

 
COinS