The noisy Bangla handwritten digit dataset*

Saikat Basu, Manohar Karki, Robert DiBiano, Supratik Mukhopadhyay and Rajgopal Kannan, Louisiana State University
Sangram Ganguly, Bay Area Environmental Research Institute/NASA Ames Research Center
Shreekant Gayaka, Applied Materials
Ramakrishna R. Nemani, NASA Advanced Supercomputing Division, NASA Ames Research Center

The noisy Bangla dataset is created using the offline Bangla dataset of handwritten digits by adding -
(1) additive white gaussian noise,
(2) motion blur and
(3) a combination of additive white gaussian noise and reduced contrast to the offline Bangla numeral dataset.

The datasets are available here:

Sample images from the noisy Bangla dataset:

Noisy Bangla with Additive White Gaussian Noise (AWGN) Noisy Bangla with Motion Blur Noisy Bangla with reduced contrast and AWGN

Dataset description:

The datasets are encoded as MATLAB .mat files that can be read using the standard load command in MATLAB. Each of the three datasets contain a total of 193890 training samples and 3999 test samples. Each sample image is 28x28 and linearized as a vector of size 1x784. So, the training and test datasets are 2-d vectors of size 193890x784 and 3999x784 respectively. The training and test labels are 1x10 vectors having a single 1 indexing a particular bangla numeral from 0 to 9 and 0 values at all other indices.

The MAT files contain the following variables:

train_x 193890x784 double (containing 193890 training samples of 28x28 images each linearized into a 1x784 linear vector)
train_y 193890x10 double (containing 1x10 vectors having labels for the 193890 training samples)
test_x 3999x784 double (containing 3999 test samples of 28x28 images each linearized into a 1x784 linear vector)
test_y 3999x10 double (containing 1x10 vectors having labels for the 3999 test samples)

Pre-processing for the Bangla Dataset

The numerals in the Bangla dataset are first thresholded using a local adaptive mean filter. The thresholded images are then complemented and we extract the largest connected component. Then we find the center of mass of the largest connected component and the corresponding bounding box of the numeral and then use this information to center the image. The centered image is then padded with 10 pixels on all sides. Finally, the images are resized to 28x28 pixels.

Data Augmentation

Following the procedure defined in Bhattacharya et al. , we create a synthetic dataset by using rotation and blurring on the original Bangla dataset. For the rotation transformation, each sample is randomly rotated by an angle which lies in the range 5° and 10° and another in the range -10° and -5°. All the original and rotated training samples generated above are blurred by applying a Gaussian blurring kernel with mean, μ = 0.75 and standard deviation σ = 0.33. So, the original images along with the rotated and blurred images form the final training dataset. These images are in turn added with noise and form our noisy Bangla dataset.

Additive White Gaussian Noise (AWGN)

The AWGN dataset is created using an Additive White Gaussian Noise with signal-to-noise ratio of 9.5. This emulates significant background clutter.

Motion Blur

The Motion Blur filter emulates a linear motion of a camera by τ pixels, with an angle of θ degrees. The filter becomes a vector for horizontal and vertical motions. We use a τ value of 5 pixels and θ value of 15 degrees in the counterclockwise direction.

Reduced Contrast and AWGN

The contrast range was scaled down to half and was applied with an Additive White Gaussian Noise with signal-to-noise ratio of 12. This emulates background clutter along with significant change in lighting conditions.

* To use this dataset, please cite the following paper:

Basu, S., Karki, M., Ganguly, S., DiBiano, R., Mukhopadhyay, S., Gayaka, S., ... & Nemani, R. (2017). Learning sparse feature representations using probabilistic quadtrees and deep belief nets. Neural Processing Letters, 45(3), 855-867.