---------------------------------------------------------------- Block-Based Ground Truth Dataset for ICDAR2003 SceneTrialTrain Dataset ICDAR2003-SceneTrialTrain-GT4 (Rev.20090417) ---------------------------------------------------------------- 1. About This Dataset This dataset contains 4-class Ground Truth data of the natural scene images with text from the ICDAR 2003 Robust Reading Competition. The original image data can be found at: http://algoval.essex.ac.uk/icdar/Datasets.html This dataset is intended to be used for evaluations of block-based text detection algorithms. Please refer to the following paper for details. [HG2008IJDAR] Hideaki Goto, "Redefining the DCT-based feature for scene text detection -- Analysis and comparison of spatial frequency-based features," IJDAR, Vol.11, No.1, pp.1-8 (2008). The original package of this dataset can be found on our website: http://www.imglab.org/db/ ---------------------------------------------------------------- Copyright (C) 2007-2009 Hideaki Goto All Rights Reserved. You may use, copy, modify, merge, and distribute this dataset without restriction and free of charge, subject to the following conditions: * The above copyright notice and this permission notice shall be included in all copies or substantial portions of the dataset. * Modification(s) made to the dataset and the reason(s) of the modification(s) must be clearly explained in a document, and the document shall be included in all copies or substantial portions of the dataset. * Use and distribution shall meet the conditions of the dataset of ICDAR 2003 Robust Reading Competition. THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND. THE AUTHOR(S) OR COPYRIGHT HOLDER(S) WILL NOT BE RESPONSIBLE FOR ANY DAMAGE CAUSED BY THIS DATASET. ---------------------------------------------------------------- 2. Description The dataset is based on the Ground Truth data developed and used in Assoc. Prof. Goto's research group at Tohoku University. The basic block size is 16x16 (pixels). Each block was manually classified into one of the following four classes. value class 0 text block 127 intermediate block 200 large text block 255 non-text block The criteria used are as follows. a) A block containing no character stroke at all should be classified as "non-text block." b) We ignore very small characters shorter than 6 pixels since such characters are basically too small for character recognition. c) We classify the characters taller than double of the block size (8x8 or 16x16) as "large characters." The blocks containing only such large characters should be classified as "large text blocks," since they have a low probability of containing enough number of strokes. d) A block containing only a portion (a few pixels) of character stroke should be classified as "intermediate block," since it is often very hard even for humans to know whether it is a text block or not. e) The other blocks should be classified as "text blocks." 3. Files & Directories README : This file. FILES-ALL : List of all image files with extention .JPG . FILES-ALL-BASE : List of the base names of image/GT files. FILES-HG2008IJDAR-BASE : List of the base names of image/GT files used in [HG2008IJDAR]. FILES-Large-BASE : List of the large images scaled-down to fit into 720x480 pixels (or 480x720 pixels for portrait). jpeg-orig : Subdirectory containing the symbolic links to the original image files in ICDAR2003 SceneTrialTrain Dataset. The symbolic links work if the database is expanded in the same directory. image-pgm : Subdirectory containing grey-converted images. image-pgm-8 : Subdirectory containing half-resolution images. GT4-16 : Subdirectory containing the Ground Truth data at 16x16 pixels. GT4-8 : Subdirectory containing the Ground Truth data at 8x8 pixels automatically generated from the above 16x16-pixel Ground Truth by scaling-down. 4. Changes Rev.20090417 First release. 5. Credits This Ground Truth dataset was created by the following people. Tohoku University, Japan Hideaki Goto Makoto Tanaka Contact: Assoc. Prof. Hideaki Goto Cyberscience Center, Tohoku University, Sendai 980-8578, JAPAN E-mail: hgot_(at)_isc.tohoku.ac.jp (remove underscores) WWW : http://www.sc.isc.tohoku.ac.jp/~hgot/