http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/
Char74k dataset
In this dataset, symbols used in both English and Kannada are available.
In the English language, Latin script (excluding accents) and Hindu-Arabic numerals are used. For simplicity we call this the "English" characters set. Our dataset consists of:
- 64 classes (0-9, A-Z, a-z)
- 7705 characters obtained from natural images
- 3410 hand drawn characters using a tablet PC
- 62992 synthesised characters from computer fonts
This gives a total of over 74K images (which explains the name of the dataset).
http://openresearch.baidu.com/activitybulletin/618.jhtml
一段文字识别代码
http://prir.ustb.edu.cn/TexStar/MOMV-text-detection/
这个网址介绍
Multi-Orientation Scene Text Detection and USTB-SV1K Dataset 并且提供了多方向多视角自然图像文本数据库
USTB-SV1K
Text detection in natural scene images is an important prerequisite for many content-based image analysis tasks, while most current research efforts only focus on horizontal or near horizontal scene text. In our paper, first we present a unified distance metric learning framework for adaptive hierarchical clustering, which can simultaneously learn similarity weights (to adaptively combine different feature similarities) and the clustering threshold (to automatically determine the number of clusters). Then, we propose an effective multi-orientation scene text detection system, which constructs text candidates by grouping characters based on this adaptive clustering. Our text candidates construction method consists of several sequential coarse-to-fine grouping steps: morphology-based grouping via single-link clustering, orientation-based grouping via divisive hierarchical clustering, and projection-based grouping also via divisive clustering. The effectiveness of our proposed system is evaluated on several public scene text databases, e.g., ICDAR Robust Reading Competition datasets (2011 and 2013), and MSRA-TD500. Specifically, on the multi-orientation text dataset MSRA-TD500, the
f
measure of our system is 70%, much better than 60% of one recent state-of-the-art performance。
We also construct and release a practical challenging multi-orientation scene text dataset (USTB-SV1K), which is available at http://prir.ustb.edu.cn/TexStar/MOMV-text-detection/.
Dataset description
.
We annotate an image in which a list of words to label with bounding boxes by the coordinates of the left-top point, width, height and inclination angle along with the ground truth word, which is similar to MSRA-TD500. We collect 1000 (500 for training and 500 for testing) street view (patch) images from 6 USA cities, i.e., New York, Boston, Los Angle, Washington DC, San Francisco, and Seattle. The set from each city includes about 160 ~ 180 images, about half of which are for training, and the rest for testing. There are three main challenges for detection and recognition on this dataset (see samples in Figure 2). First, in many cases, text is in multiple orientations and views. About 75%, 10% and 15% images are with (near) horizontal, multi-orientation, and multi-view (always with skewed distortions) text respectively. Second, this dataset includes a lot of small or blurred text (about 28%). Third, about one fourth of text regions are specific street and business names, or parts of words, and can’t be found in a common dictionary. Overall, our dataset, USTBSV1K, has a general open and challenging situation for natural (street view) scene text detection and recognition.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3734103/
Rotation-Invariant Features for Multi-Oriented Text Detection in Natural Images
这篇文章的信息量非常大,值得认真读。
下面是MSRA-TD500的下载地址
http://www.iapr-tc11.org/mediawiki/index.php/MSRA_Text_Detection_500_Database_(MSRA-TD500)#Version_1.0
http://www.iapr-tc11.org/mediawiki/index.php/MSRA_Text_Detection_500_Database_(MSRA-TD500)
这个是
NEOCRdataset的地址,不过访问不了。
http://www.iapr-tc11.org/mediawiki/index.php/NEOCR:_Natural_Environment_OCR_Dataset
下面介绍icdar 的数据库,我下到了icdar2003 和icdar 2013 的数据库,因为
icdar 2005/2007/2009针对robust reading 和 text locating 没有给出新的数据库,使用的仍是icdar2003给出的数据库。至于icdar 2011 给出的数据库下载地址不可用,目前我没有找到其他可供下载icdar 2011 数据库的地址。
Although widely used in the community, the ICDAR datasets
have two major drawbacks. First, most of the text lines (or single characters) in the ICDAR datasets are horizontal. In real scenarios, however, text may appear in any orientation. The second drawback is that all the text lines or characters in this dataset are in English. Therefore it is unable to use these datasets to assess detection systems designed for multilingual scripts.
来源: <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3734103/#pone.0070173-Nagy1>
下面是icdar2003 robust reading 的数据库下载地址
http://algoval.essex.ac.uk/icdar/Datasets.html
Robust Reading Datasets
These datasets were collected and tagged by the ICDAR 2003 Robust Reading Dataset Collection Team ( photo. Clockwise from left: Shirley Wong, Simon Lucas, Alex Panaretos, Luis Sosa Velazquez, Robert Young, Anthony Tang.)
The datasets are organized into Sample , Trial and Competition datasets.
Sample datasets are provided to give you a quick impression of the data, and also to allow function testing of your software. That is, you can run tests on the sample data to check that your software works with the data, but the results won't mean much.
Trial datasets serve two purposes. Use them to get results for your ICDAR 2003 papers. For this purpose, they are partitioned into two sets: TrialTrain and TrialTest. Use TrialTrain to train or tune your algorithms, then quote results on TrialTest. For the competitions, you should train/tune your system on the entire Trial set.
Competition datasets will be used to measure the performance of your algorithms for the competitions. These will be kept private until the ICDAR 2003 conference, when they will be made public.
Robust Reading and Text Locating
Each dataset is provided as a zip file, and contains a set of JPEG scene images, and three XML tag files: locations.xml, words.xml and segmentation.xml.
locations.xml
is for the Text Locating problem, and contains the path to each image and the set of rectangles for each image.
words.xml
is for the Robust Reading competition - this tags each image with the bouding rectangles of each word in the image together with the text in each rectangle.
segmentation.xml
- like words.xml, except that each word is also given its segmentation points - just in case this information is useful to your algorithm (e.g. may be used to speed up EM).
Sample
(20 images, 7.3mb)
TrialTrain
(258 images, 43.3mb)
TrialTest (251 images, 69.6mb)
icdar 2005/2007/2009针对robust reading 和 text locating 没有给出新的数据库,使用的仍是icdar2003给出的数据库。
网址如下:
http://algoval.essex.ac.uk:8080/icdar2005/index.jsp?page=intro.html
http://www.informatik.uni-trier.de/~ley/db/conf/icdar/icdar2007.html
http://www.cvc.uab.es/icdar2009/
icdar 2011 网址:给出了新的数据库,但是打不开,不过icdar 2013 给出的是更新后的icdar 2011 数据库
http://robustreading.opendfki.de/wiki/SceneText#TrainingDataset
Training Dataset
Training data is available now on following links:
Training Data Text Localization
http://www.dfki.uni-kl.de/~shahab/robustreading/train-textloc.zip
Training Data Word Recognition
http://www.dfki.uni-kl.de/~shahab/robustreading/train-wordrec.zip
Test Dataset
Test data is available now on following links:
Test Data Text Localization
http://www.dfki.uni-kl.de/~shahab/robustreading/test-textloc.zip
Test Data Word Recognition
http://www.dfki.uni-kl.de/~shahab/robustreading/test-wordrec.zip
Test Data Text Localization + Ground Truth http://www.dfki.uni-kl.de/~shahab/robustreading/test-textloc-gt.zip
Test Data Word Recognition + Ground Truth http://www.dfki.uni-kl.de/~shahab/robustreading/test-wordrec-gt.zip
来源: <http://robustreading.opendfki.de/wiki/SceneText#TrainingDataset>