A set of classes has been defined for each book of the HBA 1.0 dataset. We have ground-truthed the HBA 1.0 dataset by annotating each foreground pixel. The ground truth of each selected foreground pixel has been defined by means of a label indicating the content type or the content class of the analyzed historical document image. Different labels for the selected foreground pixels with different fonts have been also assigned for evaluating the participating methods to separate various text fonts.
The foreground pixels have been retrieved from the analyzed historical document image. The foreground pixel selection step has been performed using a standard parameter-free binarization method, the Otsu’s method, to retrieve only those pixels representing information of the foreground (text, graphics, …). Indeed, the major goal of the HBA competition consists of classifying the selected foreground pixels and not locating accurately the foreground pixels. Using a global thresholding approach, Otsu’s method has provided an adequate and fast mean of binarization to retrieve only the foreground pixels.
The following figures illustrate few examples of the defined ground truth of the 11 books of the HBA 1.0 dataset. Each selected foreground pixel is marked by a color that symbolizes the corresponding content type. The ground truth information is currently available at the pixel level. The ground truth of the HBA 1.0 dataset contains more than 7,58 billion annotated pixels.
For additional details about the HBA 1.0 dataset and its ground truth, see the two links below:
HBA 1.0: A Pixel-based Annotated Dataset for Historical Book Analysis
HBA : un jeu d’images annotées pour l’analyse de la structure de mise en page d’ouvrages anciens