Skip to content

Procedure and custom tools used to generate the dataset.

Notifications You must be signed in to change notification settings

smartdoc2017-competition/dataset_creation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ICDAR 2017 Competition SmartDoc-reconstruction

This repository contains the procedure and the tools required to generate new elements in the dataset.

Tools used

  • pdftk: PDF Toolkit
  • convert: ImageMagick
  • ffmpeg: FFmpeg video converter
  • python: Python 2.7+
  • OpenCV: Open Computer Vision library v2.8+ (v3.x not supported)
  • create_referece.py: custom Python script provided with this package

Global procedure

  1. Select source documents, ideally from a digital source in PDF format. For the public dataset, we carefully selected documents for which source was not available online, with no copyright issue, not containing any personal information.

  2. Either raster the source PDF file or scan the document using a high-quality device.

# To burst the pages of a PDF file into separate pages
$ pdftk input.pdf burst

# To convert a single-page PDF file to a 300 DPI PNG file
$ convert -density 300 input.pdf -flatten  output.png

# Optionally, to keep the image size manageable, remove alpha channel and for color space/depth
$ convert -scale 3600\> ground-truth-orig.png -background white -alpha remove -alpha off -colorspace sRGB -depth 8 -type TrueColor ground-truth-checked.png
  1. Finalize the ground truth image: make sure the PNG is cropped to the right area, using any quality-preserving image edition software. Convert color space to sRGB if needed.

  2. Video acquisition: record a video to simulate video capture of the document (should not be long, should be focused on important details, must begin with almost full document in frame)

  3. Ideally, capture a couple of pictures in same conditions to compare the acquisition scenario (timing the process could be useful).

  4. Remove the sound from the videos

$ ffmpeg -n -stats -i input-with-sound.mp4  -an input.mp4
  1. Use the ground truth to generate task data: identify reference frame and provide coordinates of the object to reconstruct within this frame. Also provide target image shape. You should press SPACE until a valid detection is displayed, and then q to save and quit.
$ python create_reference.py -d \
    /path/to/dataset/sample01/ground-truth.png \
    /path/to/dataset/sample01/input.mp4 \
    /path/to/dataset/sample01

This will generate the following files under /path/to/dataset/sample01:

    icdar_smartdoc17_reconstruction/
    ├── sample01/
    │   ├── ground-truth.png ## Training set only!
    │   ├── input.mp4
    │   ├── reference_frame_NN_dewarped.png
    │   ├── reference_frame_NN_extracted.png
    │   ├── reference_frame_NN_extracted_viz.png
    │   └── sample.json
    ├── sample02/
    │   └── …
    ├── …
    └── sampleMM/
       └── …

Files descriptions and formats

  • ground_truth.png
    • Description: Ideal image your method should produce. Included in training/demo dataset only.
    • Format: PNG image with 3 channels (RGB, no alpha) “Truecolor” (no indexed colors) @ 8 bits / channel, sRGB color space, no embedded ICC profile. Embedded ICC profiles will be ignored, and values will be assumed to be encoded with sRGB even in the absence of specific file header.
  • input.mp4
    • Description: Video stream which should be processed by your method to produce an image as close as possible to ground_truth.png.
    • Format: No audio stream, 1 video stream: mpeg4 container, H264 encoding, yuv420p color format, variable frame-rates. Frame size may be different from one video to another, but we will target native video recording resolution from smartphones which usually is full HD (1080p).
  • reference_frame_NN_dewarped.png
    • Description: Image of the same shape as the ground truth image: participants should use either the shape of this image or the shape provided in task_data.json to find the exact shape of the image they must generate. Other shapes will results in a failure to evaluate the result. This dewarped image is generated by “undoing” (“unwarping”) the perspective transform the ground truth image has suffered, back-projecting the relevant image area into the target image shape. The “NN” value in the name indicates that this frame was the NN-th frame of the video (0-indexed). It usually means it was the first exploitable frame we found when generating the task. For most of the videos this will be “00”, but you should not assume so. Linear interpolation was used.
    • Format: Same as ground_truth.png
  • reference_frame_NN_extracted.png
    • Description: The exact same frame from the video input which was “unwarped” to produce the “dewarped” version.
    • Format: Same as ground_truth.png
  • reference_frame_NN_extracted_viz.png
    • Description: Same as reference_frame_NN_extracted.png, but with an extra visualization of the outline of the object to track drawn over the image.
    • Format: Same as ground_truth.png
  • task_data.json
    • Description: An easy-to-parse file which contains a summary of important coordinates and shapes of: the image to produce (target_image_shape), the input video frame (input_video_shape), the object to track (object_coord_in_ref_frame) along with the id of the frame used as a reference (reference_frame_id).
    • Format: JSON file similar to the example below.

Example of task_data.json file

    {
      "input_video_shape": {
        "x_len": 1920, 
        "y_len": 1080
      }, 
      "target_image_shape": {
        "x_len": 3508, 
        "y_len": 2480
      }, 
      "object_coord_in_ref_frame": {
        "top_right": {
          "y": -22.679962158203125, 
          "x": 1535.1053466796875
        }, 
        "bottom_left": {
          "y": 830.49786376953125, 
          "x": 568.02178955078125
        }, 
        "bottom_right": {
          "y": 985.6279296875, 
          "x": 1526.2147216796875
        }, 
        "top_left": {
          "y": 177.77229309082031, 
          "x": 546.0078125
        }
      }, 
      "reference_frame_id": 0
    }

Notes

  • Point coordinates are float lists with x then y coordinate in pixels.
  • Decimal separator is the dot (.) and there may be no decimal part.
  • The coordinates are expressed in the referential where the origin is at the top left of the image, x axis is horizontal (positive toward right) and y axis is vertical (positive toward bottom).
  • Coordinates may fall outside frame area because of a small part of the document being out of frame.
  • Target shape is an integer list [width, height] expressed in pixels.
  • Frames are 0-indexed (first frame of the video has id 0)