Document OCR
Before you begin
In order to use the features in this section you need to have an active Spojit account with an Enterprise plan. If you don't have an account you can checkout out the pricing and register here. If you already have an account you can login here.
The Document OCR (Optical Character Recognition) service allows you to read text and tables in documents (such as PDFs) which could be used to make an API request or create an Excel or CSV file.
Important
The Document OCR service will sometimes pick up outliers and other noise/graphics as text or have difficulty reading some text depending on the font. Changing the sections or page segmentation mode and implementing functions can help to eliminate this issue. Always test with multiple documents prior to implementing in a production environment to ensure that the proper functions have been used to get the required result.
Page Segmentation Mode¶
The page segmentation mode (PSM) will enhance the accuracy of the text by choosing a the mode for the type of text that the OCR will expect in your section. You will be able to select the mode on each simple or table section.
Mode | Description |
---|---|
0 | Orientation and script detection (OSD) only. |
1 | Automatic page segmentation with OSD. |
2 | Automatic page segmentation, but no OSD, or OCR. |
3 | Fully automatic page segmentation, but no OSD. |
4 | Assume a single column of text of variable sizes. |
5 | Assume a single uniform block of vertically aligned text. |
6 | Assume a single uniform block of text. |
7 | Treat the image as a single text line. |
8 | Treat the image as a single word. |
9 | Treat the image as a single word in a circle. |
10 | Treat the image as a single character. |
11 | Sparse text. Find as much text as possible in no particular order. |
12 | Sparse text with OSD. |
13 | Raw line. |
The following example configuration shows you how to configure the Document OCR service to read a field and a table in a PDF and create an output that could be used for other services.
Input Data Configuration¶
The input data configuration used to select the source of the document to read and the file type. Currently only PDF is supported but more options will be available soon.
Option | Description | Default | Required |
---|---|---|---|
File type | Select "PDF" to read a PDF file. | pdf | TRUE |
Raw Data | The raw data of the PDF file that will be read in the service. | - | TRUE |
OCR Configuration¶
The OCR configuration is used to upload a test document file and create selections to be read into an output.
Upload Document¶
Select your document file and upload it:
Define Tables¶
If your document contains tables, define them and the column names that are associated with them in this step.
Start by adding a table:
And then any columns that you wish to be in that table:
This step should look like the following after the table is added:
Configure OCR¶
Finally we are going to configure tables and fields on the OCR document itself. Select "Configure OCR" to get started:
First you will see the preview of the uploaded document:
Adding a simple section¶
Hold the ALT key and select a section:
Select "Simple Section" as the type, add a name for it, and then click create:
Adding a tabular section¶
Hold the ALT key and select a section for a column (like in the Simple section). Select "Tabular data" as the type, choose the corresponding column and finally click on "create":
Finally click "Done" to finish configuring the OCR
Output Data¶
The following data will be output from the OCR from the above example. Please note that the empty lines within the table are also in the output. Use available functions to remove empty lines or pick specific fields.
{
"Table 1": [
{
"Column 1": "1"
},
{
"Column 1": ""
},
{
"Column 1": ""
},
{
"Column 1": ""
},
{
"Column 1": "2"
},
{
"Column 1": ""
},
{
"Column 1": ""
},
{
"Column 1": ""
},
{
"Column 1": ""
},
{
"Column 1": "3"
},
{
"Column 1": ""
},
{
"Column 1": ""
},
{
"Column 1": ""
},
{
"Column 1": "4"
},
{
"Column 1": ""
},
{
"Column 1": ""
},
{
"Column 1": ""
},
{
"Column 1": ""
},
{
"Column 1": "5"
},
{
"Column 1": ""
},
{
"Column 1": ""
},
{
"Column 1": ""
},
{
"Column 1": "6"
},
{
"Column 1": ""
}
],
"My test section": [
"QualityHosting AG - Uferweg 40-42 - D-63571 Gelnhausen",
""
]
}