OCR on-line services

OCR ON-LINE SERVICES CONFRONTATION

There are many OCR on-line services and most of them are free of charge. However some privacy and security concerns could be raised to your attention unless this services are used for near zero value documents which contains no personal information. Otherwise, it should be considered a off-line solution.

Thus, here below the link to a simple guide that enlists some of the existing, briefly describes them and presents some confrontations among them. The current revision is by May 26th, 2019 and updates might be available in the future. Use the history button so see the revision and updates.

Please, feel free to share your experiences or share an OCR on-line service not tested yet. It would help to write a more comprehensive test guide. Send your feedback to moc.liamg|atteilgof.otrebor#moc.liamg|atteilgof.otrebor, thank you in advance.

SAMPLE IMAGE AND METHODOLOGY

The image used is a German text written in a clean font regularly spaced with some image editing correction such as rotation, contrast amplification, grey scale conversion and PNG encoding. Thus the input file was preprocessed in such a way the OCR could have shown its best.

German language has been chosen because among the European languages is the one which use extended character set with {umlaut, B-ss, etc.} and the most spread among those language that used a roman extended alphabet. The idea behind this choose is that OCR accuracy should be tested with an extended alphabet because it allows many more slightly characters variants.

Note: using English or any regular roman alphabet language might give you different results than those here exposed. Also font-type would affect the tests, obviously.

The original text was 29 written lines and 1630 characters long without counting spaces and new lines characters. The written text was reasonable filling an A4 page. The regular roman alphabet characters were 1497 which means 133 (8%) others characters were variants or punctuation signs as found with a frequency analysis by an on-line tool. The number of single characters used in the document, included punctuation signs, were 42 as found with a text analysis on-line tool.

OCR RESULTS CONFRONTATION

The de01x-onlineocr2.txt is the plan ASCII encoding file converted from de01x-onlineocr.txt which has been confronted with these other two:

de01x-onlineocr2.txt VS de01x-newocr.txt
errors: 2 2
de01x-onlineocr2.txt VS de01x-ocrspace.txt
errors: 2 11 + 1 word misplaced

From these confrontations the de01x-corrected.txt file has been created with these result

de01x-corrected.txt VS de01x-onlineocr2.txt
errors: 0 2

Thus the de01x-corrected.txt has been confronted with others files with these results:

de01x-corrected.txt VS de01x-ocrconvert.txt
errors: 0 1 + minus/dash mistaken
de01x-corrected.txt VS de01x-freeolocr.txt
errors: 0 39
de01x-corrected.txt VS de01x-convertio.txt
errors: 0 9 + 1 word misplaced
de01x-corrected.txt VS de01x-lightpdf.txt
errors: 0 9 + 1 word misplaced

After all, the underlying OCR engines behind these eight services are most probably just four because three shown almost the same mistakes set {ec:9-11, word misplaced}, two are unique in their mistakes set ({ec:39}, {minus/dash}) and finally another three shown almost the same mistakes set {ec:2} but different from others. Because the best three performers reasonably rely on the same OCR engine then their top position might vary under different conditions.

CONCLUSIONS

Thus we can reach the conclusion that "OCR Convert" is the easiest and most accurate "OCR online" service for a single page text recognition. However because it recognised the character minus (-) with the dash (—), it has broken the URL and the phone number inside the original text while "Online OCR" did not.

The "OCR Convert" and the "Online OCR" could be considered pair in being the best at this test. However the "New OCR" shown the same number of mistakes (2) but they are less impacting (w instead of W and ” instead of "), so it should considered one of the most accurate and the one which offers a more sophisticated but reasonable easy to use on-line user interface.

The "OCR Convert", the "Online OCR" and the "New OCR" offered the best online OCR service among those enlisted here below. These three shown to be far away the pack of others, so which one of these three are you going to use it is just a matter of taste.

OCR CONFRONTATION, STAGE 2

The second test has been performed with the same German text but without any image editing and it diplays a 1.18° rotation bias from the straigh line of the text. This test is more near a real use case than the one before which have been used to select the best OCR engines in optimal conditions.

de01x-ocrconvert.txt VS de02x-ocrconvert.txt
errors: 1 0
de01x-ocrconvert.txt VS de02x-ocrconvert.txt
errors: 1 0
de01x-newocr.txt VS de02x-newocr.txt
errors: 2 1

The suffix de01x- refers to the ASCII text retrieved by the near B/W plan image while the suffix de02x- refers to the ASCII text retrieved by the straight from the scanner image.

CONCLUSIONS

After all, the quality of text recognition resulted to be better when contrast is not stretched near the B/W even if a tiny rotation was present.

test-samples-in-two-pieces.png

[click on the image to enlarge it]

The image above shown the differences between the edited text image (bolder and plain) and the original text image (lighter and rotated) straight from the scanner.

Remember that this document was written on June 26, 2019 and things might changes in the meantime.

BEST OCR ON-LINE SERVICES TESTED

The following three OCR on-line service are available for free without registration and under the tests above were the top performers.

Online OCR

Gratis version: support for PNG up to 15Mb; main languages supported; very easy to use; clean online user interface, text file downloaded with not plain ASCII encoding.

OCR Convert

Gratis version: many languages supported; easy to use in 3-steps.

Note: it tends to confuse minus (-) with the dash (—) but such a mistake could easily corrected with a single operation of "search and substitution" for the whole document.

New OCR

Gratis version: many languages supported; image rotation supported; page analysis and multi column supported as option; image preview with text selection area which is a unusual feature for this kind of service but limited to 1-single page document; OCR text online editing available; Google Translate supported; API and professional service for page paying offered; quite simple and clean online user interface.

OTHER OCR ON-LINE SERVICES

Reviews of other OCR on-line services tested

OCR SPACE

Gratis version: works smootly with 5Mb limitation; the online interface could be made easier to use; support for PNG images; interesting feature as options like:

  • Process a file directly from an URL
  • Detect orientation and auto-rotate image if needed
  • Do receipt scanning and/or table recognition
  • Auto-enlarge content (recommended for low DPI)
  • Create Searchable PDF

Moreover and noticeable, they offer also:

  • Free OCR API for those needs use it as daily service
  • Privacy safe, service and API do not store data
  • API and professional service for page paying offered
OCR SPACE FOR CHROME

This is the "OCR Space" extension for Chrome, it captures and scans the selected text into the browser window.

Convertio OCR

Gratis version: Chrome extention support; support many images/documents format; support various way to retrieve the document to process like Google Drive, DropBox, URL and file upload; allows to upload many files up to 10 in the gratis version; support many languages and also a secondary language for dual-language OCR scanning like italian with English technical terms documents; allows to directly save the result on Google Drive or DropBox; offers API and CLI access; professional version as for paying service; very slow compared to others; the online interface could be improved for an easier use.

LIGHT PDF TO OCR

Gratis version: support for most used languages; support for PNG images; professional version available as for paying service. Provider claims: extract text from PDF and images (JPG, BMP, TIFF, GIF) and convert into editable Word, Excel and text output formats; very easy to use.

SODAPDF OCR

Gratis version: do not support PNG/JPG as far as I tested it moreover it shown some problem with PDF format as well, Google Drive and Dropbox support.

PNG to PDF as online service

Gratis version: Chrome extension support, some problem with PNG format

Unfortunately, SodaPDF offers a set of tools to manage documents and image which has serious problems with format acceptance which is a nasty deficit.

OCR vote: not tested because input format problems, discarded for the moment.

FREE ONLINE OCR

Gratis version: very easy to use in 3-steps; automatic language detection; quite slow compared to others; text download could be improved.

I2OCR

http://www.i2ocr.com

Gratis version; easy to use in 3-steps; 100+ languages supported; very slow compared to others; multi column document analysis (not tested); anti-robot simple check.

PNG TO PDF

This tool has been tested as reference service to convert PNG image (96,8 Kb) into an uncompressed PDF (951 Kb) and into a compressed PDF (161 Kb)

Gratis version: it works smoothly and it is very easy to use.

pngtopdf-comparison.png

[click on the image to enlarge it]

For a quick comparison the above image shows the result from convert Linux command which generates a compressed PDF (89,4 Kb) from the original PNG image (96,8 Kb) and the two PDF generated by pdfcandy but as they are displayed on the screen at 200%. Using the OCR Space Chrome extension on the image above, it returns the following text (and using its copying function, in one single line!!) which has 2 characters mistaken in the last sentence (near B/W images are harder to recognise):

PDF BY CONVERT ON THE SCREEN AT 200% Sehr geehrte Kundin, sehr geehrter Kunde, PDF BY CANDY UNCOMPRESSED ON THE SCREEN AT 200% Sehr geehrte Kundin, sehr geehrter Kunde, PDF BY CANDY COMPRESSED ON THE SCREEN AT 200% Sehr geehrte Kundin, sehr geehrter Kunde, ORIGINAL TAKEN BY THE PNG IMAGE BUT RESIZED AS THE FIRST ONE SIZE Sehr qeehrte Kundin. sehr aeehrter Kunde,

YET TO TEST

The following OCR on-line services have been not tested, yet

ABBYY FINEREADER

Gratis version: requires registration.

OCR vote: not tested because registration required.

CVISIONTECH

Gratis version: OCR by PDF did not work but by JPG apparently did; it requires registration in order to download the output file.

OCR vote: not tested because registration required.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.