GSoC/2022/StatusReports/QuocHungTran

New DigiKam Plugin to Process Optical Character Recognition(OCR)

The goal of this project is to implement a new generic DPlugin to process images in batches with Tesseract. Tesseract is an open-source OCR engine. Even though it can be painful to implement and modify sometimes, only a few free and powerful OCR alternatives are available on the current market. Tesseract is compatible with many programming languages and frameworks through wrappers that can be found here. Tesseract can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single text line.

Thanks to the help of the OCR plugin in Digikam. The users will be able to select optional parameters to improve the quality of record detected text in image metadata. The output text will be saved in XML files, recorded in the exif of jfif, or the user was asked to store output text under the text file in the locale where they want. Furthermore, Digikam users will be able to review them and correct (spell checking) any OCR errors.

Mentors : Gilles Caulier, Maik Qualmann, Thanh Trung Dinh

Important Links

Project Proposal

Digikam Plugin to Process Optical Character Recognition(OCR)

GitLab development branch

gsoc22-ocr-test

Contacts

Email: [email protected]

Github: quochungtran

Invent KDE: quochungtran

LinkedIn: https://www.linkedin.com/in/tran-quoc-hung-6362821b3/

Project goals

27/06/2022 :

Researching preprocessing for improving the quality of the output. These results can be applied for building preprocessing of image engine to improve the quality of plugin.

30/08/2022 :

A new optional Generic Tesseract-based DPlugin is available in DigiKam and Showfoto to run OCR automatically. Recognized text can be stored in a text file and XMP metadata for users to review and generate them.

Links to Blogs and other writing

My blog for GSoC

My entire blog :

https://quochungtran.github.io/

June 13 to June 27 (Week 1 - 2) - Tesseract Page Segmentation Modes (PSMs) Explained and their relations

https://quochungtran.github.io/junk/2022/06/27/week1-2.html

June 27 to July 10 (Week 3 - 4) - Preprocessing for improving the quality of the output

https://quochungtran.github.io/junk/2022/07/25/week3-4.html

June 11 to July 25 (Week 5 - 6) - Preprocessing for improving the quality of the output

https://quochungtran.github.io/junk/2022/07/25/weed5-6.html

July 26 to August 8 (Week 7 - 8) - OCR batch processing based on internal-multi threading

https://quochungtran.github.io/junk/2022/08/08/weed7-8.html

August 9 to August 16 (Week 9) - Storing OCR result

https://quochungtran.github.io/junk/2022/08/14/weed9.html

August 17 to September 4 (Week 10 - 12) - Code refactoring and demo application

https://quochungtran.github.io/junk/2022/08/30/lasteweeks.html

Conclusion

During the significant coding period of GSoC 2022, We successfully implemented the tool that can convert documented image data to Text format by using Tesseract, an open-source Optical Characters Recognition engine. Besides, we are totally able to improve the quality of OCR accuracy by embedding preprocessing methods. Here are some pictures of final result :