OCR of Japanese Manga
OCR or Optical Character Recognition is a computer technology used to recognize text within a digital image such as a scanned document and convert it to machine readable text data that can be edited in a word processor. For example documents printed on paper can be digitised by an OCR program with near perfect accuracy while manual typing can lead to many errors.
While OCR is nearly perfect for languages that use the Latin alphabet, the OCR of the Japanese language, both written and typed, is prone to errors. These errors can be linked to 3 main reasons. First, the Japanese language possesses over 3000 characters both kanji and kana, and furthermore, some Japanese Kanji have a very similar shape which makes it hard to distinguish them from each other. Another hindrance is that the Japanese language does not use spaces, which can lead to difficulties when separating words while using OCR programs.
An added difficulty of the OCR of Japanese text is manga. Not only does manga have the same issues with Japanese language as plain texts, the manga format itself is also problematic. Firstly; a manga doesn’t not only consist of text. Most of the page is made up of images and it is up to the OCR program to detect where the texts is located amidst those images. Secondly, manga is normally written vertically in speech balloon which are read from the right to the left. This changes the way traditional OCR programs need to read the texts.
Which programs did we use?
In our quest to OCR a manga we have come across some programs and techniques that are on the market today. Regarding these methods I will zoom in on three programs we believe helped us the most.
Capture2text eliminates one of the main difficulties when working with manga: the recognition of speech balloon . The program requires the user to highlight part of an image where the text the user wants use is situated (A. K. A. in our manga; the speech balloon). Capture2text then proceeds to read the highlighted text and opens a window in which the text is displayed and automatically copies it to your clipboard. This way it can the easily be copied into a text editor. While the program can make the occasional mistake a little tedious to use, it is the most accurate program we encountered, when using uncleaned scans of a manga. This is an important benefit as most other programs need extremely clear scans to be able to recognize characters correctly and see the difference between them.
Tesseract is an open-source tool that performs an OCR of an entire image and can then simultaneously translate it to English, or any other wished language through the use of Google Translate. The tool we've used is Manga-Translator-TesseractOCR by Kocarus found on Github. It requires the installation of Python 2.7. We have tested it with newer versions of Python, but have encountered several issues and had to reinstall version 2.7. Next to Python, it requires several extensions to run smoothly. The installation of these extensions were first done manually (by using "Pip install"), but soon after were done using the PyCharm IDE CE for Python. This IDE automated the installation of these libraries/extensions. At first, the tool didn't work as intended. A certain amount of knowledge was required in order to troubleshoot the OCR, but we've managed to get it up and running. Although it scans the whole page of the manga at once, it is very prone to mistakes. It has difficulty recognizing texts balloons and it detects characters where there are none. The higher the quality of the image the better the program performs. But even then in our experience, Tessaract makes a lot more mistakes than Capture2text.
Ultimately, for some manga’s we literally typed the characters by hand, without an OCR program. We did this for one of our manga’s were we not able to find scans of sufficient quality (even for Capture2text). Because this manga was important to our further research we made the effort of typing the text.
We have taken Barakamon and Love Com as mangas to OCR. The first pages were done using Tesseract, followed by a few pages in Capture2Text. Though these seemed inefficient. The remaining pages after the Capture2Text part have been manually typed from the manga.