Skip to content

LSTM models recognize random characters instead of asterisk (*) #4458

@saikrishnagopal1227

Description

@saikrishnagopal1227

Current Behavior

When using Tesseract OCR to extract text from an image containing asterisks (*), the output does not preserve the asterisk character. Instead, it is replaced with seemingly random characters or repeated letters.
This issue can be seen in the attached screenshot, where the expected asterisk is missing and the extracted text contains unexpected sequences.

Expected Behavior

The OCR output should accurately preserve all characters from the original image, including asterisks (*). When an image contains an asterisk, the extracted text should include the asterisk in the correct position, matching the source content exactly. No unexpected or random characters should appear in place of the asterisk.

Suggested Fix

No response

tesseract -v

5.4.1 and 5.5.1 In both versions

Operating System

No response

Other Operating System

Linux 20.04.6

uname -a

No response

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

Do OCR on attached Image to reproduce the issue

dxz00000001.tif

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions