Parse tiff data to hocr format from subfolders

HI, Aim working on a project to OCR from tiff image to hocr (html) format retrieving images from subfolders and output to another respective sub folder path. Need to tweak the code and improvise with Beautiful Soup. The code iam using to extract to text image is given below.

from PIL import Image
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\Tesseract- OCR\tesseract.exe"

image = Image.open(r"C:\Users\multipage.tiff")

config = ("--oem 3 --psm 6")

txt = ''
for frame in range(image.n_frames):
    image.seek(frame)
    txt += pytesseract.image_to_string(image, config = config, lang='eng') + '\n'

print(txt)
with open(r"C:\Users\multipage_output.txt", mode = 'w') as f:
    f.write(txt)

Request your assistance

Thanks!
Joe

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.