HI, Aim working on a project to OCR from tiff image to hocr (html) format retrieving images from subfolders and output to another respective sub folder path. Need to tweak the code and improvise with Beautiful Soup. The code iam using to extract to text image is given below.
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\Tesseract- OCR\tesseract.exe"
image = Image.open(r"C:\Users\multipage.tiff")
config = ("--oem 3 --psm 6")
txt = ''
for frame in range(image.n_frames):
image.seek(frame)
txt += pytesseract.image_to_string(image, config = config, lang='eng') + '\n'
print(txt)
with open(r"C:\Users\multipage_output.txt", mode = 'w') as f:
f.write(txt)
Request your assistance
Thanks!
Joe