How to extract mobile number from multiple images containing in a folder in PYTHON language


  1. extract the mobile number from images in PYTHON, I have 500 images, cant open by one.
  2. all the images containing in a folder, this program should remove duplicates & other alphabetical characters as well.
  3. need to store all the mobile no. in a single TEXT file after removing duplicates.
  4. need to add all the country codes and format of mobile number all over the globe, currently i have added only indian mobile number format.
  5. PROBLEM: this code was working 1 year back, now its not working, I don’t know how to give the path in python as well, can someone help me out to get the desired output I want.

SOURCE CODE: Extracting phone numbers from multiple images using Python | by Ankit Gupta | Medium

import os, pyperclip, re, send2trash
from pytesseract import image_to_string

path = os.path.dirname(os.path.realpath(__file__))
input_path = path + '/Input/'

all_text = []

for root, dirs, filenames in os.walk(input_path):
    for filename in filenames:
            img = + filename)
            # Deleting the files scanned
            send2trash.send2trash(input_path + filename)

# +91 95959 59595 or 07557575575 or 0-99555 55999
phone_regex = re.compile(r'''(
    (\+91|0)?                                   # country code
    (\s|-|\.)?                                  # seperator
    (\d{5})                                     # 5 digits
    (\s|-|\.)?                                  # seperator
    (\d{5})                                     # 5 digits
)''', re.VERBOSE)

text = str('\n'.join(all_text))

matches = []
phone_num = ''

for groups in phone_regex.findall(text):
    phone_num = ''.join([groups[3], groups[5]])

if len(matches) > 0:
    distinct_matches = list(dict.fromkeys(matches))

    if len(matches)!=len(distinct_matches):
        print(str(len(matches)-len(distinct_matches)) + ' Duplicates Removed')
        print('No duplicates found!')

    print('Copied to clipboard')
    print('No Phone no. was found!!')

What error are you getting?
If it was working before but stopped it could be a dependency issue. I’d check your python environment. Pytesseract, for one, requires you have the tesseract-ocr engine installed; the path to where the executable lives needs to be included in your PATH variable or you need to explicitly tell pytesseract where it’s at (). Tesseract-OCR manual & download links

To address your goals:

  • Since all the phone numbers are being stored in a list, you could just cast to a set to remove duplicates.
  • For addressing various phone number formats, I would personally use one wider regex that can match all the various instances. Then just remove any non-digits from the results. If you google phone number regex you’ll find many useful examples for phone number validation that would likely work here.
  • You can write that list of matches to a txt file using the python file object write/writelines methods. Maybe just replace the bit where you copy them to the clipboard.

I would, for security purposes, use VoIP numbers. You probably can’t hack into a phone through them. I think it is still possible, but it will be much more difficult, thus, people who hack and retrieve data will be more appropriate to hack a regular number than a VoIP number. I would advise you to learn more about this on Mightycall for example, as I myself read from there, but it’s up to you to decide.