How to Extract a Grammatically Correct Word from a String

Hi. So I’m currently working on a moderation system for a Discord bot. Currently, I’m trying to add another stage to the system that extracts all words(English only) from a string. An example would be an input of “aaaatestdd221” would output “test”. Currently I’m using spaCy, a natural language processing library for Python. Here’s my current code:

nlp = spacy.load("en_core_web_sm")

doc = nlp("text input to process")

words_in_input: list = []
for token in doc:
    # If the token in the input is alphabetical
    if token.is_alpha:
        # Start going through all words in the vocabulary
        for word in self.nlp.vocab.strings:
            # If the word is in the token, append it to the list
            if token.is_alpha and word in str(token).lower():
                words_in_input.append(word.lower())

But this is including garbage words as well. An example would be an input of “test” would output “e es est s st t te tes test”. The only item I care about is “test.” (And I know that NLP might be overkill for this scenario, but currently I don’t know a better way of going about it.)

Is there a way to limit spaCy to only include words from its’ vocabulary that are full, grammatically correct words?

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.