Hi. So I’m currently working on a moderation system for a Discord bot. Currently, I’m trying to add another stage to the system that extracts all words(English only) from a string. An example would be an input of “aaaatestdd221
” would output “test
”. Currently I’m using spaCy, a natural language processing library for Python. Here’s my current code:
nlp = spacy.load("en_core_web_sm")
doc = nlp("text input to process")
words_in_input: list = []
for token in doc:
# If the token in the input is alphabetical
if token.is_alpha:
# Start going through all words in the vocabulary
for word in self.nlp.vocab.strings:
# If the word is in the token, append it to the list
if token.is_alpha and word in str(token).lower():
words_in_input.append(word.lower())
But this is including garbage words as well. An example would be an input of “test
” would output “e es est s st t te tes test
”. The only item I care about is “test.” (And I know that NLP might be overkill for this scenario, but currently I don’t know a better way of going about it.)
Is there a way to limit spaCy to only include words from its’ vocabulary that are full, grammatically correct words?