Linguistic measures on corporate filings

Hi,

I have an excel file which includes txt file name and year. I am trying to conduct textual analysis for corporate disclosure (a bunch of txt files, firm-year level), but having difficulty in generating the following measures. Can anyone provide some guidance? Thank you!!!

  1. The number of words in sentences that include at least one 4-word phrase that is shared by at least 75% of all firms in a given fiscal year.

  2. The number of words in sentences that include at least one 8-word phrase that is identical to a phrase used in the prior year’s 10-K.