How to vectorize and speed-up double for-loop for pandas dataframe when doing text similarity scoring

ill · December 12, 2022, 11:04pm

I have the following dataframe:

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2]
}
df_test = pd.DataFrame(d_test)

I want to identify similar names in name column if those names belong to one cluster number and create unique id for them. For example South Beach and Beach belong to cluster number 1 and their similarity score is pretty high. So we associate it with unique id, say 1. Next cluster is number 2 and three entities from name column belong to this cluster: Dog, Big Dog, Cat, ‘Fish’ and ‘Dry Fish’. Dog and Big Dog have high similarity score and their unique id will be, say 2. For Cat unique id will be, say 3. Finally for ‘Fish’ and ‘Dry Fish’ unique id will be, say 4. And so on.

I created a code for the logic above:

# pip install thefuzz
from thefuzz import fuzz

df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)

df_test['id'] = 0

i = 1
for index, row in df_test.iterrows():
    row_ = row
    index_ = index

    while index_ < len(df_test) and df_test.loc[index, 'cluster_number'] == df_test.loc[index_, 'cluster_number'] and df_test.loc[index_, 'id'] == 0:
        if row['name'] == df_test.loc[index_, 'name'] or fuzz.ratio(row['name'], df_test.loc[index_, 'name']) > 50:
            df_test.loc[index_,'id'] = i
            is_i_used = True
        index_ += 1

    if is_i_used == True:
        i += 1
        is_i_used = False

Code generates expected result:

	name	     cluster_number	 id
0	Beach		        1	     1
1	South Beach	        1	     1
2	Big Dog		        2	     2
3	Cat		            2	     3
4	Dog		            2	     2
5	Dry Fish	        2	     4
6	Fish		        2	     4
7	Ant		            3	     5
8	Bird		        3	     6
9	Dear		        4	     7

Computation runs for 210 seconds for dataframe with 1 million rows where in average each cluster has about 10 rows and max cluster size is about 200 rows. I am trying to understand how to vectorize the code.

Also thefuzz module has process function and it allows to process data at once:

from thefuzz import process
out = process.extract("Beach", df_test['name'], limit=len(df_test))

But I don’t see if it can help with speeding up the code.

system · June 13, 2023, 11:04am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.