Removing rows from pandas dataframe efficiently?

bharatkandrigi16 · July 27, 2021, 9:22pm

I have to use data from two pandas dataframes but I’m having trouble figuring out how to remove data efficiently from the datasets. The df_books dataframe contains roughly 300k entries which includes book details (isbn, title, and author), while the df_ratings dataframe contains 1.1 million entries including user rating details (user, isbn, rating).

Format of the data to be cleaned:

# import csv data into dataframes
df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

 df_books,df_ratings = df_books.assign(keep='yes'),df_ratings.assign(keep='yes')

The project specifies that:

If you graph the dataset (optional), you will notice that most books are not rated frequently. To ensure statistical significance, remove from the dataset users with less than 200 ratings and books with less than 100 ratings.

Thus, in order to remove certain rows of data, I need to group the data (by isbn #) that I find in the df_ratings dataframe based upon the specified condition:

df_groupby_isbn = df_ratings.groupby('isbn').count()[lambda x: x['rating']<100]
df_groupby_user = df_ratings.groupby('user').count()[lambda x: x['rating']<200]

My issue is that I can’t seem to figure out how to drop rows efficiently from the df_ratings dataframe based on the condition above.

I’ve tried to directly call the .drop() method using df_groupby_isbn and it ended up doing nothing:

df_ratings  = df_ratings.drop(df_groupby_isbn,axis=1,inplace=True)
print(len(df_ratings))

I’ve also looked into vectorized methods (couldn’t figure it out) and used a for-loop to check for the rows which fell under the condition, but it was incredibly slow and the process did not finish.

Example dataframe:

import pandas as pd
df_ratings = pd.DataFrame({'user':[276725,276726,276727],'isbn':['034545104X','0155061224','0446520802'],'rating':[7.0,5.0,3.0],'keep':['yes','yes','yes']})

How can I loop through the data frame to check (and remove) a row if it contains a matching column value under either of the conditions above?

AndyG · July 27, 2021, 9:46pm

Hello @bharatkandrigi16 Since is a FCC assignment I cannot give you the complete answer.
This process is called data cleaning, and there are many ways of doing it, not only with the drop method.

This is a condition this should give you the hint you need. You can create a loop and then in the iteration, specify the variables and remove the accorded values.
You shouldnt remove the complete row at all.
If it were up to me, would begin somewhat like this:
for row in df.iterrows():
Then iterate with the needed conditions.
Hope this helps.

bharatkandrigi16 · July 31, 2021, 9:48pm

This was definitely a step in the right step. I just have one problem. I have two nested for-loops that I am using to check each row based on a variable df_groupbooks_by_isbn that checks the data frame for the rows with an isbn label that fall under the specified condition and appends to a new dataframe. The runtime is extremely slow, processing 7000 rows every 3 minutes. What method would you suggest to improve the runtime of the following code:

df_groupby_isbn['isbn'] = df_groupby_isbn.index
df_groupbooks_by_isbn = df_books.groupby(df_groupby_user['isbn'])
print(len(df_groupbooks_by_isbn))

new_df = pd.DataFrame(columns=['user','isbn','rating'])
count=0
for index, row in df_ratings.iterrows():
  print(count,len(new_df))
  count+=1
  for isbn in df_groupbooks_by_isbn['isbn']:
    for val in isbn[1]:
      if val==row['isbn']:
        new_row = pd.Series({'user':row.user,'isbn':row.isbn,'rating':row.rating})
        new_df.append(new_row,ignore_index=True)

AndyG · July 31, 2021, 10:13pm

Hello @bharatkandrigi16 .

well this can certainly be improved sometimes, and sometimes not too much. I will depend of your datasets also.
As far as I know you have two options:

Make data cleaning, (I haven’t see the dataset), you can use drop(), dropna(), isnull(), fillna() methods to better manage your variable “isbn”. And therefore the processing time will be better.
Try a different code, as you suggest, well that is not easy as it seems, you must be careful that your new code give you the same results.
Its occur to me that in this part:

you can replace two redundant for loops by a def () function and a while loop, then you will only need to count the values in which val==row[‘isbn’]

But I cannot guarantee you that the speed of processing will improve. If it were up to me I would see if I could make data cleaning first.
Hope this helps.

system · January 30, 2022, 10:13am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data Cleaning not working with KNN book recommendation Python	2	203	August 1, 2023
Requesting help for a faster alternative of pd.groupby Python	3	453	January 7, 2022
Hi , check out the intro tutorials on Data Cleansing and Visualization	1	559	January 17, 2021
Machine Learning with Python Projects - Book Recommendation Engine using KNN Python	5	734	January 30, 2024
KNN Book Recommender Python	3	1861	June 1, 2021

Removing rows from pandas dataframe efficiently?

Related topics