Find duplicates rows in Pandas DataFrame

Pandas

Sometimes we have to filter data while data manipulation. One way to do so can be to Find duplicates rows in Pandas DataFrame. In this post, we are going to learn how to Find duplicates rows in Pandas DataFrames using built-in function pandas. duplicated().

Pandas duplicated() function


DataFrame.duplicated() function uses to find duplicate rows based on given columns or all columns by default.

Syntax

DataFrame.duplicated(subset=None, keep='first')

Parameters

  • Subset: It is a single or a list of columns label, if not provided by default use all columns.
  • keep: It control occurrence that would be considered duplicate.
    • first: It marks all same values duplicate except the first occurrence,it is defualt value for keep agrument
    • last : It marks all same values duplicate except the last occurrence.
    • False : It marks all same values duplicates

Returns :

It returns the boolean series of each duplicate row.

1. Find duplicate rows of all columns except first occurrence


To find all the duplicate rows for all columns in the dataframe. We have used duplicated() function without subset and keep parameters. The default value for the keep parameter is ‘First’ which means it selects all duplicate rows except the first occurrence.

Program Example

#Python 3 program for find duplicate rows in Pandas Dataframe

import pandas as pd
Stu_Data = [('Max',100,'Math'),
 ( 'Rack',100,'Math'),
('David', 100,'Math'),
('Max',100,'Math'),
('David', 100,'Math'),
('David', 100,'Math'),
( 'Rack',100,'Math')
]

origanl_df = pd.DataFrame(Stu_Data, columns=['Name', 'Marks', 'Subjects'])

print('original dataframe:\n',origanl_df)




#selecting all duplicate rows except first occurrence 

df_dup_rows = origanl_df[origanl_df.duplicated()]

print("\nDatafram duplicate rows:\n ",df_dup_rows)

Output

original dataframe:
     Name  Marks Subjects
0    Max    100     Math
1   Rack    100     Math
2  David    100     Math
3    Max    100     Math
4  David    100     Math
5  David    100     Math
6   Rack    100     Math


Datafram duplicate rows:
      Name  Marks Subjects
3    Max    100     Math
4  David    100     Math
5  David    100     Math
6   Rack    100     Math

2. Find duplicate row of all columns except last


To find all the duplicate rows for all columns in the dataframe. We have used duplicated() function with the keep=’last’ parameter it select all duplicate rows except the last occurrence.

Program Example

import pandas as pd
Stu_Data = [('Max',100,'Math'),
 ( 'Rack',100,'Math'),
('David', 100,'Math'),
('Max',100,'Math'),
('David', 100,'Math'),
('David', 100,'Math'),
( 'Rack',100,'Math')
]

origanl_df = pd.DataFrame(Stu_Data, columns=['Name', 'Marks', 'Subjects'])

print('original dataframe:\n',origanl_df)




#selecting all duplicate rows except last occurrence 
df_dup_rows = origanl_df[origanl_df.duplicated(keep='last')]

print("\nDatafram after drop duplicate:\n ",df_dup_rows)

Output

original dataframe:
     Name  Marks Subjects
0    Max    100     Math
1   Rack    100     Math
2  David    100     Math
3    Max    100     Math
4  David    100     Math
5  David    100     Math
6   Rack    100     Math

Datafram duplicate rows:
      Name  Marks Subjects
0    Max    100     Math
1   Rack    100     Math
2  David    100     Math
4  David    100     Math

3. Find all duplcate rows in dataframe


If we want to find all duplicates that include first and last occurrence as well by using the keep=False of the duplicated() method.

Program Example

#python program to  Find all duplcate rows  in dataframe
import pandas as pd
Stu_Data = [('Max',100,'Math'),
 ( 'Rack',100,'Math'),
('David', 100,'Math'),
('Max',100,'Math'),
('David', 100,'Math'),
('David', 100,'Math'),
( 'Rack',100,'Math'),
( 'Tom',100,'Math')
]

origanl_df = pd.DataFrame(Stu_Data, columns=['Name', 'Marks', 'Subjects'])




#selecting all duplicate rows  of dataframes
df_dup_rows = origanl_df[origanl_df.duplicated(keep=False)]

print("\nDatafram duplicates rows:\n ",df_dup_rows)

Output

Datafram duplicates rows:
      Name  Marks Subjects
0    Max    100     Math
1   Rack    100     Math
2  David    100     Math
3    Max    100     Math
4  David    100     Math
5  David    100     Math
6   Rack    100     Math

4. Find duplicate based on selected columns


We can find duplicate rows by comparing rows based on columns bypassing the column name to as a list to subset parameter of duplicated(subset=[‘Name’, ‘Marks’])] method.

Program Example

import pandas as pd
Stu_Data = [('Max',100,'Math'),
 ( 'Rack',100,'Math'),
('David', 100,'Math'),
('Max',100,'Math'),
('David', 100,'Math'),
('David', 100,'Math'),
( 'Rack',100,'Math')
]

origanl_df = pd.DataFrame(Stu_Data, columns=['Name', 'Marks', 'Subjects'])

print('original dataframe:\n',origanl_df)




#drop the all duplicate except first occurrence 
df_dup_rows = origanl_df[origanl_df.duplicated(subset=['Name', 'Marks'])]

print("\nDatafram duplicate rows:\n ",df_dup_rows)

Output

original dataframe:
     Name  Marks Subjects
0    Max    100     Math
1   Rack    100     Math
2  David    100     Math
3    Max    100     Math
4  David    100     Math
5  David    100     Math
6   Rack    100     Math

Datafram duplicate rows:
      Name  Marks Subjects
3    Max    100     Math
4  David    100     Math
5  David    100     Math
6   Rack    100     Math

Summary

In this post, we have learned multiple ways how to Find duplicates rows in Pandas DataFrame using Pandas built-in duplicated() function.