Sometimes we have to filter data while data manipulation. One way to do so can be to Find duplicates rows in Pandas DataFrame. In this post, we are going to learn how to Find duplicates rows in Pandas DataFrames using built-in function pandas. duplicated().
Pandas duplicated() function
DataFrame.duplicated() function uses to find duplicate rows based on given columns or all columns by default.
Syntax
DataFrame.duplicated(subset=None, keep='first')
Parameters
- Subset: It is a single or a list of columns label, if not provided by default use all columns.
- keep: It control occurrence that would be considered duplicate.
- first: It marks all same values duplicate except the first occurrence,it is defualt value for keep agrument
- last : It marks all same values duplicate except the last occurrence.
- False : It marks all same values duplicates
Returns :
It returns the boolean series of each duplicate row.
1. Find duplicate rows of all columns except first occurrence
To find all the duplicate rows for all columns in the dataframe. We have used duplicated() function without subset and keep parameters. The default value for the keep parameter is ‘First’ which means it selects all duplicate rows except the first occurrence.
Program Example
#Python 3 program for find duplicate rows in Pandas Dataframe
import pandas as pd
Stu_Data = [('Max',100,'Math'),
( 'Rack',100,'Math'),
('David', 100,'Math'),
('Max',100,'Math'),
('David', 100,'Math'),
('David', 100,'Math'),
( 'Rack',100,'Math')
]
origanl_df = pd.DataFrame(Stu_Data, columns=['Name', 'Marks', 'Subjects'])
print('original dataframe:\n',origanl_df)
#selecting all duplicate rows except first occurrence
df_dup_rows = origanl_df[origanl_df.duplicated()]
print("\nDatafram duplicate rows:\n ",df_dup_rows)
Output
original dataframe:
Name Marks Subjects
0 Max 100 Math
1 Rack 100 Math
2 David 100 Math
3 Max 100 Math
4 David 100 Math
5 David 100 Math
6 Rack 100 Math
Datafram duplicate rows:
Name Marks Subjects
3 Max 100 Math
4 David 100 Math
5 David 100 Math
6 Rack 100 Math
2. Find duplicate row of all columns except last
To find all the duplicate rows for all columns in the dataframe. We have used duplicated() function with the keep=’last’ parameter it select all duplicate rows except the last occurrence.
Program Example
import pandas as pd
Stu_Data = [('Max',100,'Math'),
( 'Rack',100,'Math'),
('David', 100,'Math'),
('Max',100,'Math'),
('David', 100,'Math'),
('David', 100,'Math'),
( 'Rack',100,'Math')
]
origanl_df = pd.DataFrame(Stu_Data, columns=['Name', 'Marks', 'Subjects'])
print('original dataframe:\n',origanl_df)
#selecting all duplicate rows except last occurrence
df_dup_rows = origanl_df[origanl_df.duplicated(keep='last')]
print("\nDatafram after drop duplicate:\n ",df_dup_rows)
Output
original dataframe:
Name Marks Subjects
0 Max 100 Math
1 Rack 100 Math
2 David 100 Math
3 Max 100 Math
4 David 100 Math
5 David 100 Math
6 Rack 100 Math
Datafram duplicate rows:
Name Marks Subjects
0 Max 100 Math
1 Rack 100 Math
2 David 100 Math
4 David 100 Math
3. Find all duplcate rows in dataframe
If we want to find all duplicates that include first and last occurrence as well by using the keep=False of the duplicated() method.
Program Example
#python program to Find all duplcate rows in dataframe
import pandas as pd
Stu_Data = [('Max',100,'Math'),
( 'Rack',100,'Math'),
('David', 100,'Math'),
('Max',100,'Math'),
('David', 100,'Math'),
('David', 100,'Math'),
( 'Rack',100,'Math'),
( 'Tom',100,'Math')
]
origanl_df = pd.DataFrame(Stu_Data, columns=['Name', 'Marks', 'Subjects'])
#selecting all duplicate rows of dataframes
df_dup_rows = origanl_df[origanl_df.duplicated(keep=False)]
print("\nDatafram duplicates rows:\n ",df_dup_rows)
Output
Datafram duplicates rows:
Name Marks Subjects
0 Max 100 Math
1 Rack 100 Math
2 David 100 Math
3 Max 100 Math
4 David 100 Math
5 David 100 Math
6 Rack 100 Math
4. Find duplicate based on selected columns
We can find duplicate rows by comparing rows based on columns bypassing the column name to as a list to subset parameter of duplicated(subset=[‘Name’, ‘Marks’])] method.
Program Example
import pandas as pd
Stu_Data = [('Max',100,'Math'),
( 'Rack',100,'Math'),
('David', 100,'Math'),
('Max',100,'Math'),
('David', 100,'Math'),
('David', 100,'Math'),
( 'Rack',100,'Math')
]
origanl_df = pd.DataFrame(Stu_Data, columns=['Name', 'Marks', 'Subjects'])
print('original dataframe:\n',origanl_df)
#drop the all duplicate except first occurrence
df_dup_rows = origanl_df[origanl_df.duplicated(subset=['Name', 'Marks'])]
print("\nDatafram duplicate rows:\n ",df_dup_rows)
Output
original dataframe:
Name Marks Subjects
0 Max 100 Math
1 Rack 100 Math
2 David 100 Math
3 Max 100 Math
4 David 100 Math
5 David 100 Math
6 Rack 100 Math
Datafram duplicate rows:
Name Marks Subjects
3 Max 100 Math
4 David 100 Math
5 David 100 Math
6 Rack 100 Math
Summary
In this post, we have learned multiple ways how to Find duplicates rows in Pandas DataFrame using Pandas built-in duplicated() function.