Remove rows with duplicate indices Pandas

In this post, we will learn how to Remove rows with duplicate indices in Pandas or how to remove duplicate first or last index. We will use the Pandas library, so to use it first we have to install it on the local system by using the pip command “pip install pandas” and import it into our code by using “import pandas as pd” to use its functions.

1. Remove rows with duplicate indices Pandas


In this example,First we have got indices from dataframe using df.index and used dfobj[~dfobj.index.duplicated(keep=’first’)] method that returns a series of booleans that show whether each index is duplicated in the dataframe. The not operator is used to reverse the each values in resulted series and return a subset of original dataframe keeping first occurrence of duplicate and remove except this.

import pandas as pd
  
Student_dict = {
    'Name': ['Jack', 'Rack', 'Max','Kom'],
    'Marks':[100,100, 100,100],
    'Fee':[100,200,300,400],
    'Subject': ['Math', 'Math', 'Music','Phy']
}
  
 
 
dfobj = pd.DataFrame(Student_dict,index =[1,1,2,2])
 
df3 = dfobj[~dfobj.index.duplicated(keep='first')]

print('dataframe after remove duplicate index:\n',df3)

Output

dataframe after remove duplicate index:
    Name  Marks  Fee Subject
1  Jack    100  100    Math
2   Max    100  300   Music

2.Remove rows with duplicate indices Pandas


The pandas reset_index() is used to reset the index and drop_duplicates() is used to drop/remove duplicates from the dataframe. During data analysis, these functions return index objects after removing duplicates. Even we can have the choice to choose which duplicate we want to keep in the dataframe.

Syntax

index.drop_duplicates(keep='last')

Parameters

  • The keep parameters takes any of these values and defualt is ‘first’
  • ‘first’: Remove all duplicate and keep first occurrence
  • Last‘: Remove all duplicates and keep last occurrence.
  • False‘: Remove all duplicates from dataframe.
import pandas as pd
  
Student_dict = {
    'Name': ['Jack', 'Rack', 'Max','Kom'],
    'Marks':[100,100, 100,100],
    'Fee':[100,200,300,400],
    'Subject': ['Math', 'Math', 'Music','Phy']
}  
 
 
dfobj = pd.DataFrame(Student_dict,index =[1,1,2,2])

result = dfobj.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
 

print('dataframe after remove duplicate index:\n',result)

Output

dataframe after remove duplicate index:
        Name  Marks  Fee Subject
index                          
1      Jack    100  100    Math
2       Max    100  300   Music

3.Remove rows with duplicate indices Pandas


In this example we have used index to get indices from dataframe. The dfobj.query[~index.duplicated(keep=’first’)] method returns a series of booleans that shows whether each index has duplicate in dataframe. The not operator is used to reverse each value in the resulted series and return a subset of original dataframe without duplicate.

  • To get the last values we need to replace the last statement with this (keep=’last’)”)
import pandas as pd
  
Student_dict = {
    'Name': ['Jack', 'Rack', 'Max','Kom'],
    'Marks':[100,100, 100,100],
    'Fee':[100,200,300,400],
    'Subject': ['Math', 'Math', 'Music','Phy']
}
  
 
 
dfobj = pd.DataFrame(Student_dict,index =[1,1,2,2])

#
result = dfobj.query("~index.duplicated(keep='first')")
 

print('dataframe after remove duplicate index:\n',result)

Output

dataframe after remove duplicate index:
    Name  Marks  Fee Subject
1  Jack    100  100    Math
2   Max    100  300   Music

4.Remove rows with duplicate indices Pandas


In this example, we have used group by with level =0.In case of Mutiindex groupby function group on particular level or levels. The group.last() function is used to get the last values from each group.

  • In case of getting the first occurrence of indexes from each group and removing duplicates, we need to change the statement groupby(level=0).first()
  • Other way to select first occurrence of indexes and remove duplicate by using group by is dfobj = dfobj.groupby(dfobj.index).first()
import pandas as pd
  
Student_dict = {
    'Name': ['Jack', 'Rack', 'Max','Kom'],
    'Marks':[100,100, 100,100],
    'Fee':[100,200,300,400],
    'Subject': ['Math', 'Math', 'Music','Phy']
}
  
 
 
dfobj = pd.DataFrame(Student_dict,index =[1,1,2,2])
result = dfobj.groupby(level=0).last()

print('dataframe after remove duplicate index:\n',result)


#another way to use group by to remove duplicate indices

dfobj = dfobj.groupby(dfobj.index).last() 


print('dataframe after remove duplicate index:\n',dfobj)

Output

dataframe after remove duplicate index:
    Name  Marks  Fee Subject
1  Rack    100  200    Math
2   Kom    100  400     Phy

5.Remove rows with duplicate indices Pandas keep first


In this example we are selecting the index first indices and removing the duplicate indices.The np.unique() selects the first unique indices and dataframe.iloc[] is used to get the subset of dataframe based on selected indices.

import pandas as pd
import numpy as np
  
Student_dict = {
    'Name': ['Jack', 'Rack', 'Max','Kom'],
    'Marks':[100,100, 100,100],
    'Fee':[100,200,300,400],
    'Subject': ['Math', 'Math', 'Music','Phy']
}
  
 
 
dfobj = pd.DataFrame(Student_dict,index =[1,1,2,2])

idx = np.unique(dfobj.index.values, return_index = True )[1]

dfobj = dfobj.iloc[idx]

 

print('dataframe after remove duplicate index:\n',dfobj)

Output

dataframe after remove duplicate index:
    Name  Marks  Fee Subject
1  Jack    100  100    Math
2   Max    100  300   Music

6.Remove rows with duplicate indices Pandas indices keeping last


In this example we are selecting the unique last indices by using np.unique() and selecting the unique indices values by using iloc[] function of dataframe.

import pandas as pd
import numpy as np
  
Student_dict = {
    'Name': ['Jack', 'Rack', 'Max','Kom'],
    'Marks':[100,100, 100,100],
    'Fee':[100,200,300,400],
    'Subject': ['Math', 'Math', 'Music','Phy']
}
  
 
 
dfobj = pd.DataFrame(Student_dict,index =[1,1,2,2])


dfobj = dfobj[::-1]
dfobj = dfobj.iloc[ np.unique( dfobj.index.values, return_index = True )[1] ] 

print('dataframe after remove duplicate index:\n',dfobj)

Output

dataframe after remove duplicate index:
    Name  Marks  Fee Subject
1  Rack    100  200    Math
2   Kom    100  400     Phy