CodeFari: MongoDB Query: Remove duplicate records from collection except one

Problem: DBA work is critical and sensitive; we work on Non Relational or Relational Database and the data consistency is a common factor for both. For the data consistency, sometimes we need to remove the duplicate records except one.

Solution: some solutions are given below to remove duplicates of records in MongoDB. First, we will generate the scenario after that, and we will fix the issue with some command.

Suppose we have the Employee collection like below.

{

"_id" : ObjectId("5e5f5cf8cad2677f9f839323"),

"Name" : "Dilip",

"EmpId" : 1.0,

"Department" : "IT"

}

{

"_id" : ObjectId("5e5f5d20cad2677f9f839327"),

"Name" : "Dilip",

"EmpId" : 1.0,

"Department" : "IT"

}

{

"_id" : ObjectId("5e5f5d27cad2677f9f839328"),

"Name" : "Dilip",

"EmpId" : 1.0,

"Department" : "IT"

}

{

"_id" : ObjectId("5e5f5d3bcad2677f9f839329"),

"Name" : "Anil",

"EmpId" : 2.0,

"Department" : "HR"

}

{

"_id" : ObjectId("5e5f5d42cad2677f9f83932a"),

"Name" : "Anil",

"EmpId" : 2.0,

"Department" : "HR"

}

We can see the duplicate documents for employee "Dilip" and "Anil". Now what we will do we delete the document except each one of them.

Using Group aggregate

var duplicatesIds = [];

db.Employee.aggregate([

{

$group: {

_id: {

EmpId: "$EmpId"

dups: {

"$addToSet": "$_id"

count: {

"$sum": 1

}

}, {

$match: {

count: {

"$gt": 1

}

], {

allowDiskUse: true

}).forEach(function (doc) {

doc.dups.shift();

doc.dups.forEach(function (dupId) {

duplicatesIds.push(dupId);

})

});

printjson(duplicatesIds);

db.Employee.remove({_id:{$in:duplicatesIds}})

db.Employee.find();

Now we will do an analysis of the above-written query.

1- var duplicatesIds = []: This is an array declaration where this query will push the duplicate IDs.

2-{$group:{_id:{EmpId:"$EmpId"},dups:{"$addToSet":"$_id"} ,count:{"$sum":1}}}: Here we are grouping the records on behalf of EmpId, and using $addToSet command, we can create an array "dups", and count:{"$sum":1} is counting the duplicate records.

3- {$match:{count:{"$gt":1}}}: Here we are filtering the records that have a count greater than 1. As the above group pipeline, we are counting the duplicate records on behalf of EmpId.

4- ForEach: we are iterating records one by one here which are grouped EmpId, here we will find the array of duplicate records, for example

"dups" : [

ObjectId("5e5f5d20cad2677f9f839327"),

ObjectId("5e5f5d27cad2677f9f839328"),

ObjectId("5e5f5cf8cad2677f9f839323")

5- doc.dups.shift():Here we are removing one record which will not be deleted, and It means we will delete the duplicates except one document.

6- doc.dups.forEach(function (dupId): here again, we are iterating the array to push (duplicatesIds.push(dupId)) it records (duplicatesIds)on the above-declared array.

7- db.Employee.find(): to fetch the records.

Now finally execute the above MongoDB query, and you will find the following records.

{

"_id" : ObjectId("5e5f5d20cad2677f9f839327"),

"Name" : "Dilip",

"EmpId" : 1.0,

"Department" : "IT"

}

{

"_id" : ObjectId("5e5f5d42cad2677f9f83932a"),

"Name" : "Anil",

"EmpId" : 2.0,

"Department" : "HR"

}

It means all duplicate records have been removed except one.

MongoDB Query: Remove duplicate records from collection except one

Using Group aggregate

Now we will do an analysis of the above-written query.

2 comments:

Related Posts

What is the Use of isNaN Function in JavaScript? A Comprehensive Explanation for Effective Input Validation