MongoDB Query: Remove duplicate records from collection except one



Problem: DBA work is critical and sensitive; we work on Non Relational or Relational Database and the data consistency is a common factor for both. For the data consistency, sometimes we need to remove the duplicate records except one.

Solution: some solutions are given below to remove duplicates of records in MongoDB. First, we will generate the scenario after that, and we will fix the issue with some command.

Suppose we have the Employee collection like below.


{
    "_id" : ObjectId("5e5f5cf8cad2677f9f839323"),
    "Name" : "Dilip",
    "EmpId" : 1.0,
    "Department" : "IT"
}
{
    "_id" : ObjectId("5e5f5d20cad2677f9f839327"),
    "Name" : "Dilip",
    "EmpId" : 1.0,
    "Department" : "IT"
}
{
    "_id" : ObjectId("5e5f5d27cad2677f9f839328"),
    "Name" : "Dilip",
    "EmpId" : 1.0,
    "Department" : "IT"
}
{
    "_id" : ObjectId("5e5f5d3bcad2677f9f839329"),
    "Name" : "Anil",
    "EmpId" : 2.0,
    "Department" : "HR"
}
{
    "_id" : ObjectId("5e5f5d42cad2677f9f83932a"),
    "Name" : "Anil",
    "EmpId" : 2.0,
    "Department" : "HR"
}


We can see the duplicate documents for employee "Dilip" and "Anil". Now what we will do we delete the document except each one of them.

Using Group aggregate


var duplicatesIds = [];
db.Employee.aggregate([
        {
            $group: {
                _id: {
                    EmpId: "$EmpId"
                },
                dups: {
                    "$addToSet": "$_id"
                },
                count: {
                    "$sum": 1
                }
            }
        }, {
            $match: {
                count: {
                    "$gt": 1
                }
            }
        }
    ], {
    allowDiskUse: true
}).forEach(function (doc) {
    doc.dups.shift();
    doc.dups.forEach(function (dupId) {
        duplicatesIds.push(dupId);
    })
});
printjson(duplicatesIds); 

db.Employee.remove({_id:{$in:duplicatesIds}}) 
db.Employee.find();


Now we will do an analysis of the above-written query.

1- var duplicatesIds = []: This is an array declaration where this query will push the duplicate IDs.

2-{$group:{_id:{EmpId:"$EmpId"},dups:{"$addToSet":"$_id"} ,count:{"$sum":1}}}: Here we are grouping the records on behalf of EmpId, and using $addToSet command, we can create an array "dups", and count:{"$sum":1} is counting the duplicate records.

3- {$match:{count:{"$gt":1}}}: Here we are filtering the records that have a count greater than 1. As the above group pipeline, we are counting the duplicate records on behalf of EmpId.

4- ForEach: we are iterating records one by one here which are grouped EmpId, here we will find the array of duplicate records, for example 
"dups" : [
        ObjectId("5e5f5d20cad2677f9f839327"),
        ObjectId("5e5f5d27cad2677f9f839328"),
        ObjectId("5e5f5cf8cad2677f9f839323")
    ].

5- doc.dups.shift():Here we are removing one record which will not be deleted, and It means we will delete the duplicates except one document.

6- doc.dups.forEach(function (dupId): here again, we are iterating the array to push (duplicatesIds.push(dupId)) it records (duplicatesIds)on the above-declared array.

7- db.Employee.find(): to fetch the records.
Now finally execute the above MongoDB query, and you will find the following records.


{
    "_id" : ObjectId("5e5f5d20cad2677f9f839327"),
    "Name" : "Dilip",
    "EmpId" : 1.0,
    "Department" : "IT"
}
{
    "_id" : ObjectId("5e5f5d42cad2677f9f83932a"),
    "Name" : "Anil",
    "EmpId" : 2.0,
    "Department" : "HR"
}


It means all duplicate records have been removed except one.

2 comments:

  1. Click on the https://www.viewmantra.com get expert advise on mobile purchasing

    ReplyDelete

Please do not enter any spam link in the comment box.

NUnit : Introduction

There are several unit test tools in the market, but one of them is NUnit, the most popular tool for doing unit testing. So in this tutorial...