MongoDB Query: Remove duplicate records from collection except one


Problem: DBA work is critical and sensitive; we work on Non Relational or Relational Database and the data consistency is a common factor for both. For the data consistency, sometimes we need to remove the duplicate records except one.

Solution: some solutions are given below to remove duplicates of records in MongoDB. First, we will generate the scenario after that, and we will fix the issue with some command.

Suppose we have the Employee collection like below.


{
    "_id" : ObjectId("5e5f5cf8cad2677f9f839323"),
    "Name" : "Dilip",
    "EmpId" : 1.0,
    "Department" : "IT"
}
{
    "_id" : ObjectId("5e5f5d20cad2677f9f839327"),
    "Name" : "Dilip",
    "EmpId" : 1.0,
    "Department" : "IT"
}
{
    "_id" : ObjectId("5e5f5d27cad2677f9f839328"),
    "Name" : "Dilip",
    "EmpId" : 1.0,
    "Department" : "IT"
}
{
    "_id" : ObjectId("5e5f5d3bcad2677f9f839329"),
    "Name" : "Anil",
    "EmpId" : 2.0,
    "Department" : "HR"
}
{
    "_id" : ObjectId("5e5f5d42cad2677f9f83932a"),
    "Name" : "Anil",
    "EmpId" : 2.0,
    "Department" : "HR"
}


We can see the duplicate documents for employee "Dilip" and "Anil". Now what we will do we delete the document except each one of them.

Using Group aggregate


var duplicatesIds = [];
db.Employee.aggregate([
        {
            $group: {
                _id: {
                    EmpId: "$EmpId"
                },
                dups: {
                    "$addToSet": "$_id"
                },
                count: {
                    "$sum": 1
                }
            }
        }, {
            $match: {
                count: {
                    "$gt": 1
                }
            }
        }
    ], {
    allowDiskUse: true
}).forEach(function (doc) {
    doc.dups.shift();
    doc.dups.forEach(function (dupId) {
        duplicatesIds.push(dupId);
    })
});
printjson(duplicatesIds); 

db.Employee.remove({_id:{$in:duplicatesIds}}) 
db.Employee.find();


Now we will do an analysis of the above-written query.

1- var duplicatesIds = []: This is an array declaration where this query will push the duplicate IDs.

2-{$group:{_id:{EmpId:"$EmpId"},dups:{"$addToSet":"$_id"} ,count:{"$sum":1}}}: Here we are grouping the records on behalf of EmpId, and using $addToSet command, we can create an array "dups", and count:{"$sum":1} is counting the duplicate records.

3- {$match:{count:{"$gt":1}}}: Here we are filtering the records that have a count greater than 1. As the above group pipeline, we are counting the duplicate records on behalf of EmpId.

4- ForEach: we are iterating records one by one here which are grouped EmpId, here we will find the array of duplicate records, for example 
"dups" : [
        ObjectId("5e5f5d20cad2677f9f839327"),
        ObjectId("5e5f5d27cad2677f9f839328"),
        ObjectId("5e5f5cf8cad2677f9f839323")
    ].

5- doc.dups.shift():Here we are removing one record which will not be deleted, and It means we will delete the duplicates except one document.

6- doc.dups.forEach(function (dupId): here again, we are iterating the array to push (duplicatesIds.push(dupId)) it records (duplicatesIds)on the above-declared array.

7- db.Employee.find(): to fetch the records.
Now finally execute the above MongoDB query, and you will find the following records.


{
    "_id" : ObjectId("5e5f5d20cad2677f9f839327"),
    "Name" : "Dilip",
    "EmpId" : 1.0,
    "Department" : "IT"
}
{
    "_id" : ObjectId("5e5f5d42cad2677f9f83932a"),
    "Name" : "Anil",
    "EmpId" : 2.0,
    "Department" : "HR"
}


It means all duplicate records have been removed except one.

2 comments:

  1. Click on the https://www.viewmantra.com get expert advise on mobile purchasing

    ReplyDelete

PostgreSQL: DROP TABLE is not working

Problem: I want to drop my table, but I am not able to drop the table while I tried the query "DROP TABLE MyTable CASCADE;". ...