Problem:
DBA work is critical and sensitive; we work on Non Relational or Relational
Database and the data consistency is a common factor for both. For the data
consistency, sometimes we need to remove the duplicate records except one.
Solution:
some solutions are given below to remove duplicates of records in MongoDB.
First, we will generate the scenario after that, and we will fix the issue with
some command.
Suppose
we have the Employee collection like below.
{
"_id" :
ObjectId("5e5f5cf8cad2677f9f839323"),
"Name" : "Dilip",
"EmpId" : 1.0,
"Department" : "IT"
}
{
"_id" :
ObjectId("5e5f5d20cad2677f9f839327"),
"Name" : "Dilip",
"EmpId" : 1.0,
"Department" : "IT"
}
{
"_id" :
ObjectId("5e5f5d27cad2677f9f839328"),
"Name" : "Dilip",
"EmpId" : 1.0,
"Department" : "IT"
}
{
"_id" :
ObjectId("5e5f5d3bcad2677f9f839329"),
"Name" : "Anil",
"EmpId" : 2.0,
"Department" : "HR"
}
{
"_id" :
ObjectId("5e5f5d42cad2677f9f83932a"),
"Name" : "Anil",
"EmpId" : 2.0,
"Department" : "HR"
}
|
We can see the duplicate documents for employee
"Dilip" and "Anil". Now what we will do we delete the
document except each one of them.
Using Group aggregate
var
duplicatesIds = [];
db.Employee.aggregate([
{
$group: {
_id: {
EmpId: "$EmpId"
},
dups: {
"$addToSet":
"$_id"
},
count: {
"$sum": 1
}
}
}, {
$match: {
count: {
"$gt": 1
}
}
}
], {
allowDiskUse: true
}).forEach(function
(doc) {
doc.dups.shift();
doc.dups.forEach(function (dupId) {
duplicatesIds.push(dupId);
})
});
printjson(duplicatesIds);
db.Employee.remove({_id:{$in:duplicatesIds}})
db.Employee.find();
|
Now we will do an analysis of the above-written query.
1-
var duplicatesIds = []: This is an array declaration where
this query will push the duplicate IDs.
2-{$group:{_id:{EmpId:"$EmpId"},dups:{"$addToSet":"$_id"}
,count:{"$sum":1}}}: Here we are grouping the
records on behalf of EmpId, and using $addToSet command, we can create an array
"dups", and count:{"$sum":1} is counting the duplicate
records.
3-
{$match:{count:{"$gt":1}}}: Here we are filtering the
records that have a count greater than 1. As the above group pipeline, we are
counting the duplicate records on behalf of EmpId.
4-
ForEach: we are iterating records one by one here which are
grouped EmpId, here we will find the array of duplicate records, for example
"dups"
: [
ObjectId("5e5f5d20cad2677f9f839327"),
ObjectId("5e5f5d27cad2677f9f839328"),
ObjectId("5e5f5cf8cad2677f9f839323")
].
5-
doc.dups.shift():Here we are removing one record which will
not be deleted, and It means we will delete the duplicates except one document.
6-
doc.dups.forEach(function (dupId): here again, we are
iterating the array to push (duplicatesIds.push(dupId)) it records
(duplicatesIds)on the above-declared array.
7-
db.Employee.find(): to fetch the records.
Now finally execute the above MongoDB query, and you
will find the following records.
{
"_id" :
ObjectId("5e5f5d20cad2677f9f839327"),
"Name" : "Dilip",
"EmpId" : 1.0,
"Department" : "IT"
}
{
"_id" :
ObjectId("5e5f5d42cad2677f9f83932a"),
"Name" : "Anil",
"EmpId" : 2.0,
"Department" : "HR"
}
|
It means all duplicate records have been removed except
one.
Nice description
ReplyDeleteClick on the https://www.viewmantra.com get expert advise on mobile purchasing
ReplyDelete