Mark custom object record as duplicate via Apex for large (100,000+) data sets
- Get link
- X
- Other Apps
Problem statement
Let's say I have a data load that inserts 100,000 unique records of a custom object - call it MyObject__c. Each record has a lookup to a Contact record that is populated. I want to iterate over the entire list of 100,000 inserted records and find all of the records of MyObject that look up to the same Contact, and then set a checkbox on all but one record that indicates they are 'duplicates'.
Example: there are 5 record of MyObject (out of 100,000) that all look up to the Contact MyContact (SFID 123456789). After processing the entire list of 100,000 records, I'd have four of the five records have a 'duplicate reference' checkbox marked, with the fifth one having it unchecked. Basically a uniqueness check on the lookup.
What would be the best way to accomplish something like this?
My current thought process is to build an Apex batch job; The batch runs the query to get all MyObject records that need to be processed (100,000) and splits them into a batch. I'm able to get their individual ContactId value they look up to, and I can use that to compare against other MyObject records and see if the Contact SFID lookups match and bulk-process the DML update to the records. Easy.
...But I'm getting hung on on how to get the data for all 100,000 MyObject records to compare against for each batch. Batch splits up the processing, but within each 200-record batch, I still need to get all 100,000 MyObject records that I am processing, read the ContactId they point to and find duplicates.
That is to say, if I am processing record 187 of Batch 1, I still need to compare it against all 100,000 other MyObject records to find the duplicates. I might be overthinking, but I can't find a way to avoid governance limits in this case. a SOQL statement [SELECT contactId FROM MyObject WHERE DataLoadId = myId] would return more than 50,000 rows (since I have 100,000 records that would match). SOSL has similar limitations on returned rows, as do aggregate queries.
How the heck would I do a compare against a large set of data? I feel like the answer is much simpler than I am making it out to be in my head...
Solution
Solution:1 If you don't have any usage of duplicate records then you can utilise database to identify duplicate. Just make 'contact' lookup as external unique field. while uploading, salesforce DB will figure out the duplicates and will allow insertion of only one record for a particular contact. So you will end up with 1-to-1 record for myObject__c and contact.
Solution-2 If you have usage of duplicate records then you can introduce one more object let's say myObject2__c, it should have just one field as 'contact id' make it text/external/unique.
Now write a before insert trigger on your 1st object myObject__c. This trigger should in turn try to insert record in myObject2__c. if insertion fails then update the 'isDuplicate' checkbox to true else nothing. Trigger would look something like below.
trigger on myObject__c (before insert){
try{
MyObject2__c mobj2 = new MyObject2__c( ContactId__c = trigger.new(0).contactId__c;
}catch(DMLExeption ex){
trigger.new(0).isDuplicate__c = true;
}
}
This way by utilizing database power of identifying duplicates, you will have data exactly as you wanted to achieve.
I assume you would be able to bulkify the trigger code. given sample is for concept explanation only.
- Get link
- X
- Other Apps
Comments
Post a Comment