I had a requirement to compare two files today and remove the entries from the larger list that matched the entries in the smaller list – think a poor man’s mailing list management.
Thanks to the post at StackOverflow, I was able to very quickly remove the entries with the following.
grep -Fiv -f potentialduplicates.txt %lt; fulllist.txt > noduplicates.txt
The flags are as follows –
-F no regexps (fast)
-i case-insensitive
-v invert results
-f
This worked really well and the end user was pleased. I did convert all the entries into all lower case first in excel using =lowercase(a1) and dragging down the list. Copying the new list to a text file meant I had some clean lists to process.
Unfortunately I didn’t have a copy of grep on my Windows7 machine, so I just uploaded it to a linux server to do the processing – quicker than obtaining grep.
As it turns out, I could have used findstr which comes with Windows. The same output can be obtained with
findstr /g:potentialduplicates.txt fulllist.txt >noduplicates.txt
Comments
I attempted to do this using the method you indicated using findstr, but when run it only shows the duplicates in the new file not the new list with the duplicates removed?
you need to add /v to get the list of unique entities,
findstr /v /g:potentialduplicates.txt fulllist.txt >noduplicates.txt
this then worked for me apart from it leaves out any word with the characters ‘IS’ in, looking for another tool now
If your intention is to remove the dupes, and not simply come up with a file containing them- then findstr is not suitable.
It’s used for searching for strings so the best it can do is give you the duplicates, not remove the duplicates.