I had a requirement to compare two files today and remove the entries from the larger list that matched the entries in the smaller list – think a poor man’s mailing list management.
Thanks to the post at StackOverflow, I was able to very quickly remove the entries with the following.
grep -Fiv -f potentialduplicates.txt %lt; fulllist.txt > noduplicates.txt
The flags are as follows –
-F no regexps (fast)
-i case-insensitive
-v invert results
-f
This worked really well and the end user was pleased. I did convert all the entries into all lower case first in excel using =lowercase(a1) and dragging down the list. Copying the new list to a text file meant I had some clean lists to process.
Unfortunately I didn’t have a copy of grep on my Windows7 machine, so I just uploaded it to a linux server to do the processing – quicker than obtaining grep.
As it turns out, I could have used findstr which comes with Windows. The same output can be obtained with
findstr /g:potentialduplicates.txt fulllist.txt >noduplicates.txt