Comparing two text files and removing duplicates from one

I had a requirement to compare two files today and remove the entries from the larger list that matched the entries in the smaller list – think a poor man’s mailing list management.
Thanks to the post at StackOverflow, I was able to very quickly remove the entries with the following.
grep -Fiv -f potentialduplicates.txt %lt; fulllist.txt > noduplicates.txt

The flags are as follows –
-F no regexps (fast)
-i case-insensitive
-v invert results
-f get patterns from file

This worked really well and the end user was pleased. I did convert all the entries into all lower case first in excel using =lowercase(a1) and dragging down the list. Copying the new list to a text file meant I had some clean lists to process.
Unfortunately I didn’t have a copy of grep on my Windows7 machine, so I just uploaded it to a linux server to do the processing – quicker than obtaining grep.
As it turns out, I could have used findstr which comes with Windows. The same output can be obtained with
findstr /g:potentialduplicates.txt fulllist.txt >noduplicates.txt

Comments

David

January 17, 2012 at 3:18 pm

I attempted to do this using the method you indicated using findstr, but when run it only shows the duplicates in the new file not the new list with the duplicates removed?

Andy G

February 23, 2012 at 5:17 am

you need to add /v to get the list of unique entities,

findstr /v /g:potentialduplicates.txt fulllist.txt >noduplicates.txt

this then worked for me apart from it leaves out any word with the characters ‘IS’ in, looking for another tool now

Tom

March 13, 2012 at 2:21 am

If your intention is to remove the dupes, and not simply come up with a file containing them- then findstr is not suitable.

It’s used for searching for strings so the best it can do is give you the duplicates, not remove the duplicates.

David
January 17, 2012 at 3:18 pm

I attempted to do this using the method you indicated using findstr, but when run it only shows the duplicates in the new file not the new list with the duplicates removed?
Andy G
February 23, 2012 at 5:17 am

you need to add /v to get the list of unique entities,

findstr /v /g:potentialduplicates.txt fulllist.txt >noduplicates.txt

this then worked for me apart from it leaves out any word with the characters ‘IS’ in, looking for another tool now
Tom
March 13, 2012 at 2:21 am

If your intention is to remove the dupes, and not simply come up with a file containing them- then findstr is not suitable.

It’s used for searching for strings so the best it can do is give you the duplicates, not remove the duplicates.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Comparing two text files and removing duplicates from one

Related Posts

Fixed: Changepk to upgrade from Home to Enterprise (to Pro) fails silently

Streamdeck button to set Teams Status.

Streamdeck is now finally working with Teams

Comments

Leave a Reply