Recognize, find and delete duplicate words in a txt or doc file

I received a few text documents with thousands of words in them (each word is in a line). I'm sure there are duplicate words and need to delete those duplicate and just remain a single of them. I copy/paste all those words in an MS document and now I need to find duplicates and delete extra ones. Find and then delete one by one is boring and takes much time and some of them can escape from my eyes. I need software or a method to do it inside MS Word at once. Something that searches all words, and give me a result list to tell it to keep a single one of those words and delete them the rest to clean my list. I use MS Word 2019 on Windows 10 x64. Is there a macro or simple way to be able to fix this? I google it and find the old macro, but didn't work on MS Word 2019, and also was complicated. Looking for an easier way or program with easy UI to do it. Free or trial software would be appreciated.

2 Answers

If you have Excel, you could instead copy your list into a spreadsheet (if in separate lines, they should paste in as separate cell/row for each word in a single column). You can then use Excel's Remove Duplicate feature (on the Data tab).

1

You can use PowerShell to do this, to open Powershell, use Win+R->type PowerShell -> Enter; The basic idea is to create an empty array first, then check if the array already contains the word, add the word to the array only if the array does not contain the word.

You said each word is in its own, separate line, then it would be simple to achieve with these codes:

[array]$words=get-content "path\to\file\files.txt"
$uniquewords=@()
foreach ($word in $words) { if ($uniquewords -notcontains $word) {$uniquewords += $word}
}
$uniquewords | out-file "path\to\file\files.txt"

Update as per comment:

An array is a data structure that is designed to store a collection of items. The items can be the same type or different types.

Microsoft Docs:Arrays

An [Array] ([System.Array]) is a type of PowerShell objects that is a collection of items, Arrays can be easily traversed and manipulated with PowerShell commands.

Use [array] | get-member -static to get all available methods for [array]'s.

To make a variable an [array], put [array] before it;

In the first line, get-content gets the content of the file located at "path\to\file\files.txt" and assigns to result to a variable named words, the dollar sign $ indicates the string following it names a variable. The variable is an [array] because the [array] put before it.

Get-Content returns each line as a separate string, so each line would be an element in the $words [array].

The second command creates an empty [array] named unique word.

In the third line, foreach ($word in $words) means for each item in the array named words(for every item, one by one, in order)

for example:

$array=@('one','two','three','four','five')

The above line creates an [array] named $array with the five elements, each word is an element, the elements are [string]'s because of the quotes that enclose them. The elements are separated by comma.

Try this command:

foreach ($arra in $array) {$arra}

This will output:

one
two
three
four
five

The things in () is a condition, the things in {} is a scriptblock(commands to be executed).

the scriptblock of the foreach statement,

 if ($uniquewords -notcontains $word) {$uniquewords += $word}

This is a if conditional statement, the things in () is a condition, things in {} is a scriptblock.

-notcontains is an operator that means the thing before it does not contain the thing after it(exactly what it says in its name), += is an operator that adds the thing after it to the thing before it.

The if statement means if $uniquewords doesn't contain the word, add the word to $uniquewords.

The final line outputs the content of $uniquewords to the file.

The foreach statement ensures every word is processed.

As how to replace path, replace the "path\to\file\files.txt" with the full path of the file.

For example, if the file is named textfile.txt stored on Desktop, then it is in %userprofile%\desktop your username is username, its full path C:\Users\Username\Desktop\textfile.txt

In cmd, you can use %userprofile%\desktop\textfile.txt to indicate the full path for any username.

In PowerShell, use this instead:

$Desktop=[$Environment]::GetFolderPath('Desktop')
${Desktop}\textfile.txt

If you really really are not programming material, no matter how simple it is you just cannot understand it, use Shift+RMB and scroll down to find "Copy as path" in the context menu and click it after finding the file in explorer and LMB on it.

To replace the path, replace "path\to\file\files.txt" with full path of the file.

For example, if the file is named textfile.txt stored in C:\somefolder\

Use this:

[array]$words=get-content "C:\somefolder\textfile.txt"
......
$uniquewords=set-content "C:\somefolder\textfile.txt"

I am sorry I cannot make it any simpler...

1

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

You Might Also Like