List and count unique words from a Word document [closed]

I want to take a Microsoft Word document and produce a spreadsheet of all the words contained in the document and the number of times each word appears.

e.g.,

cat 23
said 15
jumped 12
dog 7

Is this a no-brainer problem that can be accomplished in a simple, straightforward way using built-in functions and features of Word or Excel?

If not, is this functionality readily available in off-the-shelf tools (in which case, please advise what I should inquire about on the Software Recs site), or would custom programming be required?

1

2 Answers

Apart from VBA, one can develop such an application using API of OpenOffice to read the contents of the Word document; process it and export the results as a CSV file to open in a spreadsheet application.

However it's actually just a few line of codes if you're familiar with any programming language. For example in Python you can easily do it like that:

Here we define a simple function which counts words given a list

def countWords(a_list): words = {} for i in range(len(a_list)): item = a_list[i] count = a_list.count(item) words[item] = count return sorted(words.items(), key = lambda item: item[1], reverse=True)

The rest is to manipulate the content of the document.First paste it:

content = """This is the content of the word document. Just copy paste it.
It can be very very very very long and it can contain punctuation
(they will be ignored) and numbers like 123 and 4567 (they will be counted)."""

Here we remove the punctuation, EOL, parentheses etc. and then generate a word list for our function:

import re
cleanContent = re.sub('[^a-zA-Z0-9]',' ', content)
wordList = cleanContent.lower().split()

Then we run our function and store its result (word-count pairs) in another list and print the results:

result = countWords(wordList)
for words in result: print(words)

So the result is:

('very', 4)
('and', 3)
('it', 3)
('be', 3)
('they', 2)
('will', 2)
('can', 2)
('the', 2)
('ignored', 1)
('just', 1)
('is', 1)
('numbers', 1)
('punctuation', 1)
('long', 1)
('content', 1)
('document', 1)
('123', 1)
('4567', 1)
('copy', 1)
('paste', 1)
('word', 1)
('like', 1)
('this', 1)
('of', 1)
('contain', 1)
('counted', 1)

You can remove parentheses and comma using search/replace if you want.

All you need to do download Python 3, install it, open IDLE (comes with Python), replace the content of your word document and run the commands one at a time and in the given order.

1

Use VBA.  A macro (subroutine) to do exactly what you request is on this page:

Sub WordFrequency() Const maxwords = 9000 'Maximum unique words allowed Dim SingleWord As String 'Raw word pulled from doc Dim Words(maxwords) As String 'Array to hold unique words Dim Freq(maxwords) As Integer 'Frequency counter for unique words Dim WordNum As Integer 'Number of unique words Dim ByFreq As Boolean 'Flag for sorting order Dim ttlwds As Long 'Total words in the document Dim Excludes As String 'Words to be excluded Dim Found As Boolean 'Temporary flag Dim j, k, l, Temp As Integer 'Temporary variables Dim ans As String 'How user wants to sort results Dim tword As String ' ' Set up excluded words Excludes = "[the][a][of][is][to][for][by][be][and][are]" ' Find out how to sort ByFreq = True ans = InputBox("Sort by WORD or by FREQ?", "Sort order", "WORD") If ans = "" Then End If UCase(ans) = "WORD" Then ByFreq = False End If Selection.HomeKey Unit:=wdStory System.Cursor = wdCursorWait WordNum = 0 ttlwds = ActiveDocument.Words.Count ' Control the repeat For Each aword In ActiveDocument.Words SingleWord = Trim(LCase(aword)) 'Out of range? If SingleWord < "a" Or SingleWord > "z" Then SingleWord = "" End If 'On exclude list? If InStr(Excludes, "[" & SingleWord & "]") Then SingleWord = "" End If If Len(SingleWord) > 0 Then Found = False For j = 1 To WordNum If Words(j) = SingleWord Then Freq(j) = Freq(j) + 1 Found = True Exit For End If Next j If Not Found Then WordNum = WordNum + 1 Words(WordNum) = SingleWord Freq(WordNum) = 1 End If If WordNum > maxwords - 1 Then j = MsgBox("Too many words.", vbOKOnly) Exit For End If End If ttlwds = ttlwds - 1 StatusBar = "Remaining: " & ttlwds & ", Unique: " & WordNum Next aword ' Now sort it into word order For j = 1 To WordNum - 1 k = j For l = j + 1 To WordNum If (Not ByFreq And Words(l) < Words(k)) _ Or (ByFreq And Freq(l) > Freq(k)) Then k = l Next l If k <> j Then tword = Words(j) Words(j) = Words(k) Words(k) = tword Temp = Freq(j) Freq(j) = Freq(k) Freq(k) = Temp End If StatusBar = "Sorting: " & WordNum - j Next j ' Now write out the results tmpName = ActiveDocument.AttachedTemplate.FullName Documents.Add Template:=tmpName, NewTemplate:=False Selection.ParagraphFormat.TabStops.ClearAll With Selection For j = 1 To WordNum .TypeText Text:=Trim(Str(Freq(j))) _ & vbTab & Words(j) & vbCrLf Next j End With System.Cursor = wdCursorNormal j = MsgBox("There were " & Trim(Str(WordNum)) & _ " different words ", vbOKOnly, "Finished")
End Sub
1

You Might Also Like