How do I count the total number of words of all files in a directory (and its subdirectories)?

I am thinking I can do this with wc somehow, if there was a recursive option but I am not sure. I want a grand total of the total number of words in the files under a directory and its subdirectories (not just a per-file word count).

Note I am performing this with my mac.

Ok, I just tried this command

find enwiki/ -type f | xargs wc -w > output.txt

The resulting output file has 6425104 lines, indicating that many files. But the total word count in the end was only 381609. Did perhaps, the grand total of words counted exceed the maximum allowed in bash? I'm not sure if that happened or if I used wc incorrectly.

3 Answers

Using find to find all files, then concatenating them with cat and counting the words in the concatenated stream with wc:

find . -type f -exec cat {} + | wc -w

The issue with your command is the wc will be called multiple times on batches of files if you have many thousands of files to process. In the command above, cat will be called multiple times on batches of files, but all output is sent to a single invocation of wc.

1

If your wc has the --files0-from option, you can do this:

find . -type f -print0 | wc -w --files0-from=-

Explanation:

I found this solution by first reading the wc(1) man page to see what options were available for scanning multiple files. I found this:

--files0-from=F read input from the files specified by NUL-terminated names in file F; If F is - then read names from standard input

From using find before, I knew that it could generate the desired list of files and with the -print0 option, output the files as a list of NULL-terminated names.

Putting that together resulted in the command above. The find command searches the current directory (.) and all subdirectories for regular files (-type f) and prints their full path names to standard output, each name followed by a null character instead of the usual newline (-print0). That result is piped (|) into the standard input of wc which read that list from the specified file (--files0-from=), where - means the standard input, and prints the number of words (-w) found in each file followed by the total of all words found.

If all you are interested in is the grand total, you could append this to the command above.

| tail -1
2

Try:

$ find . -type f -exec wc -w {} \; -print | nawk -f sum -

where sum is the nawk/gawk/awk program file given by the two lines below that executes for every line output from the command on the left side of the pipe symbol - i.e. '|':

{ s += $1 }
END { print "word sum = ", s }

Note: permissions of files matter, so it is possible to get Permission denied output, otherwise, all files owned by the user issuing the above find command piped into the nawk (or gawk, or awk) command should give the output you are seeking less any files for which the user does not have read permission.

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

You Might Also Like