I'm trying use grep to extract attributes from a large collection of XML files. I've tried usinggrep -E -m 1 -o -Z "<tag>(.*)</tag>" /home/somepath/*.xml || printf "NULL" but for some reason, it doesn't return NULL for a file if the regex doesn't match. The endgame here is to build a rudimentary SQL database of these files, using the information from the tags to populate columns. This is my first foray into DBs, so maybe I'm going about it all wrong?
2 Answers
Being you want something per file, you'll have to use a grep per file, something like:
$ find /home/somepath -type f -name '*.xml' | \
> while read path; do \
> grep -E -H -m 1 -o -Z "<tag>(.*)</tag>" "$path" || echo -e "$path\x00NULL"; \
> doneBreaking it down:
$ find /home/somepath -type f -name '*.xml' -print | \This generates the list of files to search and pipes them into the while. The only thing this needs to do is print one path per line, so there are lots of ways to do this.
> while read path;do \This reads each line into the path shell variable and loops until read returns false, which it does when it reaches end-of-file, which it does when find has generated all the paths it's going to.
> grep -E -H -m 1 -o -Z "<tag>(.*)</tag>" "$path" || echo -e "$path\x00NULL"; \This searches the current file (in $path). If the pattern isn't found in the file, grep returns false (i.e. exits with a non-zero exit code), so the echo is executed. The -e says to interpret escapes, so the echo will print the current path, an ASCII nul, and the literal NULL. That's to emulate grep's output, which will be the current path (forced by -H, being grep wouldn't normally output the path when searching a single file), an ASCII nul (because of the -Z) and the matched text.
> doneCloses out the while loop.
Try this way:
grep -E -m 1 -o -Z "<tag>(.*)</tag>" /home/somepath/*.xml 2>&- || echo "NULL" 4