How can I merge log files, i.e. files that are sorted by time but that also have multi-lines, where only the first line has the time, and the remaining ones have not.
log1
01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:04:03.6497,2224,0022 bar
1bar
bar2
3barlog2
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3Expected result
01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3
01:04:03.6497,2224,0022 bar
1bar
bar2
3barIf it weren't for the non-timestamp lines starting with a digit, a simple sort -nm log1 log2 would do.
Is there an easy way on a unix/linux cmd line to get the job done?
Edit As these log files are often in the gigabytes, merging should be done without re-sorting the (already sorted) log files, and without loading the files completely into memory.
13 Answers
Tricky. While it is possible using date and bash arrays, this really is the kind of thing that would benefit from a real programming language. In Perl for example:
$ perl -ne '$d=$1 if /(.+?),/; $k{$d}.=$_; END{print $k{$_} for sort keys(%k);}' log*
01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3
01:04:03.6497,2224,0022 bar
1bar
bar2
3barHere's the same thing uncondensed into a commented script:
#!/usr/bin/env perl
## Read each input line, saving it
## as $_. This while loop is equivalent
## to perl -ne
while (<>) { ## If this line has a comma if (/(.+?),/) { ## Save everything up to the 1st ## comma as $date $date=$1; } ## Add the current line to the %k hash. ## The hash's keys are the dates and the ## contents are the lines. $k{$date}.=$_;
}
## Get the sorted list of hash keys
@dates=sort(keys(%k));
## Now that we have them sorted,
## print each set of lines.
foreach $date (@dates) { print "$k{$date}";
}Note that this assumes that all date lines and only the date lines contain a comma. If that's not the case, you can use this instead:
perl -ne '$d=$1 if /^(\d+:\d+:\d+\.\d+),/; $k{$d}.=$_; END{print $k{$_} for sort keys(%k);}' log*The approach above needs to keep the entire contents of the files in memory. If that is a problem, here's one that doesn't:
$ perl -pe 's/\n/\0/; s/^/\n/ if /^\d+:\d+:\d+\.\d+/' log* | sort -n | perl -lne 's/\0/\n/g; printf'
01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3
01:04:03.6497,2224,0022 bar
1bar
bar2
3barThis one simply puts all lines between successive timestamps on to a single line by replacing newlines with \0 (if this can be in your log files, use any sequence of characters you know will never be there). This passed to sort and then tr to get the lines back.
As very correctly pointed out by the OP, all of the above solutions need to be resorted and don't take into account that the files can be merged. Here's one that does but which unlike the others will only work on two files:
$ sort -m <(perl -pe 's/\n/\0/; s/^/\n/ if /^\d+:\d+:\d+\.\d+/' log1) \ <(perl -pe 's/\n/\0/; s/^/\n/ if /^\d+:\d+:\d+\.\d+/' log2) | perl -lne 's/[\0\r]/\n/g; printf'And if you save the perl command as an alias, you can get:
$ alias a="perl -pe 's/\n/\0/; s/^/\n/ if /^\d+:\d+:\d+\.\d+/'"
$ sort -m <(a log1) <(a log2) | perl -lne 's/[\0\r]/\n/g; printf' 10 One way to do it (thanks @terdon for the newline replace idea):
- Concat all multilines to single lines by replacing those newlines by e.g. NUL in each input file
- Do a
sort -mon the replaced files - Replace NUL back to newlines
Example
As the multiline concatenation is used more than once, let's alias it away:
alias a="awk '{ if (match(\$0, /^[0-9]{2}:[0-9]{2}:[0-9]{2}\\./, _))\ { if (NR == 1) printf \"%s\", \$0; else printf \"\\n%s\", \$0 }\ else printf \"\\0%s\", \$0 } END { print \"\" }'"Here's the merge command, using above alias:
sort -m <(a log1) <(a log2) | tr '\0' '\n'As shell script
In order to use it like this
merge-logs log1 log2I put it into a shell script:
x=""
for f in "$@";
do x="$x <(awk '{ if (match(\$0, /^[0-9]{2}:[0-9]{2}:[0-9]{2}\\./, _)) { if (NR == 1) printf \"%s\", \$0; else printf \"\\n%s\", \$0 } else printf \"\\0%s\", \$0 } END { print \"\" }' $f)"
done
eval "sort -m $x | tr '\0' '\n'"Not sure if I can offer a variable number of log files without resorting to evil eval.
When using Java is an option for you, try log-merger:
java -jar log-merger-0.0.3-jar-with-dependencies.jar -f 1 -tf "HH:MM:ss.SSS" -d "," -i log1,log2
01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3
01:04:03.6497,2224,0022 bar
1bar
bar2
3bar 1