I have a text file of size approx. 25 GB. I want to delete the duplicate rows based on the value in second column. If duplicates are found in a file then, I want to delete all rows with that value in column and keep only one row with highest value in the fourth column. The file is in CSV format and is already sorted.
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.06,-1
2,10482422,45,0.18,-1
2,10482422,45,0.4,-1
2,10482423,45,0.15,-1
2,10482423,45,0.43,-1
2,10482424,45,0.18,-1
2,10482424,45,0.49,-1
2,10482425,45,0.21,-1
2,10482425,45,0.52,-1
2,10482426,45,0.27,-1
2,10482426,45,0.64,-1
2,10482427,45,0.09,-1
2,10482427,45,0.34,-1
2,10482427,45,0.73,-1In the above example, I just want one maximum surge value for each Cell_Id by deleting other duplicate rows
Expected output is:
2,10482422,45,0.4,-1
2,10482423,45,0.43,-1
2,10482424,45,0.49,-1
2,10482425,45,0.52,-1
2,10482426,45,0.64,-1
2,10482427,45,0.73,-1 10 3 Answers
Since the input appears to be grouped/sorted by the 2nd column already this should be quite simple and doesn’t require to keep and sort the entire data set in memory, only two records at a time.1
I first thought of an Awk solution but found it to clumsy to deal with arrays and non-blank field delimiters. Then I decided on a short-ish Python program:
#!/usr/bin/python3
import sys
DELIMITER = ','
def remove_duplicates(records): prev = None for r in records: r = (int(r[0]), int(r[1]), int(r[2]), float(r[3]), int(r[4])) if prev is None: prev = r elif r[1] != prev[1]: yield prev prev = r elif r[3] > prev[3]: prev = r if prev is not None: yield prev
def main(): for r in remove_duplicates( l.rstrip('\n').rsplit(DELIMITER) for l in sys.stdin ): print(*r, sep=',')
if __name__ == '__main__': main()On my system it has a throughput of ~250,000 records or 5 MB per CPU second.
Usage
python3 remove-duplicates.py < input.txt > output.txtThe program can’t deal with column headers, so you need to strip them off:
tail -n +2 < input.txt | python3 remove-duplicates.py > output.txtIf you want to add them back to the result:
{ read -r header && printf '%s\n' "$header" && python3 remove-duplicates.py; } < input.txt > output.txt1 This is one major advantage over waltinator’s and steeldriver’s approaches for data sets that don’t fit into main memory.
2If you'd sorted them in decreasing order of the 4th field, you could simply have taken the first occurrence of each 2nd field value using an associative array or hash e.g. awk -F, '!seen[$2]++' file or perl -F, -ne 'print $_ unless $seen{$F[1]}++'
With the values in increasing order, it's a little trickier to do it in an efficient single pass - you can do so (with a little bit of setup) by printing the previous line each time the key value changes:
awk -F, ' NR==1 {print; next} # print the header line NR==2 {key=$2; next} # initialize the comparison $2 != key { print lastval; key = $2 # print the last (largest) value of the previous key group } {lastval = $0} # save the current line END {print lastval} # clean up
' file
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.4,-1
2,10482423,45,0.43,-1
2,10482424,45,0.49,-1
2,10482425,45,0.52,-1
2,10482426,45,0.64,-1
2,10482427,45,0.73,-1 If you don't have too many unique Cell_ids, you could keep track of the already-seen ones in a Perl associative array. If you do have too many (and my Perl script runs out of memory), write a C program to keep unique ones in a bit field. Here's the Perl.
#!/usr/bin/perl -w
use strict;
my %seen = (); # key=Cell_ID, value=1
my @cols=(); # for splitting input
while( <> ) { # read STDIN @cols = split ',',$_; next if ( defined $seen{$cols[1]}); # skip if we already saw this Cell_Id $seen{$cols[1]} = 1; print;
}Here's my test:
walt@bat:~(0)$ cat u.dat
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.06,-1
2,10482422,45,0.18,-1
2,10482422,45,0.4,-1
2,10482423,45,0.15,-1
2,10482423,45,0.43,-1
2,10482424,45,0.18,-1
2,10482424,45,0.49,-1
2,10482425,45,0.21,-1
2,10482425,45,0.52,-1
2,10482426,45,0.27,-1
2,10482426,45,0.64,-1
2,10482427,45,0.09,-1
2,10482427,45,0.34,-1
2,10482427,45,0.73,-1
walt@bat:~(0)$ perl ./unique.pl u.dat
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.06,-1
2,10482423,45,0.15,-1
2,10482424,45,0.18,-1
2,10482425,45,0.21,-1
2,10482426,45,0.27,-1
2,10482427,45,0.09,-1 3