I am a noob in Linux. I have a file like this:
col1 col2 col3 ID1234567-DNA_A01 chr1_10203040_T/C gene 0 ID1234568-DNA_A02 chr1_10203050_T/A gene 0 ID1234569-DNA_A03 chr1_10203060_A/G gene 0 ID1234570-DNA_A04 chr1_10203070_C/T gene 0I want to use only the first column and divide each line into 4 columns:
#CHROM POS REF ALT 1 10203040 T C 1 10203050 T A 1 10203060 A G 1 10203070 C TI tried to make:
awk 'BEGIN{OFS="\t";FS="\t"; print"#CHROM","POS","REF","ALT"} | cut -d' ' -f2- {print substr($1,4,1),substr($1,6}' old_file > new_fileI know I did wrong, but any suggestion would be helpful!Thanks
3 Answers
Maybe you can try like like this:
cut -d " " -f 2 test.txt | awk -F '[_,/]' 'BEGIN{printf "#CHROM \tPOS\tREF\tALT\n"} {printf ("%s\t %s\t %s\t %s\n" ,$1, $2, $3, $4)}'Here test.txt is name of your file. And if you want to redirect output to file just add > new_file.txt at end of the command.
I'd go with:
awk 'NR>1 {print $2}' file \
| awk -F'[_/]' 'BEGIN{OFS="\t"; print "#CHROM","POS","REF","ALT"}{$1=$1}1'- First
awk, output the second field only. - Second
awk, choose[_/]as field separator, print the new Header and the fields.$1=$1triggers reorganisation of fields, which is necessary as we change the output field separator to\t. - You may add
| column -tto make the columns in line.
We could do it in one go, but then you need to use split which is more complicated I think.
Output:
#CHROM POS REF ALT
chr1 10203040 T C
chr1 10203050 T A
chr1 10203060 A G
chr1 10203070 C T 0 If you have GNU awk (gawk), then - notwithstanding the advice here - you could consider capturing the parts you want using a regular expression rather than a string split:
$ gawk ' BEGIN{OFS="\t"; print "#CHROM","POS","REF","ALT"} match($2,/chr([0-9])_([0-9]+)_([ACGT])[/]([ACGT])/,a) {print a[1],a[2],a[3],a[4]} ' old_file
#CHROM POS REF ALT
1 10203040 T C
1 10203050 T A
1 10203060 A G
1 10203070 C T(Other awk implementations have the match function, but the GNU version extends that with a capture group array.)