1

I have a input data like:

chr17   41243232    41243373    BRCA1_ex11
chr17   41243232    41243373    BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr17   41243639    41243811    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23
chr13   32954112    32954208    BRCA2_ex24

And I need to check for duplicates rows $2 and $3 row, if is duplicated, I need to merged into one line and $4 column print as comma separated.

Output:

chr17   41243232    41243373    BRCA1_ex11,BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr17   41243639    41243811    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23,BRCA2_ex24

Is there any AWK solution to easy process this kind a data? I would appreciate explained solution. Input and output are tab-separated formats. NOTE: First, second and third fields are allays equal.

My try was:

awk -v OFS="\t" '{i=$2 FS $1 FS $3 FS $4} {a[i]=!a[i]?$4:a[i] "," $4} END {for (l in a) {print l,a[l]}}' infile

Thank you for any ideas.

2
  • 2
    What if the first fields differ while the second and third are equal? Did you try anything yourself by any chance? Commented Sep 27, 2016 at 12:16
  • @JamesBrown thank you good question. First field is allays the same like second and third. Yes I was trying to arrays in awk - but still self learning. I will edit my post.. Commented Sep 27, 2016 at 12:19

4 Answers 4

2
$ cat script.awk
{
    a[$2 OFS $3] = $1                  # store $1, last instance
    b[$2 OFS $3] = b[$2 FS $3] $4 ","  # append the $4s
} 
END {
    for (i in a) {                     # order is awk default
        sub(/,$/, "", b[i])            # remove trailing ","
        print a[i], i, b[i]            # print
    }
}

Run:

$ awk -f script.awk infile
chr17 41243471 41243644 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243639 41243811 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24
Sign up to request clarification or add additional context in comments.

1 Comment

Looks this is work great. You allays have solution James, thank you.
1

Just replace the first assignment with

 i=$1 FS $2 FS $3

And possibly filter the output through sed to replace spaces with tabs:

...  | sed 's/ /    /g'

      space---^   ^--- TAB

Output:

chr13   32954112        32954208        BRCA2_ex23,BRCA2_ex24
chr17   41243639        41243811        BRCA1_ex11
chr17   41243232        41243373        BRCA1_ex11,BRCA1_ex12
chr17   41243471        41243644        BRCA1_ex11

Comments

1

if perl is okay:

$ cat ip.txt 
chr17   41243232    41243373    BRCA1_ex11
chr17   41243232    41243373    BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr17   41243639    41243811    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23
chr13   32954112    32954208    BRCA2_ex24

$ perl -ale '$k = join "\t",@F[0..2]; $h{$k} .= $h{$k} ? ",$F[3]" : $F[3]; END{ print "$_\t$h{$_}" foreach (keys %h) }' ip.txt 
chr17   41243639    41243811    BRCA1_ex11
chr17   41243232    41243373    BRCA1_ex11,BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23,BRCA2_ex24
  • -ale split input line on spaces and save to @F array, strip newlines from input line and add newline for print statements
  • $k = join "\t",@F[0..2] key to work with - first 3 elements joined by tab
  • $h{$k} .= $h{$k} ? ",$F[3]" : $F[3] append values to hash variable, add , depending on existing value is empty or not
  • END{ print "$_\t$h{$_}" foreach (keys %h) } after all lines are processed, print key and value separated by tab. Order of keys is random

Alternate way by extracting key, value using regex:

$ perl -nle '($k,$v)=/^(.*?)\s+(\S+)$/; $h{$k} .= $h{$k} ? ",$v" : $v; END{print "$_\t$h{$_}" foreach (keys %h) }' ip.txt 
chr13   32954112    32954208    BRCA2_ex23,BRCA2_ex24
chr17   41243639    41243811    BRCA1_ex11
chr17   41243232    41243373    BRCA1_ex11,BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11

Comments

1
$ cat tst.awk
{
    curr = $2 FS $3

    if (curr == prev) {
        buf = buf "," $NF
    }
    else {
        if (NR>1) {
            print buf
        }
        buf = $0
    }

    prev = curr
}
END { print buf }

$ awk -f tst.awk file
chr17   41243232    41243373    BRCA1_ex11,BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr17   41243639    41243811    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23,BRCA2_ex24

The differences between this and @JamesBrown's solution are:

  1. This only stores 1 output line in memory at a time while James' stores the whole file.
  2. This will print the lines in the order they appear in the input while James' will print them in "random" (hash) order.
  3. This relies on the input with the key ($2 & $3) values contiguous as you show in your sample while James' will work for input in any order.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.