Check and count duplicates lines

Question

I have a input data like:

chr17   41243232    41243373    BRCA1_ex11
chr17   41243232    41243373    BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr17   41243639    41243811    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23
chr13   32954112    32954208    BRCA2_ex24

And I need to check for duplicates rows $2 and $3 row, if is duplicated, I need to merged into one line and $4 column print as comma separated.

Output:

chr17   41243232    41243373    BRCA1_ex11,BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr17   41243639    41243811    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23,BRCA2_ex24

Is there any AWK solution to easy process this kind a data? I would appreciate explained solution. Input and output are tab-separated formats. NOTE: First, second and third fields are allays equal.

My try was:

awk -v OFS="\t" '{i=$2 FS $1 FS $3 FS $4} {a[i]=!a[i]?$4:a[i] "," $4} END {for (l in a) {print l,a[l]}}' infile

Thank you for any ideas.

What if the first fields differ while the second and third are equal? Did you try anything yourself by any chance? — James Brown
– James Brown, Commented Sep 27, 2016 at 12:16
@JamesBrown thank you good question. First field is allays the same like second and third. Yes I was trying to arrays in awk - but still self learning. I will edit my post.. — Paul
– Paul, Commented Sep 27, 2016 at 12:19

James Brown · Accepted Answer · 2016-09-27 12:24:13Z

2

$ cat script.awk
{
    a[$2 OFS $3] = $1                  # store $1, last instance
    b[$2 OFS $3] = b[$2 FS $3] $4 ","  # append the $4s
} 
END {
    for (i in a) {                     # order is awk default
        sub(/,$/, "", b[i])            # remove trailing ","
        print a[i], i, b[i]            # print
    }
}

Run:

$ awk -f script.awk infile
chr17 41243471 41243644 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243639 41243811 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24

answered Sep 27, 2016 at 12:24

James Brown

37.7k8 gold badges52 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Paul Over a year ago

Looks this is work great. You allays have solution James, thank you.

CiaPan · Accepted Answer · 2016-09-27 12:36:29Z

1

Just replace the first assignment with

 i=$1 FS $2 FS $3

And possibly filter the output through sed to replace spaces with tabs:

...  | sed 's/ /    /g'

      space---^   ^--- TAB

Output:

chr13   32954112        32954208        BRCA2_ex23,BRCA2_ex24
chr17   41243639        41243811        BRCA1_ex11
chr17   41243232        41243373        BRCA1_ex11,BRCA1_ex12
chr17   41243471        41243644        BRCA1_ex11

answered Sep 27, 2016 at 12:36

CiaPan

9,5802 gold badges23 silver badges38 bronze badges

Comments

Sundeep · Accepted Answer · 2016-09-27 13:10:02Z

if perl is okay:

$ cat ip.txt 
chr17   41243232    41243373    BRCA1_ex11
chr17   41243232    41243373    BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr17   41243639    41243811    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23
chr13   32954112    32954208    BRCA2_ex24

$ perl -ale '$k = join "\t",@F[0..2]; $h{$k} .= $h{$k} ? ",$F[3]" : $F[3]; END{ print "$_\t$h{$_}" foreach (keys %h) }' ip.txt 
chr17   41243639    41243811    BRCA1_ex11
chr17   41243232    41243373    BRCA1_ex11,BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23,BRCA2_ex24

-ale split input line on spaces and save to @F array, strip newlines from input line and add newline for print statements
$k = join "\t",@F[0..2] key to work with - first 3 elements joined by tab
$h{$k} .= $h{$k} ? ",$F[3]" : $F[3] append values to hash variable, add , depending on existing value is empty or not
END{ print "$_\t$h{$_}" foreach (keys %h) } after all lines are processed, print key and value separated by tab. Order of keys is random

Alternate way by extracting key, value using regex:

$ perl -nle '($k,$v)=/^(.*?)\s+(\S+)$/; $h{$k} .= $h{$k} ? ",$v" : $v; END{print "$_\t$h{$_}" foreach (keys %h) }' ip.txt 
chr13   32954112    32954208    BRCA2_ex23,BRCA2_ex24
chr17   41243639    41243811    BRCA1_ex11
chr17   41243232    41243373    BRCA1_ex11,BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11

Community · Accepted Answer · 2017-05-23 12:00:53Z

$ cat tst.awk
{
    curr = $2 FS $3

    if (curr == prev) {
        buf = buf "," $NF
    }
    else {
        if (NR>1) {
            print buf
        }
        buf = $0
    }

    prev = curr
}
END { print buf }

$ awk -f tst.awk file
chr17   41243232    41243373    BRCA1_ex11,BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr17   41243639    41243811    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23,BRCA2_ex24

The differences between this and @JamesBrown's solution are:

This only stores 1 output line in memory at a time while James' stores the whole file.
This will print the lines in the order they appear in the input while James' will print them in "random" (hash) order.
This relies on the input with the key ($2 & $3) values contiguous as you show in your sample while James' will work for input in any order.

Collectives™ on Stack Overflow

Check and count duplicates lines

4 Answers 4

1 Comment

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Related