2

I have a dataset of images labeled/classified by characteristics, where an image can have more than one label. I want to count how many of each identifier I have. A toy dataset is created below, with different colors being the labels.

bballdave025@MY-MACHINE /home/bballdave025/toy
$ touch elephant_grey.jpg && touch zebra_white_black.jpg && touch rubik-s_cube_-_1977-first-prod_by_ErnoRubik-_red_orange_yellow_white_blue_green.jpg && touch Radio_Hotel.Washington_Heights.NYC-USA_green_yellow_blue_red_orange_reddish-purple_teal_grey-brown.jpg && touch Big_Bird__yellow_orange_red.jpg

Let's make it more easily visible. The files in the initially labeled dataset are shown below. (The | awk -F'/' '{print $NF}' is just meant to take off the ./ or path/to/where/the/jpegs/are/ that would otherwise be before the filename.)

$ find . -type f | awk -F'/' '{print $NF}'
Big_Bird__yellow_orange_red.jpg
elephant_grey.jpg
Radio_Hotel.Washington_Heights.NYC-USA_green_yellow_blue_red_orange_reddish-purple_teal_grey-brown.jpg
rubik-s_cube_-_1977-first-prod_by_ErnoRubik-_red_orange_yellow_white_blue_green.jpg
zebra_white_black.jpg

Those are the filenames for labeled versions of the images. The corresponding originals are below:

$ find ../toy_orig_bak/ -type f | awk -F'/' '{print $NF}'
Big_Bird_.jpg
elephant.jpg
Radio_Hotel.Washington_Heights.NYC-USA.jpg
rubik-s_cube_-_1977-first-prod_by_ErnoRubik-.jpg
zebra.jpg

This is to show that the color labels are inserted between the filename and the dot extension. They are separated from each other and from the original filename by a (delimiting) _ character. (There are rules for the label names and for the filenames1.) The only allowed color strings at this initial point are any of {black, white, grey, red, orange, yellow, green, blue, reddish-purple, teal, grey-brown}.

I further want to show that other labels may be added, as long as they're part of my controlled vocabulary, something which can be changed only by me. Imagine a file named rainbox.jpg gets put in with the original filenames ( touch ../toy_orig_bak/rainbow.jpg, for those of you following along for reproducibility ). I decide that I want to add indigo and violet to my controlled vocabulary list, so I can create the labeled filename,

$ touch rainbow_red_orange_yellow_green_blue_indigo_violet.jpg

Desired Output

Again, I want a count of each of the labels. For the dataset I've set up (including that last labeled picture of a rainbow), the correct output would be

      1 black
      3 blue
      3 green
      1 grey
      1 grey-brown
      1 indigo
      4 orange
      4 red
      1 reddish-purple
      1 teal
      1 violet
      2 white
      4 yellow

(The counts were performed somewhat manually, due to my grep confusion.)


Attempts and a note on the details of the solution I want

Research below

My first thought (although I did worry about delimiter consumption) was to look at the surrounding delimiters: '_' before and '_' or '.' after.

Here's first my grep attempt

find . -type f -iname "*.jpg" | \
 \
    grep -o "[_]\(black\|white\|grey\|red\|orange\|yellow\|green\|blue\|"\
"reddish-purple\|teal\|grey-brown\|indigo\|violet\)[_.]" | \
 \
        tr -d [_.] | sort | uniq -c

and its output

      3 blue
      1 green
      1 grey
      1 orange
      3 red
      1 teal
      1 violet
      1 white
      3 yellow

Which is not the same as before. Here's the comparison.

   Before              |   Now
-----------------------|---------------------
      1 black          |
      3 blue           |      3 blue
      3 green          |      1 green
      1 grey           |      1 grey
      1 grey-brown     |
      1 indigo         |
      4 orange         |      1 orange
      4 red            |      3 red
      1 reddish-purple |
      1 teal           |      1 teal
      1 violet         |      1 violet
      2 white          |      1 white
      4 yellow         |      3 yellow
                       |

I know this is happening because the regex engine consumes the second delimiter2.

Here is the crux of my main question: (I do want to solve my count problem, and I'll talk about some solutions I've researched and considered myself, but) the detail I want to know is about truly regular expressions and consuming the delimiter.

I want to get a count of each identifier string, and I'm wondering if I can do it with approach and (POSIX) Basic Regular Expressions – BRENote 2 and reddit thread (archived as a gist), specifically with grep .

Any of sed, awk, IFS with read, etc. are welcome, too. I'm sure someone has a way solve this problem with Perl (dermis and feline can be divorced by manifold methods), and I'd be glad to get that one, too.

Basically, I am absolutely okay with other solutions to the task of getting a count of each identifier string. However, if it's true that there's no way of stepping back the engine with a Basic Regular Expression engine (that's truly regular), I want to know. I've thought of zero-width matches, lookaheads, and look-behinds, but I don't know how these play out in POSIX Basic Regular Expressions or in mathematically/grammatically regular language parsers.


One thing I realize I wasn't taking into account

The point of the rules (see note [2]) was to allow the regex to take advantage of the fact that we should be able to assure ourselves that we're only getting the part of the classified filename with labels, as we only allow one of a finite set of strings preceded by an underscore and followed by either an underscore or a dot, with the dot only happening before the file extension. (I guess we can't be absolutely certain, as the original, pre-labeled filename could have one of the labels immediately preceding the dot - something like a_sunburn_that_is_bright_red.jpg, but that's something for which I check and correct by adding a specific non-label string before the dot and extension.)

My regex, imagining that it could get past the delimiter being consumed, would still allow the following example problems

the_new_red_car_-_1989_red_black_silver.jpg
  • would return {red, red   , silver   } as is,
  • {red, red, black   , silver   } if working without consuming the 2nd '_',
  • whereas {red, black   , silver   } is desired
parrot_at_blue_gold_banquet_-_a_black_tie_affair_yellow_red_green.jpg
  • would return {blue, black, yellow, green} as is,
  • {blue,   gold,   black, yellow, green} if not consuming the 2nd '_',
  • whereas {yellow, red, green} is desired

Extra points for answers and discussions that take that into account. ; )


Research and ideas

There are a few discussions on different StackExchange sites, like this one, that one, another one, but I think the Unix & Linux discussion here (archived) is the best one. I think that one of the approaches in this answer from @terdon ♦ or in the answer with hashes – from @Sobrique – might be useful.

I keep thinking that some version of ^.*\([_][<any-of-the-labels>]\)\+[.]jpg$ might be key to the situation, but I haven't been able to put together that solution today. If you know how it can help, you're welcome to give an answer using it; I'm going to wait for a fresh brain tomorrow morning.

Edit: @ilkkachu successfully used this idea.


Why am I doing this? I'm training a CNN to recognize different occurrences (not colors) in pictures of old and often handwritten books. I want to make sure the classes are balanced as I want. Also, I'll compare this with another method that doesn't look at the delimiter to make sure I don't have any problems like a '_yllow' (instead of '_yellow'), or a '_whiteorange' _instead of '_white_orange'). Most of the labels are put on through a Java program I've put together, but I've given a little leeway for people to change the filenames themselves in case of multiple labels for one file. Having given that permission, I have the responsibility of verifying legal labeled filenames.


Notes

[1] The rules for the identifying/classifying labels are:

The identifiers can be any of a finite set of strings which can contain only characters in [A-Za-z0-9-] but not underscores.


The bare filenames (without dot and extension) can consist of any ASCII characters except: 1) non-printable/control characters; 2) spaces or tabs   ; OR 3) any of [!"#$%&/)(][}{*?]   See the next paragraph for the real 3). (Note that this means the bare filenames CAN have an underscore, '_', or even several of them.)


Edit: I had my no-no list of characters as is now crossed (struck) out above when @ilkkachu gave the accepted answer. One option of that answer makes excellent use of the '@' which was then not in the excluded character group, but which I actually don't allow in my filenames. There are other omissions in the original character group. As I actually want it, the above paragraph should be amended with the following.

3) any of '[] ~@#$%^&|/)(}{[*?><;:"`'"'"']' Edit: Now this compiles as a BRE. (This was the simplest and most-readable BRE I could come up with.)

that beautifully crazy character group means that any of

{   [, ], , ~, @, #, $, %, ^, &, |, \, /, ), (, }, {, [, *, ?, >, <, ;, :, ", `, '   }

is not allowed – and neither is any tab (\t, ...), nor any non-printing/control characters. Some of these are already standard on the no-no list for filenames on different OSs, but I give my complete set (when I'm in charge of creating the filenames).


[2] Here is what I mean by the delimiter being consumed. I'll do my best to illustrate an example with our (Basic) Reg(ular)Ex(pression),

"[_]\(black\|white\|grey\|red\|orange\|yellow\|green\|blue\|"\
"reddish-purple\|teal\|grey-brown\|indigo\|violet\)[_.]"

Here goes.

This missing of some of the color strings is happening because the regex engine consumes the second delimiter.

For example, using O to denote part of a miss (non-match) and X to denote part of a hit (match), with YYYYY denoting a complete match for the whole regex pattern, we get the following behavior.

Engine goes along looking for '_'

engine is here
       |
       v
rainbow_red_orange_yellow_green_blue_indigo_violet.jpg
OOOOOOO

Matches 
      [_]
          with '_'
engine is at
        |
        v
rainbow_red_orange_yellow_green_blue_indigo_violet.jpg
OOOOOOOX

Matches
 \(...\|red\|...\)
                  with 'red'    
  engine is at
           |
           v
rainbow_red_orange_yellow_green_blue_indigo_violet.jpg
OOOOOOOXXXX
Matches
          [_.]
               with '_'
   engine is at
            |
            v
rainbow_red_orange_yellow_green_blue_indigo_violet.jpg
OOOOOOOXXXXX

We have a whole match! 

rainbow_red_orange_yellow_green_blue_indigo_violet.jpg
OOOOOOOYYYYY

Given the  -o  flag, the engine outputs

'_red_'

The  'tr -d [_.]' takes off the surrounding underscores, 
and our output line becomes 

'red'

The problem now is that the engine cannot go back to find the '_' before 'orange', or at least it can't do so using any process I know about from my admittedly imperfect knowledge of Basic Regular Expressions. As far as a REGULAR expression engine, using a REGULAR grammar and a REGULAR language parser knows, the whole universe in which it's searching now consists of

orange_yellow_green_blue_indigo_violet.jpg

(I don't know if this statement is correct from a mathematical/formal-language point of view, and I'd be interested to know.)

And the process continues as from the first, beginning with Engine goes along looking for '_'

orange_yellow_green_blue_indigo_violet.jpg
OOOOOOXXXXXXXX

Match!

orange_yellow_green_blue_indigo_violet.jpg
OOOOOOYYYYYYYY

Engine spits out  '_yellow_'  which is 'tr -d [_.]'-ed

Engine cannot go back, so its search universe is now

green_blue_indigo_violet.jpg

and we continue with

green_blue_indigo_violet.jpg
OOOOOXXXXXX

Match!

green_blue_indigo_violet.jpg
OOOOOYYYYYYOOOOOOYYYYYYYY

That last match being on the '.' from [_.]

[3] More formally, I want to know if it can be done with a real regular expression, i.e. one which can define a regular language and whose language is a context-free language, cf. Wikipedia's Regex article (archived). I think this is the same as a POSIX regular expression, but I'm not sure.

Refs. [A] (archived), [B] (archived), [C] (archived),


Dang it, I know there's a missing ending parenthesis up there in the text, somewhere, because I noticed it and went up to fix it. When I got up into the text, I couldn't remember the context of the parenthesis, so it's still there, just mocking me. I found it, and I bolded it! I'll probably take the bold formatting and this note down, soon, but I'm sharing my happiness right now.

4 Answers 4

4

Don't use grep for this. Use something like awk or perl. For example, the following perl script finds and counts the tags, sorts them by count, and prints them in a format similar to uniq -c but using a tab to separate the count from the tag:

#!/usr/bin/perl

use strict;

my %allowed; # hash to hold the allowed tags
my %tags;    # hash to hold a counter for each found tag

# array containing the allowed tags
my @a = qw(black white grey red orange yellow green blue
           reddish-purple teal grey-brown indigo violet);

# convert to a hash for quick lookup
map { $allowed{$_} = 1 } @a;

# main loop, read stdin and increment counter for each tag found
while(<<>>) {
  chomp;
  # find all matches by splitting on _ or . and store
  # in local array @t
  my @t = split /[_.]/;

  # increment count (in %tags hash) for each match if in allowed list
  map { $tags{$_}++ if exists $allowed{$_} } @t ;
};

# sort the %tags array by value and print 
foreach my $k (sort { $tags{$b} <=> $tags{$a} } keys %tags) {
  printf "%5i\t%s\n", $tags{$k}, $k
};

Save it as, e.g. count-tags.pl and make it executable with chmod +x count-tags.pl and run it like so:

$ find . | ./count-tags.pl 
    4   red
    4   yellow
    4   orange
    3   green
    3   blue
    2   white
    1   grey-brown
    1   teal
    1   reddish-purple
    1   grey
    1   violet
    1   black
    1   indigo

PS: to re-iterate and emphasise what @ilkkachu said in his answer, don't use a delimiter that might be in the filename, that's just making things more difficult for yourself. e.g. my first attempt at this tried to avoid listing all the allowed tags, but that incorrectly made false tags out of filename components like Heights, Hotel, 1977-first-prod, ErnoRubik-, and Bird.

If you really must use a delimiter that's in the filename, use another delimiter to separate the filename from the tags - e.g. @ as suggested by ilkkachu or __ (two underscores) or anything else that you can guarantee will not be in either the filename or any of the tags.

In other words, make the filenames easy to process.

4
  • 1
    m/_(.*?)[_.]/g has the same problem as their initial grep, the pattern matches against and consumes the trailing separator, which then can't be used as the initial separator of the next tag. E.g. from somefile_red_orange_yellow_white.jpg, it only matches _red_ and _yellow_. With Perl you could perhaps use split() instead? Commented Jun 5 at 16:58
  • 2
    yeah, split will work better. Commented Jun 6 at 8:01
  • Thanks. I always like to have the (well, a) Perl solution. I considered making this the accepted answer, but I appreciated the notes on how regexes work with really Regular expressions. Commented Jun 7 at 14:59
  • I think this is the solution that most people having a similar problem will use. I really appreciate that you solved the problem as I gave it, but also gave me pointers on how to better make filenames in the future. Commented Jun 7 at 15:03
3

Some thoughts:

Using the underscore to delimit the filename proper and the color tags is a bit awkward when the filename itself can contain underscores. It would make it easier if you could use some other character there. That is, instead of Big_Bird__yellow_orange_red.jpg, having e.g. an @ in between, like so: Big_Bird_@yellow_orange_red.jpg

If you had that, you could just start with the list of filenames, remove everything up to the @ and everything starting from the ., and then split on underscores.

% ls
Big_Bird_@yellow_orange_red.jpg
Radio_Hotel.Washington_Heights.NYC-USA@green_yellow_blue_red_orange_reddish-purple_teal_grey-brown.jpg
[email protected]
rainbow@red_orange_yellow_green_blue_indigo_violet.jpg
rubik-s_cube_-_1977-first-prod_by_ErnoRubik-@red_orange_yellow_white_blue_green.jpg
zebra@white_black.jpg

% printf "%s\n" * | sed -e 's/^.*@//; s/\..*$//' | tr _ '\n'  | sort |uniq -c
   1 black
   3 blue
   3 green
   1 grey
   1 grey-brown
   1 indigo
   4 orange
   4 red
   1 reddish-purple
   1 teal
   1 violet
   2 white
   4 yellow

This would include any mistyped tags in the output.


If you keep with an underscore, you can still do it almost the way you did, by listing every tag in the RE. Just match on the underscore before the tag only, and use the fact that grep doesn't do overlapping matches to your advantage.

% ls
Big_Bird__yellow_orange_red.jpg
Radio_Hotel.Washington_Heights.NYC-USA_green_yellow_blue_red_orange_reddish-purple_teal_grey-brown.jpg
elephant_grey.jpg
rainbow_red_orange_yellow_green_blue_indigo_violet.jpg
rubik-s_cube_-_1977-first-prod_by_ErnoRubik-_red_orange_yellow_white_blue_green.jpg
zebra_white_black.jpg

% printf "%s\n" * | grep -E -o "_(black|white|grey|red|orange|yellow|green|blue|reddish-purple|teal|grey-brown|indigo|violet)" | tr -d _ | sort | uniq -c
   1 black
   3 blue
   3 green
   1 grey
   1 grey-brown
   1 indigo
   4 orange
   4 red
   1 reddish-purple
   1 teal
   1 violet
   2 white
   4 yellow

Or use (_(black|...))+\.jpg$ to lock to the end of the filename too, at the cost of needing some extra post-processing.

Though it works here, I have the faint impression that some regex engines might prefer an earlier branch of an alternation, even if a latter one is longer. That would mean you'd need to put grey-brown before grey.

As you say, you'd still need to do something to deal with files where the filename proper contains something that looks like a color tag. And explicit list of tags would also silently ignore mistyped tags, and if it's locked to the end, it would also ignore valid tags preceding an invalid tag.

Neither of the above do anything about filenames with the same tag twice.


As sidenotes:

  • \| isn't actually part of POSIX BRE and there's no substitute for it. Even if it's widely supported on GNU systems, e.g. the sed on macOS doesn't support it. Similarly \+ isn't BRE, but it can be replaced with \{1,\}. Just use grep -E if you want ERE features. Not having to sprinkle backslashes everywhere also makes the regex prettier.

  • tr takes just a list of characters, it doesn't need brackets. tr -d [_.] would delete brackets as well as the underscore and dot, while tr -d ._ would do to just delete the latter. (and it doesn't look like a glob, so it doesn't need quoting in e.g. zsh where the shell errors on a non-matching glob)

6
  • Thank you! I appreciate the solution and the comments about BRE. It was more theoretical, but it was something I was wondering. A system I had during grad school research was stripped very bare, and it only had the option for BRE. I don't work with any such systems now, but I have a theoretical bent and often like the puzzle aspect. Commented Jun 3 at 13:45
  • The solution using '@' to delimit filename from tags is doubtless more efficient. It would require keeping track of original filenames--something already done with the ../toy_orig_bak/ directory--but wouldn't be too difficult. My true reservation is that I "shudder" at using @ in a filename. This isn't extremely logical, but I like to keep my filenames extremely parsable, avoiding even possibly-special characters. : ) I find it a satisfying challenge to parse filenames of others who don't avoid special characters and spaces in filenames, but if it is I who makes the file, I keep it easy. Commented Jun 3 at 14:08
  • I'm accepting your answer due to the variety of solutions, but I especially appreciate the (_(black|...))+\.jpg$ to lock the end of the filename. Compute cycles aren't a limiting factor, and I like the clarity of it. I also appreciated the sidenotes; I always appreciate deepening my knowledge of bash tools. Commented Jun 3 at 14:18
  • 1
    @bballdave025, ah, oops, I had missed the fine print about forbidden chars. :) Anyway, the point was mostly that any sort of separator that doesn't happen to appear in the original filenames would make it easier. Something like @ or % should be relatively safe in that they aren't special for (most) shells. Or :, but it's forbidden on Windows. But you could use even something like ---, since the combination likely doesn't appear naturally, but it doesn't contain any "bad" characters. Well, I guess the tags could be stored in a separate file to avoid messing with filenames at all... Commented Jun 3 at 20:58
  • 1
    It's a great point, whether it be @, %, or something else. I really like the triple dash. I appreciate that you give me tips for better setup, but also answer the original question. Commented Jun 7 at 15:02
2

To answer your question about BRE (and truly regular expressions), they match as early as possible and maximally, so you don't actually need to search for the delimiters as all.

Rather, searching for contiguous runs of non-underscores will match exactly whole labels; you just need to help it along a little by getting rid of the other junk.

Firstly, trim off all the stuff that's not labels; directory paths, the dotted suffix, and the original basename before the first underscore:

sed '
      s@.*/@@;
      s!^[^_]*_!!;
      s#\.[^._]*$##;
    ' < list > list-stage2

Then split the tags into separate lines; either

grep -o '[^_][^_]*' < list-stage2 > list-stage3

or

tr _ \\n < list-stage2 > list-stage3

Then remove labels with embedded dots (which you probably don't want):

grep -vF . < list-stage3 > list-stage4

Then get the summary counts:

sort < list-stage4 | uniq -c

Apply further formatting adjustments as needed.

Combine into a pipeline to avoid leaving temp files scattered everywhere.

PS: GNU find can do find ... -printf '%f\n' which prints the basename of each file without its directory path.

4
  • I also really like this approach. I was a little put off by the temporary files, but you mentioned that this feature ought to be fixed with a pipeline. I especially appreciate the -printf '%f\n, something I hadn't known before. I honestly had trouble deciding whether to put the checkmark on your answer or that which @ilkkachu wrote, especially since you both paid attention to the BRE. You even talked about truly regular expressions. Your solution does get the job done. Thanks! Commented Jun 3 at 13:58
  • Ultimately, I'm not sure if I can strip away all of the non-label stuff, though I do think that somehow keeping a list of your stripped-away stuff and the remaining labels, then doing a final trim based on contiguous allowed labels might be golden. I'd have to do some code profiling (timing) to see if that might process faster. Compute resources aren't a limiting factor in my problem, but I didn't mention that in my question. Commented Jun 3 at 14:00
  • I'll tag @Charles_Duffy in these comments, as the -printf '%f\n is a possible workaround to the find_hack(){ (( $# || exit )); /usr/bin/find $(printf '%q ' "$@") | awk -F'/' '{print $NF}'; } problem. This '%f' does exactly what I was trying to do in that specific case (get only the basename without directory path), but it doesn't solve the general issue of modifying find by passing exact parameters and adding a piped function. I'll be investigating the general issue, but only after I complete the project that led me to both this question and the '%q ' issue. Commented Jun 3 at 14:27
  • (The '%q ' issue being from this SO post Commented Jun 3 at 14:31
2

Regex with negative-lookahead might be a good way to go as well:

ls -1|grep -oP '[^_.@]+(?=.*\.)(?!.*@)'|sort|uniq -c

OR

printf "%s\n" *|grep -oP '[^_.@]+(?=.*\.)(?!.*@)'|sort|uniq -c|sort|uniq -c

OR

find . -type f| grep -oP '[^_.@]+(?=.*\.)(?!.*@)'|sort|uniq -c

Both are same just one using ls other the printf as an input. Basically one or more character which is not ., _ or @ and the line does not contain @ from this point till the end, also a positive-lookahead that at least one . (dot) character should be present till the end (to not list the file extension alongside with the tags).

EDIT: I've just realized this probably won't fit to "BRE" (basic) regular expression based solution.

1
  • I like that, @zolo . Welcome, and thanks for helping me out! I wanted to know what limitations there were for BRE, and you've answered that question nicely as concerns look-arounds. Commented Jun 16 at 15:06

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.