2
\$\begingroup\$

Code: https://github.com/Loki-Astari/Puzzle/tree/master/wc

Challenge: https://codingchallenges.fyi/challenges/challenge-wc/

wc.cpp

#include <cstddef>
#include <fstream>
#include <vector>
#include <string>
#include <iostream>

/*
 * Command line options.
 * If no options are specified then any is true and we print all values.
 */
struct Options
{
    bool    any         = true;
    bool    lines       = false;
    bool    words       = false;
    bool    chars       = false;
    bool    bytes       = false;
};

/*
 * Collect values from a file.
 */
struct Result
{
    std::size_t     lines   = 0;
    std::size_t     words   = 0;
    std::size_t     chars   = 0;
    std::size_t     bytes   = 0;
};

Result getData(std::istream& file)
{
    Result      result;
    int         c        = file.get();
    bool        newLine  = true;
    bool        inWord   = !std::isspace(c);

    for (; c != std::char_traits<char>::eof(); c = file.get()) {

        // A line must have at least one character on it.
        // The new line character counts as a character for this purpose.
        if (newLine == true) {
            newLine = false;
            result.lines    += 1;
        }
        if (c == '\n') {
            newLine = true;
        }

        // Words are "white space" separated.
        // Increment the counter when we hit a space when inside a word.
        bool isSpace = std::isspace(c);

        if (inWord && isSpace) {
            inWord = false;
            result.words    += 1;
        }
        else if (!inWord && !isSpace) {
            inWord = true;
        }

        // Ignore extra characters in multi byte character;
        if ((c & 0xC0) != 0x80) {
            result.chars    += 1;
        }

        // Increment for each char read from the stream
        result.bytes    += 1;
    }

    // We are in a word that has not been counted.
    if (inWord) {
        result.words += 1;
    }

    return result;
}

void display(std::istream& file, std::string const& fileName, Options const& options)
{
    Result data = getData(file);
    std::cout << "\t";
    if (options.any || options.lines) {
        std::cout << data.lines << "\t";
    }
    if (options.any || options.words) {
        std::cout << data.words << "\t";
    }
    if (options.any || options.chars) {
        std::cout << data.chars << "\t";
    }
    if (options.any || options.bytes) {
        std::cout << data.bytes << "\t";
    }
    std::cout << fileName << "\n";
}

int main(int argc, char* argv[])
{
    Options                     options;
    std::vector<std::string>    files;

    int loop = 1;
    for (; loop < argc; ++loop) {
        /*
         * If this is not a flag then we have reached the files.
         */
        if (argv[loop][0] != '-') {
            break;
        }

        /* Allow old style unix flags */
        for (int flag = 1; argv[loop][flag]; ++flag) {
            if (argv[loop][flag] == 'l') {
                options.any     = false;
                options.lines   = true;
            }
            else if (argv[loop][flag] == 'w') {
                options.any     = false;
                options.words   = true;
            }
            else if (argv[loop][flag] == 'm') {
                options.any     = false;
                options.chars   = true;
            }
            else if (argv[loop][flag] == 'c') {
                options.any     = false;
                options.bytes   = true;
            }
            else {
                std::cerr << "Usage: wc [-lwmc] <files>*\n";
                return 1;
            }
        }
    }

    /* Any remaining command line values are files */
    for (; loop < argc; ++loop) {
        files.emplace_back(argv[loop]);
    }

    /* If no files are explicitly set then use std::cin */
    if (files.size() == 0) {
        display(std::cin, "", options);
    }
    /* Loop over all the specified files */
    for (auto fileName: files) {
        std::ifstream   file(fileName);
        if (!file) {
            std::cout << "Unknown file: " << fileName << "\n";
        }
        else {
            display(file, fileName, options);
        }
    }
}
\$\endgroup\$

4 Answers 4

1
\$\begingroup\$

UX

The code works really well, and I only have a few minor suggestions.

If I mistakenly use an unsupported option, I get an expected error message:

ccwc -b
Usage: wc [-lwmc] <files>*

It would be nice to have an option to get a more verbose description of what the code does and what all the options mean. The wc on my system has a --help option which does just that.

Simpler

This line:

if (newLine == true) {

is simpler as:

if (newLine) {

Layout

In the getData function, whitespace before operators is a bit inconsistent:

newLine = false;
result.lines    += 1;

I think this is more consistent (single space):

newLine = false;
result.lines += 1;

Documentation

The comments in the code are very helpful. It would be nice to add a block comment at the top to summarize the purpose of the code, mentioning that it is a version of wc and what options are supported.

\$\endgroup\$
1
\$\begingroup\$

Error reporting

To properly replicate the functionality of wc the error in this loop should print to std::cerr rather than std::cout.

    /* Loop over all the specified files */
    for (auto fileName: files) {
        std::ifstream   file(fileName);
        if (!file) {
            std::cout << "Unknown file: " << fileName << "\n";
        }
        else {
            display(file, fileName, options);
        }
    }

As I can see on my test machine:

% wc foo
wc: foo: open: No such file or directory
% wc foo 2> /dev/null 
% 

Fortunately, it's about the simples fix imaginable. All of three characters.

Since there are multiple reasons your program might fail to open a file, you should either:

  • Investigate further and identify the actual source of the error, or...
  • Use a more generic error message like "failure to open file {filename}".

Strings

Consider std::string_view vs. const std::string&.

From Stack Overflow: How exactly is std::string_view faster than const std::string&?

\$\endgroup\$
2
  • \$\begingroup\$ Also opening a file can fail for other reasons than a wrong file name (e.g. missing permissions). \$\endgroup\$ Commented 13 hours ago
  • 1
    \$\begingroup\$ Switching to string_view makes opening files harder as you can't open a std::fstream with a string_view. \$\endgroup\$ Commented 11 hours ago
1
\$\begingroup\$

Overflow

Only for the pedantic.

A text file's length is not limited to fit in a size_t. size_t is a object size limitation. result.bytes += 1; risks overflow.

Of course other statistics may overflow, yet certainly .bytes is the first one at risk.

I could see using a wider type here.
Alerting the user by coding a test for overflow looks a tad time expensive to catch such an extreme condition.

Or just leave it as is - What is the chance an extreme overflow will cause an expensive issue?

Alternate code

Minor simplification:

        if (newLine == true) {
            newLine = false;
            result.lines    += 1;
        }
        if (c == '\n') {
            newLine = true;
        }

to

        result.lines += newLine;
        newLine = (c == '\n');

Line count

There is a bit of a holy war about should a line count include a line that lacks a '\n' at the end of the file.

As I see this code, it counts final text without a '\n' as a line.

I suspect the classic WC does not - it only counts '\n'.
If you do go this way, consider a name change from .lines to .lf_count to avoid having to explain lines.

IMO, I like counting lines as OP has done here, yet backward compatibility with existing WC is likely more important. I'd like it that if WC reports the line-feed count and does not include a last "partial" line, then it offers some mechanism to report the existence of a partial line too. OTOH, if line counts include a last line without a '\n', and that occurs, I'd like a like-wise reporting mechanism.

\$\endgroup\$
13
  • \$\begingroup\$ I assumed wc was smart about line counting. But I just checked its not. So I will revert the code to performing like standard wc. echo -n "A" | wc -l => 0 \$\endgroup\$ Commented 10 hours ago
  • \$\begingroup\$ Changed the type for counting to std::uintmax_t this matches what is returned by std::filesystem::file_size() \$\endgroup\$ Commented 10 hours ago
  • \$\begingroup\$ The -portable type to use for a file length is off_t. This will be signed 64-bit on all modern systems, Some might need compiler flags to enable large-file support. \$\endgroup\$ Commented 9 hours ago
  • \$\begingroup\$ Or if you want to use a type guaranteed to be in the Standard Library, std::fpos_t or std::streampos should also work for any file size the OS supports. (This might not be true for some old systems that did Large Filename Support in a weird non-standard way.) Or std::int_least64_t is guaranteed to work, for the conceivable future. \$\endgroup\$ Commented 9 hours ago
  • \$\begingroup\$ @Davislor "The -portable type to use for a file length is off_t." --> off_t is not a standard C++ type. It s a POSIX one. So I guess it depends on how portable is the goal, even if WC originated in *nix. \$\endgroup\$ Commented 8 hours ago
0
\$\begingroup\$

Handling the first character of the file separately seems unnecessary, i.e. this

int         c        = file.get();
bool        newLine  = true;
bool        inWord   = !std::isspace(c);

for (; c != std::char_traits<char>::eof(); c = file.get()) {

can be replaced by

bool        newLine  = true;
bool        inWord   = false;

for (int c = file.get(); c != std::char_traits<char>::eof(); c = file.get()) {

That simplifies the logic a bit and makes the scope of c local to the loop.


It is nice that your program distinguishes between “characters” and “bytes”, but it was not immediately obvious to me that the file is assumed to be UTF-8 encoded. If you add an online help (as suggested in another answer) then this should be mentioned. Also this comment

// Ignore extra characters in multi byte character;

is perhaps clearer described as

// Ignore UTF-8 continuation bytes for the character count:

For the character count you might consider to ignore a UTF-8 byte order mark (BOM) EF BB BF at the beginning of the file – note that the sample file “test.txt” from the challenge site starts with a BOM.

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.