More and more data are being produced by applications. And there are many available tools to capture, transform, stream, process and display these data. But, from time to time, integrating existing software modules or applications to analyze some data can't be the solution. One such case is when the execution environment is based on a constrained device, i.e. a device with little memory, little processing power, and little disk space [1].
I recently faced this situation in a project: I had to capture and transform log files generated by an embedded application, and to provide resulting filtered and aggregated data to another embedded application, running in the same device.
The only available specification for the log file format was the following one:
- every line of the file conforms to the same structure
- the structure is defined as below:
Y/M/D–H/MN.S, service, parameter1 = [value1, value2], parameterN = parameterN value, ...
And, of course, some real log files were available. Below is an extract:
2020/11/30-07:51:38, BOOT, Type = UpTime, TimeUp = 3710
2020/11/30-07:51:38, MEMINFO, MemTotal = 95919, MemFree = 59158, Buffers = 3787, Cached = 10252
2020/11/30-07:51:38, SENSORS, Fan_HDD = 0
2020/11/30-07:50:28, NETWORK, Type = NetworkDelay, ClientNetworkDelays = [1,0,0,0,0,0,0,0,0,0,0,0], MeanClientNetworkDelay = 0.000000e+00
Creating in-memory data structures resulting from the analysis of a log line adhering to the above format was the main functional requirement.
As said above, there was one additional, non-functional, requirement: the associated code should run in a constrained device.
And I added a third requirement: the code should be easily modifiable. As the log file format was not specified in a formal way, I was quite sure that I'd find some unanticipated line formats.
A fourth, still non-functional, requirement was to write the source code in C (or in C++), without using other libraries than the standard ones, already present in the device.
I chose a solution based on a technique developed more than 50 years ago: parsing.
A parser is usually divided in two parts: a lexical analyzer and a syntax analyzer. The lexical analyzer is in charge of transforming the input data into a list of symbols, or tokens. The syntax analyzer then checks that the list of tokens is acceptable, according to the specified format, and produces a parse tree that depicts the structure of the input data.
Let's look at how this applies to our problem.
But before going on, we'll make a simplification: we'll ignore the timestamp which is found at the beginning of every line. Doing this allows to simplify the processing exposed farther below. And removing the timestamp is a very simple operation, as it is fixed-length.
When looking at the line below
NETWORK, Type = NetworkDelay, ClientNetworkDelays = [1,0,0,0,0,0,0,0,0,0,0,0], MeanClientNetworkDelay = 0.000000e+00
(remember, the timestamp was removed) we can see that it contains six types of symbols:
, (comma)
= (equal sign)
[ (opening square bracket)
] (closing square bracket)
eol (end of line: not a visible character)
string, a sequence of characters that do not contain any of the above five specific characters and no space or tab character
According to this list, a number is read as a string (containing a number).
The lexical analyzer can be implemented in a very straightforward way, as a simple Finite State Machine (FSM), called for every line:
For each transition, the condition is written above the horizontal line while the output token(s), if any, is/are written under the line. For the sake of simplification, space and tab characters processing is omitted.
As an example, the lexical analysis of the above line produces this list of tokens:
type: string - value: "NETWORK"
type: comma - value: none
type: string - value: "Type"
type: equal sign - value: none
type: string - value: "NetworkDelay"
etc.
Syntactic analysis is a little bit more complex. We must first define what is named a grammar. A grammar is a set of four elements:
- a set of terminal symbols - a terminal symbol is one of the possible tokens produced by the lexical analyzer
- a set of nonterminal symbols - a nonterminal symbol is a sequence of one or more terminal and nonterminal symbols
- a set of production rules - the production rules describe all the accepted combinations of symbols
- a start symbol - one of the nonterminal symbols
The grammar defines the set of sentences that form the language associated to the grammar. The various possible sentences are derived from the start symbol.
Let's take again the example of the log line below:
NETWORK, Type = NetworkDelay, ClientNetworkDelays = [1,0,0,0,0,0,0,0,0,0,0,0], MeanClientNetworkDelay = 0.000000e+00
The string token of value NETWORK is the name of a service, according to the log file format specification. Then we have a sequence of parameters, two parameters being separated by a comma token. Each parameter contains a sequence of a string token that contains the name of the parameter, an equal sign token and a value. Let's try to write this in a more formal way:
line = service ',' parameter+
parameter = parameterName '=' parameterValue parameterTerminator
parameterTerminator = ',' | eol
parameterName = STRING
These four lines are the first production rules of the grammar that will define the log file format in a formal way. The syntax we use to write them is an Extended Backus-Naur Form (EBNF). Yes, we have to define the syntax that we use to define the syntax of the log line 🙂
- on the left-hand side of the equal sign, there is a nonterminal symbol
- on the right-hand side of the equal sign, there is a sequence of one or more terminal and/or nonterminal symbols
- a terminal symbol that is one of the five specific ones is written between single quotes
- the plus sign means that the previous symbol is present one or more times
- the vertical bar sign is used for an alternative
- a production rule is written on one line only
The second production rule contains a nonterminal symbol, parameterValue, that we must define. According to the above example, a parameter value may be either a single value, or a sequence of several values, separated by a comma, the sequence being framed by square brackets:
parameterValue = '[' valueList ']' | value
valueList = value (',' value)*
We have two additional syntactic forms for our EBNF:
- parenthesis are used to group symbols
- the star sign means that the previous symbol (or group of symbols) is present zero or more times
At this point, we have the whole set of production rules:
line = service ',' parameter+
parameter = parameterName '=' parameterValue parameterTerminator
parameterTerminator = ',' | eol
parameterValue = '[' valueList ']' | value
valueList = value (',' value)*
service = STRING
parameterName = STRING
value = STRING
The start symbol of our grammar is line.
Now, how to implement the syntactic analyzer? There are several different ways to do it, mainly depending on some properties of the grammar. Additionally, there exist parser generators (see the References section). In our case, the grammar is very simple, and was written in a way that allows a very simple implementation, based on what is named top-down parsing (again, check the references for more information). The idea is to write a function for every nonterminal symbol, that calls other nonterminal functions and/or extract a token, according to the associated production rule.
For instance, let's look at the first production rule:
line = service ',' parameter+
Replacing it with function calls, we get, in a pseudo language:
void parse_line(void) {
get_service();
get_comma();
bool last_one;
while (true) {
last_one = get_parameter();
if (last_one) break;
}
}
In the get_service() function, we extract the value of the string token and store it into some global structure that will contain the result of the log line analysis at the end of the parsing:
void get_service(void) {
token_t serviceToken = get_token();
store_service_info(serviceToken);
}
Let's write the get_parameter() and the get_parameterValue() functions now:
bool get_parameter(void) {
token_t parameterNameToken = get_token();
get_equal();
parameterValue_t parameterValue = get_parameterValue();
token_t terminator = get_parameterTerminator();
store_parameter_info(parameterNameToken, parameterValue);
if (terminator == eol) {
return true;
} else {
return false;
}
}
parameterValue_t get_parameterValue(void) {
parameterValue_t parameterValue;
if (is_next_token('[')) {
get_opening_bracket();
parameterValue = get_value_list();
get_closing_bracket();
} else {
parameterValue = get_value();
}
}
I let the reader write the pseudocode for the other production rules.
In the above pseudocode, we consider that the input line is correct, i.e. its contents is a sentence of the language generated by the grammar. Of course, in the real world, we will get errors.
An error may have two origins:
- the input line does not conform to the syntax because there is a bug in the log generation code
- the input line does not conform to the syntax because our grammar is not correct, i.e. it does not accurately describe all possible sentences that can be generated by the log generation code
In any case, we need to react at two levels:
- we should display an information allowing to find the line that contains the error
- we should go on processing the file
The second point is really easy to implement (that is not the case in a compiler!): we can ignore the line containing the error. Of course, the above pseudocode must be enhanced accordingly.
In addition to this, we have to enrich the output of the lexical analyzer, adding the line number to every token. In case of error, we display this information, as a minimum.
I had to slightly modify the grammar twice, after having parsed some log files: the one-line format specification does not describe two specific cases I found while checking the files... I really appreciated the direct mapping between the production rules and the code, that let me adapt the code very rapidly and, perhaps more important, very reliably.
Resulting code is quite short: around 290 lines for the lexical analyzer and around 490 lines for the syntactic analyzer (including header files and comments). And, as already mentioned, it is reliable.
But using such a technique does not come for free: as for any abstraction mechanism, the developer must have a certain level of experience in the field [2].
There is an abundant literature about parsing. One of the reference books is Compilers: Principles, Techniques, and Tools, from Aho, Ullman, Lam and Sethi (also known as The Dragon Book).
You can also check the Standford University course CS143.
For parser generation, look at Lex and Yacc, the original Unix tools to generate lexical analyzers and parsers. Bison is a more recent parser generator, upward compatible with Yacc.
[1] "little" is not precisely defined. RAM usually starts at 8 KB, up to 512 MB. Clock frequency usually starts at a few MHz, up to a few hundreds of MHz.
[2] One important point is to ensure that the production rules have some specific properties. Looking at the pseudocode provided here, the reader can easily understand why.
