wiki:DataParser

Version 4 (modified by peter, 4 weeks ago) (diff)

Add some more sections.

The data parser class

Introduction

Cloudy has to read many data files. In the past, a custom parser would be developed whenever another data file was added. In practice this would often involve copying code from elsewhere and then modifying it to fit the current needs. Over the years this has lead to a large amount of copied code, leading to a maintenance nightmare. Moreover, the code was usually very verbose containing many repeated error checks. Despite this, the checks were usually not complete, meaning that errors could still slip through. Also, the many repeated error checks obfuscated the control flow in the parser, which could lead to other coding errors. To alleviate all these problems, we have created a new parser class specifically designed for data files called DataParser. It is essentially a tokenizer, allowing you to parse the data file line by line, and then each line token by token. It includes many basic checks that can be performed without understanding what the token means. It also has a mechanism for efficiently generating error and warning messages resulting from further validity checks performed by the user on the token. Below we will give a more detailed description of this class.

Basic assumptions

The data parser will read the file on a line-by-line basis. It allows comments to be embedded in the file. All comments must start with the hash symbol (#). Any text after the hash symbol up to the end of the line is considered part of the comment and will be ignored. If a line has a hash symbol in the first column, it will be completely ignored and automatically skipped when you read the next line. Comments may also be appended at the end of a line containing data. Note that if a line has a comment starting after the first column, but contains only whitespace leading up to the hash symbol, will be considered a blank line. Such a line may be considered either a comment or an end-of-file (EOF) marker, depending on how the file was opened (see next section for more details).

The parser assumes that all data items (which we will call tokens) are separated by whitespace. So it is assumed you can retrieve the next token from the line by skipping leading whitespace and then copying non-whitespace characters until the next whitespace character is found. Our definition of whitespace matches that of the isspace() routine provided by the system. This includes the normal space character and the horizontal tab (\t) which are both commonly used as separators. Other common separators (such as the comma) are not supported.

The carriage return character (\r) is automatically stripped from the line in case an MSDOS file is read on a *NIX system. Actually, anything after \r until the end of the line, marked by a newline (\n), will be deleted as well. For a properly formatted MSDOS file there should be no text between \r and \n.

The parser tries to enforce strict checking of the contents of the file. This includes checks that all the text that was parsed is actually used in the conversion to a data item. As an example, if you are reading a long variable, and the next token contains the text "2.05", it will store the value 2 in the variable. But it will subsequently detect that the remainder ".05" was not used in the conversion and abort (complaining that it found trailing junk). This has two advantages: you can detect storing data in the wrong type of variable (as in this example where a realnum or double should have been used instead of a long) and you can detect corruption in files (which may lead to text looking like "2.54e-025e-02", which would also be detected).

When parsing text tokens, they are always considered case sensitive. This is done for performance reasons.

Opening a data file

You obviously need to open the data file before you can read it. This is usually done in the constructor, as shown below:

DataParser d( "sample.dat", ES_NONE );

Opening the file is done using open_data(), so the search path applies, just like it would have done if you had called open_data() directly. So in our example the file "sample.dat" will be searched along the search path. If it cannot be found, the code will abort, just like open_data() would have done. There is also a mandatory second parameter in the constructor, which indicates the style for including EOF markers in the file. There are 3 possible choices:

ES_NONE: there are no special in-file EOF markers.
ES_STARS_ONLY: a field of stars (***) is an in-file EOF marker.
ES_STARS_AND_BLANKS: both a blank line and a field of stars are in-file EOF markers.

An in-file EOF marker implies that all lines following that line are considered free-format comments and should not be parsed. A field of stars is a line containing at least 3 stars starting in the first column. A blank line is a line containing only whitespace plus optionally a comment that starts after the first column. In the ES_NONE case, a field of stars will have no special meaning. In the ES_NONE, ES_STARS_ONLY cases, a blank line will be considered a comment and will be automatically skipped.

The DataParser class also has an open() method for opening the file. It can be used as follows:

DataParser d;
d.open( "sample.dat", ES_NONE );

The parameters have the exact same meaning as before. Normally this is not needed, but can be handy if you need to parse multiple files. You can then reuse the variable, as in the example below:

DataParser d;
d.open( "sample.dat", ES_NONE );
// ... parse data ...
d.open( "sample2.dat", ES_NONE );

Note that there is no need to close the file. The second call to open() will automatically close the first file. The same happens when the variable d goes out of scope. The destructor will close the file.

In rare cases you may need to parse a file that could potentially be absent. This can be done as follows:

DataParser d( "sample.dat", ES_NONE, AS_TRY );
if( !d.isOpen() )
{
    // ... do some error handling ...
}
else
{
    // ... parse data ...
}

The parameter AS_TRY tells open_data() not to abort if opening the file fails (and not even print any error messages). The method isOpen() can then be used to test if opening the file was successful and react accordingly. The open() method obviously also allows this optional third parameter to be added.

In even rarer cases, you may need to close the file before the variable d goes out of scope. The only plausible use case for this I can think of, is if you want to delete a temporary file after parsing it. This can be done with the call:

d.close();
remove( "tempfile.dat" );

Reading lines of data

After you opened the file, you can start reading it. Parsing is done line by line, so you first need to read in a line:

d.open( "sample.dat", ES_NONE );
d.getline();

The getline() method returns a boolean which indicates if there are more lines to read. This makes it very convenient to parse a file using the following loop:

while( d.getline() )
{
    // ... parse one line ...
}

The getline() method will automatically skip comment lines and also automatically strip any comment at the end of the line. This means you need not (and should not) worry about comments in data files while parsing.

If you are using in-file EOF markers, you will need some extra code:

while( d.getline() )
{
    if( d.lgEOFMarker() )
        break;
    // ... parse one line ...
}

The reason for this is that in-file EOF markers are not automatically handled by getline(), so you need to add code yourself to break out of the loop. This allows additional checks to be done, like in the Stout files where the mandatory presence of a field of stars is enforced. This could not be done if getline() would take over the task of lgEOFMarker() itself...

Checking the magic number

For most data files, the first task after opening the file is to check whether the magic numbers are OK. Since this is such a common task, a special method has been created to do this:

d.open( "sample.dat", ES_NONE );
d.getline();
static const long yr=2007, mon=11, day=18; 
d.checkMagic( yr, mon, day );

The checkMagic() method will check the numbers in the file against those supplied as arguments. Four versions of checkMagic() exist, with one, two, three, or four parameters, all of type long. Internally the code will read one, two, three, or four tokens of type long from the line and then compare them to the parameters that were supplied. If any one of those doesn't match (or if reading failed), an informative error message will be printed and the code will abort.

Note that you need to call getline() first. For versatility, the checkMagic() method does not assume that the data are on a new line (i.e. it does not call getline() itself), nor does it assume that there are no more data after the magic numbers (i.e., it does not call checkEOL() after parsing the magic numbers -- checkEOL() is discussed below). But it does assume that all magic numbers are on the same line.

Reading the data

The basic method for reading data from a line is getToken(). This can be used to read data of any type, as long as operator>>() is defined for that type (either by the system or by you). For the commonly used types double, realnum, and most integer types, specializations have been created that scan the numbers more efficiently than the standard system routines can. This is transparent for the user. Below is an example where we read two integers and a double from every line in a file:

while( d.getline() )
{
    long ilo, ihi;
    d.getToken(ilo);
    d.getToken(iHi);
    double Aul;
    d.getToken(Aul);
}

On each call the method getToken() will skip leading whitespace and then read the number until it hits another whitespace character. If the text is not fully used in the conversion, an error will be generated and the code will abort. This method is not limited to numeric tokens. It can e.g. also read characters or a string:

char c;
d.getToken(c);
string s;
d.getToken(s);

Note that there is a separate method for reading quoted strings that are allowed to contain whitespace. That will be discussed later.

If a line contains a fixed number of data items that you want to read in one call, you can use:

double temps[10];
d.getToken(temps,10);

Sometimes you do not know in advance how many data items are on a line. You can handle that using the following method:

vector<double> temps;
double token;
while( d.getTokenOptional(token) )
{
    temps.emplace_back(token);
}

The routine getTokenOptional() will return true as long as reading another number was successful. It will not generate an error if reading the number fails, and return false instead. If reading the number fails, the token will be set to a default constructed instance of the relevant type (zero for integer and floating point numbers, an empty string for strings, etc.).

For certain data items, a separate method has been created. The first allows you to read quoted strings:

string text;
d.getQuote(text);

This method will read a string between double quotes, which allows you to read strings with whitespace embedded. If the first or the second double quote cannot be found, the code will abort. There is also an optional version of this method:

string text;
bool success = d.getQuoteOptional(text);

This method will return true if reading was successful, and false otherwise. In the latter case, the variable text will be set to an empty string.

Skipping parts of the line

The basic philosophy of the data parser is that tokens are read from the line in consecutive order. This can result in a performance penalty if the majority of those tokens are not needed. To alleviate this problem, it is possible to skip over part of the line. Two methods are available to achieve this. The first allows you to skip to a specified column number. Note that the column numbering follows the usual C style, i.e., the first column has number 0. The syntax is very simple:

long index;
d.getToken(index); // read first integer on the line
d.skipTo(56);      // skip to column 56
realnum Aul;
d.getToken(Aul);   // read next token...

Note that skipping backwards is not allowed (this wouldn't make any sense). Skipping beyond the end of the line will raise the EOL flag and all subsequent attempts to read a token will fail. This method is obviously only useful for data files that are strictly aligned on columns.

The second method is to skip after a specified string:

d.getToken(index);  // read first integer on the line
d.skipAfter("Aul:"); // skip after string "Aul"
realnum Aul;
d.getToken(Aul);    // read next token...

The parser will search for the first instance of the string "Aul" on the line and position itself immediately after that text. If the string cannot be found, the code will abort. The search is case sensitive, as is the case for all actions of the data parser. This method can be called multiple times. Each time it will search the first instance of the requested string after the current position, so even if the same string occurs multiple times on a line, each call will skip to the next instance of that string.

If the data items you are not interested in are at the end of the line, you can simply stop parsing and continue with the next line. The parser will not detect that the line was not fully parsed unless you explicitly check that yourself (see below).

Generating error messages and warnings

Miscellaneous other methods