wiki:DataParser

Version 6 (modified by peter, 3 weeks ago) (diff)

rename EOF markers -> EOD markers, some other clarifications

The data parser class

Introduction

Cloudy has to read many data files. In the past, a custom parser would be developed whenever another data file was added. In practice this would often involve copying code from elsewhere and then modifying it to fit the current needs. Over the years this has lead to a large amount of copied code, leading to a maintenance nightmare. Moreover, the code was usually very verbose containing many repeated error checks. Despite this, the checks were usually not complete, meaning that errors could still slip through. Also, the many repeated error checks obfuscated the control flow in the parser, which could lead to other coding errors. To alleviate all these problems, we have created a new parser class specifically designed for data files called DataParser. It is essentially a tokenizer, allowing you to parse the data file line by line, and then each line token by token. It includes many basic checks that can be performed without understanding what the token means. It also has a mechanism for efficiently generating error and warning messages resulting from further validity checks performed by the user on the token. Below we will give a more detailed description of this class.

Basic assumptions

The data parser will read the file on a line-by-line basis. It allows comments to be embedded in the file. All comments must start with the hash symbol (#). Any text after the hash symbol up to the end of the line is considered part of the comment and will be ignored. If a line has a hash symbol in the first column, it will be completely ignored and automatically skipped when you read the next line. Comments may also be appended at the end of a line containing data. Note that if a line has a comment starting after the first column, but contains only whitespace leading up to the hash symbol, it will be considered a blank line. Such a line may be considered either a comment or an end-of-data (EOD) marker, depending on how the file was opened (see next section for more details).

The parser assumes that all data items (which we will call tokens) are separated by whitespace. So it is assumed you can retrieve the next token from the line by skipping leading whitespace and then copying non-whitespace characters until the next whitespace character is found. Our definition of whitespace matches that of the isspace() routine provided by the system. This includes the normal space character and the horizontal tab (\t) which are both commonly used as separators. Other common separators (such as the comma) are not supported.

The carriage return character (\r) is automatically stripped from the line in case an MSDOS file is read on a *NIX system. Actually, anything after \r until the end of the line, marked by a newline (\n), will be deleted as well. For a properly formatted MSDOS file there should be no text between \r and \n.

The parser tries to enforce strict checking of the contents of the file. This includes checks that all the text that was parsed is actually used in the conversion to a data item. As an example, if you are reading a long variable, and the next token contains the text "2.05", it will store the value 2 in the variable. But it will subsequently detect that the remainder ".05" was not used in the conversion and abort (complaining that it found trailing junk). This has two advantages: you can detect storing data in the wrong type of variable (as in this example where a realnum or double should have been used instead of a long) and you can detect corruption in files (which may lead to text looking like "2.54e-025e-02", which would also be detected).

When parsing text tokens, they are always considered case sensitive. This is done for performance reasons.

Opening a data file

You obviously need to open the data file before you can read it. This is usually done in the constructor, as shown below:

DataParser d( "sample.dat", ES_NONE );

Opening the file is done using open_data(), so the search path applies, just like it would have done if you had called open_data() directly. So in our example the file "sample.dat" will be searched along the search path. If it cannot be found, the code will abort, just like open_data() would have done. There is also a mandatory second parameter in the constructor, which indicates the style for including EOD markers in the file. There are 3 possible choices:

ES_NONE: there are no special in-file EOD markers.
ES_STARS_ONLY: a field of stars (***) is an in-file EOD marker.
ES_STARS_AND_BLANKS: both a blank line and a field of stars are in-file EOD markers.

An in-file EOD marker implies that all lines following that line are considered free-format comments and should not be parsed. A field of stars is a line containing at least 3 contiguous stars starting in the first column. A blank line contains either zero or more whitespace characters and nothing else, or one or more whitespace characters followed by a comment. In the ES_NONE case, a field of stars will have no special meaning and will be considered a token. In the ES_NONE, ES_STARS_ONLY cases, a blank line will be considered a comment and will be automatically skipped.

The DataParser class also has an open() method for opening the file. It can be used as follows:

DataParser d;
d.open( "sample.dat", ES_NONE );

The parameters have the exact same meaning as before. Normally this is not needed, but can be handy if you need to parse multiple files. You can then reuse the variable, as in the example below:

DataParser d;
d.open( "sample.dat", ES_NONE );
// ... parse data ...
d.open( "sample2.dat", ES_NONE );

Note that there is no need to close the file. The second call to open() will automatically close the first file. The same happens when the variable d goes out of scope. The destructor will close the file.

In rare cases you may need to parse a file that could potentially be absent. This can be done as follows:

DataParser d( "sample.dat", ES_NONE, AS_TRY );
if( !d.isOpen() )
{
    // ... file is absent, do some error handling ...
}
else
{
    // ... parse data ...
}

The parameter AS_TRY tells open_data() not to abort if opening the file fails (and not even print any error messages). The method isOpen() can then be used to test if opening the file was successful and react accordingly. The open() method obviously also allows this optional third parameter to be added.

In even rarer cases, you may need to close the file before the variable d goes out of scope. The only plausible use case for this I can think of, is if you want to delete a temporary file after parsing it. This can be done with the call:

d.close();
remove( "tempfile.dat" );

Reading lines of data

After you opened the file, you can start reading it. Parsing is done line by line, so you first need to read in a line:

d.open( "sample.dat", ES_NONE );
d.getline();

The getline() method returns a boolean which indicates if there are more lines to read. This makes it very convenient to parse a file using the following loop:

while( d.getline() )
{
    // ... parse one line ...
}

The getline() method will automatically skip comment lines and also automatically strip any comment at the end of the line. This means you need not (and should not) worry about comments in data files while parsing.

If you are using in-file EOD markers, you will need some extra code:

while( d.getline() )
{
    if( d.lgEODMarker() )
        break;
    // ... parse one line ...
}

The reason for this is that in-file EOD markers are not automatically handled by getline(), so you need to add code yourself to break out of the loop. This allows additional checks to be done, like in the Stout files where the mandatory presence of a field of stars is enforced. This could not be done if getline() would take over the task of lgEODMarker() itself...

Checking the magic number

For most data files, the first task after opening the file is to check whether the magic numbers are OK. Since this is such a common task, a special method has been created to do this:

d.open( "sample.dat", ES_NONE );
d.getline();
static const long yr=2007, mon=11, day=18; 
d.checkMagic( yr, mon, day );

The checkMagic() method will check the numbers in the file against those supplied as arguments. Four versions of checkMagic() exist, with one, two, three, or four parameters, all of type long. Internally the code will read one, two, three, or four tokens of type long from the line and then compare them to the parameters that were supplied. If any one of those doesn't match (or if reading failed), an informative error message will be printed and the code will abort.

Note that you need to call getline() first. For versatility, the checkMagic() method does not assume that the data are on a new line (i.e. it does not call getline() itself), nor does it assume that there are no more data after the magic numbers (i.e., it does not call checkEOL() after parsing the magic numbers -- checkEOL() is discussed below). But it does assume that all magic numbers are on the same line.

Reading the data

The basic method for reading data from a line is getToken(). This can be used to read data of any type, as long as operator>>() is defined for that type (either by the system or by you). For the commonly used types double, realnum, and most integer types, specializations have been created that scan the numbers more efficiently than the standard system routines can. This is transparent for the user. Below is an example where we read two integers and a double from every line in a file:

while( d.getline() )
{
    long ilo, ihi;
    d.getToken( ilo );
    d.getToken( iHi );
    double Aul;
    d.getToken( Aul );
}

On each call the method getToken() will skip leading whitespace and then read the token until it hits another whitespace character. If the text is not fully used in the conversion, an error will be generated and the code will abort. This method is not limited to numeric tokens. It can e.g. also read characters or a string:

char c;
d.getToken( c );
string s;
d.getToken( s );

Note that there is a separate method for reading quoted strings that are allowed to contain whitespace. That will be discussed later. Also note that parsing C-style strings is not supported!

If a line contains a fixed number of data items that you want to read in one call, you can use:

double temps[10];
d.getToken( temps, 10 );

Sometimes you do not know in advance how many data items are on a line. You can handle that using the following method:

vector<double> temps;
double token;
while( d.getTokenOptional( token ) )
{
    temps.emplace_back( token );
}

The routine getTokenOptional() will return true as long as reading another number was successful. It will not generate an error if reading the number fails, and return false instead. If reading the number fails, the token will be set to a default constructed instance of the relevant type (zero for integer and floating point numbers, an empty string for strings, etc.).

For certain data items, a separate, specialized method has been created. The first allows you to read quoted strings:

string text;
d.getQuote( text );

This method will read a string between double quotes, which allows you to read strings with whitespace embedded. If the first or the second double quote cannot be found, the code will abort. There is also an optional version of this method:

string text;
bool success = d.getQuoteOptional( text );

This method will return true if reading was successful, and false otherwise. In the latter case, the variable "text" will be set to an empty string.

The second specialized method is for reading line IDs. This routine works very similar to the getLineID() method that is part of the input parser. It reads a data item of type LineID, consisting of a species label and a wavelength. The species label must start in the first column of the line, and can be either a quoted string (in which case the label may be of arbitrary length) or the first 4 columns of the line are used verbatim as the label. The next item on the line must be the wavelength in the usual Cloudy format. There must be whitespace between the label and the wavelength. If a 4-column line label is used, the wavelength must therefore start in the 6th column or later (unless the label ends in whitespace, as is e.g. the case for CO, in which case the 5th column is acceptable). It is not permitted to read other tokens from a line before the line ID is read. The code will abort if this is attempted. This is an example of the use of this method:

d.getline();
LineID id;
d.getLineID( id ); // must be the first token read from this line

This method is currently only used for reading line list files. There are two differences with the method in the input parser. First, it does allow other tokens to be read after the line label was parsed. This is for possible future extensions of the code. The parser for line list files does enforce that there are no other data items on the line by calling checkEOL(). The second difference is that the input parser allows the wavelength to be an expression, while the data parser does not. In practice it is highly unlikely that this difference will ever be of any significance. The calling sequence is also different, to be in line with the other methods for reading tokens in the data parser.

Skipping parts of the line

The basic philosophy of the data parser is that tokens are read from the line in consecutive order. This can result in a performance penalty if the majority of those tokens are not needed. To alleviate this problem, it is possible to skip over part of the line. Two methods are available to achieve this. The first allows you to skip to a specified column number. Note that the column numbering follows the usual C style, i.e., the first column has number 0. The syntax is very simple:

long index;
d.getToken( index ); // read first integer on the line
d.skipTo( 56 );      // skip to column 56
realnum Aul;
d.getToken( Aul );   // read next token...

Note that skipping backwards is not allowed (this wouldn't make any sense). Skipping beyond the end of the line will raise the EOL flag and all subsequent attempts to read a token will fail. This method is obviously only useful for data files that are strictly aligned on columns.

The second method is to skip after a specified string:

d.getToken( index );   // read first integer on the line
d.skipAfter( "Aul:" ); // skip after string "Aul:"
realnum Aul;
d.getToken( Aul );     // read next token...

The parser will search for the first instance of the string "Aul" on the line and position itself immediately after that text. If the string cannot be found, the code will abort. The search is case sensitive, as is the case for all actions of the data parser. This method can be called multiple times. Each time it will search the first instance of the requested string after the current position, so even if the same string occurs multiple times on a line, each call will skip to the next instance of that string.

If you are not interested in the data items at the end of the line, you can simply stop parsing and continue with the next line. The parser will not detect that the line was not fully parsed unless you explicitly check that yourself (see checkEOL() below).

Generating error messages and warnings

The data parser does all sorts of checks on the tokens that are being read. These can of course only be generic checks that apply to all tokens. Specific checks (e.g., a temperature in a data file must be greater than zero) must be done by the user. This can be done using the following method

double temp;
d.getToken( temp );
if( temp <= 0. )
    d.errorAbort( "invalid temperature" );

If the test fails, the code will generate an error message containing the name of the data file and the location in that file, as well as the error message supplied in the call. The next line will show the line being parsed with an arrow underneath pointing at the location where the error occured:

 stout/fe/fe_9/fe_9.coll:2:41: error: invalid temperature
  TEMP 1.62e+04 4.05e+04 8.10e+04 -1.62e+05 4.05e+05 8.10e+05 1.62e+06 4.05e+06 8.10e+06 1.62e+07
                                           ^
 [Stop in errorAbort at parser.cpp:1330, something went wrong]

The format is <filename>:<lineno>:<colno>: error: <error message>, where the line numbering starts at line 1 and the column numbering at column 0. This format is identical to what the GNU compiler uses. Since the parser indicates the location of the error, it is important to do the checks before other tokens are parsed. Otherwise the message would point to the wrong location and confuse the user.

There is also a method for generating warning messages. It works exactly the same way as the errorAbort() method above, except that the code will continue after generating the warning:

double temp;
d.getToken( temp );
if( temp < phycon.TEMP_LIMIT_LOW )
    d.warning( "temperature below Cloudy limit" );

Miscellaneous other methods

The method lgEOL() returns true if the end-of-line has been reached after skipping whitespace, and false otherwise. So if this routine returns false, there are tokens left on the line to be parsed. If only whitespace remains after reading the last token, this method will return true.

The checkEOL() method calls lgEOL() and aborts if it returns false. This can be used to enforce that no other tokens remain after parsing the line. This can be useful to detect corruption in a file, so using this method is encouraged where appropriate:

d.getline();
double x[10];
d.getToken( x, 10 ); // there should be exactly 10 numbers on this line
d.checkEOL();        // abort if tokens remain on the line

The method lgEOF() returns true if the end-of-file has been reached and false otherwise. This relates only to the EOF as defined by the system, and not to the in-file EOD markers discussed above. For the latter use the lgEODMarker() method shown above.

Sometimes data files are parsed twice, typically to first count the number of items in a file, then to allocate memory for storing the data, and finally to read the data file a second time and store the data items in the allocated memory. This is wasteful as reading data files can be expensive (especially the conversion to floating point numbers). Usually a better solution can be found by using C++ containers such as vector, which allow the arrays to be dynamically resized under the hood while parsing the data. However, for backward compatibility, the method rewind() is offered that allows data files to be parsed multiple times:

DataParser d( "data.dat", ES_NONE );
d.getline();
d.checkMagic( 20170923 );
d.getline();
// read more data...
d.rewind();
d.getline(); // this reads the line with the magic number, no need to check it twice...
d.getline();
// read more data...

But really, try to avoid using this method. Consider it deprecated.

The method setline() allows you to manually set the contents of a line. This is not useful when parsing data files. This method is intended to be used in unit testing.