Changes between Version 4 and Version 5 of DataParser


Ignore:
Timestamp:
2017-10-27T15:51:16Z (4 weeks ago)
Author:
peter
Comment:

Add remaining sections

Legend:

Unmodified
Added
Removed
Modified
  • DataParser

    v4 v5  
    77== Basic assumptions ==
    88
    9 The data parser will read the file on a line-by-line basis. It allows comments to be embedded in the file. All comments must start with the hash symbol (#). Any text after the hash symbol up to the end of the line is considered part of the comment and will be ignored. If a line has a hash symbol in the first column, it will be completely ignored and automatically skipped when you read the next line. Comments may also be appended at the end of a line containing data. Note that if a line has a comment starting after the first column, but contains only whitespace leading up to the hash symbol, will be considered a blank line. Such a line may be considered either a comment or an end-of-file (EOF) marker, depending on how the file was opened (see next section for more details).
     9The data parser will read the file on a line-by-line basis. It allows comments to be embedded in the file. All comments must start with the hash symbol (#). Any text after the hash symbol up to the end of the line is considered part of the comment and will be ignored. If a line has a hash symbol in the first column, it will be completely ignored and automatically skipped when you read the next line. Comments may also be appended at the end of a line containing data. Note that if a line has a comment starting after the first column, but contains only whitespace leading up to the hash symbol, it will be considered a blank line. Such a line may be considered either a comment or an end-of-file (EOF) marker, depending on how the file was opened (see next section for more details).
    1010
    1111The parser assumes that all data items (which we will call tokens) are separated by whitespace. So it is assumed you can retrieve the next token from the line by skipping leading whitespace and then copying non-whitespace characters until the next whitespace character is found. Our definition of whitespace matches that of the isspace() routine provided by the system. This includes the normal space character and the horizontal tab (\t) which are both commonly used as separators. Other common separators (such as the comma) are not supported.
     
    2929ES_STARS_AND_BLANKS: both a blank line and a field of stars are in-file EOF markers.
    3030}}}
    31 An in-file EOF marker implies that all lines following that line are considered free-format comments and should not be parsed. A field of stars is a line containing at least 3 stars starting in the first column. A blank line is a line containing only whitespace plus optionally a comment that starts ''after'' the first column. In the {{{ES_NONE}}} case, a field of stars will have no special meaning. In the {{{ES_NONE, ES_STARS_ONLY}}} cases, a blank line will be considered a comment and will be automatically skipped.
     31An in-file EOF marker implies that all lines following that line are considered free-format comments and should not be parsed. A field of stars is a line containing at least 3 stars starting in the first column. A blank line is a line containing only whitespace plus optionally a comment that starts ''after'' the first column. In the {{{ES_NONE}}} case, a field of stars will have no special meaning and will be considered a token. In the {{{ES_NONE, ES_STARS_ONLY}}} cases, a blank line will be considered a comment and will be automatically skipped.
    3232
    3333The !DataParser class also has an open() method for opening the file. It can be used as follows:
     
    5050if( !d.isOpen() )
    5151{
    52     // ... do some error handling ...
     52    // ... file is absent, do some error handling ...
    5353}
    5454else
     
    112112{
    113113    long ilo, ihi;
    114     d.getToken(ilo);
    115     d.getToken(iHi);
     114    d.getToken( ilo );
     115    d.getToken( iHi );
    116116    double Aul;
    117     d.getToken(Aul);
     117    d.getToken( Aul );
    118118}
    119119}}}
     
    121121{{{
    122122char c;
    123 d.getToken(c);
     123d.getToken( c );
    124124string s;
    125 d.getToken(s);
     125d.getToken( s );
    126126}}}
    127127Note that there is a separate method for reading quoted strings that are allowed to contain whitespace. That will be discussed later.
     
    130130{{{
    131131double temps[10];
    132 d.getToken(temps,10);
     132d.getToken( temps, 10 );
    133133}}}
    134134
     
    137137vector<double> temps;
    138138double token;
    139 while( d.getTokenOptional(token) )
    140 {
    141     temps.emplace_back(token);
     139while( d.getTokenOptional( token ) )
     140{
     141    temps.emplace_back( token );
    142142}
    143143}}}
    144144The routine getTokenOptional() will return true as long as reading another number was successful. It will not generate an error if reading the number fails, and return false instead. If reading the number fails, the token will be set to a default constructed instance of the relevant type (zero for integer and floating point numbers, an empty string for strings, etc.).
    145145
    146 For certain data items, a separate method has been created. The first allows you to read quoted strings:
     146For certain data items, a separate, specialized method has been created. The first allows you to read quoted strings:
    147147{{{
    148148string text;
    149 d.getQuote(text);
     149d.getQuote( text );
    150150}}}
    151151This method will read a string between double quotes, which allows you to read strings with whitespace embedded. If the first or the second double quote cannot be found, the code will abort. There is also an optional version of this method:
    152152{{{
    153153string text;
    154 bool success = d.getQuoteOptional(text);
    155 }}}
    156 This method will return true if reading was successful, and false otherwise. In the latter case, the variable text will be set to an empty string.
     154bool success = d.getQuoteOptional( text );
     155}}}
     156This method will return true if reading was successful, and false otherwise. In the latter case, the variable "text" will be set to an empty string.
     157
     158The second specialized method is for reading line IDs. This routine works very similar to the getLineID() method that is part of the input parser. It reads a data item of type LineID, consisting of a species label and a wavelength. The species label must start in the first column of the line, and can be either a quoted string (in which case the label may be of arbitrary length) or the first 4 columns of the line are used verbatim as the label. The next item on the line must be the wavelength in the usual Cloudy format. There must be whitespace between the label and the wavelength. If a 4-column line label is used, the wavelength must therefore start in the 6th column or later (unless the label ends in whitespace, as is e.g. the case for CO, in which case the 5th column is acceptable). It is not permitted to read other tokens from a line before the line ID is read. The code will abort if this is attempted. This is an example of the use of this method:
     159{{{
     160d.getline();
     161LineID id;
     162d.getLineID( id ); // must be the first token read from this line
     163}}}
     164This method is currently only used for reading line list files. There are two differences with the method in the input parser. First, it does allow other tokens to be read after the line label was parsed. This is for possible future extensions of the code. The parser for line list files ''does'' enforce that there are no other data items on the line by calling checkEOL(). The second difference is that the input parser allows the wavelength to be an expression, while the data parser does not. In practice it is highly unlikely that this difference will ever be of any significance. The calling sequence is also different, to be in line with the other methods for reading tokens in the data parser.
    157165
    158166== Skipping parts of the line ==
     
    161169{{{
    162170long index;
    163 d.getToken(index); // read first integer on the line
    164 d.skipTo(56);      // skip to column 56
     171d.getToken( index ); // read first integer on the line
     172d.skipTo( 56 );      // skip to column 56
    165173realnum Aul;
    166 d.getToken(Aul);   // read next token...
     174d.getToken( Aul );   // read next token...
    167175}}}
    168176Note that skipping backwards is not allowed (this wouldn't make any sense). Skipping beyond the end of the line will raise the EOL flag and all subsequent attempts to read a token will fail. This method is obviously only useful for data files that are strictly aligned on columns.
     
    170178The second method is to skip after a specified string:
    171179{{{
    172 d.getToken(index);  // read first integer on the line
    173 d.skipAfter("Aul:"); // skip after string "Aul"
     180d.getToken( index );   // read first integer on the line
     181d.skipAfter( "Aul:" ); // skip after string "Aul:"
    174182realnum Aul;
    175 d.getToken(Aul);    // read next token...
     183d.getToken( Aul );     // read next token...
    176184}}}
    177185The parser will search for the first instance of the string "Aul" on the line and position itself immediately after that text. If the string cannot be found, the code will abort. The search is case sensitive, as is the case for all actions of the data parser. This method can be called multiple times. Each time it will search the first instance of the requested string after the current position, so even if the same string occurs multiple times on a line, each call will skip to the next instance of that string.
    178186
    179 If the data items you are not interested in are at the end of the line, you can simply stop parsing and continue with the next line. The parser will not detect that the line was not fully parsed unless you explicitly check that yourself (see below).
     187If the data items you are not interested in are at the end of the line, you can simply stop parsing and continue with the next line. The parser will not detect that the line was not fully parsed unless you explicitly check that yourself (see checkEOL() below).
    180188
    181189== Generating error messages and warnings ==
    182190
     191The data parser does all sorts of checks on the tokens that are being read. These can of course only be generic checks that apply to all tokens. Specific checks (e.g., a temperature in a data file must be greater than zero) must be done by the user. This can be done using the following method
     192{{{
     193double temp;
     194d.getToken( temp );
     195if( temp <= 0. )
     196    d.errorAbort( "invalid temperature" );
     197}}}
     198If the test fails, the code will generate an error message containing the name of the data file and the location in that file, as well as the error message supplied in the call. The next line will show the line being parsed with an arrow underneath pointing at the location where the error occured:
     199{{{
     200 stout/fe/fe_9/fe_9.coll:2:41: error: invalid temperature
     201  TEMP 1.62e+04 4.05e+04 8.10e+04 -1.62e+05 4.05e+05 8.10e+05 1.62e+06 4.05e+06 8.10e+06 1.62e+07
     202                                           ^
     203 [Stop in errorAbort at parser.cpp:1330, something went wrong]
     204}}}
     205The format is <filename>:<lineno>:<colno>: error: <error message>, where the line numbering starts at line 1 and the column numbering at column 0. This format is identical to what the GNU compiler uses. Since the parser indicates the location of the error, it is important to do the checks before other tokens are parsed. Otherwise the message would point to the wrong location and confuse the user.
     206
     207There is also a method for generating warning messages. It works exactly the same way as the errorAbort() method above, except that the code will continue after generating the warning:
     208{{{
     209double temp;
     210d.getToken( temp );
     211if( temp < phycon.TEMP_LIMIT_LOW )
     212    d.warning( "temperature below Cloudy limit" );
     213}}}
     214
    183215== Miscellaneous other methods ==
     216
     217The method lgEOL() returns true if the end-of-line has been reached ''after skipping whitespace'', and false otherwise. So if this routine returns false, there are tokens left on the line to be parsed. If only whitespace remains after reading the last token, this method will return true.
     218
     219The checkEOL() method calls lgEOL() and aborts if it returns false. This can be used to enforce that no other tokens remain after parsing the line. This can be useful to detect corruption in a file, so using this method is encouraged where appropriate:
     220{{{
     221d.getline();
     222double x[10];
     223d.getToken( x, 10 ); // there should be exactly 10 numbers on this line
     224d.checkEOL();        // abort if tokens remain on the line
     225}}}
     226
     227The method lgEOF() returns true if the end-of-file has been reached and false otherwise. This relates only to the EOF as defined by the system, and not to the in-file EOF markers discussed above. For the latter use the lgEOFMarker() method shown above.
     228
     229Sometimes data files are parsed twice, typically to first count the number of items in a file, then to allocate memory for storing the data, and finally to read the data file a second time and store the data items in the allocated memory. This is wasteful as reading data files can be expensive (especially the conversion to floating point numbers). Usually a better solution can be found by using C++ containers such as vector, which allow the arrays to be dynamically resized under the hood while parsing the data. However, for backward compatibility, the method rewind() is offered that allows data files to be parsed multiple times:
     230{{{
     231DataParser d( "data.dat", ES_NONE );
     232d.getline();
     233d.checkMagic( 20170923 );
     234d.getline();
     235// read more data...
     236d.rewind();
     237d.getline(); // this reads the line with the magic number, no need to check it twice...
     238d.getline();
     239// read more data...
     240}}}
     241But really, try to avoid using this method. Consider it deprecated.
     242
     243The method setline() allows you to manually set the contents of a line. This is not useful when parsing data files. This method is intended to be used in unit testing.