Changes between Version 5 and Version 6 of DataParser


Ignore:
Timestamp:
2017-10-30T15:42:39Z (3 weeks ago)
Author:
peter
Comment:

rename EOF markers -> EOD markers, some other clarifications

Legend:

Unmodified
Added
Removed
Modified
  • DataParser

    v5 v6  
    77== Basic assumptions ==
    88
    9 The data parser will read the file on a line-by-line basis. It allows comments to be embedded in the file. All comments must start with the hash symbol (#). Any text after the hash symbol up to the end of the line is considered part of the comment and will be ignored. If a line has a hash symbol in the first column, it will be completely ignored and automatically skipped when you read the next line. Comments may also be appended at the end of a line containing data. Note that if a line has a comment starting after the first column, but contains only whitespace leading up to the hash symbol, it will be considered a blank line. Such a line may be considered either a comment or an end-of-file (EOF) marker, depending on how the file was opened (see next section for more details).
     9The data parser will read the file on a line-by-line basis. It allows comments to be embedded in the file. All comments must start with the hash symbol (#). Any text after the hash symbol up to the end of the line is considered part of the comment and will be ignored. If a line has a hash symbol in the first column, it will be completely ignored and automatically skipped when you read the next line. Comments may also be appended at the end of a line containing data. Note that if a line has a comment starting after the first column, but contains only whitespace leading up to the hash symbol, it will be considered a blank line. Such a line may be considered either a comment or an end-of-data (EOD) marker, depending on how the file was opened (see next section for more details).
    1010
    1111The parser assumes that all data items (which we will call tokens) are separated by whitespace. So it is assumed you can retrieve the next token from the line by skipping leading whitespace and then copying non-whitespace characters until the next whitespace character is found. Our definition of whitespace matches that of the isspace() routine provided by the system. This includes the normal space character and the horizontal tab (\t) which are both commonly used as separators. Other common separators (such as the comma) are not supported.
     
    2323DataParser d( "sample.dat", ES_NONE );
    2424}}}
    25 Opening the file is done using open_data(), so the search path applies, just like it would have done if you had called open_data() directly. So in our example the file "sample.dat" will be searched along the search path. If it cannot be found, the code will abort, just like open_data() would have done. There is also a mandatory second parameter in the constructor, which indicates the style for including EOF markers in the file. There are 3 possible choices:
    26 {{{
    27 ES_NONE: there are no special in-file EOF markers.
    28 ES_STARS_ONLY: a field of stars (***) is an in-file EOF marker.
    29 ES_STARS_AND_BLANKS: both a blank line and a field of stars are in-file EOF markers.
    30 }}}
    31 An in-file EOF marker implies that all lines following that line are considered free-format comments and should not be parsed. A field of stars is a line containing at least 3 stars starting in the first column. A blank line is a line containing only whitespace plus optionally a comment that starts ''after'' the first column. In the {{{ES_NONE}}} case, a field of stars will have no special meaning and will be considered a token. In the {{{ES_NONE, ES_STARS_ONLY}}} cases, a blank line will be considered a comment and will be automatically skipped.
     25Opening the file is done using open_data(), so the search path applies, just like it would have done if you had called open_data() directly. So in our example the file "sample.dat" will be searched along the search path. If it cannot be found, the code will abort, just like open_data() would have done. There is also a mandatory second parameter in the constructor, which indicates the style for including EOD markers in the file. There are 3 possible choices:
     26{{{
     27ES_NONE: there are no special in-file EOD markers.
     28ES_STARS_ONLY: a field of stars (***) is an in-file EOD marker.
     29ES_STARS_AND_BLANKS: both a blank line and a field of stars are in-file EOD markers.
     30}}}
     31An in-file EOD marker implies that all lines following that line are considered free-format comments and should not be parsed. A field of stars is a line containing at least 3 contiguous stars starting in the first column. A blank line contains either zero or more whitespace characters and nothing else, or one or more whitespace characters followed by a comment. In the {{{ES_NONE}}} case, a field of stars will have no special meaning and will be considered a token. In the {{{ES_NONE, ES_STARS_ONLY}}} cases, a blank line will be considered a comment and will be automatically skipped.
    3232
    3333The !DataParser class also has an open() method for opening the file. It can be used as follows:
     
    8181The getline() method will automatically skip comment lines and also automatically strip any comment at the end of the line. This means you need not (and should not) worry about comments in data files while parsing.
    8282
    83 If you are using in-file EOF markers, you will need some extra code:
     83If you are using in-file EOD markers, you will need some extra code:
    8484{{{
    8585while( d.getline() )
    8686{
    87     if( d.lgEOFMarker() )
     87    if( d.lgEODMarker() )
    8888        break;
    8989    // ... parse one line ...
    9090}
    9191}}}
    92 The reason for this is that in-file EOF markers are not automatically handled by getline(), so you need to add code yourself to break out of the loop. This allows additional checks to be done, like in the Stout files where the mandatory presence of a field of stars is enforced. This could not be done if getline() would take over the task of lgEOFMarker() itself...
     92The reason for this is that in-file EOD markers are not automatically handled by getline(), so you need to add code yourself to break out of the loop. This allows additional checks to be done, like in the Stout files where the mandatory presence of a field of stars is enforced. This could not be done if getline() would take over the task of lgEODMarker() itself...
    9393
    9494== Checking the magic number ==
     
    118118}
    119119}}}
    120 On each call the method getToken() will skip leading whitespace and then read the number until it hits another whitespace character. If the text is not fully used in the conversion, an error will be generated and the code will abort. This method is not limited to numeric tokens. It can e.g. also read characters or a string:
     120On each call the method getToken() will skip leading whitespace and then read the token until it hits another whitespace character. If the text is not fully used in the conversion, an error will be generated and the code will abort. This method is not limited to numeric tokens. It can e.g. also read characters or a string:
    121121{{{
    122122char c;
     
    125125d.getToken( s );
    126126}}}
    127 Note that there is a separate method for reading quoted strings that are allowed to contain whitespace. That will be discussed later.
     127Note that there is a separate method for reading quoted strings that are allowed to contain whitespace. That will be discussed later. Also note that parsing C-style strings is not supported!
    128128
    129129If a line contains a fixed number of data items that you want to read in one call, you can use:
     
    185185The parser will search for the first instance of the string "Aul" on the line and position itself immediately after that text. If the string cannot be found, the code will abort. The search is case sensitive, as is the case for all actions of the data parser. This method can be called multiple times. Each time it will search the first instance of the requested string after the current position, so even if the same string occurs multiple times on a line, each call will skip to the next instance of that string.
    186186
    187 If the data items you are not interested in are at the end of the line, you can simply stop parsing and continue with the next line. The parser will not detect that the line was not fully parsed unless you explicitly check that yourself (see checkEOL() below).
     187If you are not interested in the data items at the end of the line, you can simply stop parsing and continue with the next line. The parser will not detect that the line was not fully parsed unless you explicitly check that yourself (see checkEOL() below).
    188188
    189189== Generating error messages and warnings ==
     
    225225}}}
    226226
    227 The method lgEOF() returns true if the end-of-file has been reached and false otherwise. This relates only to the EOF as defined by the system, and not to the in-file EOF markers discussed above. For the latter use the lgEOFMarker() method shown above.
     227The method lgEOF() returns true if the end-of-file has been reached and false otherwise. This relates only to the EOF as defined by the system, and not to the in-file EOD markers discussed above. For the latter use the lgEODMarker() method shown above.
    228228
    229229Sometimes data files are parsed twice, typically to first count the number of items in a file, then to allocate memory for storing the data, and finally to read the data file a second time and store the data items in the allocated memory. This is wasteful as reading data files can be expensive (especially the conversion to floating point numbers). Usually a better solution can be found by using C++ containers such as vector, which allow the arrays to be dynamically resized under the hood while parsing the data. However, for backward compatibility, the method rewind() is offered that allows data files to be parsed multiple times: