Using the Standard Library

In this example, we will develop a simple parser for Comma Separated Value (CSV) files. The rules we will follow are as follows:

  • Each record will occupy one line, and newline indicates a new record
  • Fields in the record are separated by commas, unless they are within a quoted string
  • Strings can be quoted using single (') or double quotes ("), in which case they can contain commas as part of the string
  • Quotes immediately repeated ('' or "") is a literal, and a part of the string rather than a delimiter of a string
  • If a string is quoted, then spaces outside of the string are ignored

This is a very basic implementation, and omits the usual requirement that quoted strings can contain newlines.

In this example, much of the manipulation will be using string objects as containers of individual characters.

Start by creating a folder for the chapter called Chapter_08 in the folder for this book. In that folder, create a file called csv_parser.cpp. Since the application will use console output and file input, add the following lines at the top of the file:

    #include <iostream> 
#include <fstream>

using namespace std;

The application will also take a command line parameter that is the CSV file to parse, so add the following code at the bottom of the file:

    void usage() 
{
cout << "usage: csv_parser file" << endl;
cout << "where file is the path to a csv file" << endl;
}

int main(int argc, const char* argv[])
{
if (argc <= 1)
{
usage();
return 1;
}
return 0;
}

The application will read a file line by line into a vector of string objects, so add <vector> to the list of include files. To make the coding easier, define the following above the usage function:

    using namespace std; 
using vec_str = vector<string>;

The main function will read the file in line by line and the simplest way to do this is to use the getline function, so add the <string> header file to the include file list. Add the following lines to the end of the main function:

    ifstream stm; 
stm.open(argv[1], ios_base::in);
if (!stm.is_open())
{
usage();
cout << "cannot open " << argv[1] << endl;
return 1;
}

vec_str lines;
for (string line; getline(stm, line); )
{
if (line.empty()) continue;
lines.push_back(move(line));
}
stm.close();

The first few lines open the file using an ifstream class. If the file cannot be found, then the operation to open the file fails and this is tested by calling is_open. Next, a vector of string objects is declared and filled with lines read from the file. The getline function has two parameters: the first is the open file stream object and the second is a string to contain the character data. This function returns the stream object, which has a bool conversion operator, and hence the for statement will loop until this stream object indicates that it can read no more data. When the stream gets to the end of the file, an internal end-of-file flag is set and this causes the bool conversion operator to return a value of false.

If the getline function reads a blank line, then the string will not be able to be parsed, so there is a test for this, and such blank lines are not stored. Each legitimate line is pushed into the vector, but, since this string variable will not be used after this operation, we can use move semantics and so this is made explicit by calling the move function.

This code will now compile and run (although it will produce no output). You can use it on any CSV file that meets the criteria given previously, but as a test file we have used the following file:

    George Washington,1789,1797 
"John Adams, Federalist",1797,1801
"Thomas Jefferson, Democratic Republican",1801,1809
"James Madison, Democratic Republican",1809,1817
"James Monroe, Democratic Republican",1817,1825
"John Quincy Adams, Democratic Republican",1825,1829
"Andrew Jackson, Democratic",1829,1837
"Martin Van Buren, Democratic",1837,1841
"William Henry Harrison, Whig",1841,1841
"John Tyler, Whig",1841,1841
John Tyler,1841,1845

These are US presidents up to 1845; the first string is the name of the president and their affiliation, but when the president has no affiliation then it is missed out (Washington and Tyler). The names are then followed by the start and end years of their terms of office.

Next, we want to parse the data in the vector and split the items into individual fields according to the rules given previously (fields separated by commas, but quotation marks are respected). To do this, we will represent each line as a list of fields, with each field being a string. Add an include for <list> near the top of the file. At the top of the file, where the using declarations are made, add the following:

    using namespace std; 
using vec_str = vector<string>;
using list_str = list<string>;using vec_list = vector<list_str>;

Now, at the bottom of the main function, add:

    vec_list parsed; 
for (string& line : lines)
{
parsed.push_back(parse_line(line));
}

The first line creates the vector of list objects, and the for loop iterates through each line calling a function called parse_line that parses a string and returns a list of string objects. The return value of the function will be a temporary object and hence an rvalue, so this means that the version of push_back with move semantics will be called.

Above the usage function, add the start of the parse_line function:

    list_str parse_line(const string& line) 
{
list_str data;
string::const_iterator it = line.begin();

return data;
}

The function will treat the string as a container of characters and hence it will iterate through the line parameter with a const_iterator. The parsing will be carried out in a do loop, so add the following:

    list_str data; 
string::const_iterator it = line.begin();
string item;
bool bQuote = false;
bool bDQuote = false;
do{
++it;
} while (it != line.end());
data.push_back(move(item));
return data;

The Boolean variables will be explained in a moment. The do loop increments the iterator, and when it reaches the end value, the loop finishes. The item variable will hold the parsed data (at this point it is empty) and the last line will put the value into the list; this is so that any unsaved data is stored in the list before the function finishes. Since the item variable is about to be destroyed, the call to move ensures that its contents are moved into the list rather than copied. Without this call, the string copy constructor will be called when putting the item into the list.

Next, you need to do the parsing of the data. To do this, add a switch to test for the three cases: a comma (to indicate the end of a field), and a quote or a double quote to indicate a quoted string. The idea is to read each field and build its value up character by character, using the item variable.

    do 
{
switch (*it) {
case ''':

break;

case '"':

break;

case ',':

break;

default:

item.push_back(*it);

};

++it;
} while (it != line.end());

The default action is simple: it copies the character into the temporary string. If the character is a single quote, we have two options. Either the quote is within a string that is double-quoted, in which case we want the quote to be stored in item, or the quote is a delimiter, in which case we store whether it is the opening or closing quote by setting the bQuote value. For the case of a single quote, add the following:

    case ''': 
if (bDQuote) item.push_back(*it);
else
{
bQuote = !bQuote;

if (bQuote) item.clear();

}

break;

This is simple enough. If this is in a double-quoted string (bDQuote is set), then we store the quote. If not, then we flip the bQuote bool so that if this is the first quote, we register that the string is quoted, otherwise we register that it is the end of a string. If we are at the start of a quoted string, we clear the item variable to ignore any spaces between the previous comma (if there is one) and the quote. However, this code does not take into account the use of two quote marks next to each other, which means that the quote is a literal and part of the string. Change the code to add a check for this situation:

    if (bDQuote) item.push_back(*it); 
else
{
if ((it + 1) != line.end() && *(it + 1) == ''') {
item.push_back(*it);

++it;

}

else

{
bQuote = !bQuote;
if (bQuote) item.clear();
}
}

The if statement checks to make sure that if we increment the iterator, we are not at the end of the line (short-circuiting will kick in here in this case and the rest of the expression will not be evaluated). We can test the next item, and we then peek at the next item to see if it is a single quote; if it is, then we add it to the item variable and increment the iterator so that both quotes are consumed in the loop.

The code for the double quote is similar, but switches over the Boolean variables and tests for double quotes:

    case '"': 
if (bQuote) item.push_back(*it);
else
{
if ((it + 1) != line.end() && *(it + 1) == '"')
{
item.push_back(*it);
++it;
}
else {
bDQuote = !bDQuote;

if (bDQuote) item.clear();

}

}

break;

Finally, we need code to test for a comma. Again, we have two situations: either this is a comma in a quoted string, in which case we need to store the character, or it's the end of a field, in which case we need to finish the parsing for this field. The code is quite simple:

    case ',': 
if (bQuote || bDQuote) item.push_back(*it);
else data.push_back(move(item));

break;

The if statement tests to see if we are in a quoted string (in which case either bQuote or bDQuote will be true), and if so, the character is stored. If this is the end of the field, we push the string into the list, but we use move so that the variable data is moved across and the string object left in an uninitialized state.

This code will compile and run. However, there is still no output, so before we redress that, review the code that you have written. At the end of the main function you will have a vector in which each item has a list object representing each row in the CSV file, and each item in the list is a field. You have now parsed the file and can use this data accordingly. So that you can see that the data has been parsed, add the following lines to the bottom of the main function:

    int count = 0; 
for (list_str row : parsed)
{
cout << ++count << "> ";
for (string field : row)
{
cout << field << " ";
}
cout << endl;
}

You can now compile the code (use the /EHsc switch) and run the application passing the name of a CSV file.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset