Creating the project

Create a folder for the project and create a C++ file called email_parser.cpp. Since this application will read files and process strings, add includes for the appropriate libraries and add code to take the name of a file from the command-line:

    #include <iostream> 
#include <fstream>
#include <string>

using namespace std;

void usage()
{
cout << "usage: email_parser file" << "n";
cout << "where file is the path to a file" << "n";
}

int main(int argc, char *argv[])
{
if (argc <= 1)
{
usage();
return 1;
}

ifstream stm;
stm.open(argv[1], ios_base::in);
if (!stm.is_open())
{
usage();
cout << "cannot open " << argv[1] << "n";
return 1;
}

return 0;
}

A header will have a name and a body. The body could be a single string, or one or more subitems. Create a class to represent the body of a header, and for the time being, treat this as a single line. Add the following class above the usage function:

    class header_body 
{
string body;
public:
header_body() = default;
header_body(const string& b) : body(b) {}
string get_body() const { return body; }
};

This simply wraps the class around a string; later on we will add code to separate out the subitems in the body data member. Now create a class to represent the email. Add the following code after the header_body class:

    class email 
{
using iter = vector<pair<string, header_body>>::iterator;
vector<pair<string, header_body>> headers;
string body;

public:
email() : body("") {}

// accessors
string get_body() const { return body; }
string get_headers() const;
iter begin() { return headers.begin(); }
iter end() { return headers.end(); }

// two stage construction
void parse(istream& fin);
private:
void process_headers(const vector<string>& lines);
};

The headers data member holds the headers as name/value pairs. The items are stored in a vector rather than a map because as an email is passed from mail server to mail server, headers may be added by each server that already exist in the email, so headers are duplicated. We could use a multimap, but then we will lose the ordering of the headers, since a multimap will store the items in an order that aids searching for items.

A vector keeps the items in the order that they are inserted in the container, and since we will parse the e-mail serially, this means that the headers data member will have the header items in the same order as in the e-mail. Add an appropriate include so that you can use the vector class.

There are accessors for the body and the headers as a single string. In addition, there are accessors that return iterators from the headers data member, so that external code can iterate through the headers data member (a complete implementation of this class would have accessors that allow you to search for a header by name, but for the purpose of this example, only iteration is permitted).

The class supports two-stage construction, where most of the work is carried out by passing an input stream to the parse method. The parse method reads in the email as a series of lines in a vector object and it calls a private function, process_headers, to interpret these lines as headers.

The get_headers method is simple: it just iterates through the headers and puts one header on each line in the format name: value. Add the inline function:

    string get_headers() const 
{
string all = "";
for (auto a : headers)
{
all += a.first + ": " + a.second.get_body();
all += "n";
}
return all;
}

Next, you need to read in the email from a file and extract the body and the headers. The main function already has the code to open a file, so create an email object and pass the ifstream object for the file to the parse method. Now print out the parsed email using the accessors. Add the following to the end of the main function:

        email eml;   
eml.parse(stm);

cout << eml.get_headers();

cout << "n";

cout << eml.get_body() << "n";


return 0;
}

After the email class declaration, add the definition for the parse function:

    void email::parse(istream& fin) 
{
string line;
vector<string> headerLines;
while (getline(fin, line))
{
if (line.empty())
{
// end of headers
break;
}
headerLines.push_back(line);
}

process_headers(headerLines);

while (getline(fin, line))
{
if (line.empty()) body.append("n");
else body.append(line);
}
}

This method is simple: it repeatedly calls the getline function in the <string> library to read a string until a newline is detected. In the first half of the method, the strings are stored in a vector and then passed to the process_headers method. If the string read in is empty, it means a blank line has been read--in which case, all of the headers have been read. In the second half of the method, the body of the e-mail is read in.

The getline function will have stripped the newlines used to format the email to 78-character line lengths, so the loop merely appends the lines as one string. If a blank line is read in, it indicates the end of a paragraph, and so a newline is added to the body string.

After the parse method, add the process_headers method:

    void email::process_headers(const vector<string>& lines) 
{
string header = "";
string body = "";
for (string line : lines)
{
if (isspace(line[0])) body.append(line);
else
{
if (!header.empty())
{
headers.push_back(make_pair(header, body));
header.clear();
body.clear();
}

size_t pos = line.find(':'),
header = line.substr(0, pos);
pos++;
while (isspace(line[pos])) pos++;
body = line.substr(pos);
}
}

if (!header.empty())
{
headers.push_back(make_pair(header, body));
}
}

This code iterates through each line in the collection, and when it has a complete header it splits the string into the name/body pair on the colon. Within the loop, the first line tests to see if the first character is whitespace; if not, then the header variable is checked to see if it has a value; and if so, the name/body pair are stored in the class headers data member before clearing the header and body variables.

The following code acts upon the line read from the collection. This code assumes that this is the start of the header line, so the string is searched for the colon and split at this point. the name of the header is before the colon and the body of the header (trimmed of leading whitespace) is after the colon. Since we do not know if the header body will be folded onto the next line, the name/body is not stored; instead, the while loop is allowed to repeat another time so that the first character of the next line can be tested to see if it is whitespace, and if so, it is appended to the body. This action of holding the name/body pair until the next iteration of the while loop means that the last line will not be stored in the loop, and hence there is a test at the end of the method to see if the header variable is empty, and if not, the name/body pair is stored.

You can now compile the code (remember to use the /EHsc switch) to test that there are no typos. To test the code, you should save an email from your email client as a file and then run the email_parser application with the path to this file. The following is one of the example email messages given in the Internet Message Format RFC 5322, which you can put into a text file to test the code:

    Received: from x.y.test
by example.net
via TCP
with ESMTP
id ABC12345
for <[email protected]>; 21 Nov 1997 10:05:43 -0600
Received: from node.example by x.y.test; 21 Nov 1997 10:01:22 -0600
From: John Doe <[email protected]>
To: Mary Smith <[email protected]>
Subject: Saying Hello
Date: Fri, 21 Nov 1997 09:55:06 -0600
Message-ID: <[email protected]>

This is a message just to say hello.
So, "Hello".

You can test the application with an email message to show that the parsing has taken into account header formatting, including folding whitespace.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset