How it works...

Considering the problem of verifying the format of e-mail addresses, even though this may look like a trivial problem, in practice it is hard to find a simple regular expression that covers all the possible cases for valid e-mail formats. In this recipe, we will not try to find that ultimate regular expression, but rather to apply a regular expression that is good enough for most cases. The regular expression we will use for this purpose is this:

    ^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,}$

The following table explains the structure of the regular expression:

Part Description
^ Start of string
[A-Z0-9._%+-]+ At least one character in the range A-Z, 0-9, or one of -, %, + or - that represents the local part of the email address
@ Character @
[A-Z0-9.-]+ At least one character in the range A-Z, 0-9, or one of -, %, + or - that represents the hostname of the domain part
. A dot that separates the domain hostname and label
[A-Z]{2,} The DNS label of a domain that can have between 2 and 63 characters
$ End of the string

 

Bear in mind that in practice a domain name is composed of a hostname followed by a dot-separated list of DNS labels. Examples include localhost, gmail.com, or yahoo.co.uk. This regular expression we are using does not match domains without DNS labels, such as localhost (an e-mail, such as root@localhost is a valid e-mail). The domain name can also be an IP address specified in brackets, such as [192.168.100.11] (as in john.doe@[192.168.100.11]). E-mail addresses containing such domains will not match the regular expression defined above. Even though these rather rare formats will not be matched, the regular expression can cover most of the e-mail formats.

The regular expression in the example in this chapter is provided for didactical purposes only, and it is not intended for being used as it is in production code. As explained earlier, this sample does not cover all possible e-mail formats.

We began by including the necessary headers, <regex> for regular expressions and <string> for strings. The is_valid_email() function shown in the following (that basically contains the samples from the How to do it... section) takes a string representing an e-mail address and returns a boolean indicating whether the e-mail has a valid format or not. We first construct an std::regex object to encapsulate the regular expression indicated with the raw string literal. Using raw string literals is helpful because it avoids escaping backslashes that are used for escape characters in regular expressions too. The function then calls std::regex_match(), passing the input text and the regular expression:

    bool is_valid_email_format(std::string const & email) 
{
auto pattern {R"(^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,}$)"s};

auto rx = std::regex{pattern};

return std::regex_match(email, rx);
}

The std::regex_match() method tries to match the regular expression against the entire string. If successful it returns true, otherwise false:

    auto ltest = [](std::string const & email)  
{
std::cout << std::setw(30) << std::left
<< email << " : "
<< (is_valid_email_format(email) ?
"valid format" : "invalid format")
<< std::endl;
};

ltest("[email protected]"s); // valid format
ltest("[email protected]"s); // valid format
ltest("[email protected]"s); // valid format
ltest("[email protected]"s); // valid format
ltest("ROOT@LOCALHOST"s); // invalid format
ltest("[email protected]"s); // invalid format

In this simple test, the only e-mails that do not match the regular expression are ROOT@LOCALHOST and [email protected]. The first contains a domain name without a dot-prefixed DNS label and that case is not covered in the regular expression. The second contains only lowercase letters, and in the regular expression, the valid set of characters for both the local part and the domain name was uppercase letters, A to Z.

Instead of complicating the regular expression with additional valid characters (such as [A-Za-z0-9._%+-]), we can specify that the match can ignore the case. This can be done with an additional parameter to the constructor of the std::basic_regex class. The available constants for this purpose are defined in the regex_constants namespace. The following slight change to the is_valid_email_format() will make it ignore the case and allow e-mails with both lowercase and uppercase letters to correctly match the regular expression:

    bool is_valid_email_format(std::string const & email) 
{
auto rx = std::regex{
R"(^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,}$)"s,
std::regex_constants::icase};

return std::regex_match(email, rx);
}

This is_valid_email_format() function is pretty simple, and if the regular expression was provided as a parameter along with the text to match, it could be used for matching anything. However, it would be nice to be able to handle with a single function not only multi-byte strings (std::string) but also wide strings (std::wstring). This can be achieved by creating a function template where the character type is provided as a template parameter:

    template <typename CharT> 
using tstring = std::basic_string<CharT, std::char_traits<CharT>,
std::allocator<CharT>>;

template <typename CharT>
bool is_valid_format(tstring<CharT> const & pattern,
tstring<CharT> const & text)
{
auto rx = std::basic_regex<CharT>{
pattern, std::regex_constants::icase };

return std::regex_match(text, rx);
}

We start by creating an alias template for std::basic_string in order to simplify its use. The new is_valid_format() function is a function template very similar to our implementation of is_valid_email(). However, we now use std::basic_regex<CharT> instead of the typedef std::regex, which is std::basic_regex<char>, and the pattern is provided as the first argument. We now implement a new function called is_valid_email_format_w() for wide strings that relies on this function template. The function template, however, can be reused for implementing other validations, such as if a license plate has a particular format:

    bool is_valid_email_format_w(std::wstring const & text) 
{
return is_valid_format(
LR"(^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,}$)"s,
text);
}

auto ltest2 = [](auto const & email)
{
std::wcout << std::setw(30) << std::left
<< email << L" : "
<< (is_valid_email_format_w(email) ? L"valid" : L"invalid")
<< std::endl;
};

ltest2(L"[email protected]"s); // valid
ltest2(L"[email protected]"s); // valid
ltest2(L"[email protected]"s); // valid
ltest2(L"[email protected]"s); // valid
ltest2(L"ROOT@LOCALHOST"s); // invalid
ltest2(L"[email protected]"s); // valid

Of all the examples shown above, the only one that does not match is ROOT@LOCAHOST, as already expected.

The std::regex_match() method has, in fact, several overloads, and some of them have a parameter that is a reference to an std::match_results object to store the result of the match. If there is no match, then std::match_results is empty and its size is 0. Otherwise, if there is a match, the std::match_results object is not empty and its size is 1 plus the number of matched subexpressions.

The following version of the function uses the mentioned overloads and returns the matched subexpressions in an std::smatch object. Note that the regular expression is changed, as three caption groups are defined-- one for the local part, one for the hostname part of the domain, and one for the DNS label. If the match is successful, then the std::smatch object will contain four submatch objects: the first to match the entire string, the second for the first capture group (the local part), the third for the second capture group (the hostname), and the fourth for the third and last capture group (the DNS label). The result is returned in a tuple, where the first item actually indicates success or failure:

    std::tuple<bool, std::string, std::string, std::string>
is_valid_email_format_with_result(std::string const & email)
{
auto rx = std::regex{
R"(^([A-Z0-9._%+-]+)@([A-Z0-9.-]+).([A-Z]{2,})$)"s,
std::regex_constants::icase };
auto result = std::smatch{};
auto success = std::regex_match(email, result, rx);

return std::make_tuple(
success,
success ? result[1].str() : ""s,
success ? result[2].str() : ""s,
success ? result[3].str() : ""s);
}

Following the preceding code, we use C++17 structured bindings to unpack the content of the tuple into named variables:

    auto ltest3 = [](std::string const & email) 
{
auto [valid, localpart, hostname, dnslabel] =
is_valid_email_format_with_result(email);

std::cout << std::setw(30) << std::left
<< email << " : "
<< std::setw(10) << (valid ? "valid" : "invalid")
<< "local=" << localpart
<< ";domain=" << hostname
<< ";dns=" << dnslabel
<< std::endl;
};

ltest3("[email protected]"s);
ltest3("[email protected]"s);
ltest3("[email protected]"s);
ltest3("[email protected]"s);
ltest3("ROOT@LOCALHOST"s);
ltest3("[email protected]"s);

The output of the program will be as follows:

    [email protected]            : valid 
local=JOHN.DOE;domain=DOMAIN;dns=COM
[email protected] : valid
local=JOHNDOE;domain=DOMAIL.CO;dns=UK
[email protected] : valid
local=JOHNDOE;domain=DOMAIL;dns=INFO
[email protected] : valid
local=J.O.H.N_D.O.E;domain=DOMAIN;dns=INFO
ROOT@LOCALHOST : invalid
local=;domain=;dns=
[email protected] : valid
local=john.doe;domain=domain;dns=com
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset