6.13. Roman Numerals

Problem

You want to match Roman numerals such as IV, XIII, and MVIII.

Solution

Roman numerals without validation:

^[MDCLXVI]+$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Modern Roman numerals, strict:

^(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Modern Roman numerals, flexible:

^(?=[MDCLXVI])M*(C[MD]|D?C*)(X[CL]|L?X*)(I[XV]|V?I*)$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Simple Roman numerals:

^(?=[MDCLXVI])M*D?C{0,4}L?X{0,4}V?I{0,4}$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Roman numerals are written using the letters M, D, C, L, X, V, and I, representing the values 1,000, 500, 100, 50, 10, 5, and 1, respectively. The first regex matches any string composed of these letters, without checking whether the letters appear in the order or quantity necessary to form a proper Roman numeral.

In modern times (meaning during the past few hundred years), Roman numerals have generally been written following a strict set of rules. These rules yield exactly one Roman numeral per number. For example, 4 is always written as IV, never as IIII. The second regex in the solution matches only Roman numerals that follow these modern rules.

Each nonzero digit of the decimal number is written out separately in the Roman numeral. 1999 is written as MCMXCIX, where M is 1000, CM is 900, XC is 90, and IX is 9. We don’t write MIM or IMM.

The thousands are easy: one M per thousand, easily matched with M*.

There are 10 variations for the hundreds, which we match using two alternatives. C[MD] matches CM and CD, which represent 900 and 400. D?C{0,3} matches DCCC, DCC, DC, D, CCC, CC, C, and the empty string, representing 800, 700, 600, 500, 300, 200, 100, and nothing. This gives us all of the 10 digits for the hundreds.

We match the tens with X[CL]|L?X{0,3} and the units with I[XV]|V?I{0,3}. These use the same syntax, but with different letters.

All four parts of the regex allow everything to be optional, because each of the digits could be zero. The Romans did not have a symbol, or even a word, to represent zero. Thus, zero is unwritten in Roman numerals. While each part of the regex should indeed be optional, they’re not all optional at the same time. We have to make sure our regex does not allow zero-length matches. To do this, we put the lookahead (?=[MDCLXVI]) at the start of the regex. This lookahead, as Recipe 2.16 explains, makes sure that there’s at least one letter in the regex match. The lookahead does not consume the letter that it matches, so that letter can be matched again by the remainder of the regex.

The third regex is a bit more flexible. It also accepts numerals such as IIII, while still accepting IV.

The fourth regex only allows numerals written without using subtraction, and therefore all the letters must be in descending order. 4 must be written as IIII rather than IV. The Romans themselves usually wrote numbers this way.

Tip

All regular expressions are wrapped between anchors (Recipe 2.5) to make sure we check whether the whole input is a Roman numeral, as opposed to a floating-point number occurring in a larger string. You can replace ^ and $ with  word boundaries if you want to find Roman numerals in a larger body of text.

Convert Roman Numerals to Decimal

This Perl function uses the “strict” regular expression from this recipe to check whether the input is a valid Roman numeral. If it is, it uses the regex [MDLV]|C[MD]?|X[CL]?|I[XV]? to iterate over all of the letters in the numeral, adding up their values:

sub roman2decimal {
    my $roman = shift;
    if ($roman =~
        m/^(?=[MDCLXVI])
          (M*)               # 1000
          (C[MD]|D?C{0,3})   # 100
          (X[CL]|L?X{0,3})   # 10
          (I[XV]|V?I{0,3})   # 1
          $/ix)
    {
        # Roman numeral found
        my %r2d = ('I' =>    1, 'IV' =>   4, 'V' =>   5, 'IX' =>   9,
                   'X' =>   10, 'XL' =>  40, 'L' =>  50, 'XC' =>  90,
                   'C' =>  100, 'CD' => 400, 'D' => 500, 'CM' => 900,
                   'M' => 1000);
        my $decimal = 0;
        while ($roman =~ m/[MDLV]|C[MD]?|X[CL]?|I[XV]?/ig) {
            $decimal += $r2d{uc($&)};
        }
        return $decimal;
    } else {
        # Not a Roman numeral
        return 0;
    }
}

See Also

All the other recipes in this chapter show more ways of matching different kinds of numbers with a regular expression.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.1 explains which special characters need to be escaped. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.10 explains backreferences. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.

The source code snippet in this recipe uses the technique for iterating over regex matches discussed in Recipe 3.11.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset