Atom and Atom modifiers

In this section, we will be expanding on our knowledge of regular expressions by discussing the atom. We will be covering the concept of an atom. An atom is a single expression such as a character or a dot, or an expression that has been defined using parentheses or - as we will see in a further section- the character class. We will also introduce atom modifiers. The idea is that you can take any atom, and then modify it using a modifier. Now, let's go back to our RegexLearning notebook and continue from where we left off in the last section. 

Imagine that you have a string representing a date in the year-month-day separated by a dashes format, and you wish to verify that this date is in the 1900s or the 2000s. So, let's say that we have a date of 1969-07-20, and we wish to verify that this date is in either the 1900s or the 2000s:

Well, we crafted a regular expression based on an atom of 19|20 followed by a series of dots, and then dashes separating those dots for the remainder part of the string, then checked the Bool value. In this case, we got a True as 1969 belongs to the 1900s.

Let's test it for a different date, say, 18AB-40-99. This is another date that could potentially enter into a system by a stretch of the imagination, and let's see if this is valid in the same regular expression:

This came out to be False. As you can see, this regular expression only validates the first two digits with the 19|20 atom. This isn't a perfect regular expression because it doesn't validate the entire date, but we'll build on this particular expression as we progress. For now, let's hold on to this example; it will get better. This idea however lends itself to the concept of atom modifications. So, with atom modifiers, we can specify how many times an atom should be repeated inside of an expression. 

Let's change the next block in our notebook to a markdown. We can modify any atom using symbols after the atom. For example, how do you spell the word "color"? Let's see the following screenshot:

As you can see in American English, we spell it as color, whereas there are other parts of the world that spell it colour. However, we know for a fact that coluor is the wrong spelling.

Using the ? regular expression, we can specify that the u is optional, and that either spelling with the u or without the u is legal as long as that u appears after the o. So, let's define our regular expression:

We have the American spelling, color, on the left and the one with u on the right. As you can see, there is a question mark that appears after the atom u, meaning that the u is now optional. Hence, the result comes out to be True. Let's take the same expression and modify it a bit:

So, we have replaced color with colour on the left. This expression still comes out to True. Now, let's mispell the word and check the Bool value:

So, instead of ou, we have uo now, and it results in False because the u didn't appear in the correct place.

Now, let's look at our next two modifiers. For example, imagine that you have a pattern that requires that we match a string containing any number of 1s followed by any number of 2s. The * character means that an atom must appear at a minimum of 0 times, or many times. So, let's see an example:

So, will 1122 match a pattern of any number of 1s followed by any number of 2s? The result is True. Well, let's say we just have an empty string. Will that also meet the pattern of any number of 1s followed by any number of 2s?

Yes it does, because any number could also mean 0

What if we just leave 1 out, for example, 111? See the following:

That too results in True.

What about something that has nothing to do with 1s or 2s at all?

Yes, all of these are True. In fact, you'll notice all of these expressions that we've gone through are True. It's not possible for us to write a string that fails on this particular regular expression. Now, let's change this up, and say that the 1s and 2s in this string have to appear at least once, but they can still appear any number of times. Well, we just change those * characters to + signs. Let's see a few examples:

The first case came out True because we have at least one of each character. But, if we leave out one character or the other, as you can see in the second previous input box , you'll see that it returns False. At a minimum, we must have one of each in order for the expression to be True, as seen in the third input box.

Finally, we have something known as the custom modifier, which will allow you to create custom specifications based on any number of repetitions. So, for now a convoluted example. Let's say that we want the number 1 followed by the number 2, which must appear between 3 and 5 times, followed by the number 3. If your string is 123, the following results:

In order to create a custom modification, we pass in high and low values in curly braces. So, this says that the 2 must appear between 3 and 5 times; and we see that the 2 only appears once in our original string, so this fails. Hence, we get False.

If we repeat this again with just two 2s, we see that also will also result in False, as seen in the following screenshot:

But, once we enter three 2s into our expression, we will see that it will give us True:

Since this is a custom modification, where we say at most 5, and if I enter six 2s, this is going to fail:

So, those were the four modifiers that we covered in this section. We've covered the ?, which can appear 0 or 1 times. We've covered the *, which can appear 0 to many times. We've covered the + sign, which can cover one to many times; and then we have the custom modifier, which is based on our own criteria.

Any expression that's been modified with a minimum number of 0 required matches, such as the ?, the *, or a specified modifier with 0 minimum matches, will always evaluate to TrueYou need to be mindful of your expression so that if all of your atoms are modified in this manner, your expression will always be True, and then your expression is not useful to you.

In the next section, we're going to introduce character classes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset