Strings in Python are the way we work with text. Words, sentences, paragraphs, and even entire files are read into and manipulated via strings. Because so much of our work revolves around text, it’s no surprise that strings are one of the most common data types.
You should remember two important things about Python strings: (1) they’re immutable, and (2) in Python 3, they contain Unicode characters, encoded in UTF-8. (See the sidebars on each of these subjects.)
There’s no such thing as a “character” type in Python. We can talk about a “one-character string,” but that just means a string whose length is 1.
Python’s strings are interesting and useful, not only because they allow us to work with text, but also because they’re a Python sequence. This means that we can iterate over them (character by character), retrieve their elements via numeric indexes, and search in them with the in
operator.
This chapter includes exercises designed to help you work with strings in a variety of ways. The more familiar you are with Python’s string manipulation techniques, the easier it will be to work with text.
Pig Latin (http://mng.bz/YrON) is a common children’s “secret” language in English-speaking countries. (It’s normally secret among children who forget that their parents were once children themselves.) The rules for translating words from English into Pig Latin are quite simple:
If the word begins with a vowel (a, e, i, o, or u), add “way” to the end of the word. So “air” becomes “airway” and “eat” becomes “eatway.”
If the word begins with any other letter, then we take the first letter, put it on the end of the word, and then add “ay.” Thus, “python” becomes “ythonpay” and “computer” becomes “omputercay.”
(And yes, I recognize that the rules can be made more sophisticated. Let’s keep it simple for the purposes of this exercise.)
For this exercise, write a Python function (pig_latin
) that takes a string as input, assumed to be an English word. The function should return the translation of this word into Pig Latin. You may assume that the word contains no capital letters or punctuation.
This exercise isn’t meant to help you translate documents into Pig Latin for your job. (If that is your job, then I really have to question your career choices.) However, it demonstrates some of the powerful techniques that you should know when working with sequences, including searches, iteration, and slices. It’s hard to imagine a Python program that doesn’t include any of these techniques.
This has long been one of my favorite exercises to give students in my introductory programming classes. It was inspired by Brian Harvey, whose excellent series Computer Science Logo Style (http://mng.bz/gyNl), has long been one of my favorites for beginning programmers.
The first thing to consider for this solution is how we’ll check to make sure that word[0]
, the first letter in word
, is a vowel. I’ve often seen people start to use a loop, as in
starts_with_vowel = False for vowel in 'aeiou': if word[0] == vowel: starts_with_vowel = True break
Even if that code will work, it’s already starting to look a bit clumsy and convoluted.
Another solution that I commonly see is this:
if (word[0] == 'a' or word[0] == 'e' or word[0] == 'i' or word[0] == 'o' or word[0] == 'u'): break
As I like to say to my students, “Unfortunately, this code works.” Why do I dislike this code so much? Not only is it longer than necessary, but it’s highly repetitive. The don’t repeat yourself (DRY) rule should always be at the back of your mind when writing code.
Moreover, Python programs tend to be short. If you find yourself repeating yourself and writing an unusually long expression or condition, you’ve likely missed a more Pythonic way of doing things.
We can take advantage of the fact that Python sees a string as a sequence, and use the built-in in
operator to search for word[0]
in a string containing the vowels:
if word[0] in 'aeiou':
That single line has the combined advantage of being readable, short, accurate, and fairly efficient. True, the time needed to search through a string--or any other Python sequence--rises along with the length of the sequence. But such linear time, sometimes expressed as O(n)
, is often good enough, especially when the strings through which we’ll be searching are fairly short.
Tip The in
operator works on all sequences (strings, lists, and tuples) and many other Python collections. It effectively runs a for
loop on the elements. Thus, using in
on a dict will work but will only search through the keys, ignoring the values.
Once we’ve determined whether the word begins with a vowel, we can apply the appropriate Pig Latin rule.
def pig_latin(word): if word[0] in 'aeiou': return f'{word}way' return f'{word[1:]}{word[0]}ay' print(pig_latin('python'))
You can work through a version of this code in the Python Tutor at http://mng.bz/ XP5M.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
It’s hard to exaggerate just how often you’ll need to work with strings in Python. Moreover, Python is often used in text analysis and manipulation. Here are some ways that you can extend the exercise to push yourself further:
Handle capitalized words --If a word is capitalized (i.e., the first letter is capitalized, but the rest of the word isn’t), then the Pig Latin translation should be similarly capitalized.
Handle punctuation --If a word ends with punctuation, then that punctuation should be shifted to the end of the translated word.
Consider an alternative version of Pig Latin --We don’t check to see if the first letter is a vowel, but, rather, we check to see if the word contains two different vowels. If it does, we don’t move the first letter to the end. Because the word “wine” contains two different vowels (“i” and “e”), we’ll add “way” to the end of it, giving us “wineway.” By contrast, the word “wind” contains only one vowel, so we would move the first letter to the end and add “ay,” rendering it “indway.” How would you check for two different vowels in the word? (Hint: sets can come in handy here.)
Now that you’ve successfully written a translator for a single English word, let’s make things more difficult: translate a series of English words into Pig Latin. Write a function called pl_sentence
that takes a string containing several words, separated by spaces. (To make things easier, we won’t actually ask for a real sentence. More specifically, there will be no capital letters or punctuation.)
pl_sentence('this is a test translation')
histay isway away estay ranslationtay
Print the output on a single line, rather than with each word on a separate line.
This exercise might seem, at least superficially, like the previous one. But here, the emphasis is not on the Pig Latin translation. Rather, it’s on the ways we typically use loops in Python, and how loops go together with breaking strings apart and putting them back together again. It’s also common to want to take a sequence of strings and print them out on a single line. There are a few ways to do this, and I want you to consider the advantages and disadvantages of each one.
The core of the solution is nearly identical to the one in the previous section, in which we translated a single word into Pig Latin. Once again, we’re getting a text string as input from the user. The difference is that, in this case, rather than treating the string as a single word, we’re treating it as a sentence--meaning that we need to separate it into individual words. We can do that with str.split
(http://mng.bz/aR4z). str.split
can take an argument, which determines which string should be used as the separator between fields.
It’s often the case that you want to use any and all whitespace characters, regardless of how many there are, to split the fields. In such a case, don’t pass an argument at all; Python will then treat any number of spaces, tabs, and newlines as a single separation character. The difference can be significant:
s = 'abc def ghi' ❶ s.split(' ') ❷ s.split() ❸
❷ Returns ['abc', '', 'def ', '', 'ghi']
❸ Returns ['abc', 'def', 'ghi']
Note If you don’t pass any arguments to str.split
, it’s effectively the same as passing None
. You can pass any string to str.split
, not just a single-character string. This means that if you want to split on ::
, you can do that. However, you can’t split on more than one thing, saying that both ,
and ::
are field separators. To do that, you’ll need to use regular expressions and the re.split
function in the Python standard library, described here: http://mng.bz/K2RK.
Thus, we can take the user’s input and break it into words--again, assuming that there are no punctuation characters--and then translate each individual word into Pig Latin. Whereas the one-word version of our program could simply print its output right away, this one needs to store the accumulated output and then print it all at once. It’s certainly possible to use a string for that, and to invoke +=
on the string with each iteration. But as a general rule, it’s not a good idea to build strings in that way. Rather, you should add elements to a list using list.append
(http://mng.bz/Mdlm) and then invoke str.join
to turn the list’s elements into a long string.
That’s because strings are immutable, and +=
on a string forces Python to create a new string. If we’re adding to a string many times, then each time will trigger the creation of a new object whose contents will be larger than the previous iteration. By contrast, lists are mutable, and adding to them with list.append
is relatively inexpensive, in both memory and computation.
def pl_sentence(sentence): output = [] for word in sentence.split(): if word[0] in 'aeiou': output.append(f'{word}way') else: output.append(f'{word[1:]}{word[0]}ay') return ' '.join(output) print(pl_sentence('this is a test'))
You can work through a version of this code in the Python Tutor at http://mng.bz/yydE.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
Splitting, joining, and manipulating strings are common actions in Python. Here are some additional activities you can try to push yourself even further:
Take a text file, creating (and printing) a nonsensical sentence from the nth word on each of the first 10 lines, where n is the line number.
Write a function that transposes a list of strings, in which each string contains multiple words separated by whitespace. Specifically, it should perform in such a way that if you were to pass the list ['abc
def
ghi',
'jkl
mno
pqr',
'stu
vwx
yz']
to the function, it would return ['abc
jkl
stu',
'def
mno
vwx',
'ghi
pqr
yz']
.
Read through an Apache logfile. If there is a 404 error--you can just search for '
404
'
, if you want--display the IP address, which should be the first element.
When they hear that Python’s strings are immutable, many people wonder how the language can be used for text processing. After all, if you can’t modify strings, then how can you do any serious work with them?
Moreover, there are times when a simple for
loop, as we used with the Pig Latin examples, won’t work. If we’re modifying each word only once, then that’s fine, but if we’re potentially modifying it several times, we have to make sure that each modification won’t affect future modifications.
This exercise is meant to help you practice thinking in this way. Here, you’ll implement a translator from English into another secret children’s language, Ubbi Dubbi (http://mng.bz/90zl). (This was popularized on the wonderful American children’s program Zoom, which was on television when I was growing up.) The rules of Ubbi Dubbi are even simpler than those of Pig Latin, although programming a translator is more complex and requires a bit more thinking.
In Ubbi Dubbi, every vowel (a, e, i, o, or u) is prefaced with ub
. Thus milk
becomes mubilk
(m-ub-ilk
) and program
becomes prubogrubam
(prub-ogrub-am
). In theory, you only put an ub
before every vowel sound, rather than before each vowel. Given that this is a book about Python and not linguistics, I hope that you’ll forgive this slight difference in definition.
Ubbi Dubbi is enormously fun to speak, and it’s somewhat magical if and when you can begin to understand someone else speaking it. Even if you don’t understand it, Ubbi Dubbi sounds extremely funny. See some YouTube videos on the subject, such as http://mng.bz/aRMY, if you need convincing.
For this exercise, you’ll write a function (called ubbi_dubbi
) that takes a single word (string) as an argument. It returns a string, the word’s translation into Ubbi Dubbi. So if the function is called with octopus
, the function will return the string uboctubopubus
. And if the user passes the argument elephant
, you’ll output ubelubephubant
.
As with the original Pig Latin translator, you can ignore capital letters, punctuation, and corner cases, such as multiple vowels combining to create a new sound. When you do have two vowels next to one another, preface each of them with ub
. Thus, soap
will become suboubap
, despite the fact that oa
combines to a single vowel sound.
Much like the “Pig Latin sentence” exercise, this brings to the forefront the various ways we often need to scan through strings for particular patterns, or translate from one Python data structure or pattern to another, and how iterations can play a central role in doing so.
The task here is to ask the user for a word, and then to translate that word into Ubbi Dubbi. This is a slightly different task than we had with Pig Latin, because we need to operate on a letter-by-letter basis. We can’t simply analyze the word and produce output based on the entire word. Moreover, we have to avoid getting ourselves into an infinite loop, in which we try to add ub
before the u
in ub
.
The solution is to iterate over each character in word
, adding it to a list, output
. If the current character is a vowel, then we add ub
before the letter. Otherwise, we just add the letter. At the end of the program, we join and then print the letters together. This time, we don’t join the letters together with a space character ('
'
), but rather with an empty string ('
'
). This means that the resulting string will consist of the letters joined together with nothing between them--or, as we often call such collections, a word.
def ubbi_dubbi(word):
output = []
for letter in word:
if letter in 'aeiou':
output.append(f'ub{letter}') ❶
else:
output.append(letter)
return ''.join(output)
print(ubbi_dubbi('python'))
❶ Why append to a list, and not to a string? To avoid allocating too much memory. For short strings, it’s not a big deal. But for long loops and large strings, it’s a bad idea.
You can work through this code in the Python Tutor at http://mng.bz/eQJZ.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
It’s common to want to replace one value with another in strings. Python has a few different ways to do this. You can use str.replace
(http://mng.bz/WPe0) or str .translate
(http://mng.bz/8pyP), two string methods that translate strings and sets of characters, respectively. But sometimes, there’s no choice but to iterate over a string, look for the pattern we want, and then append the modified version to a list that we grow over time:
Handle capitalized words --If a word is capitalized (i.e., the first letter is capitalized, but the rest of the word isn’t), then the Ubbi Dubbi translation should be similarly capitalized.
Remove author names --In academia, it’s common to remove the authors’ names from a paper submitted for peer review. Given a string containing an article and a separate list of strings containing authors’ names, replace all names in the article with _
characters.
URL-encode characters --In URLs, we often replace special and nonprintable characters with a %
followed by the character’s ASCII value in hexadecimal. For example, if a URL is to include a space character (ASCII 32, aka 0x20), we replace it with %20
. Given a string, URL-encode any character that isn’t a letter or number. For the purposes of this exercise, we’ll assume that all characters are indeed in ASCII (i.e., one byte long), and not multibyte UTF-8 characters. It might help to know about the ord
(http://mng.bz/EdnJ) and hex
(http://mng .bz/nPxg) functions.
If strings are immutable, then does this mean we’re stuck with them forever, precisely as they are? Kind of--we can’t change the strings themselves, but we can create new strings based on them, using a combination of built-in functions and string methods. Knowing how to work around strings’ immutability and piece together functionality that effectively changes strings, even though they’re immutable, is a useful skill to have.
In this exercise, you’ll explore this idea by writing a function, strsort
, that takes a single string as its input and returns a string. The returned string should contain the same characters as the input, except that its characters should be sorted in order, from the lowest Unicode value to the highest Unicode value. For example, the result of invoking strsort('cba')
will be the string abc
.
The solution’s implementation of strsort
takes advantage of the fact that Python strings are sequences. Normally, we think of this as relevant in a for
loop, in that we can iterate over the characters in a string. However, we don’t need to restrict ourselves to such situations.
For example, we can use the built-in sorted
(http://mng.bz/pBEG) function, which takes an iterable--which means not only a sequence, but anything over which we can iterate, such as a set of files--and returns its elements in sorted order. Invoking sorted
in our string will thus do the job, in that it will sort the characters in Unicode order. However, it returns a list, rather than a string.
To turn our list into a string, we use the str.join
method (http://mng.bz/gyYl). We use an empty string (''
) as the glue we’ll use to join the elements, thus returning a new string whose characters are the same as the input string, but in sorted order.
def strsort(a_string): return ''.join(sorted(a_string)) print(strsort('cbjeaf'))
You can work through this code in the Python Tutor at http://mng.bz/pBd0.
Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.
This exercise is designed to give you additional reminders that strings are sequences and can thus be put wherever other sequences (lists and tuples) can be used. We don’t often think in terms of sorting a string, but there’s no difference between running sorted
on a string, a list, or a tuple. The elements (in the case of a string, the characters) are returned in sorted order.
However, sorted
(http://mng.bz/pBEG) returns a list, and we wanted to get a string. We thus needed to turn the resulting list back into a string--something that str.join
is designed to do. str.split
(http://mng.bz/aR4z) and str.join
(http:// mng.bz/gyYl) are two methods with which you should become intimately familiar because they’re so useful and help in so many cases.
Consider a few other variations of, and extensions to, this exercise, which also use str.split
and str.join
, as well as sorted
:
Given the string “Tom Dick Harry,” break it into individual words, and then sort those words alphabetically. Once they’re sorted, print them with commas (,
) between the names.
Note that for the second and third challenges, you may well want to read up on the key
parameter and the types of values you can pass to it. A good introduction, with examples, is here: http://mng.bz/D28E.
Python programmers are constantly dealing with text. Whether it’s because we’re reading from files, displaying things on the screen, or just using dicts, strings are a data type with which we’re likely familiar from other languages.
At the same time, strings in Python are unusual, in that they’re also sequences--and thus, thinking in Python requires that you consider their sequence-like qualities. This means searching (using in
), sorting (using sorted
), and using slices. It also means thinking about how you can turn strings into lists (using str.split
) and turn sequences back into strings (using str.join
). While these might seem like simple tasks, they crop up on a regular basis in production Python code. The fact that these data structures and methods are written in C, and have been around for many years, means they’re also highly efficient--and not worth reinventing.