2 Strings

Strings in Python are the way we work with text. Words, sentences, paragraphs, and even entire files are read into and manipulated via strings. Because so much of our work revolves around text, it’s no surprise that strings are one of the most common data types.

You should remember two important things about Python strings: (1) they’re immutable, and (2) in Python 3, they contain Unicode characters, encoded in UTF-8. (See the sidebars on each of these subjects.)

There’s no such thing as a “character” type in Python. We can talk about a “one-character string,” but that just means a string whose length is 1.

Python’s strings are interesting and useful, not only because they allow us to work with text, but also because they’re a Python sequence. This means that we can iterate over them (character by character), retrieve their elements via numeric indexes, and search in them with the in operator.

This chapter includes exercises designed to help you work with strings in a variety of ways. The more familiar you are with Python’s string manipulation techniques, the easier it will be to work with text.

Useful references

Table 2.1 What you need to know

Concept

What is it?

Example

To learn more

in

Operator for searching in a sequence

'a' in 'abcd'

http://mng.bz/yy2G

Slice

Retrieves a subset of elements from a sequence

# returns 'bdf'

'abcdefg'[1:7:2]

http://mng.bz/MdW7

str.split

Breaks strings apart, returning a list

# returns ['abc', 'def', 'ghi']

'abc def ghi'.split()

http://mng.bz/aR4z

str.join

Combines strings to create a new one

# returns 'abc*def*ghi'

'*'.join(['abc', 'def', 'ghi'])

http://mng.bz/gyYl

list.append

Adds an element to a list

mylist.append('hello')

http://mng.bz/aR7z

sorted

Returns a sorted list, based on an input sequence

# returns [10, 20, 30]

sorted([10, 30, 20])

http://mng.bz/pBEG

Iterating over files

Opens a file and iterates over its lines one at a time

for one_line in open(filename):

http://mng.bz/OMAn

Exercise 5 Pig Latin

Pig Latin (http://mng.bz/YrON) is a common children’s “secret” language in English-speaking countries. (It’s normally secret among children who forget that their parents were once children themselves.) The rules for translating words from English into Pig Latin are quite simple:

  • If the word begins with a vowel (a, e, i, o, or u), add “way” to the end of the word. So “air” becomes “airway” and “eat” becomes “eatway.”

  • If the word begins with any other letter, then we take the first letter, put it on the end of the word, and then add “ay.” Thus, “python” becomes “ythonpay” and “computer” becomes “omputercay.”

(And yes, I recognize that the rules can be made more sophisticated. Let’s keep it simple for the purposes of this exercise.)

For this exercise, write a Python function (pig_latin) that takes a string as input, assumed to be an English word. The function should return the translation of this word into Pig Latin. You may assume that the word contains no capital letters or punctuation.

This exercise isn’t meant to help you translate documents into Pig Latin for your job. (If that is your job, then I really have to question your career choices.) However, it demonstrates some of the powerful techniques that you should know when working with sequences, including searches, iteration, and slices. It’s hard to imagine a Python program that doesn’t include any of these techniques.

Working it out

This has long been one of my favorite exercises to give students in my introductory programming classes. It was inspired by Brian Harvey, whose excellent series Computer Science Logo Style (http://mng.bz/gyNl), has long been one of my favorites for beginning programmers.

The first thing to consider for this solution is how we’ll check to make sure that word[0], the first letter in word, is a vowel. I’ve often seen people start to use a loop, as in

starts_with_vowel = False
for vowel in 'aeiou':
    if word[0] == vowel:
        starts_with_vowel = True
        break

Even if that code will work, it’s already starting to look a bit clumsy and convoluted.

Another solution that I commonly see is this:

if (word[0] == 'a' or word[0] == 'e' or
       word[0] == 'i' or word[0] == 'o' or word[0] == 'u'):
    break

As I like to say to my students, “Unfortunately, this code works.” Why do I dislike this code so much? Not only is it longer than necessary, but it’s highly repetitive. The don’t repeat yourself (DRY) rule should always be at the back of your mind when writing code.

Moreover, Python programs tend to be short. If you find yourself repeating yourself and writing an unusually long expression or condition, you’ve likely missed a more Pythonic way of doing things.

We can take advantage of the fact that Python sees a string as a sequence, and use the built-in in operator to search for word[0] in a string containing the vowels:

if word[0] in 'aeiou':

That single line has the combined advantage of being readable, short, accurate, and fairly efficient. True, the time needed to search through a string--or any other Python sequence--rises along with the length of the sequence. But such linear time, sometimes expressed as O(n), is often good enough, especially when the strings through which we’ll be searching are fairly short.

Tip The in operator works on all sequences (strings, lists, and tuples) and many other Python collections. It effectively runs a for loop on the elements. Thus, using in on a dict will work but will only search through the keys, ignoring the values.

Once we’ve determined whether the word begins with a vowel, we can apply the appropriate Pig Latin rule.

Slices

All of Python’s sequences--strings, lists, and tuples--support slicing. The idea is that if I say

s = 'abcdefgh'
print(s[2:6])    

Returns “cdef”

I’ll get all of the characters from s, starting at index 2 and until (but not including) index 6, meaning the string cdef. A slice can also indicate the step size:

s = 'abcdefgh'
print(s[2:6:2])    

Returns “ce”

This code will print the string ce, since we start at index 2 (c), move forward two indexes to e, and then reach the end.

Slices are Python’s way of retrieving a subset of elements from a sequence. You can even omit the starting and/or ending index to indicate that you want to start from the sequence’s first element or end at its last element. For example, we can get every other character from our string with

s = 'abcdefgh'
print(s[::2])     

Returns “aceg”

Solution

def pig_latin(word):
    if word[0] in 'aeiou':
        return f'{word}way'
 
    return f'{word[1:]}{word[0]}ay'
 
 
print(pig_latin('python'))

You can work through a version of this code in the Python Tutor at http://mng.bz/ XP5M.

Screencast solution

Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.

Beyond the exercise

It’s hard to exaggerate just how often you’ll need to work with strings in Python. Moreover, Python is often used in text analysis and manipulation. Here are some ways that you can extend the exercise to push yourself further:

  • Handle capitalized words --If a word is capitalized (i.e., the first letter is capitalized, but the rest of the word isn’t), then the Pig Latin translation should be similarly capitalized.

  • Handle punctuation --If a word ends with punctuation, then that punctuation should be shifted to the end of the translated word.

  • Consider an alternative version of Pig Latin --We don’t check to see if the first letter is a vowel, but, rather, we check to see if the word contains two different vowels. If it does, we don’t move the first letter to the end. Because the word “wine” contains two different vowels (“i” and “e”), we’ll add “way” to the end of it, giving us “wineway.” By contrast, the word “wind” contains only one vowel, so we would move the first letter to the end and add “ay,” rendering it “indway.” How would you check for two different vowels in the word? (Hint: sets can come in handy here.)

Immutable?

One of the most important concepts in Python is the distinction between mutable and immutable data structures. The basic idea is simple: if a data structure is immutable, then it can’t be changed--ever.

For example, you might define a string and then try to change it:

s = 'abcd'
s[0] = '!'    

You’ll get an exception when running this code.

But this code won’t work; you’ll get an exception, with Python telling you that you’re not allowed to modify a string.

Many data structures in Python are immutable, including such basics as integers and Boolean values. But strings are where people get tripped up most often, partly because we use strings so often, and partly because many other languages have mutable strings.

Why would Python do such a thing? There are a number of reasons, chief among which is that it makes the implementation more efficient. But it also has to do with the fact that strings are the most common type used as dict keys. If strings were mutable, they wouldn’t be allowed as dict keys--or we’d have to allow for mutable keys in dicts, which would create a whole host of other issues.

Because immutable data can’t be changed, we can make a number of assumptions about it. If we pass an immutable type to a function, then the function won’t modify it. If we share immutable data across threads, then we don’t have to worry about locking it, because it can’t be changed. And if we invoke a method on an immutable type, then we get a new object back--because we can’t modify immutable data.

Learning to work with immutable strings takes some time, but the trade-offs are generally worthwhile. If you find yourself needing a mutable string type, then you might want to look at StringIO (http://mng.bz/045x), which provides file-like access to a mutable, in-memory type.

Many newcomers to Python think that immutable is just another word for constant, but it isn’t. Constants, which many programming languages offer, permanently connect a name with a value. In Python, there’s no such thing as a constant; you can always reassign a name to point to a new value. But you can’t modify a string or a tuple, no matter how hard you try; for example

s = 'abcd'
s[0] = '!'    
t = s         
s = '!bcd'    

Not allowed, since strings are immutable

The variables s and t now refer to the same string.

The variable s now refers to the new string, but t continues to refer to the old string, unchanged.

Exercise 6 Pig Latin sentence

Now that you’ve successfully written a translator for a single English word, let’s make things more difficult: translate a series of English words into Pig Latin. Write a function called pl_sentence that takes a string containing several words, separated by spaces. (To make things easier, we won’t actually ask for a real sentence. More specifically, there will be no capital letters or punctuation.)

So, if someone were to call

pl_sentence('this is a test translation')

the output would be

histay isway away estay ranslationtay

Print the output on a single line, rather than with each word on a separate line.

This exercise might seem, at least superficially, like the previous one. But here, the emphasis is not on the Pig Latin translation. Rather, it’s on the ways we typically use loops in Python, and how loops go together with breaking strings apart and putting them back together again. It’s also common to want to take a sequence of strings and print them out on a single line. There are a few ways to do this, and I want you to consider the advantages and disadvantages of each one.

Working it out

The core of the solution is nearly identical to the one in the previous section, in which we translated a single word into Pig Latin. Once again, we’re getting a text string as input from the user. The difference is that, in this case, rather than treating the string as a single word, we’re treating it as a sentence--meaning that we need to separate it into individual words. We can do that with str.split (http://mng.bz/aR4z). str.split can take an argument, which determines which string should be used as the separator between fields.

It’s often the case that you want to use any and all whitespace characters, regardless of how many there are, to split the fields. In such a case, don’t pass an argument at all; Python will then treat any number of spaces, tabs, and newlines as a single separation character. The difference can be significant:

s = 'abc  def  ghi'     
s.split(' ')            
s.split()               

Two spaces separating

Returns ['abc', '', 'def ', '', 'ghi']

Returns ['abc', 'def', 'ghi']

Note If you don’t pass any arguments to str.split, it’s effectively the same as passing None. You can pass any string to str.split, not just a single-character string. This means that if you want to split on ::, you can do that. However, you can’t split on more than one thing, saying that both , and :: are field separators. To do that, you’ll need to use regular expressions and the re.split function in the Python standard library, described here: http://mng.bz/K2RK.

Thus, we can take the user’s input and break it into words--again, assuming that there are no punctuation characters--and then translate each individual word into Pig Latin. Whereas the one-word version of our program could simply print its output right away, this one needs to store the accumulated output and then print it all at once. It’s certainly possible to use a string for that, and to invoke += on the string with each iteration. But as a general rule, it’s not a good idea to build strings in that way. Rather, you should add elements to a list using list.append (http://mng.bz/Mdlm) and then invoke str.join to turn the list’s elements into a long string.

That’s because strings are immutable, and += on a string forces Python to create a new string. If we’re adding to a string many times, then each time will trigger the creation of a new object whose contents will be larger than the previous iteration. By contrast, lists are mutable, and adding to them with list.append is relatively inexpensive, in both memory and computation.

Solution

def pl_sentence(sentence):
    output = []
    for word in sentence.split():
        if word[0] in 'aeiou':
            output.append(f'{word}way')
        else:
            output.append(f'{word[1:]}{word[0]}ay')
 
    return ' '.join(output)
 
print(pl_sentence('this is a test'))

You can work through a version of this code in the Python Tutor at http://mng.bz/yydE.

Screencast solution

Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.

Beyond the exercise

Splitting, joining, and manipulating strings are common actions in Python. Here are some additional activities you can try to push yourself even further:

  • Take a text file, creating (and printing) a nonsensical sentence from the nth word on each of the first 10 lines, where n is the line number.

  • Write a function that transposes a list of strings, in which each string contains multiple words separated by whitespace. Specifically, it should perform in such a way that if you were to pass the list ['abc def ghi', 'jkl mno pqr', 'stu vwx yz'] to the function, it would return ['abc jkl stu', 'def mno vwx', 'ghi pqr yz'].

  • Read through an Apache logfile. If there is a 404 error--you can just search for ' 404 ', if you want--display the IP address, which should be the first element.

Exercise 7 Ubbi Dubbi

When they hear that Python’s strings are immutable, many people wonder how the language can be used for text processing. After all, if you can’t modify strings, then how can you do any serious work with them?

Moreover, there are times when a simple for loop, as we used with the Pig Latin examples, won’t work. If we’re modifying each word only once, then that’s fine, but if we’re potentially modifying it several times, we have to make sure that each modification won’t affect future modifications.

This exercise is meant to help you practice thinking in this way. Here, you’ll implement a translator from English into another secret children’s language, Ubbi Dubbi (http://mng.bz/90zl). (This was popularized on the wonderful American children’s program Zoom, which was on television when I was growing up.) The rules of Ubbi Dubbi are even simpler than those of Pig Latin, although programming a translator is more complex and requires a bit more thinking.

In Ubbi Dubbi, every vowel (a, e, i, o, or u) is prefaced with ub. Thus milk becomes mubilk (m-ub-ilk) and program becomes prubogrubam (prub-ogrub-am). In theory, you only put an ub before every vowel sound, rather than before each vowel. Given that this is a book about Python and not linguistics, I hope that you’ll forgive this slight difference in definition.

Ubbi Dubbi is enormously fun to speak, and it’s somewhat magical if and when you can begin to understand someone else speaking it. Even if you don’t understand it, Ubbi Dubbi sounds extremely funny. See some YouTube videos on the subject, such as http://mng.bz/aRMY, if you need convincing.

For this exercise, you’ll write a function (called ubbi_dubbi) that takes a single word (string) as an argument. It returns a string, the word’s translation into Ubbi Dubbi. So if the function is called with octopus, the function will return the string uboctubopubus. And if the user passes the argument elephant, you’ll output ubelubephubant.

As with the original Pig Latin translator, you can ignore capital letters, punctuation, and corner cases, such as multiple vowels combining to create a new sound. When you do have two vowels next to one another, preface each of them with ub. Thus, soap will become suboubap, despite the fact that oa combines to a single vowel sound.

Much like the “Pig Latin sentence” exercise, this brings to the forefront the various ways we often need to scan through strings for particular patterns, or translate from one Python data structure or pattern to another, and how iterations can play a central role in doing so.

Working it out

The task here is to ask the user for a word, and then to translate that word into Ubbi Dubbi. This is a slightly different task than we had with Pig Latin, because we need to operate on a letter-by-letter basis. We can’t simply analyze the word and produce output based on the entire word. Moreover, we have to avoid getting ourselves into an infinite loop, in which we try to add ub before the u in ub.

The solution is to iterate over each character in word, adding it to a list, output. If the current character is a vowel, then we add ub before the letter. Otherwise, we just add the letter. At the end of the program, we join and then print the letters together. This time, we don’t join the letters together with a space character (' '), but rather with an empty string (' '). This means that the resulting string will consist of the letters joined together with nothing between them--or, as we often call such collections, a word.

Solution

def ubbi_dubbi(word):
    output = []
    for letter in word:
        if letter in 'aeiou':
            output.append(f'ub{letter}')    
        else:
            output.append(letter)
 
    return ''.join(output)
 
print(ubbi_dubbi('python'))

Why append to a list, and not to a string? To avoid allocating too much memory. For short strings, it’s not a big deal. But for long loops and large strings, it’s a bad idea.

You can work through this code in the Python Tutor at http://mng.bz/eQJZ.

Screencast solution

Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.

Beyond the exercise

It’s common to want to replace one value with another in strings. Python has a few different ways to do this. You can use str.replace (http://mng.bz/WPe0) or str .translate (http://mng.bz/8pyP), two string methods that translate strings and sets of characters, respectively. But sometimes, there’s no choice but to iterate over a string, look for the pattern we want, and then append the modified version to a list that we grow over time:

  • Handle capitalized words --If a word is capitalized (i.e., the first letter is capitalized, but the rest of the word isn’t), then the Ubbi Dubbi translation should be similarly capitalized.

  • Remove author names --In academia, it’s common to remove the authors’ names from a paper submitted for peer review. Given a string containing an article and a separate list of strings containing authors’ names, replace all names in the article with _ characters.

  • URL-encode characters --In URLs, we often replace special and nonprintable characters with a % followed by the character’s ASCII value in hexadecimal. For example, if a URL is to include a space character (ASCII 32, aka 0x20), we replace it with %20. Given a string, URL-encode any character that isn’t a letter or number. For the purposes of this exercise, we’ll assume that all characters are indeed in ASCII (i.e., one byte long), and not multibyte UTF-8 characters. It might help to know about the ord (http://mng.bz/EdnJ) and hex (http://mng .bz/nPxg) functions.

Exercise 8 Sorting a string

If strings are immutable, then does this mean we’re stuck with them forever, precisely as they are? Kind of--we can’t change the strings themselves, but we can create new strings based on them, using a combination of built-in functions and string methods. Knowing how to work around strings’ immutability and piece together functionality that effectively changes strings, even though they’re immutable, is a useful skill to have.

In this exercise, you’ll explore this idea by writing a function, strsort, that takes a single string as its input and returns a string. The returned string should contain the same characters as the input, except that its characters should be sorted in order, from the lowest Unicode value to the highest Unicode value. For example, the result of invoking strsort('cba') will be the string abc.

Working it out

The solution’s implementation of strsort takes advantage of the fact that Python strings are sequences. Normally, we think of this as relevant in a for loop, in that we can iterate over the characters in a string. However, we don’t need to restrict ourselves to such situations.

For example, we can use the built-in sorted (http://mng.bz/pBEG) function, which takes an iterable--which means not only a sequence, but anything over which we can iterate, such as a set of files--and returns its elements in sorted order. Invoking sorted in our string will thus do the job, in that it will sort the characters in Unicode order. However, it returns a list, rather than a string.

To turn our list into a string, we use the str.join method (http://mng.bz/gyYl). We use an empty string ('') as the glue we’ll use to join the elements, thus returning a new string whose characters are the same as the input string, but in sorted order.

Unicode

What is Unicode? The idea is a simple one, but the implementation can be extremely difficult and is confusing to many developers.

The idea behind Unicode is that we should be able to use computers to represent any character used in any language from any time. This is a very important goal, in that it means we won’t have problems creating documents in which we want to show Russian, Chinese, and English on the same page. Before Unicode, mixing character sets from a number of languages was difficult or impossible.

Unicode assigns each character a unique number. But those numbers can (as you imagine) get very big. Thus, we have to take the Unicode character number (known as a code point) and translate it into a format that can be stored and transmitted as bytes. Python and many other languages use what’s known as UTF-8, which is a variable-length encoding, meaning that different characters might require different numbers of bytes. Characters that exist in ASCII are encoded into UTF-8 with the same number they use in ASCII, in one byte. French, Spanish, Hebrew, Arabic, Greek, and Russian all use two bytes for their non-ASCII characters. And Chinese, as well as your childrens' emojis, are three bytes or more.

How much does this affect us? Both a lot and a little. On the one hand, it’s convenient to be able to work with different languages so easily. On the other hand, it’s easy to forget that there’s a difference between bytes and characters, and that you sometimes (e.g., when working with files on disk) need to translate from bytes to characters, or vice versa.

For further details about characters versus strings, and the way Python stores characters in our strings, I recommend this talk by Ned Batchelder, from PyCon 2012: http://mng .bz/NKdD.

Solution

def strsort(a_string):
    return ''.join(sorted(a_string))
 
print(strsort('cbjeaf'))

You can work through this code in the Python Tutor at http://mng.bz/pBd0.

Screencast solution

Watch this short video walkthrough of the solution: https://livebook.manning.com/ video/python-workout.

Beyond the exercise

This exercise is designed to give you additional reminders that strings are sequences and can thus be put wherever other sequences (lists and tuples) can be used. We don’t often think in terms of sorting a string, but there’s no difference between running sorted on a string, a list, or a tuple. The elements (in the case of a string, the characters) are returned in sorted order.

However, sorted (http://mng.bz/pBEG) returns a list, and we wanted to get a string. We thus needed to turn the resulting list back into a string--something that str.join is designed to do. str.split (http://mng.bz/aR4z) and str.join (http:// mng.bz/gyYl) are two methods with which you should become intimately familiar because they’re so useful and help in so many cases.

Consider a few other variations of, and extensions to, this exercise, which also use str.split and str.join, as well as sorted:

  • Given the string “Tom Dick Harry,” break it into individual words, and then sort those words alphabetically. Once they’re sorted, print them with commas (,) between the names.

  • Which is the last word, alphabetically, in a text file?

  • Which is the longest word in a text file?

Note that for the second and third challenges, you may well want to read up on the key parameter and the types of values you can pass to it. A good introduction, with examples, is here: http://mng.bz/D28E.

Summary

Python programmers are constantly dealing with text. Whether it’s because we’re reading from files, displaying things on the screen, or just using dicts, strings are a data type with which we’re likely familiar from other languages.

At the same time, strings in Python are unusual, in that they’re also sequences--and thus, thinking in Python requires that you consider their sequence-like qualities. This means searching (using in), sorting (using sorted), and using slices. It also means thinking about how you can turn strings into lists (using str.split) and turn sequences back into strings (using str.join). While these might seem like simple tasks, they crop up on a regular basis in production Python code. The fact that these data structures and methods are written in C, and have been around for many years, means they’re also highly efficient--and not worth reinventing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset