One of the most common and important functions of the Python language is to process and manipulate large amounts of text when implementing scripts, parsing XML/HTML, and interfacing with databases. For that reason, Python includes extremely dynamic and powerful string manipulation methods.
The phrases in this chapter are intended to give you a quick start into manipulating strings using the Python language. Although this chapter is not comprehensive, it tries to cover both the most commonly used functionality such as string comparisons, searching, and formatting, as well as some of the more powerful and dynamic functionality such as using strings as executable code, interpolating variables in strings, and evaluating strings as Python expressions.
Comparing strings in Python is best accomplished using a simple logical operation. For example, to determine whether a string matches another string exactly, you would use the is equal
or ==
operation. You can also use other logical operations such as >=
or <
to determine a sort order for several strings.
Python provides several methods for string objects that help when comparing. The most commonly used are the upper()
and lower()
methods, which return a new string that is all upper- or lowercase, respectively.
Another useful method is the capitalize()
method, which returns a new string with the first letter capitalized. There is also a swapcase()
that will return a new string with exactly the opposite casing for each character.
cmpStr = "abc" upperStr = "ABC" lowerStr = "abc" print "Case Sensitive Compare" if cmpStr == lowerStr: print lowerStr + " Matches " + cmpStr if cmpStr == upperStr: print upperStr + " Matches " + cmpStr print " Case In-Sensitive Compare" if cmpStr.upper() == lowerStr.upper(): print lowerStr + " Matches " + cmpStr if cmpStr.upper() == upperStr.upper(): print upperStr + " Matches " + cmpStr
comp_str.py
Case Sensitive Compare abc Matches abc Case In-Sensitive Compare abc Matches abc ABC Matches abc
Strings can be joined together using a simple add operation, formatting the strings together or using the join()
method. Using either the +
or +=
operation is the simplest method to implement and start off with. The two strings are simply appended to each other.
Formatting strings together is accomplished by defining a new string with string format codes, %s
, and then adding additional strings as parameters to fill in each string format code. This can be extremely useful, especially when the strings need to be joined in a complex format.
The fastest way to join a list of strings is to use the join(wordList)
method to join all the strings in a list. Each string, starting with the first, is added to the existing string in order. The join
method can be a little tricky at first because it essentially performs a string+=list[x]
operation on each iteration through the list of strings. This results in the string being appended as a prefix to each item in the list. This actually becomes extremely useful if you want to add spaces between the words in the list because you simply define a string as a single space and then implement the join
method from that string:
word1 = "A" word2 = "few" word3 = "good" word4 = "words" wordList = ["A", "few", "more", "good", "words"] #simple Join print "Words:" + word1 + word2 + word3 + word4 print "List: " + ' '.join(wordList) #Formatted String sentence = ("First: %s %s %s %s." % (word1,word2,word3,word4)) print sentence #Joining a list of words sentence = "Second:" for word in wordList: sentence += " " + word sentence += "." print sentence
join_str.py
Words:Afewgoodwords List: A few more good words First: A few good words. Second: A few more good words.
Output from join_str.py code
The split(
separator
)
and splitlines(
keeplineends
)
methods are provided by Python to split strings into substrings. The split
method searches a string, splits it on each occurrence of the separator character, and subdivides it into a list of strings. If no separator character is specified, the split
method will split the string at each occurrence of a whitespace character (space, tab, newline, and so on).
The splitlines
method splits the string at each newline character into a list of strings. This can be extremely useful when you are parsing a large amount of text. The splitlines
method accepts one argument that is a Boolean true or false to determine whether the newline character should be kept.
sentence = "A Simple Sentence." paragraph = "This is a simple paragraph. It is made up of of multiple lines of text." entry = "Name:Brad Dayley:Occupation:Software Engineer" print sentence.split() print entry.split(':') print paragraph.splitlines(1)
split_str.py
['A', 'Simple', 'Sentence.'] ['Name', 'Brad Dayley', 'Occupation', 'Software Engineer'] ['This is a simple paragraph. ', 'It is made up of of multiple ', 'lines of text.']
Example .
print searchStr.find("Red") print searchStr.rfind("Blue") print searchStr.index("Blue") print searchStr.index("Blue",8)
The two most common ways to search for a substring contained inside another string are the find(sub, [, start, [,end]]))
and index(sub, [, start, [,end]])
methods.
The index
method is faster than the find
method; however, if the substring is not found in the string, an exception is thrown. If the find
method fails to find the substring, then a -1
is returned. The find
and index
methods accept a search string as the first argument. The area of the string that is searched can be limited by specifying the optional start and/or end index. Only characters within those indexes will be searched.
Python also provides the rfind
and rindex
methods. These methods work in a similar manner as the find
and index
methods; however, they look for the right-most occurrence of the substring.
searchStr = "Red Blue Violet Green Blue Yellow Black" print searchStr.find("Red") print searchStr.rfind("Blue") print searchStr.find("Blue") print searchStr.find("Teal") print searchStr.index("Blue") print searchStr.index("Blue",20) print searchStr.rindex("Blue") print searchStr.rindex("Blue",1,18)
0 22 4 -1 4 22 22 4
Output from search_str.py code
Example .
question2 = question.replace("swallow", "European swallow") question3 = question.replace("swallow", "African swallow")
The native string type in Python provides a replace(old, new, maxreplace)
method to replace a specific substring with new text. The replace
method accepts a search string as the first argument and replacement string as the second argument. Each occurrence of the search string will be replaced with the new string. Optionally, you can specify a maximum number of times to perform the replace operation as the third argument.
question = "What is the air speed velocity of an unlaiden swallow?" print question question2 = question.replace("swallow", "European swallow") print question2 question3 = question.replace("swallow", "African swallow") print question3
replace_str.py
What is the air speed velocity of an unlaiden swallow? What is the air speed velocity of an unlaiden European swallow? What is the air speed velocity of an unlaiden African swallow?
Output from replace_str.py code
Example .
if f.endswith('.py'): print "Python file: " + f elif f.endswith('.txt'): print "Text file: " + f
The endswith(suffix, [, start, [,end]])
and startswith(prefix, [, start, [,end]])
methods provide a simple and safe way to determine whether a string begins or ends with a specific prefix or suffix, respectively. The first argument is a string used to compare to the prefix or suffix of the string. The endswith
and startswith
methods are dynamic enough for you to limit the search to within a specific range of the string using the start
and/or end
arguments.
The endswith
and startswith
methods are extremely useful when parsing file lists for extensions or filenames.
import os for f in os.listdir('C:\txtfiles'): if f.endswith('.py'): print "Python file: " + f elif f.endswith('.txt'): print "Text file: " + f
end_str.py
Python file: comp_str.py Python file: end_str.py Python file: eval_str.py Python file: join_str.py Text file: output.txt Python file: replace_str.py Python file: search_str.py Python file: split_str.py Python file: trim_str.py Python file: unicode_str.py Python file: var_str.py
Output from end_str.py code
Example .
str(len(badSentence.rstrip(' '))) print badSentence.lstrip(' ') print badParagraph.strip((' ?! '))
Common problems when parsing text are leftover characters at the beginning or end of the string. Python provides several strip methods to remove those characters. The strip([chrs])
, lstrip([chrs])
, and rstrip([chrs])
methods accept a list of characters as the only argument and return a new string with those characters trimmed from either the start, end, or both ends of the string.
The strip
will remove the specified characters from both the beginning and end of the string. The lstrip
and rstrip
methods remove the characters only from the beginning or end of the string, respectively.
import string badSentence = " This sentence has problems. " badParagraph = " This paragraph has even more problems.!? " #Strip trailing spaces print "Length = " + str(len(badSentence)) print "Without trailing spaces = " + str(len(badSentence.rstrip(' '))) #Strip tabs print " Bad: " + badSentence print " Fixed: " + badSentence.lstrip(' ') #Strip leading and trailing characters print " Bad: " + badParagraph print " Fixed: " + badParagraph.strip((' ?! '))
Length = 32 Without trailing spaces = 29 Bad: This sentence has problems. Fixed: This sentence has problems. Bad: This paragraph has even more problems.!? Fixed: This paragraph has even more problems.
Output from trim_str.py code
Example .
print "Chapter " + str(x) + str(chapters[x]).rjust(15,'.') print " Hex String: " + hexStr.upper().ljust(8,'0') print "Chapter %d %15s" % (x,str(chapters[x]))
One of the biggest advantages of the Python language is its capability to process and manipulate strings quickly and effectively. The native string type implements the rjust(width [, fill])
and ljust(width [, fill])
methods to quickly justify the text in a string a specific width to the right or left, respectively. The optional fill
argument to the rjust
and ljust
methods will fill the space created by the justification with the specified character.
Another extremely useful part of Python’s string management is the capability to create complex string formatting on the fly by creating a format string and passing arguments to that string using the %
operator. This results in a new formatted string that can be used in a string assignment, passed as an argument, or used in a print statement.
chapters = {1:5, 2:46, 3:52, 4:87, 5:90} hexStr = "3f8" #Right justify print "Hex String: " + hexStr.upper().rjust(8,'0') print for x in chapters: print "Chapter " + str(x) + str(chapters[x]).rjust(15,'.') #Left justify print " Hex String: " + hexStr.upper().ljust(8,'0') #String format print for x in chapters: print "Chapter %d %15s" % (x,str(chapters[x]))
format_str.py
Hex String: 000003F8 Chapter 1..............5 Chapter 2.............46 Chapter 3.............52 Chapter 4.............87 Chapter 5.............90 Hex String: 3F800000 Chapter 1 5 Chapter 2 46 Chapter 3 52 Chapter 4 87 Chapter 5 90
One of the most dynamic features of Python is the capability to evaluate a string that contains code and execute the code locally. The exec(str [,globals [,locals]])
function will execute Python code that is contained in the str
string and return the result. Local and global variables can be added to the environment used to execute the code by specifying global and/or local dictionaries containing corresponding variable name and values.
The eval(str [,globals [,locals]])
function works in a similar manner as the exec
function except that it only evaluates the string as a Python expression and returns the results.
cards = ['Ace', 'King', 'Queen', 'Jack'] codeStr = "for card in cards: print "Card = " + card" areaStr = "pi*(radius*radius)" #Execute string exec(codeStr) #Evaluate string print " Area = " + str(eval(areaStr, {"pi":3.14}, {"radius":5}))
Card = Ace Card = King Card = Queen Card = Jack Area = 78.5
Output from eval_str.py code
Python provides the capability to interpolate variables inside strings. This functionality provides the ability to create string templates and then apply variable values to them based on the state of an existing variable.
Interpolating variables is accomplished in two steps. The first step is to create a string template, using the Template(
string
)
method, which includes the formatted text and properly placed variable names preceded by the $
character.
To include a $
character in your template string use a double $$
set. The $$
will be replaced with a single $
when the template is applied.
Once the template has been created, the second step is to apply a variable value to the template using the substitute(m, [, kwargs])
method of the Template
class. The argument m
can be a specific assignment, a dictionary of variable values, or a keyword list.
import string values = [5, 3, 'blue', 'red'] s = string.Template("Variable v = $v") for x in values: print s.substitute(v=x)
var_str.py
Variable v = 5 Variable v = 3 Variable v = blue Variable v = red
Output from var_str.py code
Example .
print uniStr.encode('utf-8') print uniStr.encode('utf-16') print uniStr.encode('iso-8859-1') asciiStr =asciiStr.translate( string.maketrans('xF1','n'), '') print asciiStr.encode('ascii')
The Python language provides a simple encode(encoding)
method to convert unicode strings to a local string for easier processing. The encoding method takes only encoding such as utf-8
, utf-16
, iso-8859-1
, and ascii
as its single argument and returns a string encoded in that format.
Strings can be converted to unicode by several different methods. One is to define the string as unicode by prefixing it with a u
when assigning it to a variable. Another is to combine a unicode string with another string. The resulting string will be unicode. You can also use the decode(encoding)
method to decode the string. The decode
method returns a unicode form of the string.
The ASCII encoding allows only for characters up to 128. If your string includes characters that are above that range, you will need to translate those characters before encoding the string to ASCII.
import string locStr = "El " uniStr = u"Niu00F1o" print uniStr.encode('utf-8') print uniStr.encode('utf-16') print uniStr.encode('iso-8859-1') #Combine local and unicode results #in new unicode string newStr = locStr+uniStr print newStr.encode('iso-8859-1') #ascii will error because character 'xF1' #is out of range asciiStr = newStr.encode('iso-8859-1') asciiStr =asciiStr.translate( string.maketrans('xF1','n'), '') print asciiStr.encode('ascii') print newStr.encode('ascii')
unicode_str.py
Niño ÿþN|I|ñ|o Niño El Niño El Nino Traceback (most recent call last): File "C:ookspythonCH2codeunicode_str.py", line 19, in ? print newStr.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode character u'xf1' in position 5: ordinal not in range(128)
Output from unicode_str.py code