As with any well-developed scripting language, Python is very prepared to handle the need to directly manage and manipulate files. Python includes several built-in functions, as well as additional modules to help manage files. These functions and modules provide the versatility and power to handle file parsing, data storage and retrieval, and filesystem management, as well as archive management.
It’s not possible to adequately address all the file management features of Python in this book; however, this chapter will provide the most common phrases to create and use files, manage files on a file system, and archive files for storage or distribution.
To use most of the built-in file functions in Python, you will need to first open the file, perform whatever file operations are necessary, and then close it. Python uses the simple open(path [,mode [,buffersize]])
call to open files for both reading and writing. The path
is a path string pointing to the file. The mode
determines what mode the file will be opened in, as shown in Table 4.1 .
Table 4.1. File Modes for Python’s Built-In File Functions
Mode | Description |
---|---|
r | Opens an existing file for reading. |
w | Opens a file for writing. If the file already exists, the contents are deleted. If the file does not already exist, a new one is created. |
a | Opens an existing file for updating, keeping the existing contents intact. |
r+ | Opens a file for both reading and writing. The existing contents are kept intact. |
w+ | Opens a file for both writing and reading. The existing contents are deleted. |
a+ | Opens a file for both reading and writing. The existing contents are kept intact. |
b | Is applied in addition to one of the read, write, or append modes. Opens the file in binary mode. |
U | Is applied in addition to one of the read, write, or append modes. Applies the “universal” newline translator to the file as it is opened. |
The optional buffersize
argument specifies which buffering mode should be used when accessing the file. 0 indicates that the file should be unbuffered, 1 indicates line-buffering, and any other positive number indicates a specific buffer size to be used when accessing the file. Buffering the file improves performance because part of the file is cached in computer memory. Omitting this argument or specifying a negative number results in the system default buffer size to be used.
After using the file, you should close it using the built-in close()
function. This will free up the system resources and keep the file from being held open any longer than necessary.
Using the universal newline mode U is extremely useful if you need to deal with files that are created by applications that are not consistent in managing newline characters. The universal newline mode converts all the different variations (
,
,
) to the standard
character.
inPath = "input.txt" outPath = "output.txt" #Open a file for reading file = open(inPath, 'rU') if file: # read from file here (see Reading an Entire File # later in this chapter for more info) file.close() else: print "Error Opening File." #Open a file for writing file = open(outPath, 'wb') if file: # write to file here (see Writing a File later # in this chapter for more info) file.close() else: print "Error Opening File."
open_file.py
Example .
buffer += open(filePath, 'rU').read() inList = open(filePath, 'rU').readlines() while(1): bytes = file.read(5) if bytes: buffer += bytes
Python provides several methods to read the entire contents of a file. The first is to open the file and call the read()
function. This will read the entire contents of the file until an EOF marker is encountered and returns the contents of the file as a string.
Another method to read an entire file is to use the readlines()
function. This reads the entire contents of the file, separating each line into individual strings, until an EOF marker is encountered. Once the end of the file is found, a list of strings representing each line is returned.
In case of very large files, you might want to read only a specific number of bytes at a time. Use the read(bytes)
function to read a specific number of bytes at a time, which can then be processed more easily. This will read a specific number of bytes from the file if possible and return them as a string. If the first character read is an EOF marker, null is returned.
The code in read_file.py
demonstrates how to read the entire contents at once, one line at a time, as well as a specific number of bytes from a file.
filePath = "input.txt" #Read entire file into a buffer buffer = "Read buffer: " buffer += open(filePath, 'rU').read() print buffer #Read lines into a buffer buffer = "Readline buffer: " inList = open(filePath, 'rU').readlines() print inList for line in inList: buffer += line print buffer #Read bytes into a buffer buffer = "Read buffer: " file = open(filePath, 'rU') while(1): bytes = file.read(5) if bytes: buffer += bytes else: break print buffer
read_file.py
Read buffer: Line 1 Line 2 Line 3 Line 4 ['Line 1 ', 'Line 2 ', 'Line 3 ', 'Line 4 '] Readline buffer: Line 1 Line 2 Line 3 Line 4 Read buffer: Line 1 Line 2 Line 3 Line 4
Example .
print linecache.getline(filePath, 1) print linecache.getline(filePath, 3) linecache.clearcache()
The linecache module in Python is an extremely useful tool if you need to access specific lines in certain files multiple times. The linecache module caches the lines in a file in memory the first time they are read. Although this does not provide any advantage the first time the file is accessed, it does speed up consecutive accesses immensely.
The getline(
filename, lineno
)
function of the linecache module accepts a filename and line number as its arguments. It then reads the line from the file, caches it in memory for later use, and then returns a string representation of the line. The clearcache()
function of the linecache module frees up the cache memory by removing all lines that have been previously read.
import linecache filePath = "input.txt" print linecache.getline(filePath, 1) print linecache.getline(filePath, 3) linecache.clearcache()
Line 1 Line 3
Output from line_cache.py code
Example .
file = open(filePath, 'rU') for line in file: for word in line.split(): wordList.append(word)
A useful tool when processing files is to separate each word in the file and process them one at a time. The words can be individually processed by opening the file, reading each line into a string, and then splitting the strings into words using the split()
function.
The program read_words.py
shows a simple example of reading a file and processing the words one at time. The lines in the file are processed one at a time using a for
loop. The split()
function splits the line into a list of words based on spaces because no other character was passed as the separator argument. Once the words are separated, they can be individually processed into lists, dictionaries, and so on.
filePath = "input.txt" wordList = [] wordCount = 0 #Read lines into a list file = open(filePath, 'rU') for line in file: for word in line.split(): wordList.append(word) wordCount += 1 print wordList print "Total words = %d" % wordCount
['Line', '1', 'Line', '2', 'Line', '3', 'Line', '4'] Total words = 8
Output from read_words.py code
Example .
file.writelines(wordList) file.write(" Formatted text: ") print >>file," %s Color Adjust" % word
Just as with reading the contents of a file, there are several ways to write data out to a file. The easiest, yet the most dynamic and powerful, is the write(
string
)
function. The write
function writes the string
argument to the file at the current file pointer. Although the write
function itself is relatively simple, the power of Python with regard to string manipulation makes the capabilities of the write
function virtually limitless.
Python provides the writelines(
sequence
)
function to save time writing a list of data out to the file. The writelines
function typically accepts a list of strings and writes those strings to the file.
Another option available in Python is to redirect the print
statement out to a file using the >>
redirection operation. This allows you to use the versatility of the Python print
function to format and write data out to a file.
wordList = ["Red", "Blue", "Green"] filePath = "output.txt" #Write a list to a file file = open(filePath, 'wU') file.writelines(wordList) #Write a string to a file file.write(" Formatted text: ") #Print directly to a file for word in wordList: print >>file," %s Color Adjust" % word file.close()
write_file.py
RedBlueGreen Formatted text: Red Color Adjust Blue Color Adjust Green Color Adjust
Contents of output.txt file
Example .
lineCount = len(open(filePath, 'rU').readlines()) print "File %s has %d lines." % (filePath, lineCount)
When parsing files using Python, it’s useful to know exactly how many lines are contained in the file. The example in file_lines.py shows a simple method to determine the number of lines contained in a file by first opening it, and then using readlines()
to generate a list of lines and using the len()
function to determine the number of lines in the list.
For large files, using readlines()
to generate a list lines in a file might be impractical because of the amount of memory and processing time necessary.
filePath = "input.txt" lineCount = len(open(filePath, 'rU').readlines()) print "File %s has %d lines." % (filePath, lineCount)
file_lines.py
File input.txt has 4 lines.
Output from file_lines.py code
Python provides a powerful directory tree-walking function in the os module. The walk(
path
)
function will walk the directory tree, and for each directory in the tree create a three-tuple containing (1) the dirpath, (2) a list of dirnames, and (3) a list of filenames.
Once the tuples have been created, they can be processed one at a time as elements of a list. For each tuple, you can access the path to the directory represented directly by using the 0 index into the tuple. Lists of the subdirectories and files contained in the directory can likewise be accessed using the 1 and 2 indexes, respectively.
The example in dir_tree.py
shows how to use the os.walk(path)
function to walk a directory tree and print out a formatted listing of the tree.
import os path = "/books/python" def printFiles(dirList, spaceCount): for file in dirList: print "/".rjust(spaceCount+1) + file def printDirectory(dirEntry): print dirEntry[0] + "/" printFiles(dirEntry[2], len(dirEntry[0])) tree = os.walk(path) for directory in tree: printDirectory(directory)
dir_tree.py
/books/python/ /Python Proposal.doc /Python_Phrasebook_TOC.doc /python_schedule.xls /template.doc /TOC_Notes.doc /books/pythonCH2/ /ch2.doc /books/pythonCH2code/ /comp_str.py /end_str.py /eval_str.py /format_str.py /join_str.py /output.txt /replace_str.py /search_str.py /split_str.py /trim_str.py /unicode_str.py /var_str.py /books/pythonCH3/ /ch3.doc
A common task when parsing files using Python is to either delete the file or at least rename it once the data has been processed. The easiest way to accomplish this is to use the os.remove(newFile)
and os.rename(oldFile, newFile)
function in the os module.
The example in ren_file shows how to rename a file by first detecting whether the new filename already exists and then removing the existing file. Once the existing file has been removed, the rename
function can be used to rename the file.
import os oldFileName = "/books/python/CH4/code/output.txt" newFileName = "/books/python/CH4/code/output.old" #Old Listing for file in os.listdir("/books/python/CH4/code/"): if file.startswith("output"): print file #Remove file if the new name already exists if os.access(newFileName, os.X_OK): print "Removing " + newFileName os.remove(newFileName) #Rename the file os.rename(oldFileName, newFileName) #New Listing for file in os.listdir("/books/python/CH4/code/"): if file.startswith("output"): print file
ren_file.py
output.old output.txt Removing /books/python/CH4/code/output.old output.old
Output from ren_file.py code
To recursively delete files and subdirectories in Python, use the walk(path)
function in the os module. For a more detailed description of the walk function, refer to the “Walking the Directory Tree” section earlier in this chapter.
The walk
function will automatically create a list of tuples representing the directories that need to be deleted. To recursively delete a tree, walk through the list of directories and delete each file contained in the files list (third item in the tuple).
The trick is removing the directories. Because a directory cannot be removed until it is completely empty, the files must first be deleted and then the directories must be removed in reverse order, starting with the deepest subdirectory.
The example in del_tree.py shows how to use the os.walk(path)
function to walk a directory tree and delete the files, and then recursively remove the subdirectories.
import os emptyDirs = [] path = "/trash/deleted_files" def deleteFiles(dirList, dirPath): for file in dirList: print "Deleting " + file os.remove(dirPath + "/" + file) def removeDirectory(dirEntry): print "Deleting files in " + dirEntry[0] deleteFiles(dirEntry[2], dirEntry[0]) emptyDirs.insert(0, dirEntry[0]) #Enumerate the entries in the tree tree = os.walk(path) for directory in tree: removeDirectory(directory) #Remove the empty directories for dir in emptyDirs: print "Removing " + dir os.rmdir(dir)
del_tree.py
Deleting files in /trash/deleted_files Deleting 102.ini Deleting 103.ini Deleting 104.ini Deleting 105.ini Deleting 106.ini Deleting 107.ini Deleting 108.ini Deleting 109.ini Deleting files in/trash/deleted_filesTest Deleting 111.ini Deleting 114.ini Deleting 115.ini Deleting files in/trash/deleted_filesTestTest2 Deleting 112.ini Deleting 113.ini Removing /trash/deleted_filesTestTest2 Removing /trash/deleted_filesTest Removing /trash/deleted_files
Example .
for ext in pattern.split(";"): extList.append(ext.lstrip("*")) .... if file.endswith(ext): print "/".rjust(spaceCount+1) + file
One of the most common file functions is to search for files based on extension. The example in find_file.py shows one way to search for files based on a string of extensions. The search is handled by first creating a list of the file extensions by splitting the pattern string using the split()
function.
Once the list of extensions is created, walk the directory tree and check to see whether the file’s extension matches one in the list by using the endswith
(string)
function on the file.
import os path = "/books/python" pattern = "*.py;*.doc" #Print files that match to file extensions def printFiles(dirList, spaceCount, typeList): for file in dirList: for ext in typeList: if file.endswith(ext): print "/".rjust(spaceCount+1) + file break #Print each sub-directory def printDirectory(dirEntry, typeList): print dirEntry[0] + "/" printFiles(dirEntry[2], len(dirEntry[0]), typeList) #Convert pattern string to list of file extensions extList = [] for ext in pattern.split(";"): extList.append(ext.lstrip("*")) #Walk the tree to print files for directory in os.walk(path): printDirectory(directory, extList)
find_file.py
/books/python/ /Python Proposal.doc /Python_Phrasebook_TOC.doc /template.doc /TOC_Notes.doc /books/pythonCH2/ /ch2.doc /books/pythonCH2code/ /comp_str.py /end_str.py /eval_str.py /format_str.py /join_str.py /replace_str.py /search_str.py /split_str.py /trim_str.py /unicode_str.py /var_str.py /books/pythonCH3/ /ch3.doc
Example .
tFile = tarfile.open("files.tar", 'w') files = os.listdir(".") for f in files: tFile.add(f)
The tarfile module, included with Python, provides a set of easy-to-use methods to create and manipulate TAR files. The open(filename [, mode [, fileobj [, bufsize]]])
method must be called with the write mode set to create a new TAR. Table 4.2 shows the different modes available when opening a TAR file.
Table 4.2. File Modes for Python’s tarfile Module
Mode | Description |
---|---|
r | (Default) Opens a TAR file for reading. If the file is compressed, it will be decompressed. |
r: | Opens a TAR file for reading with no compression. |
w or w: | Opens a TAR file for writing with no compression. |
a or a: | Opens a TAR file for appending with no compression. |
r:gz | Opens a TAR file for reading with gzip compression. |
w:gz | Opens a TAR file for writing with gzip compression. |
r:bz2 | Opens a TAR file for reading with bzip2 compression. |
w:bz2 | Opens a TAR file for writing with bzip2 compression. |
Once the TAR file has been opened in write mode, files can be added to it using the add
(name [,arcname [, recursive]])
method. The add
method adds the file or directory specified in name
to the archive. The optional arcname
argument enables you to specify what name the file should have inside the archive. The recursive
argument accepts a Boolean true or false to determine whether or not to recursively add the contents of directories to the archive.
To open a TAR file for sequential access only, replace the : character in the mode with a | character. The append mode is not available for the sequential access option.
import os import tarfile #Create Tar file tFile = tarfile.open("files.tar", 'w') #Add directory contents to tar file files = os.listdir(".") for f in files: tFile.add(f) #List files in tar for f in tFile.getnames(): print "Added %s" % f tFile.close()
tar_file.py
Added add_zip.py Added del_tree.py Added dir_tree.py Added extract.txt Added extract_tar.py Added file_lines.py Added find_file.py Added get_zip.py Added input.txt Added open_file.py Added output.old Added read_file.py Added read_line.py Added read_words.py Added ren_file.py Added tar_file.py Added write_file.py
Output from tar_file.py code
The tarfile module includes the extract
(file [, path])
method to extract files specified by the file
argument and place them in the location specified by the path
argument. If no path is specified, the current working directory becomes the destination.
The example in extract_tar.py
opens the TAR file created in the previous phrase and extracts only the Python files to a directory called /bin/py
.
import os import tarfile extractPath = "/bin/py" #Open Tar file tFile = tarfile.open("files.tar", 'r') #Extract py files in tar for f in tFile.getnames(): if f.endswith("py"): print "Extracting %s" % f tFile.extract(f, extractPath) else: print "%s is not a Python file." % f tFile.close()
extract_tar.py
Extracting add_zip.py Extracting del_tree.py Extracting dir_tree.py extract.txt is not a Python file. Extracting extract_tar.py Extracting file_lines.py Extracting find_file.py Extracting get_zip.py input.txt is not a Python file. Extracting open_file.py output.old is not a Python file. Extracting read_file.py Extracting read_line.py Extracting read_words.py Extracting ren_file.py Extracting tar_file.py Extracting write_file.py
Example .
tFile = zipfile.ZipFile("files.zip", 'w') files = os.listdir(".") for f in files: tFile.write(f)
The zipfile module, included with Python, provides a set of easy-to-use methods to create and manipulate ZIP files. The ZipFile(filename [, mode [, compression]])
method creates or opens a ZIP file depending on the mode specified. The available modes for ZIP files are r, w, and a to read, write, or append, respectively. Using the w mode will create a new ZIP file or truncate the existing file to zero if it already exists.
The optional compression
argument will accept either the ZIP_STORED(not compressed)
or ZIP_DEFLATED(compressed)
compression options to set the default compression when writing files to the archive.
Once the ZIP file has been opened in write mode, files can be added to it using the write
(filename [,arcname
[, compression]])
method. The write
method adds the file specified in filename
to the archive. The optional arcname
argument enables you to specify what name the file should have inside the archive.
import os import zipfile #Create the zip file tFile = zipfile.ZipFile("files.zip", 'w') #Write directory contents to the zip file files = os.listdir(".") for f in files: tFile.write(f) #List archived files for f in tFile.namelist(): print "Added %s" % f tFile.close()
add_zip.py
Added add_zip.py Added del_tree.py Added dir_tree.py Added extract.txt Added extract_tar.py Added files.zip Added file_lines.py Added find_file.py Added get_zip.py Added input.txt Added open_file.py Added output.old Added read_file.py Added read_line.py Added read_words.py Added ren_file.py Added tar_file.py Added write_file.py
Retrieving file contents from a ZIP file is easily done using the read
(filename)
method included in the zipfile module. Once the ZIP file is opened in read mode, the read
method is called and the contents of the specified file are returned as a string. Once the contents are returned, they can be added to a list or dictionary, printed to the screen, written to a file, or any number of other possibilities.
The example in get_zip.py opens the ZIP file created in the previous phrase, reads Python file ren_file.py, prints the contents to the screen, and then writes the contents to a new file called extract.txt.
import os import zipfile tFile = zipfile.ZipFile("files.zip", 'r') #List info for archived file print tFile.getinfo("input.txt") #Read zipped file into a buffer buffer = tFile.read("ren_file.py") print buffer #Write zipped file contents to new file f = open("extract.txt", "w") f.write(buffer) f.close() tFile.close()
get_zip.py
<zipfile.ZipInfo instance at 0x008DCB70> import os oldFileName = "/books/python/CH4/code/output.txt" newFileName = "/books/python/CH4/code/output.old" #Old Listing for file in os.listdir("/books/python/CH4/code/"): if file.startswith("output"): print file #Remove file if the new name already exists if os.access(newFileName, os.X_OK): print "Removing " + newFileName os.remove(newFileName) #Rename the file os.rename(oldFileName, newFileName) #New Listing for file in os.listdir("/books/python/CH4/code/"): if file.startswith("output"): print file
Output from get_zip.py code