IN THIS CHAPTER
Data is everywhere. For example, if you run a website, you're collecting data continually and you may not even know it. Every visit to your site generates information that is stored in a file on your server. This file contains lots of useful information, if you take the time to examine it.
That's just one example of data collection. Virtually every automated system collects data and stores it. Most of the time, the system that collects the data is also equipped to verify and analyze the data — but not always. And, of course, data is collected manually. A nonautomated telephone survey is a good example.
Excel is a good tool for analyzing data, and it's often used to summarize the information and display it in the form of tables and charts. But often, the data that's collected isn't perfect. For one reason or another, it needs to be cleaned up before it can be analyzed.
One common use for Excel is as a tool to clean up data. Cleaning up data involves getting raw data into a worksheet and then manipulating it so it conforms to various requirements. In the process, the data will be made consistent so it can be properly analyzed.
This chapter describes various ways to get data into a worksheet and provides some tips to help you clean it up.
Before you can do anything with data, you must get it into a worksheet. Excel is able to import most common text file formats and can retrieve data from websites.
This section describes file types that Excel can open directly, using the File Open command. Figure 32.1 shows the list of file filter options you can specify in the Open dialog box.
In addition to the current file formats (XLSX, XLSM, XLSB, XLTX, XLTM, and XLAM), Excel 2016 can open workbook files from all previous versions of Excel:
Excel can also open one file format created by other spreadsheet products: ODS, the OpenDocument spreadsheet format. ODS files are produced by a variety of “open” software, including Google Drive, OpenOffice, LibreOffice, StarOffice, and several others.
Excel 2016 can open the following database file formats:
.mdb
and .accdb
.In addition, Excel supports various types of database connections that enable you to access data selectively. For example, you can perform a query on a large database to retrieve only the records you need (rather than the entire database).
A text file contains raw characters, with no formatting. Excel can open most types of text files:
Most of these text file types have variants. For example, text files produced on a Mac have different end-of-row characters. Excel can usually handle the variants without a problem.
When you attempt to open a text file in Excel, the Text Import Wizard might kick in to help you specify how you want the data to be retrieved.
Excel can open most HTML files, which can be stored on your local drive or on a web server. Choose File Open and locate the HTML file. If the file is on a web server, you'll need to copy the URL and paste it into the File Name field in the Open dialog box.
The way the HTML code renders in Excel varies considerably. Sometimes the HTML file may look exactly as it does in a browser. Other times it may bear little resemblance, especially if the HTML file uses Cascading Style Sheets (CSS) for layout.
In some cases, you can access data on the Web by using the Get & Transform feature. I discuss this topic in Chapter 38, “Working with Get & Transform.”
XML (Extensible Markup Language) is a text file format suitable for structured data. Data is enclosed in tags, which also serve to describe the data.
Excel can open XML files, and simple files will display with little or no effort. Complex XML files will require some work, however. A discussion of this topic is beyond the scope of this book. You'll find information about getting data from XML files in Excel's Help system and online.
If you need to insert a text file into a specific range in a worksheet, you might think that your only choice is to import the text into a new workbook and then to copy the data and paste it into the range where you want it to appear. However, you can do it in a more direct way.
Figure 32.2 shows a small CSV file. The following instructions describe how to import this file, named monthly.csv
, beginning at cell C3.
If all else fails, you can try standard copy-and-paste techniques. If you can copy data from an application (for example, a word-processing program or a document displayed in PDF viewer), there's a good chance you can paste it into an Excel workbook. For best results, try pasting using the Home Clipboard Paste Paste Special command, and try various paste options listed. Usually, pasted data will require some cleanup.
This section discusses a variety of techniques that you can use to clean up data in a worksheet.
If data is compiled from multiple sources, it may contain duplicate rows. Most of the time, you want to eliminate the duplicates. In the past, removing duplicate data was essentially a manual task — although it could be automated by using a confusing advanced filter technique. But now removing duplicate rows is easy, thanks to Excel's Remove Duplicates command (introduced in Excel 2007).
Start by moving the cell cursor to any cell within your data range. Choose Data Data Tools Remove Duplicates, and the Remove Duplicates dialog box, shown in Figure 32.5, appears.
The Remove Duplicates dialog box lists all the columns in your data range or table. Place a check mark next to the columns that you want to be included in the duplicate search. Most of the time, you'll want to select all the columns, which is the default. Click OK, and Excel weeds out the duplicate rows and displays a message that tells you how many duplicates it removed. It would be nice if Excel gave you the option to change your mind, but it doesn't. If Excel deleted too many rows, you can undo the procedure by clicking Undo (or by pressing Ctrl+Z).
When you select all columns in the Remove Duplicates dialog box, Excel will delete a row only if the content of every column is duplicated. In some situations, you may not care about matching some columns, so you would deselect those columns in the Remove Duplicates dialog box. For example, if each row has a unique ID code, Excel would never find any duplicate rows. So you'd want to uncheck that column in the Remove Duplicates dialog box.
When duplicate rows are found, the first row is kept and subsequent duplicate rows are deleted.
If you would like to identify duplicate rows so you can examine them without automatically deleting them, here's another method. Unlike the technique described in the previous section, this method looks at actual values, not formatted values.
Create a formula to the right of your data that concatenates each of the cells to the left. The formulas that follow assume that the data is in columns A:F.
Enter this formula into cell G2:
=A2&B2&C2&D2&E2&F2
Add another formula in cell H2. This formula displays the number of times a value in column G occurs:
=COUNTIF(G:G,G2)
Copy these formulas down the column for each row of your data.
Column H displays the number of occurrences of that row. Unduplicated rows will display 1. Duplicated rows will display a number that corresponds to the number of times that row appears.
Figure 32.6 shows a simple example. If you don't care about a particular column, just omit it from the formula in column G. For example, if you want to find duplicates regardless of the Status column, just omit D2 from the concatenating formula.
When importing data, you might find that multiple values are imported into a single column. Figure 32.7 shows an example of this type of import problem.
If the text is all the same length (as in the example), you might be able to write a series of formulas that extract the information to separate columns. The LEFT
, RIGHT
, and MID
functions are useful for this task.
You should also be aware that Excel offers two nonformula methods to assist in splitting data so it occupies multiple columns: Text to Columns and Flash Fill.
The Text to Columns command can parse strings into their component parts.
First, make sure that the column that contains the data to be split up has enough empty columns to the right to accommodate the extracted data. Then select the data to be parsed and choose Data Data Tools Text to Columns. Excel displays the Convert Text to Columns Wizard, which consists of a series of dialog boxes that walk you through the steps to convert a single column of data into multiple columns. Figure 32.8 shows the initial step, in which you choose the type of data:
Make your choice and click Next to move on to step 2, which depends on the choice you made in step 1.
If you're working with delimited data, specify the delimiting character or characters. You'll see a preview of the result. If you're working with fixed-width data, specify the column breaks directly in the preview window.
When you're satisfied with the column breaks, click Next to move to step 3. In this step you can click a column in the preview window and specify general formatting for the column. Click Finish, and Excel splits the data as specified.
The Text to Columns Wizard works well for many types of data. But sometimes you'll encounter data that can't be parsed by that wizard. For example, the Text to Columns Wizard is useless if you have variable-width data that doesn't have delimiters. In such a case, the Flash Fill feature might save the day. But keep in mind that Flash Fill works successfully only when the data is very consistent. Flash Fill was introduced in Excel 2013.
Flash Fill uses pattern recognition to extract data (and also concatenate data). Just enter a few examples in a column that's adjacent to the data and choose Data Data Tools Flash Fill (or press Ctrl+E). Excel analyzes the examples and attempts to fill in the remaining cells. If Excel didn't recognize the pattern you had in mind, press Ctrl+Z, add another example or two, and try again.
Figure 32.9 shows a worksheet with some text in a single column. The goal is to extract the numeric value from each text string and put the number into a separate cell. The Text to Columns Wizard can't do it because the space delimiters aren't consistent. It might be possible to write an array formula, but it would be complicated.
To try using Flash Fill, activate cell B1 and type the first number (20). Move to B2, and type the second number (6). Can Flash Fill identify the remaining numbers and fill them in? Choose Data Data Tools Flash Fill (or press Ctrl+E), and Excel fills in the remaining cells in a flash. Figure 32.10 shows the result.
As you see, Excel identified most of the values. Accuracy increases if you provide more examples. For example, provide an example of a decimal number. Delete the suggested values in Column B, enter 3.12 in cell B6, and press Ctrl+E. This time, Flash Fill gets all of them correct (see Figure 32.11).
This simple example demonstrates two important points:
Figure 32.12 shows another example, names in column A. The goal is to extract the first, last, and middle name (if it has one). In column B Flash Fill successfully gets all the first names using only two examples (Mark and Tim). Plus, it successfully extracted all the last names (column C), using Russell and Colman. Extracting the middle names or initials (column D) eluded me until I provided examples that included a space on either side of the middle name).
To summarize, Excel's Flash Fill is an interesting idea, but it works reliably only if the data is very consistent. Even when you think it worked correctly, make sure you examine the results carefully. And think twice before trusting it with important data because there's no way to document the way the data was extracted. But the main limitation is that (unlike formulas) Flash Fill is not a dynamic technique. If your data changes, the flash-filled column does not update.
Often, you'll want to make text in a column consistent, in terms of case. Excel provides no direct way to change the case of text, but it's easy to do with formulas (see the sidebar “Transforming Data with Formulas”).
The three relevant functions are
UPPER
: Converts the text to ALL UPPERCASE.LOWER
: Converts the text to all lowercase.PROPER
: Converts the text to Proper Case. (The first letter in each word is capitalized, as in a proper name.)These functions are quite straightforward. They operate only on alphabetic characters and just ignore all other characters and return them unchanged.
If you use the PROPER
function, you'll probably need to do some additional cleanup to handle exceptions. Following are examples of transformations that you probably would consider incorrect:
PROPER
function doesn't handle names with an embedded capital letter, such as McDonald.Often, you can correct some of these problems by using Find and Replace.
It's usually a good idea to ensure that data doesn't have extra spaces. It's impossible to spot a space character at the end of a text string. Extra spaces can cause lots of problems, especially when you need to compare text strings. The text July is not the same as the text July with a space appended to the end. The first is four characters long, and the second is five characters long.
Create a formula that uses the TRIM
function to remove all leading and trailing spaces and to replace multiple spaces with a single space. This example uses the TRIM
function. The formula returns Fourth Quarter Earnings
(with no excess spaces):
=TRIM(" Fourth Quarter Earnings ")
Data that is imported from a web page often contains a different type of space: a nonbreaking space, indicated by  
in HTML code. In Excel, this character can be generated by this formula:
=CHAR(160)
You can use a formula like this to replace those spaces with normal spaces:
=SUBSTITUTE(A2,CHAR(160)," ")
Or use this formula to replace the nonbreaking space character with normal spaces and to remove excess spaces:
=TRIM(SUBSTITUTE(A2,CHAR(160)," "))
Often, data imported into an Excel worksheet contains strange (sometimes unprintable) characters. You can use the CLEAN
function to remove all nonprinting characters from a string. If the data is in cell A2, this formula will do the job:
=CLEAN(A2)
In some cases you may need to convert values from one system to another. For example, you may import a file that has values in fluid ounces, and they need to be expressed in milliliters. Excel's handy CONVERT
function can perform that and many other conversions.
If cell A2 contains a value in ounces, the following formula converts it to milliliters:
=CONVERT(A2,"oz","ml")
This function is extremely versatile and can handle most common measurement units in the following categories: weight and mass, distance, time, pressure, force, energy, power, magnetism, temperature, volume, liquid, area, bits and bytes, and speed.
Excel can also convert between number bases. You may import a file that contains hexadecimal values, and you need to convert them to decimal. Use the HEX2DEC
function to perform this conversion. For example, the following formula returns 1,279, the decimal equivalent of its hex argument.
=HEX2DEC("4FF")
Excel can also convert from binary to decimal (BIN2DEC
) and from octal to decimal (OCT2DEC
).
Functions that convert from decimal to another number base are DEC2HEX
, DEC2BIN
, and DEC2OCT
.
Excel 2013 introduced a new function, BASE
, that converts a decimal number to any number base. Note that there is not a function that works in the opposite direction. Excel does not provide a function that converts any number base to decimal. You're limited to binary, octal, and hexadecimal.
Often, you may have values that need to be classified into a group. For example, if you have ages of people, you might want to classify them into groups such as 17 or younger, 18–24, 25–34, and so on.
The easiest way to perform this classification is with a lookup table. Figure 32.13 shows ages in column A and classifications in column B. Column B uses the lookup table in D2:E9. The formula in cell B2 is
=VLOOKUP(A2,$D$2:$E$9,2)
This formula was copied to the cells below.
You can also use a lookup table for nonnumeric data. Figure 32.14 shows a lookup table that is used to assign a region to a state.
The two-column lookup table is in the range D2:E52. The formula in cell B2, which was copied to the cells below, is
=VLOOKUP(A2,$D$2:$E$52,2,FALSE)
To combine data in two more columns, you can usually use the concatenation operator (&
) in a formula. For example, the following formula combines the contents of cells A1, B1, and C1:
=A1&B1&C1
Often, you'll need to insert spaces between the cells — for example, if the columns contain a title, first name, and last name. Concatenating using the formulas above would produce something like Mr.ThomasJones. To add spaces (to produce Mr. Thomas Jones), modify the formula:
=A1&" "&B1&" "&C1
You can also use the Flash Fill feature (discussed earlier in this chapter) to join columns without using formulas. Just provide an example or two in an adjacent column, and press Ctrl+E. Excel will perform the concatenation for the other rows.
If you need to rearrange the columns in a worksheet, you could insert a blank column and then drag another column into the new blank column. But the moved column leaves a gap, which you need to delete.
Here's an easier way:
Repeat these steps until the columns are in the order you want.
If you need to arrange the rows in random order, here's a quick way to do it. In the column to the right of the data, insert this formula into the first cell and copy it down:
=RAND()
Then sort the data using this column as the sort key. The rows will be in random order, and you can delete the column.
In some cases, you may have a list of URLs and need to extract only the filename. The following formula returns the filename from a URL. Assume cell A2 contains this URL:
http://example.com/assets/images/horse.jpg
The following formula returns horse.jpg
:
=RIGHT(A2,LEN(A2)-FIND("*",SUBSTITUTE(A2,"/","*",LEN(A2)-LEN(SUBSTITUTE(A2,"/","")))))
This formula returns all text that follows the last slash character. If cell A2 doesn't contain a slash character, the formula returns an error.
To extract the URL without the filename, use this formula:
=LEFT(A2,FIND("*",SUBSTITUTE(A2,"/","*",LEN(A2)-LEN(SUBSTITUTE(A2,"/","")))))
You may have some data that you need to check against another list. For example, you may want to identify the data rows in which data in a particular column appears in a different list. Figure 32.15 shows a simple example. The data is in columns A:C. The goal is to identify the rows in which the Member Num appears in the Resigned Members list, in column F. These rows can then be deleted.
Here's a formula, entered into cell D2 and copied down, that will do the job:
=IF(COUNTIF($F$2:$F$22,B2)>0,"Resigned","" )
This formula displays the word Resigned if the Member Num in column B is found in the Resigned Members list. If the Member number is not found, it returns an empty string. If the list is sorted by column D, the rows for all resigned members will appear together and can be quickly deleted.
This technique can be adapted to other types of list-matching tasks.
Figure 32.16 shows a common type of data layout that you might see when importing a file. Each record consists of three consecutive cells in a single column: Name, Department, and Location. The goal is to convert this data so each record appears in three columns.
There are several ways to convert this type of data, but here's a method that's fairly easy. It requires a small amount of setup, but the work is done with a single formula, which is copied to a range.
Start by creating some numeric vertical and horizontal “headers,” as shown in Figure 32.17. Column C contains numbers that correspond to the first row of each data item (in this case, the Name). In this example, I put the following values in column C: 1, 4, 7, 10, 13, 16, and 19. You can use a simple formula to generate this series of numbers.
The horizontal range of headers consists of consecutive integers, starting with 1. In this example, each record contains three cells of data, so the horizontal header contains 1, 2, and 3.
Here's the formula that goes into cell D2:
=OFFSET($A$1,$C2+D$1-2,0)
Copy this formula across to the next two columns and down to the next six rows. The result is shown in Figure 32.18.
You can easily adapt this technique to work with vertical data that contains a different number of rows. For example, if each record contained ten rows of data, the column C header values would be 1, 11, 21, 31, and so on. The horizontal headers would consist of values 1 through 10 rather than 1 through 3.
Notice that the formula uses an absolute reference to cell A1. That reference won't change when the formula is copied, so all the formulas use cell A1 as the base. If the data begins in a different cell, change $A$1
to the address of the first cell.
The formula also uses “mixed” referencing in the second argument of the OFFSET
function. The C2 reference has a dollar sign in front of C, so column C is the absolute part of the reference. In the D1 reference, the dollar sign is before the 1, so row 1 is the absolute part of the reference.
When you import data, you can sometimes end up with a worksheet that looks something like the one shown in Figure 32.19. This type of report formatting is common. As you can see, an entry in column A applies to several rows of data. If you sort this type of list, the missing data messes things up, and you can no longer tell who sold what when.
If the report is small, you can enter the missing cell values manually or by using a series of Home Editing Fill Down commands (or the Ctrl+D shortcut). But if you have a large list that's in this format, here's a better way:
After you complete these steps, the gaps are filled in with the correct information, and your worksheet looks similar to the one shown in Figure 32.20.
If you use a word-processing program, you probably take advantage of its spell checker feature. Spelling mistakes can be embarrassing when they appear in a text document, but they can cause serious problems when they occur within your data. For example, if you tabulate data by month, a misspelled month name will make it appear that a year has 13 months.
To access the Excel spell checker, choose Review Proofing Spelling, or press F7. To check the spelling in just a particular range, select the range before you activate the spell checker.
If the spell checker finds any words it doesn't recognize as correct, it displays the Spelling dialog box. The options are fairly self-explanatory.
You may need to systematically replace (or remove) certain characters in a column of data. For example, you may need to replace all backslash characters with forward slash characters. In many cases, you can use Excel's Find and Replace dialog box to accomplish this task. To remove text using the Find and Replace dialog box, just leave the Replace With field empty.
In other situations, you may need a formula-based solution. Consider the data shown in Figure 32.21. The goal is to replace the second hyphen character with a colon for the part numbers in Column A. Using Find and Replace wouldn't work because there isn't a way to specify that only the second hyphen should be replaced.
In this case, the solution is a fairly simple formula that replaces the second occurrence of a hyphen with a colon:
=SUBSTITUTE(A2,"-",":",2)
To remove the second occurrence of a hyphen, just omit the third argument for the SUBSTITUTE
function:
=SUBSTITUTE(A2,"-",,2)
This is another example where Flash Fill can also do the job.
If you need to add text to a cell, one solution is to use a new column of formulas. Here are some examples:
ID:
and a space to the beginning of a cell:
="ID: "&A2
.mp3
to the end of a cell:
=A2&".mp3"
=LEFT(A2,3)&"-"&RIGHT(A2,LEN(A2)-3)
You can also use the Flash Fill feature to add text to cells.
Imported data sometimes displays negative values with a trailing minus sign. For example, a negative value may appear as 3,498– rather than the more common –3,498. Excel does not convert these values. In fact, it considers them to be nonnumeric text.
The solution is so simple it may even surprise you:
This procedure works because of a default setting in the Advanced Text Import Settings dialog box (which you don't even see, normally). To display this dialog box, shown in Figure 32.22, go to step 3 in the Text to Columns Wizard dialog box and click Advanced.
This section contains a list of items that could cause problems with data. Not all these are relevant to every set of data.
This chapter began with a section on importing data, so it's only appropriate to end it with a discussion of exporting data to a file that's not a standard Excel file.
When you choose File Save As, the Save As dialog box lets you choose from a variety of text file formats. The three types are
I discuss these file types in the sections that follow.
When you export a worksheet to a CSV file, the data is saved as displayed. In other words, if a cell contains 12.8312344 but is formatted to display with two decimal places, the value will be saved as 12.83.
Cells are delimited with a comma character, and rows are delimited with a carriage return and line feed.
Note that if a cell contains a comma, the cell value is saved within quotation marks. If a cell contains a quotation mark character, that character appears twice.
Exporting a workbook to a TXT file is almost identical to the CSV file format described earlier. The only difference is that cells are separated by a tab character rather than a comma.
If your worksheet contains any Unicode characters, you should export the file using the Unicode variant. Otherwise, Unicode characters will be saved as question mark characters.
A PRN file is very much like a printed image of the worksheet. The cells are separated by multiple space characters. Also, a line is limited to 240 characters. If a line exceeds that limit, the remainder appears on the next line. PRN files are rarely used.
Excel also lets you save your work in several other formats:
.dif
extension. Not used very often..sylk
extension. Not used very often..pdf
extension. This is a common “read-only” file format..xps
extension. Microsoft's alternative to PDF files. Not used very often..htm
extension. Often, saving a file as a workbook will generate a directory of ancillary files required to render the page accurately..ods
extension. They're compatible with various open source spreadsheet programs.