Chapter 9
IN THIS CHAPTER
Defining the kinds of data manipulation
Changing dataset size using slicing and dicing
Changing dataset content using mapping and filtering
Organizing your data
Previous chapters in this book spend a lot of time looking at how to perform basic application tasks and viewing data to see what it contains in various ways. Just viewing the data won’t do you much good, however. Data rarely comes in the form you need it and even if it does, you still want the option to mix it with other data to create yet newer views of the real world. Gaining the ability to shape data in certain ways, throw out what you don’t need, refine its appearance, change its type, and otherwise condition it to meet your needs is the essential goal of this chapter.
Shaping, in the form of slicing and dicing, is the most common kind of manipulation. Data analysis can take hours, days, or even weeks at times. Anything you can do to refine the data to match specific criteria is important in getting answers fast. Obtaining answers quickly is essential in today’s world. Yes, you need the correct answer, but if someone else gets the correct answer first, you may find that the answer no longer matters. You lose your competitive edge.
Also essential is having the right data. The use of data mapping enables you to correlate data between information systems so that you can draw new conclusions. In addition, information overload, especially the wrong kind of information, is never productive, so filtering is essential as well. The combination of mapping and filtering lets you control the dataset content without changing the dataset truthfulness. In short, you get a new view of the same old information.
Data presentation — that is, its organization — is also important. The final section of this chapter discusses the issue of how to organize data to better see the patterns it contains. Given that there isn’t just one way to organize data, one presentation may show one set of patterns, and another presentation could display other patterns. The goal of all this data manipulation is to see something in the data that you haven’t seen before. Perhaps the data will give you an idea for a new product or help you market products to a new group of users. The possibilities are nearly limitless.
When you mention the term data manipulation, you convey different information to different people, depending on their particular specialty. An overview of data manipulation may include the term CRUD, which stands for Create, Read, Update, and Delete. A database manager may view data solely from this low-level perspective that involves just the mechanics of working with data. However, a database full of data, even accurate and informative data, isn’t particularly useful, even if you have all the best CRUD procedures and policies in place. Consequently, just defining data manipulation as CRUD isn’t enough, but it’s a start.
Another kind of data transformation actually does something worthwhile. In this case, the meaning of the data doesn’t change; only the presentation of the data changes. You can separate this kind of transformation into a number of methods that include (but aren’t necessarily limited to) tasks such as the following:
http://www.animations.physics.unsw.edu.au/jw/dB.htm
. Without a reference value or a baseline, determining what the dB value truly means is impossible. For audio, the dB is referenced to 1 volt (dBV), as described at http://www.sengpielaudio.com/calculator-db-volt.htm
. The reference is standard and therefore implied, even though few people actually know that a reference is involved. Now, imagine the chaos that would result if some people used 1 volt for a reference and others used 2 volts. dBV would become meaningless as a unit of measure. Many kinds of data form a ratio or other value that requires a reference. Transformations can adjust the reference or baseline value as needed so that the values can be compared in a meaningful way.Slicing and dicing are two ways to control the size of a dataset. Slicing occurs when you use a subset of the dataset in a single axis. For example, you may want only certain records (also called cases) or you may want only certain columns (also called features). Dicing occurs when you perform slicing in multiple directions. When working with two-dimensional data, you select certain rows and certain columns from those rows. You see dicing used more often with three-dimensional or higher data, when you want to restrict the x-axis and the y-axis but keep all the z-axis (as an example). The following sections describe slicing and dicing in more detail and demonstrate how to perform this task using both Haskell and Python.
Datasets can become immense. The data continues to accumulate from various sources until it becomes impossible for the typical human to comprehend it all. So slicing and dicing might at first seem to be a means for making data more comprehensible. It can do that, but making the data comprehensible isn’t the point. Too much data can even overwhelm a computer — not in the same way as a human gets overwhelmed, because a computer doesn’t understand anything, but to the point where processing proceeds at a glacial pace. As the cliché says, time is money, which is precisely why you want to control dataset size. The more focused you can make any data analysis, the faster the analysis will proceed.
Sometimes you must use slicing and dicing to break the data down into training and testing units for computer technologies such as machine learning. You use the training set to help an algorithm perform the correct processing in the correct way through examples. The testing set then verifies that the training went as planned. Even though machine learning is the most prominent technology today that requires breaking data into groups, you can find other examples. Many database managers work better when you break data into pieces and perform batch processing on it, for example.
Slicing and dicing techniques can also help you improve the focus of a particular analysis. For example, you may not actually require all the columns (features) in a dataset. Removing the extraneous columns can actually make the data easier to use and provide results that are more reliable.
Likewise, you may need to remove unneeded information from the dataset. For example, a dataset that contains entries from the last three years requires slicing or dicing when you need to analyze only the results from one year. Even though you could use various techniques to ignore the extra entries in code, eliminating the unwanted years from the dataset using slicing and dicing techniques makes more sense.
Haskell slicing and dicing requires a bit of expertise to understand because you don’t directly access the slice as you might with other languages through indexing. Of course, there are libraries that encapsulate the process, but this section reviews a native language technique that will do the job for you using the take
and drop
functions. Slicing can be a single-step process if you have the correct code. To begin, the following code begins with a one-dimensional list, let myList = [1, 2, 3, 4, 5]
.
-- Display the first two elements.
take 2 myList
-- Display the remaining three elements.
drop 2 myList
-- Display a data slice of just the center element.
take 1 $ drop 2 myList
The slice created by the last statement begins by dropping the first two elements using drop 2 myList
, leaving [3,4,5]
. The $
operator connects this output to the next function call, take 1
, which produces an output of [3]
. Using this little experiment, you can easily create a slice function that looks like this:
slice xs x y = take y $ drop x xs
To obtain just the center element from myList
, you would call slice myList 2 1
, where 2
is the zero-based starting index and 1
is the length of the output you want. Figure 9-1 shows how this sequence works.
Of course, slicing that works only on one-dimensional arrays isn't particularly useful. You can test the slice
function on a two-dimensional array by starting with a new list, let myList2 = [[1,2],[3,4],[5,6],[7,8],[9,10]]
. Try the same call as before, slice myList2 2 1
, and you see the expected output of [[5,6]]
. So, slice
works fine even with a two-dimensional list.
Dicing is somewhat the same, but not quite. To test the dice
function, begin with a slightly more robust list, let myList3 = [[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15]]
. Because you're now dealing with the inner values rather than the lists contained with a list, you must rely on recursion to perform the task. The “Defining the need for repetition” section of Chapter 8 introduces you to the forM
function, which repeats a particular code segment. The following code shows a simplified, but complete, dicing sequence.
import Control.Monad
let myList3 =
[[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15]]
slice xs x y = take y $ drop x xs
dice lst x y = forM lst (i -> do return(slice i x y))
lstr = slice myList3 1 3
lstr
lstc = dice lstr 1 1
lstc
To use forM
, you must import Control.Monad
. The slice
function is the same as before, but you must define it within the scope created after the import. The dice
function uses forM
to examine every element within the input list and then slice it as required. What you're doing is slicing the list within the list. The next items of code first slice myList3
into rows, and then into columns. The output is as you would expect: [[5],[8],[11]]
. Figure 9-2 shows the sequence of events.
In some respects, slicing and dicing is considerably easier in Python than in Haskell. For one thing, you use indexes to perform the task. Also, Python offers more built-in functionality. Consequently, the one-dimensional list example looks like this:
myList = [1, 2, 3, 4, 5]
print(myList[:2])
print(myList[2:])
print(myList[2:3])
The use of indexes enables you to write the code succinctly and without using special functions. The output is as you would expect:
[1, 2]
[3, 4, 5]
[3]
Slicing a two-dimensional list is every bit as easy as working with a one-dimensional list. Here's the code and output for the two-dimensional part of the example:
myList2 = [[1,2],[3,4],[5,6],[7,8],[9,10]]
print(myList2[:2])
print(myList2[2:])
print(myList2[2:3])
[[1, 2], [3, 4]]
[[5, 6], [7, 8], [9, 10]]
[[5, 6]]
Notice that the Python functionality matches that of Haskell’s take
and drop
functions; you simply perform the task using indexes instead. Dicing does require using a special function, but the function is concise in this case and doesn't require multiple steps:
def dice(lst, rb, re, cb, ce):
lstr = lst[rb:re]
lstc = []
for i in lstr:
lstc.append(i[cb:ce])
return lstc
In this case, you can’t really use a lambda function — or not easily, at least. The code slices the incoming list first and then dices it, just as in the Haskell example, but everything occurs within a single function. Notice that Python requires the use of looping, but this function uses a standard for
loop instead of relying on recursion. The disadvantage of this approach is that the loop relies on state, which means that you can’t really use it in a fully functional setting. Here’s the test code for the dicing part of the example:
myList3 = [[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15]]
print(dice(myList3, 1, 4, 1, 2))
[[5], [8], [11]]
You can find a number of extremely confusing references to the term map in computer science. For example, a map is associated with database management (see https://en.wikipedia.org/wiki/Data_mapping
), in which data elements are mapped between two distinct data models. However, for this chapter, mapping refers to a process of applying a high-order function to each member of a list. Because the function is applied to every member of the list, the relationships among list members is unchanged. Many reasons exist to perform mapping, such as ensuring that the range of the data falls within certain limits. The following sections of the chapter help you better understand the uses for mapping and demonstrate the technique using the two languages supported in this book.
The main idea behind mapping is to apply a function to all members of a list or similar structure. Using mapping can help you adjust the range of the values or prepare the values for particular kinds of analysis. Functional languages originated the idea of mapping, but mapping now sees use in most programming languages that support first-class functions.
Haskell is one of the few computer languages whose map
function isn't necessarily what you want. For example, the map
associated with Data.Map.Strict
, Data.Map.Lazy
, and Data.IntMap
works with the creation and management of dictionaries, not the application of a consistent function to all members of a list (see https://haskell-containers.readthedocs.io/en/latest/map.html
and http://hackage.haskell.org/package/containers-0.5.11.0/docs/Data-Map-Strict.html
for details). What you want instead is the map
function that appears as part of the base prelude so that you can access map
without importing any libraries.
The map
function accepts a function as input, along with one or more values in a list. You might create a function, square
, that outputs the square of the input value: square x = x * x
. A list of values, items = [0, 1, 2, 3, 4]
, serves as input. Calling map square items
produces an output of [0,1,4,9,16]
. Of course, you could easily create another function: double x = x + x
, with a map double items
output of [0,2,4,6,8]
. The output you receive clearly depends on the function you use as input (as expected).
The apply operator ($) is also important to mapping. You can create a condition for which you apply an argument to a list of functions. As shown in Figure 9-3, you place the argument first in the list, followed by the function list (map ($4) [double, square]
). The output is a list with one element for each function, which is [8,16]
in this case. Using recursion would allow you to apply a list of numbers to a list of functions.
Python performs many of the same mapping tasks as Haskell, but often in a slightly different manner. Look, for example, at the following code:
square = lambda x: x**2
double = lambda x: x + x
items = [0, 1, 2, 3, 4]
print(list(map(square, items)))
print(list(map(double, items)))
You obtain the same output as you would with Haskell using similar code. However, note that you must convert the map
object to a list
object before printing it. Given that Python is an impure language, creating code that processes a list of inputs against two or more functions is relatively easy, as shown in this code:
funcs = [square, double]
for i in items:
value = list(map(lambda items: items(i), funcs))
print(value)
Note that, as with the Haskell code, you're actually applying individual list values against the list of functions. However, Python requires a lambda function to get the job done. Figure 9-4 shows the output from the example.
Most programming languages provide specialized functions for filtering data today. Even when the language doesn’t provide a specialized function, you can use common methods to perform filtering manually. The following sections discuss what filtering is all about and how to use the two target languages to perform the task.
Data filtering is an essential tool in removing outliers from datasets, as well as selecting specific data based on one or more criteria for analysis. While slicing and dicing selects data regardless of specific content, data filtering makes specific selections to achieve particular goals. Consequently, the two techniques aren’t mutually exclusive; you may well employ both on the same dataset in an effort to locate the particular data needed for an analysis. The following sections discuss details of filtering use and provide examples of simple data filtering techniques for both of the languages used in this book.
Haskell relies on a filter
function to remove unwanted elements from lists and other dataset structures. The filter
function accepts two inputs: a description of what you want removed and the list of elements to filter. The filter descriptions come in three forms:
odd
and even
>
x -> mod x 3 == 0
To see how this all works, you could create a list such as items = [0, 1, 2, 3, 4, 5]
. Figure 9-5 shows the results of each of the filtering scenarios.
Python doesn't provide a few of the niceties that Haskell does when it comes to filtering. For example, you don’t have access to special keywords, such as odd or even. In fact, all the filtering in Python requires the use of lambda functions. Consequently, to obtain the same results for the three cases in the previous section, you use code like this:
items = [0, 1, 2, 3, 4, 5]
print(list(filter(lambda x: x % 2 == 1, items)))
print(list(filter(lambda x: x > 3, items)))
print(list(filter(lambda x: x % 3 == 0, items)))
Notice that you must convert the filter
output using a function such as list
. You don't have to use list
; you could use any data structure, including set
and tuple
. The lambda function you create must evaluate to True
or False
, just as it must with Haskell. Figure 9-6 shows the output from this example.
None of the techniques discussed so far changes the organization of the data directly. All these techniques can indirectly change organization through a process of data selection, but that's not the goal of the methods applied. However, sometimes you do need to change the organization of the data. For example, you might need it sorted or grouped based on specific criteria. In some cases, organizing the data can also mean to randomize it in some manner to ensure that an analysis reflects the real world. The following sections discuss the kinds of organization that most people apply to data; also covered is how you can implement sorting using the two languages that appear in this book.
Organization — the forming of any object based on a particular pattern—is an essential part of working with data for humans. The coordination of elements within a dataset based on a particular need is usually the last step in making the data useful, except when other parts of the cleaning process require organization to work properly. How something is organized affects the way in which humans view it, and organizing the object in some other manner will change the human perspective, so often people find themselves organizing datasets one way and then reorganizing them in another. No right or wrong way to organize data exists; you just want to use the approach that works best for viewing the information in a way that helps see the desired pattern.
https://fractalfoundation.org/resources/what-is-chaos-theory/
for an explanation) finds use in a wide variety of everyday events. In fact, many of today’s sciences rely heavily on the effects of chaos. Data shuffling often enhances the output of algorithms and creates conditions that enable you to see unexpected patterns. Creating a kind of organization through the randomization of data may seem counter to human thought, but it works nonetheless.Haskell provides a wide variety of sorting mechanisms, such that you probably won’t have to resort to doing anything of a custom nature unless your data is unique and your requirements are unusual. However, getting the native functionality that's found in existing libraries can prove a little daunting at times unless you think the process through first. To start, you need a list that’s a little more complex than others used in this chapter: original = [(1, "Hello"), (4, "Yellow"), (5, "Goodbye"), (2, "Yes"), (3, "No")]
. Use the following code to perform a basic sort:
import Data.List as Dl
sort original
The output is based on the first member of each tuple: [(1,"Hello"),(2,"Yes"),(3,"No"),(4,"Yellow"),(5,"Goodbye")]
. If you want to perform a reverse sort, you can use the following call instead:
(reverse . sort) original
sortBy (x y -> compare y x) original
The sortBy
function lets you use any comparison function needed to obtain the desired result. For example, you might not be interested in sorting by the first member of the tuple but instead prefer to sort by the second member. In this case, you must use the snd
function from Data.Tuple
(which loads with Prelude) with the comparing
function from Data.Ord
(which you must import), as shown here:
import Data.Ord as Do
sortBy (comparing $ snd) original
sortBy (comparing $ length . snd) original
The call applies comparing
to the result of the composition of snd
, followed by length
(essentially, the length of the second tuple member). The output reflects the change in comparison: [(3,"No"),(2,"Yes"),(1,"Hello"),(4,"Yellow"),(5,"Goodbye")]
. The point is that you can sort in any manner needed using relatively simple statements in Haskell unless you work with complex data.
The examples in this section use the same list as that found in the previous section: original = [(1, "Hello"), (4, "Yellow"), (5, "Goodbye"), (2, "Yes"), (3, "No")]
, and you'll see essentially the same sorts, but from a Python perspective. To understand these examples, you need to know how to use the sort
method, versus the sorted
function. When you use the sort
method, Python changes the original list, which may not be what you want. In addition, sort
works only with lists, while sorted
works with any iterable. The sorted
function produces output that doesn't change the original list. Consequently, if you want to maintain your original list form, you use the following call:
sorted(original)
The output is sorted by the first member of the tuple: [(1, 'Hello'), (2, 'Yes'), (3, 'No'), (4, 'Yellow'), (5, 'Goodbye')]
, but the original list remains intact. Reversing a list requires the use of the reverse
keyword, as shown here:
sorted(original, reverse=True)
Both Haskell and Python make use of lambda functions to perform special sorts. For example, to sort by the second element of the tuple, you use the following code:
sorted(original, key=lambda x: x[1])
from operator import itemgetter
sorted(original, key=itemgetter(1))
You can also create complex sorts. For example, you can sort by the length of the second tuple element by using this code:
sorted(original, key=lambda x: len(x[1]))
sorted(original, key=len(itemgetter(1)))
Even though itemgetter
is obtaining the key from the second element of the tuple, it doesn’t possess a length. To use the second tuple’s length, you must work with the tuple directly.