Other data structures

Python has data structures such as stacks, lists, sets, sequences, tuples, lists, heaps, arrays, dictionaries, and deque. We have already discussed lists while attempting to understand arrays. Tuples are typically more memory efficient than lists because they are immutable.

Stacks

The list method is very convenient to be used as a stack, which is known to be an abstract data type with the principle of operation last-in, first-out. The known operations include adding of an item at the top of the stack using append(), extracting of the item from the top of the stack using pop(), and removing of the item using remove(item-value), as shown in the following code:

stack = [5, 6, 8]
stack.append(6)
stack.append(8)

stack
[5, 6, 8, 6, 8]

stack.remove(8)
stack
[5, 6, 6, 8]

stack.pop()
8

stack.remove(8)
Traceback (most recent call last): 
File "<ipython-input-339-61d6322e3bb8>", line 1, in <module>     stack.remove(8)
  ValueError: list.remove(x): x not in list

The pop() function is most efficient (constant-time) because all the other elements remain in their location. However, the parameterized version, pop(k), removes the element that is at the k < n index of a list, shifting all the subsequent elements to fill the gap that results from the removal. The efficiency of this operation is linear because the amount of shifting depends on the choice of index k, as illustrated in the following image:

Stacks

Tuples

A tuple is a sequence of immutable objects that look similar to lists. Tuples are heterogeneous data structures, which means that their elements have different meanings, whereas lists are a homogeneous sequence of elements. Tuples have structure, and lists have order. Some examples of tuples are days of the week, course names, and grading scales, as shown in the following code:

#days of the week
weekdays = ("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

#course names
courses = ("Chemistry", "Physics", "Mathematics", "Digital Logic", "Circuit Theory")

#grades
grades = ("A+", "A", "B+", "B", "C+", "C", "I")

Tuples have immutable objects. This means that you cannot change or remove them from tuple. However, the tuple can be deleted completely, for example, "del grades" will delete this tuple. After this, if an attempt is made to use that tuple, an error will occur. The following are the built-in tuple functions:

  • cmp(tup1, tup2): This function can be used to compare the elements of two tuples
  • len(tuple): This function can be used to get the total length of the tuple
  • max(tuple): This function can be used to determine the maximum value in the tuple
  • min(tuple): This function can be used to determine the minimum value in the tuple
  • tuple(lista): This function can be used to convert lista to tuple

Python has a max() function that behaves as expected for numerical values. However, if we pass a list of strings, max() returns the item that is the longest.

weekdays = ("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
print max(weekdays) 
Wednesday  

Similarly min() has the same behavior for strings.

print min(weekdays) 
Friday

When we need to find how many elements are in an array or list, len() is a convenient method that does the job.

len(weekdays)
7

Sets

Sets are similar to lists, but are different in two aspects. Firstly, they are an unordered collection as compared to lists (which are ordered by location or index). Secondly, they do not have duplicates (if you know the mathematical definition of sets). The notation used for a set is shown in the following command:

setoftrees = { 'Basswood', 'Red Pine', 'Chestnut', 'Gray Birch', 'Black Cherry'} 

newtree = 'Tulip Tree' 
if newtree not in setoftrees:  setoftrees.add(newtree)

Now with this command, you can see what is on setoftrees:

setoftrees  # typing this shows list of elements shown below
{'Basswood', 'Black Cherry', 'Chestnut', 'Gray Birch', 'Red Pine', 'Tulip Tree'}

Then, build charsinmath and charsinchem using the appropriate spelling, as shown in the following code

#example of set of operation on letters
charsinmath = set('mathematics')
charsinchem = set('chem')

Now, let's try to see what the values are in these sets:

Charsinmath # typing this shows letters in charsinmath
{'a', 'c', 'e', 'h', 'i', 'm', 's', 't'}

charsinchem # typing this shows letters in charsinchem
{'c', 'e', 'h', 'm'}

In order to find the set difference, we need to display charsinmath – charsinchem as follows:

# take away letters from charsinchem from charsinmath
charsinmath - charsinchem
{'a', 'i', 's', 't'}

Queues

Just like stacks, it is possible to use a list as a queue. However, the difference is that elements can be added or removed from the end of the list or from the beginning of the list. Although adding and removing from the end of a list is efficient, doing the same from the beginning is not efficient because in this case, elements have to be shifted.

Fortunately, Python has deque in its collections package that efficiently implements the adding and removing of elements from both ends using append(), pop(), appendleft(), and popleft(), as shown in the following code:

from collections import deque

queue = deque(["Daniel", "Sid", "Mathew",  "Michael"]) 
queue.append("Dirk")      # Dirk arrives 
queue.append("Monte")   # Monte arrives queue

queue
deque(['Daniel', 'Sid', 'Mathew', 'Michael', 'Dirk', 'Monte'])

queue.popleft()  
'Daniel'  

queue.pop() 
'Monte'

queue.appendleft('William')
queue
deque(['William', 'Sid', 'Mathew', 'Michael', 'Dirk'])

queue.append('Lastone')
queue
deque(['William', 'Sid', 'Mathew', 'Michael', 'Dirk', 'Lastone'])

Dictionaries

Dictionaries are a collection of unordered data values that are composed of a key/value pair, which has the unique advantage of accessing a value based on the key as an index. The question is that if the key is a string, then how does the indexing work? The key has to be hashable: a hash function is applied on the key to extract the location where the value is stored. In other words, the hash function takes a key value and returns an integer. Dictionaries then use these integers (or hash values) to store and retrieve the value. Some examples are shown here:

#example 1: Top 10 GDP of Africa
gdp_dict = { 'South Africa': 285.4, 'Egypt': 188.4, 'Nigeria': 173, 'Algeria': 140.6, 'Morocco': 91.4, 'Angola': 75.5, 'Libya': 62.3, 'Tunisia': 39.6, 'Kenya': 29.4, 'Ethiopia': 28.5, 'Ghana': 26.2, 'Cameron': 22.2}

gdp_dict['Angola']
75.5

#example 2: English to Spanish for numbers one to ten
english2spanish = { 'one' : 'uno', 'two' : 'dos', 'three': 'tres', 
'four': 'cuatro', 'five': 'cinvo', 'six': 'seis', 'seven': 'seite',  'eight': 'ocho', 'nine': 'nueve', 'ten': 'diez'}

english2spanish['four'] 
'cuatro'

The keys should be immutable to have a predictable hash value; otherwise, the hash value change will result in a different location. Also, unpredictable things could occur. The default dictionary does not keep the values in the order they is inserted; therefore, by iterating after the insertion, the order of the key/value pair is arbitrary.

Python's collections package has an equivalent OrderedDict() function that keeps the order of pairs in the inserted order. One additional difference between the default dictionary and the ordered dictionary is that in the former, equality always returns true if they have an identical set of key/value pairs (not necessarily in the same order), and in the latter, equality returns true only when they have an identical set of key/value pairs and when they are in the same order. The following example demonstrates this:

# using default dictionary
dict = {}

dict['cat-ds1'] = 'Introduction to Data Structures'
dict['cat-ds2'] = 'Advanced Data Structures'
dict['cat-la1'] = 'Python Programming'
dict['cat-la2'] = 'Advanced Python Programming'
dict['cat-pda'] = 'Python for Data Analysis'
dict['cat-ps1'] = 'Data Science in Python'
dict['cat-ps2'] = 'Doing Data Science'

for key, val in dict.items():  print key,val

cat-ps1 Data Science in Python 
cat-ps2 Doing Data Science 
cat-pda Python for Data Analysis 
cat-la2 Advanced Python Programming 
cat-la1 Python Programming 
cat-ds1 Introduction to Data Structures 
cat-ds2 Advanced Data Structures

#using OrderedDict (inserting data the same way as before)
odict = OrderedDict()

odict['cat-ds1'] = 'Introduction to Data Structures'
odict['cat-ds2'] = 'Advanced Data Structures'
odict['cat-la1'] = 'Python Programming'
odict['cat-la2'] = 'Advanced Python Programming'
odict['cat-pda'] = 'Python for Data Analysis'
odict['cat-ps1'] = 'Data Science in Python'
odict['cat-ps2'] = 'Doing Data Science'

for key, val in odict.items():  print key,val

cat-ds1 Introduction to Data Structures 
cat-ds2 Advanced Data Structures 
cat-la1 Python Programming 
cat-la2 Advanced Python Programming 
cat-pda Python for Data Analysis 
cat-ps1 Data Science in Python 
cat-ps2 Doing Data Science

If you have to implement something similar to this, it is computationally better to use the ISBN as the key, rather than the catalog number as in a library. However, there may be old books that do not have an ISBN; therefore, an equivalent unique key/value has to be used to maintain consistency with other new books that have an ISBN number. A hash value is usually a number, and with a numeric key, the hash function might be much easier compared to an alphanumeric key.

Dictionaries for matrix representation

Usually, there are many examples where you can apply dictionaries when there is a key/value association. For example, state abbreviations and names; either one could be the key and the other the value, but it would be more efficient to have the abbreviation as the key. Other examples are word and word count or city names and population. One interesting area of computation where dictionaries could really be efficient is the representation of a sparse matrix.

Sparse matrices

Let's examine the space utilization of a matrix; for a 100 x 100 matrix represented using a list, each element occupies 4 bytes; therefore, the matrix would need 40,000 bytes, which is approximately 40 KB of space. However, among these 40,000 bytes, if only 100 of them have a nonzero value and the others are all zero, then the space is wasted. Now, let's consider a smaller matrix for the simplicity of discussion, as shown in the following image:

Sparse matrices

This matrix has approximately 20 percent of nonzero values; therefore, finding an alternative way to represent the nonzero elements of the matrix would be a good start. There are seven values of 1, five values of 2 and 3 each, and one value of 4, 6, and 7. This could be represented as follows:

A = {1: [(2,2),(6,6), (0,7),(1,8),(7,8),(3,9),(8,9)],
  2: [(5,2),(8,2),(6,3),(0,4),(0,9)],
  3: [(5,0),(8,0),(9,1),(1,3),(5,8)],
  4:[(1,1)], 6:[(2,0)], 7:[(2,5)]} 

However, this representation makes it harder to access the (i,j)th value of A. There is a better way to represent this sparse matrix using dictionary, as shown in the following code:

def getElement(row, col):
    if (row,col) in A.keys():
       r = A[row,col]
    else:
       r = 0
    return r

A={(0,4): 2, (0,7): 1, (1,1): 4, (1,3):3, (1,8): 1, (2,0): 6, (0,9): 2, (2,2):1, (2,5): 7, (3,9): 1, (5,0): 3, (5,2): 2, (5,8): 3, (6,3): 2, (6,6):1, (7,8): 1, (8,0): 3, (8,2): 2, (8,9): 1, (9,1): 3}

print getElement(1,3)
3

print getElement(1,2)
0

To access an element at (1, 3) of the matrix A, we could use A[(1, 3)], but if the key does not exist, it will throw an exception. In order to get the nonzero value using the key and return 0 if the key does not exist, we can use a function called getElement(), as shown in the preceding code.

Visualizing sparseness

We can visually see how sparse the matrix is with the help of SquareBox diagrams. The following image shows the sparseDisplay() function. This uses square boxes for each matrix entry that attempts to view the display. The black color represents sparseness, whereas the green color represents nonzero elements:

Visualizing sparseness

The following code demonstrates how one can display sparseness:

import numpy as np
import matplotlib.pyplot as plt

"""
  SquareBox diagrams are useful for visualizing values of a 2D array,
  Where black color representing sparse areas.  
"""
def sparseDisplay(nonzero, squaresize, ax=None):
    ax = ax if ax is not None else plt.gca()

    ax.patch.set_facecolor('black')
    ax.set_aspect('equal', 'box')
    for row in range(0,squaresize):
      for col in range(0,squaresize):
        if (row,col) in nonzero.keys():
           el = nonzero[(row,col)]
           if el == 0:  color='black' 
           else:  color = '#008000'
           rect = plt.Rectangle([col,row], 1, 1, 
                   facecolor=color, edgecolor=color)
           ax.add_patch(rect)

    ax.autoscale_view()
    ax.invert_yaxis()

if __name__ == '__main__': 
    nonzero={(0,4): 2, (0,7): 1, (1,1): 4, (1,3): 3, (1,8): 1, 
(2,0): 6, (0,9): 2, (2,2): 1, (2,5): 7, (3,9): 1, (5,0): 3, 
(5,2): 2, (5,8): 3, (6,3): 2, (6,6): 1, (7,8): 1, (8,0): 3, (8,2): 2, (8,9): 1, (9,1): 3}
    
    plt.figure(figsize=(4,4))
    sparseDisplay(nonzero, 10)
    plt.show()

This is only a quick example to display the sparse matrix. Imagine that you have a 30 x 30 matrix with only a few nonzero values, then the display would look somewhat similar to the following image. The saving in this case is 97 percent as far as space utilization is concerned. In other words, the larger the matrix, the lesser the space utilized, as shown in the following image:

Visualizing sparseness

Having found a way to store the sparse matrix using dictionary, you may have to remember that there is no need to reinvent the wheel. Moreover, it makes sense to consider the possibility of storing the sparse matrix to understand the power of dictionary. However, what is really recommended is to take a look at the SciPy and pandas package for the sparse matrix. There may be further opportunities in this book to use these approaches in some examples.

Dictionaries for memoization

Memoization is an optimization technique in computational science that enables one to store intermediate results, which otherwise could be expensive. Not every problem needs memoization, but when there is a pattern of computing the same values by calling the function, it is often useful to use this approach. One example where this approach can be used is in the computation of the Fibonacci function using the dictionary to store the already computed value, so next time, you can just search for the value, rather than recompute it again, as shown in the following code:

fibvalues = {0: 0, 1: 1, 2:1, 3:2, 4:3, 5:5}

def fibonacci(n):
    if n not in fibvalues: 
        sumvalue = fibonacci(n-1) + fibonacci(n-2)
        fibvalues[n] = sumvalue  
    return fibvalues[n]

fibonacci(40)
102334155


print sorted(fibvalues.values())
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155]

#regular fibonacci without using dictionary
def fib(n):
   if n <= 1 : return 1
   sumval = fib(n-1)+fib(n-2)
   return sumval

The dictionary of fibvalues is very useful to prevent the recomputation of the values of Fibonacci, but fibcalled is used here only to demonstrate that by using dictionary, there cannot be more than one call to fibonacci() for a particular value of n. By comparing the ratio of the running times for fib() (without using dictionary to store the computed value) and fibonacci(), we can see that when plotted, it looks similar to the following screenshot:

Dictionaries for memoization
from time import time

for nval in range(16,27):
  fibvalues = {0: 0, 1: 1, 2:1, 3:2, 4:3, 5:5}
  t3 = time()
  fibonacci(nval)
  diftime1 = time()-t3
  t2 = time()
  fib(nval)
  diftime2 = time()-t2
  print "The ratio of time-2/time-1 :"+str(diftime2/diftime1)

Tries

Trie (pronounced trie or trai) is a data structure that has different names (digital tree, radix tree, or prefix tree). Tries are very efficient for search, insert, and delete functions. This data structure is very optimal for storage. For example, when the words add, also, algebra, assoc, all, to, trie, tree, tea, and ten are stored in the trie, it will look similar to the following diagram:

Tries

The characters are shown in uppercase just for clarity purposes in the preceding diagram, whereas in real storage, the characters are stored as they appear in words. In the implementation of trie, it makes sense to store the word count. The search functionality is very efficient and in particular when the pattern does not match, the results are even quicker. In other words, if the search is for are, then the failure is determined at the level when the letter r is not found.

One of the popular functionalities is longest prefix matching. In other words, if we were to find all the words in the dictionary that have the longest prefix match with a particular search string: base (for example). The results could be base, based, baseline, or basement, or even more words if they are found in the dictionary of words.

Python has many different implementations: suffix_tree, pytire, trie, datrie, and so on. There is a nice comparison study done by J. F. Sebastian that can be accessed at https://github.com/zed/trie-benchmark.

Most search engines have an implementation of trie called inverted index. This is the central component where space optimization is very important. Moreover, searching for this kind of structure is very efficient to find the relevance between the search string and the documents. Another interesting application of trie is IP routing, where the ability to contain large ranges of values is particularly suitable. It also saves space.

A simple implementation in Python (not necessarily the most efficient) is shown in the following code:

_end = '_end_'

# to search if a word is in trie
def in_trie(trie, word):
     current_dict = trie
     for letter in word:
         if letter in current_dict:
             current_dict = current_dict[letter]
         else:
             return False
     else:
         if _end in current_dict:
             return True
         else:
             return False

#create trie stored with words
def create_trie(*words):
    root = dict()
    for word in words:
        current_dict = root
        for letter in word:
            current_dict = current_dict.setdefault(letter, {})
        current_dict = current_dict.setdefault(_end, _end)
    return root

def insert_word(trie, word):
    if in_trie(trie, word): return

    current_dict = trie
    for letter in word:
            current_dict = current_dict.setdefault(letter, {})
    current_dict = current_dict.setdefault(_end, _end)

def remove_word(trie, word):
    current_dict = trie
    for letter in word:
        current_dict = current_dict.get(letter, None)
        if current_dict is None:
            # the trie doesn't contain this word.
            break
    else:
        del current_dict[_end]

dict = create_trie('foo', 'bar', 'baz', 'barz', 'bar')
print dict
print in_trie(dict, 'bar')
print in_trie(dict, 'bars')
insert_word(dict, 'bars')
print dict
print in_trie(dict, 'bars')
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset