CHAPTER 4: Going Beyond the Basics in Julia

Chapter004.jpg

This chapter will help you get comfortable with advanced aspects of the Julia language, and allow you to use it for more customized tasks. We cover:

  • String manipulation basics
  • Custom functions
  • Implementation of a simple algorithm
  • Creating a complete solution.

String Manipulation

As numbers are generally easy to handle, the biggest challenge of data engineering often boils down to manipulating strings. Strings also constitute the vast majority of data nowadays, making them practically ubiquitous. In addition, the fact that any data type can be converted to a string in one way or another makes strings the most versatile of data types. As such, strings require special attention.

Although the string manipulation field is a fairly large one, we’ll delve into the most fundamental functions, and then direct you toward resources for further exploration of the topic. One thing to always keep in mind when handling strings in Julia is that each byte of a string is of type character (and not string) and cannot be directly compared to a string.

To access any part of a string, just index the characters you are interested in as if parsing an array, using an int or an array containing numbers of type int. For example:

In[1]: q = “Learning the ropes of Julia”

In[2]: q[14:18]

Out[2]: “ropes”

In[3]: q[23]

Out[3]: ‘J’

In[4]: q[[1,6,10,12]] #1

Out[4]: “Lite”

#1 Note that the outer set of brackets are for referencing the q variable, while the inner ones are for defining the array of indexes (characters) to be accessed in that variable. If this seems confusing, try breaking it up into two parts: indexes = [1, 6, 10, 12] and q[indexes].

In the first case we obtain a string output, while in the second we receive a character. Now, let’s look into some more powerful ways of manipulating strings.

split()

Syntax: split(S1, S2), where S1 is the string variable to be split, and S2 is the character or string to be used as the separator. S2 can be an empty string (““).

This is a useful command that allows you to turn a string variable into an array of strings, which you can process later in a more systematic way. For example, say you have a sentence (string s) and you want to obtain a list of its words. You can do this easily by typing split(s) or split(s, “ “):

In[5]: s = “Winter is coming!”;

In[6]: show(split(s))

Out[6]: SubString{ASCIIString}[“Winter”,”is”,”coming!”]

If you want a list of characters in that sentence, just use ““ as a separator string:

In[7]: s = “Julia”;

In[8]: show(split(s, “”))

Out[8]: SubString{ASCIIString}[“J”,”u”,”l”,”i”,”a”]

In general, this function is applied with two arguments: the string you want to analyze and the separator string (which will be omitted at the output, naturally). If the second argument is not provided, blank spaces will be used by default. This function is particularly useful for analyzing different pieces of text and organizing text data.

join()

Syntax: join(A, S), where A is an array (of any type of variable), and S is the connector string. S could be an empty string (““).

This function is in essence the opposite of split(), and is handy for concatenating the elements of an array together. All elements of the array will be converted to the string type first, so if there are any Boolean variables in the array they will remain as they appear (instead of turning into 1s and 0s first).

Such a concatenation is rarely useful, though, because the end result is a rather long, incomprehensible string. This is why it is helpful to add a separator string as well. Enter the second argument of the function. So, if you want to join all the elements of an array z by putting a space in between each pair of them, you just need to type join(z, “ “). Using the array z from a previous example we get:

In[9]: z = [235, “something”, true, -3232.34, ‘d’, 12345];

In[10]: join(z, “ “)

Out[10]: “235 something true -3232.34 d 12345”

Regex functions

Syntax: r”re”, where re is some regular expression code.

Unlike other languages, Julia doesn’t have a particularly rich package for string manipulation. That’s partly because Julia has a built-in regular expressions package that can handle all kinds of tasks involving string search, which is undoubtedly the most important part of string manipulation.

We have already seen how it is possible to find various parts of a string if we know the indexes of the characters involved. However, most of the time this information is not available; we need to parse the string intelligently to find what we are looking for. Regular expressions make this possible through their unique way of handling patterns in these strings.

We will not go into much depth on how regex objects are created, as this involves a whole new language. It would behoove you to look into that on your own, though, by spending some time learning the intricacies of regex structures on websites like http://www.rexegg.com and http://bit.ly/1mXMXbr. Once you get the basics down, you can practice the pure regex code in interactive regex editors like http://www.regexr.com and http://www.myregexp.com.

Unlike other aspects of programming, it is not essential to have a mastery of regexes to make good use of them. Many useful pieces of regex code are already available on the web, so you do not have to build them from scratch. And since they are universal in the programming world, you can use them as-is in Julia.

Once you have a basic familiarity with regexes, you can see how Julia integrates them gracefully in the corresponding (built-in) regex functions, the most important of which will be described in this section. Before you do anything with a regex function, though, you need to define the regex as a corresponding object. For instance: pattern = r”([A-Z])w+” would be a regex for identifying all words starting with a capital letter (in a Latin-based language). Notice the r part right before the regex string (in double quotes); this denotes that the string that follows is actually a regular expression and that Julia should therefore interpret it as such.

Although most of the tasks involving regex can be accomplished by some combination of conventional string searching code, those methods may not be the most advisable. Using conventional string searching code for such a task would entail immense overhead, carry great risk of making a mistake, generate incomprehensible code, and ultimately compromise performance. Therefore, it is worth investing some time in learning regex. Remember, you don’t need to master them before you can find them useful. We recommend you start with simpler regexes and develop them further as you get more experienced in this topic. Also, you can always get some pre-made regexes from the web and alter them to meet your needs.

ismatch()

Syntax: ismatch(R, S), where R is the regex you wish to use, and S is the string variable you wish to apply it on. R has to be prefixed with the letter r (e.g. r”[0-9]” ).

This is a useful regex function that performs a check on a given string for the presence or absence of a regex pattern. So, if we had the string s = “The days of the week are Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday.” and the regex p = r”([A-Z])w+” from before, we could use ismatch() as follows:

In[11]: ismatch(p, s)

Out[11]: true

In[12]: ismatch(p, “some random string without any capitalized words in it”)

Out[12]: false

As you probably have guessed, ismatch() returns a Boolean value, i.e. “true” or “false.” This you can store in a corresponding variable, or use as-is in a conditional, to save some time.

match()

Syntax: match(R, S, ind), where R is the regex that you wish to use, S is the string variable you wish to apply it on, and ind is the starting point of your search. The final argument is optional and has the default value of 1 (i.e. the beginning of the string).

Once you have confirmed that a certain character pattern exists within a string using the ismatch() function, you may want to dig deeper and find out which substring corresponds to this pattern and where exactly this substring dwells. In this case, match() is the function you need. So, for the aforementioned example with the days of the week, you can apply match() in the following way:

In[13]: m = match(p, s)

Out[13]: RegexMatch(“The”, 1=”T”)

In[14]: m.match

Out[14]: “The”

In[15]: m.offset

Out[15]: 1

As you can see from this example, the output of match() is an object that contains information about the first match of the regex pattern in the given string. The most important parts of it are the actual matching substring (.match part of the object) and its position in the original string (.offset part of the object). You can access these attributes of the object by referring to them in the following format: ObjectName.AttributeName.

matchall()

Syntax: matchall(R, S), where R is the regex that you wish to use, and S is the string variable you wish to apply it on.

You’ll often need more than just the first match of a pattern. In the example used previously, it is clear that there are several words in the original string (s) that fit the given regex pattern (p). What if we need all of them? That’s where the matchall() function comes in handy. By applying it to the original string containing the days of the week, you can obtain all of the matching substrings, in the form of an array:

In[16]: matchall(p, s)

Out[16]: 8-element Array{SubString{UTF8String},1}:

“The”

“Monday”

“Tuesday”

“Wednesday”

“Thursday”

“Friday”

“Saturday”

“Sunday”

Although this may seem like a trivial example, it can be useful when trying to find names or other special words (e.g. product codes) within a text, in a simple and efficient way.

eachmatch()

Syntax: eachmatch(R, S), where R is the regex that you wish to use, and S is the string variable you wish to apply it on.

This function allows you to parse all the match objects of the string, as if you were to call match() on each one of them. This makes the whole process of searching and outputting results efficient and allows for cleaner and faster code. To make the most of this function, you need to incorporate it into a for-loop, so that you can access all of the found elements. So, if we were to employ the previous example with the days of the week string s, and the capitalized words regex p, we would type:

In[17]: for m in eachmatch(p, s)

      println(m.match, “ - “, m.offset)

  end

Out[17]: The - 1

Monday - 26

Tuesday - 34

Wednesday - 43

Thursday - 54

Friday - 64

Saturday - 72

Sunday - 86

This simple program provides you with all the matching strings and their corresponding locations. With minor adjustments you can store this information in arrays and use it in various ways.

Custom Functions

Function structure

Although Julia offers a large variety of built-in functions, the day will come when you need to create your own. When performing custom tasks, you can save yourself some time by tailoring existing functions to your needs. The good thing about Julia is that even these custom functions are super-fast. In order to create your own function you need to use the following structure:

function name_of_function(variable1::type1, variable2::type2, ...)

  [some code]

  return output_variable_1, output_variable_2, ...    #A

end

#A return(output) is also fine

You can have as many (input) variables as you want, including none at all, and the type of each argument is optional. It is good practice, however, to include the type of each input variable; it makes the operation of the function much smoother, allowing multiple dispatch.

If you are using arrays as inputs, you can also include the type of the elements of these arrays in curly brackets after the array term (this can also be applied to other variables too):

function name_of_function{T <: Type}(var1::Array{T, NumberOfDimensions}, var2::T, ...)

  [some code]

  return output

end

The output of the function (which is also optional) is provided using the return() command (which can be used with or without the brackets).

For simpler functions, you can just use the one-liner version instead:

res(x::Array) = x - mean(x)

Anonymous functions

If you want to create a function that you are going to use only once (or you are just paranoid that someone else might use that same function without your permission!), you can use what are known as “anonymous functions.” Simply put, these are functions that are applied as soon as they are created and are not stored in memory as objects, rendering them inaccessible after they are applied. Here is an example to clarify this concept:

In[18]: mx = mean(X)

x -> x - mx

This simple function from before is now broken into two parts: calculating the mean of variable X and subtracting it from each element x. This function doesn’t have a name, hence its category (anonymous function). The idea is that it’s not going to be around for long, so we needn’t bother naming it.

The use case of anonymous functions is somewhat common, at least for more experienced programmers. It’s often used for applying a transformation to a bunch of values in an array, as we previously discussed (see map() function). Specifically:

In[19]: X = [1,2,3,4,5,6,7,8,9,10,11];

     mx = mean(X);

     show(map(x -> x - mx, X))

Out[19]: [-5.0,-4.0,-3.0,-2.0,-1.0,0.0,1.0,2.0,3.0,4.0,5.0]

Multiple dispatch

Multiple dispatch refers to using the same function for different data types through completely different methods. In other words, a function fun(a::Int) can incorporate a completely different process than fun(a::String), even though both functions have the same name. This is particularly useful if you want to make your function versatile and not have to remember a dozen names of its variants. Multiple dispatch allows for more intuitive code and is widely used in both the base and the auxiliary packages in Julia. So, for the example of the residuals function in the previous sections, we can also define it for single numbers:

In[20]: res(x::Number) = x

res (generic function with 2 methods)

Julia recognizes that this function already exists for array inputs, and sees this new definition as a new method for its use. Now, the next time you call it, Julia will try to match your inputs to the right method:

In[21]: show(res(5))

Out[21]: 5

In[22]: show(res([1,2,3,4,5,6,7,8,9,10,11))

Out[22]: [-5.0,-4.0,-3.0,-2.0,-1.0,0.0,1.0,2.0,3.0,4.0,5.0]

Multiple dispatch can be useful when creating (or extending) generic functions that can have any kind of inputs (e.g. length()). However, it requires some careful attention; you can easily get lost in the variety of functions with the same name, as you are building something and running it again and again while refining it. If you are creating a new function and change its inputs ever so slightly, Julia will recognize it as a totally different function. This can be an issue when debugging a script, which is why we recommend you restart the kernel whenever you get into this situation.

Here’s an example of a typical case where multiple dispatch could be handy: you have created a function that you’ve made too specific (e.g. fun(Array{Float64, 1}) ) and you try to run it on inputs that don’t fit its expectations (e.g. [1,2,3], an array of integers). In this case you could simply create another function, fun(Array{Int64, 1}), that is equipped to handle that particular input, making your function more versatile.

Function example

Let’s look now at a simple example of a custom function, putting together some of the things we’ve examined. This function, which is described below, calculates the Hamming distance between two strings X and Y. The objective of the function is described in a brief comment right after the declaration of the function, using the “#” character. In general, you can put this character wherever you want in the program and be assured that whatever comes after it is not going to be checked by Julia.

In[23]:function hdist(X::AbstractString, Y::AbstractString)

  # Hamming Distance between two strings

    lx = length(X)

    ly = length(Y)

    if lx != ly                     #A

      return -1

    else                            #B

        lettersX = split(X, “”)     #C

      lettersY = split(Y, “”)       #C

      dif = (lettersX .== lettersY) #D

      return sum(dif)

       end

  end

#A strings are of different length

#B strings are of the same length

#C get the individual letters of each string

#D create a (binary) array with all the different letters

Remember that if you want to use more generic or abstract types in your arrays (e.g. Real for real numbers, Number for all kinds of numeric variables) you must define them before or during the declaration of the function. For example, if you want the above function to work on any kind of string, define it as follows:

function hdist{T <: AbstractString}(X::T, Y::T)

Pay close attention to this rule, as it will save you a lot of frustration when you start building your own functions. Otherwise, you’ll be forced to rely heavily on multiple dispatch to cover all possibilities, making the development of a versatile function a somewhat time-consuming process.

To call a function you must use the form function_name(inputs). If you just type function_name, Julia will interpret it as a wish to obtain a high-level view of the given function (which is typically not very useful). You can view more details about the function, including all of its versions (or “methods”) and which inputs they require, by using methods(function_name). For example:

In[24]: methods(median)

Out[24]: # 2 methods for generic function “median”:

median{T}(v::AbstractArray{T,N}) at statistics.jl:475

median{T}(v::AbstractArray{T,N},region) at statistics.jl:477

Implementing a Simple Algorithm

Now, let’s look into how skewness can be implemented elegantly and efficiently in Julia. As you may already know from statistics, skewness is a useful measure for providing insights about the nature of a distribution. Here we’ll discuss the type of the skewness: whether it’s negative, positive, or zero.

Finding the type of skewness boils down to comparing the mean and the median of the distribution. As these metrics can apply to all kinds of numbers, the input will have to be an array, having a single dimension. So, the skewness type program will be something like the function in listing 4.1.

In[25]: function skewness_type(X::Array)   #A

    m = mean(X)                            #B

    M = median(X)                          #C

    if m > M                               #D

      output = “positive”

    elseif m < M

      output = “negative”

    else

      output = “balanced”

    end

    return output                          #E

  end

#A Function definition. This method applies to all kinds of arrays.

#B Calculate the arithmetic mean of the data and store it in variable m

#C Calculate the median of the data and store it in variable M

#D Compare mean to median

#E Output the result (variable “output”)

Listing 4.1 An example of a simple algorithm implemented in Julia: skewness type.

Although the above program will work with all of the single-dimensional arrays we give it, multi-dimensional arrays will confuse it. To resolve this issue, we can be more specific about what inputs it will accept by changing the first line to:

function skewness_type{T<:Number}(X::Array{T, 1})

We can test this function using various distributions, to ensure that it works as expected:

In[26]: skewness_type([1,2,3,4,5])

Out[26]: “balanced”

In[27]: skewness_type([1,2,3,4,100])

Out[27]: “positive”

In[28]: skewness_type([-100,2,3,4,5])

Out[28]: “negative”

In[29]: skewness_type([“this”, “that”, “other”])

Out[29]: ERROR: ‘skewness_type’ has no method matching skewness_type(::Array{Any,1})

In[30]: A = rand(Int64, 5, 4); skewness_type(A)

Out[30]: ERROR: `skewness_type` has no method matching skewness_type(::Array{Int64,2})

Creating a Complete Solution

Now, let’s say that we need to create a more sophisticated program, involving more than a single function, in order to handle the missing values in a dataset. This is the time to demonstrate how various functions can be integrated into a whole (referred to as “a solution”), how we can design the solution effectively, and how we can develop the corresponding functions. To achieve this, we will need to split the problem into smaller ones and solve each one as a separate routine. For example, say that we need to fill in the missing values of a dataset using either the mode or the median, for discreet and continuous variables, respectively. One way of structuring the solution would be to build the workflow shown in Figure 4.1.

In this solution we employ the following functions, all of which are custom-built and aim to fulfill a particular purpose:

has_missing_values() – a function to check whether a feature has missing values. The input will need to be a one-dimensional array containing elements of type “any,” while the output will be a Boolean (“true” if the feature has missing values and “false” otherwise). This will be used to assess whether a feature contains one or more missing values.

feature_type() – a function to assess whether a feature is discreet or continuous. The input here will need to be the same as in the previous function, while the output will be a string variable taking the values “discreet” and “continuous.” This function is essential in figuring out the approach to take when filling in missing values for a feature.

Image013.jpg

Figure 4.1 The workflow of one solution for handling missing values in a dataset.

mode() – a function for finding the mode of a discreet variable. Although the function for calculating the mode already exists in the Distributions package, here we’ll build it from scratch for additional practice. The input of this function will need to be the same as in the previous two functions, while the output will be a number denoting the mode of that variable. This simple function will provide us with the most common element in a given discreet feature.

main() – the wrapper program that will integrate all the auxiliary programs together. The input will be a two-dimensional array with elements of type “any,” while the output will be a similar array with all the missing values filled in, based on the types of the features involved. This is the function that the end-user will use in order to process a dataset containing missing values.

In this solution, the assumption made about missing values is that they are blank values in the data matrix. With that in mind, one solution is provided in Listings 4.2 - 4.5. All the empty lines and the indentation are not essential but make the code easier to read.

In[31]: function mode{T<:Any}(X::Array{T})

    ux = unique(X)        #1

    n = length(ux)        #2

    z = zeros(n)          #3

    for x in X            #4

      ind = findin(ux, x)  #5

      z[ind] += 1         #6

    end

    m_ind = findmax(z)[2]  #7

    return ux[m_ind]      #8

  End

#1 find the unique values of the given Array X

#2 find the number of elements in Array X

#3 create a blank 1-dim array of size n

#4 iterate over all elements in Array x

#5 find which unique value x corresponds to

#6 increase the counter for that unique value

#7 get the largest counter in Array z

#8 output the mode of X

Listing 4.2 Code for an auxiliary function of the missing values solution: mode().

In[32]: function missing_values_indexes{T<:Any}(X::Array{T})

   ind = Int[]          #1

   n = length(X)        #2

   for i = 1:n          #3

      if isempty(X[i])  #4

        push!(ind, i)   #5

      end

   end

   return ind           #6

   end

#1 create empty Int array named “ind” for “index”

#2 find the number of elements in Array x

#3 repeat n times (i = index variable)

#4 check if there is a missing value in this location of the X array

#5 missing value found. Add its index to “ind” array

#6 output index array “ind”

Listing 4.3 Code for an auxiliary function of the missing values solution: missing_values_indexes().

In[33]: function feature_type{T<:Any}(X::Array{T})

    n = length(X)                               #1

    ft = “discreet”                             #2

    for i = 1:n                                 #3

      if length(X[i]) > 0                       #4

      tx = string(typeof(X[i]))                 #5

          

      if tx in [“ASCIIString”, “Char”, “Bool”]  #6

        ft = “discreet”                         #7

        break                                   #8

      elseif contains(tx, “Float”)              #9

        ft = “continuous”                       #10

      end

      end

    end

    return ft                                   #11

  end

#1 find the number of elements in array X

#2 set feature type variable to one of the possible values

#3 do n iterations of the index variable i

#4 check if the i-th element of X isn’t empty

#5 get the type of that element

#6 is its type one of these types?

#7 feature X is discreet

#8 exit the loop

#9 is the type some kind of Float?

#10 feature X is continuous (for the time being)

#11 output feature type variable

Listing 4.4 Code for an auxiliary function of the missing values solution: feature_type().

In[34]: function main{T<:Any}(X::Array{T})

    N, n = size(X)                        #1

    y = Array(T,N,n)                      #2

    for i = 1:n                           #3

      F = X[:,i]                          #4

      ind = missing_values_indexes(F)     #5

      if length(ind) > 0                  #6

        ind2 = setdiff(1:N, ind)          #7

        if feature_type(F) == “discreet”  #8

          y = mode(F[ind2])               #9

      else

        y = median(F[ind2])              #10

      end

      F[ind] = y                         #11

    end

    Y[:,i] = F                           #12

  end

  return Y                               #13

   end

#1 get the dimensions of array X

#2 create an empty array of the same dimensions and of the same type

#3 do n iterations

#4 get the i-th feature of X and store it in F

#5 find the indexes of the missing values of that feature

#6 feature F has at least one missing value

#7 indexes having actual values

#8 is that feature a discreet variable?

#9 calculate the mode of F and store it in y

#10 calculate the median of F store it in y

#11 replace all missing values of F with y

#12 store F in Y as the i-th column

#13 output array Y

Listing 4.5 Code for the main function of the missing values: solution, main().

As you would expect, there are several ways to implement this solution–some more elegant than others. This one could have been made with fewer lines of code, but sometimes it is worthwhile to sacrifice brevity for the sake of having more comprehensible code. We encourage you to come back to this chapter and try out this solution on various datasets to see how it works, and try to figure out the details of the newly introduced functions.

You can always apply each one of the components of this solution independently, although in order to run the main() function you must have loaded the auxiliary functions into Julia first (by inputting the corresponding pieces of code). To test these functions, make up a simple dataset of your own (ideally a variety of arrays, each of a different type) and run the main() function on it:

data = readdlm(“my dataset.csv”, ‘,’)

main(data)

If you have time, we recommend you test each one of the functions separately, to make sure you understand its functionality. Once you become more experienced with Julia, you can come back to this solution and see if you can find ways to improve it or refactor it in a way that makes more sense to you.

Summary

  • String manipulation in Julia takes place mainly through the use of regex functions, such as match(), matchall(), and eachmatch().
  • Regexes in Julia are prefixed by “r”. For example: r”([A-Z])w+” is a regex for identifying words starting with a capital letter.
  • When defining a function that makes use of a generic or abstract type, you must define it beforehand (usually right before the input arguments of the function). This is done as follows: function F{T <: TypeName}(Array{T}), where TypeName is the name of the type (with first letter capitalized, as always).
  • When developing a complete solution, it is a good idea to create a workflow of the algorithm that you are planning to implement, making a list of all necessary functions. The auxiliary functions need to be loaded into memory before the wrapper (main) function is executed.

Chapter Challenge

  1. 1. Can you use the same function for processing completely different types of data? If so, which feature of the language would you make use of?
  2. 2. Consider the function counters() from previously. Why doesn’t it work with the inputs ‘a’, ‘b’? Shouldn’t these have a distance of 1?
  3. 3. Is it possible to extend the function mode() from before so that it can handle an input like 234 (a single number instead of an array) and return that input as an output? What characteristic of Julia would such a modification make use of?
  4. 4. Write a simple function that counts the number of words in a given text (assume that there are no line breaks in this text). Once you are done, test it with a couple of cases (you can take a random paragraph from a blog post or an ebook) and evaluate its performance.
  5. 5. Write a simple function that counts the number of characters in a given text and calculates the proportion of the ones that are not spaces.
  6. 6. Write a complete solution that takes an array of numbers (you can assume they are all floats) and provides you with a distribution of all the digits present in that text (i.e. how may 0s there are, how many 1s, and so on). What’s the most common digit out there? The simplest way to implement this is through a function that counts how many characters x exist in a given string, and a wrapper method that accumulates and outputs all the stats. You may use additional auxiliary functions if you find them helpful.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset