Lesson 21
Pulling It All Together: Word Analysis in Go

In this lesson, you will apply what you've learned up to this point to perform a common text analysis process using Go. Specifically, you will build a program that takes a dataset of e-commerce reviews and analyzes them to calculate the number of times each word appears.

EXAMINING THE DATA

When you start any data analysis process, the first step is to ensure that the data is in a format that your system can use and ensure that the data is available for use. For our project, you will need to download the data. You'll use the Digital Music review set from Julian McAuley's Amazon product data website, which can be found at http://jmcauley.ucsd.edu/data/amazon. This file, reviews.json, can also be found in the downloadable zip file for this book, which can be found at www.wiley.com/go/jobreadygo. The data is in the file reviews_Digital_Music_5.json.gz and will need to be extracted.

This file is in a modified JSON format. If you open the extracted file using any text editor, the first two records look like this:

{"reviewerID": "A3EBHHCZO6V2A4", "asin": "5555991584", "reviewerName": "Amaranth "music fan"", "helpful": [3, 3], "reviewText": "It's hard to believe "Memory of Trees" came out 11 years ago;it has held up well over the passage of time.It's Enya's last great album before the New Age/pop of "Amarantine" and "Day without rain." Back in 1995,Enya still had her creative spark,her own voice.I agree with the reviewer who said that this is her saddest album;it is melancholy,bittersweet,from the opening title song."Memory of Trees" is elegaic&majestic.;"Pax Deorum" sounds like it is from a Requiem Mass,it is a dark threnody.Unlike the reviewer who said that this has a "disconcerting" blend of spirituality&sensuality;,I don't find it disconcerting at all."Anywhere is" is a hopeful song,looking to possibilities."Hope has a place" is about love,but it is up to the listener to decide if it is romantic,platonic,etc.I've always had a soft spot for this song."On my way home" is a triumphant ending about return.This is truly a masterpiece of New Age music,a must for any Enya fan!", "overall": 5.0, "summary": "Enya's last great album", "unixReviewTime": 1158019200, "reviewTime": "09 12, 2006"}
{"reviewerID": "AZPWAXJG9OJXV", "asin": "5555991584", "reviewerName": "bethtexas", "helpful": [0, 0], "reviewText": "A clasically-styled and introverted album, Memory of Trees is a masterpiece of subtlety. Many of the songs have an endearing shyness to them - soft piano and a lovely, quiet voice. But within every introvert is an inferno, and Enya lets that fire explode on a couple of songs that absolutely burst with an expected raw power. If you've never heard Enya before, you might want to start with one of her more popularized works, like Watermark, just to play it safe. But if you're already a fan, then your collection is not complete without this beautiful work of musical art.", "overall": 5.0, "summary": "Enya at her most elegant", "unixReviewTime": 991526400, "reviewTime": "06 3, 2001"}

Each record is enclosed in curly brackets ({ }), and the records are separated by new lines. In standard JSON, each record would be enclosed in square brackets ([ ]). Your code will need to take this into account when you import the data to be analyzed.

The fields in each record include a name and value using a colon (:) as the separator:

"reviewerID": "A3EBHHCZO6V2A4"

The fields are separated by commas:

"reviewerID": "A3EBHHCZO6V2A4", "asin": "5555991584"

For this analysis, we are most interested in the reviews themselves. The review of the first record looks like the following:

"reviewText": "It's hard to believe "Memory of Trees" came out 11 years ago;it has held up well over the passage of time.It's Enya's last great album before the New Age/pop of "Amarantine" and "Day without rain." Back in 1995,Enya still had her creative spark,her own voice.I agree with the reviewer who said that this is her saddest album;it is melancholy,bittersweet,from the opening title song."Memory of Trees" is elegaic&majestic.;"Pax Deorum" sounds like it is from a Requiem Mass,it is a dark threnody.Unlike the reviewer who said that this has a "disconcerting" blend of spirituality&sensuality;,I don't find it disconcerting at all."Anywhere is" is a hopeful song,looking to possibilities."Hope has a place" is about love,but it is up to the listener to decide if it is romantic,platonic,etc.I've always had a soft spot for this song."On my way home" is a triumphant ending about return.This is truly a masterpiece of New Age music,a must for any Enya fan!"

You can see that in addition to words, the text includes punctuation. While you can use spaces as one option for separating and identifying distinct words in the data, you can also use punctuation.

Because the focus of our project is to count the number of times a word appears, you have another problem with the raw data. Some words are capitalized and others are not. Keeping in mind that Go is case-sensitive, you also want to normalize the text so that it is all in lowercase, so that both hope and Hope are considered the same word.

READING THE REVIEW DATA

Now that you have looked at the data and identified what you want your code to do, you're ready to start writing code.

In the previous lessons, you didn't have to know how to read a JSON file, but the concept is fairly similar to our use case. Let's look at the function read_json_file in Listing 21.1.

The read_json_file function takes as an input the file path. The function uses the os package to read the file using the Open function.

Next, the code checks to see if an error occurred. If that's the case, you log/display the error and the program is terminated. Otherwise, the file is valid, and you can proceed to read it.

Because you are opening a file, you also want to make sure it gets closed. The listing defers closing the file until the function finishes executing.

To read the file, you will leverage the NewScanner function from the bufio package to create a scanner called scanner. If you look closely, our JSON file is structured in a way where each review in the JSON file is on a separate line. Thus, you will need to split the text based on lines using the Split function from scanner.

At this point, all the function does is read the file and split the JSON file based on new lines. The next step will be to iterate and scan the lines in the text. Add the code shown in Listing 21.2 to the same read_json_file function.

The code adds a for loop that iterates and scans each line using the Scan function, and then it prints each line. Note that in the listing, you have commented out the Println function. You will need to uncomment it (remove the //) to see the data actually print. The entire contents of the file will be printed, which can take a while.

It's important to remember that at this point, you are still reading each review as a string. The string represents the bulk of the JSON version of the review. You will have to convert that string into a valid representation where you can access all the attributes of the review easily. In this case, using structs sounds like the best choice. Go ahead and model the JSON review using structs, as shown in Listing 21.3.

As Listing 21.3 shows, the Review struct represents the different fields in the JSON review. For each field, you use the appropriate data type. For instance, the helpful field must be [2]int to be parsed correctly. (If you choose string, then the JSON won't parse.)

Going back to the reviews, each review is defined as:

var review Review

The goal now is to take the string representation of the JSON review and convert it into the Review struct that you've defined. This is where the json package comes into play. The json package allows you to encode and decode JSON objects. In our example, it will allow you to convert the string into a valid review represented by the Review struct in Listing 21.3.

To do that we will need to use the Unmarshal function, which has the following format:

json.Unmarshal(data []byte, v interface{}) error

As it can be seen, the Unmarshal function takes as input the data that we want to unmarshal. The function will parse the JSON data and store the results in the value of v. In our case, v is the review (or the struct defined earlier in this case). The type of v is an interface that is empty (called empty interface). The empty interface type is an interface that implements at least zero or more methods and is defined as interface{}.

Since all types in Go implement zero (or more methods), we can use the struct type of the review we defined earlier as input to the Unmarshal function (some similarity with object-oriented programming [OOP] concepts in this case).

One thing to notice is that the data must be a slice of bytes, so you will need to convert the string representation of the review into the equivalent byte representation. Conveniently, you can do that by passing the text data during initialization of the byte slice, as shown here:

[]byte(scanner.Text())

Going back to the Unmarshal function, you can parse the review into the struct defined earlier by using the following code:

var review Review
json.Unmarshal([]byte(scanner.Text()), &review)

This code will parse the JSON-encoded review into the variable of type Review. Note that to use the json.Unmarshal function, you have to add "encoding/json" to the current list of imports in the listing:

import (
   "bufio"
   "encoding/json"
   "fmt"
   "log"
   "os"
)

Going back to the read_json_file function, you have to perform this process for each line of text. Listing 21.4 adds this process to the function.

Listing 21.4 adds a few lines within the for loop so that the code will scan and convert each line to a type Review. It also checks for errors returned by the Unmarshal function, and in case an error occurs, it logs the error and exits the function.

At this point, you can display individual attributes of the review/map by adding the code shown in Listing 21.5.

In Listing 21.5, you are simply displaying the asin attribute of each review. Let's see the code in action. Listing 21.6 presents a full listing using the read_json_file function.

When you execute the listing, the code will read the file, iterate through each line, parse the data from the line into a type Review, and display the asin attribute. Note that you pass the location of the review file to the read_json_file function. In Listing 21.6, the JSON file is in the same directory as the Go program. If you saved the JSON file in a different directory, then you will need to adjust the path accordingly. If you downloaded a different review file from Amazon, then you'll want to change the JSON filename to match as well.

Returning the Reviews

At this point, let's discuss what the read_json_file function should return. Ideally, you want to return a slice of reviews. In other words, the read_json_file should return the following type:

[]Review

As you can see, the slice holds elements where each element is of type Review. By doing so, you can iterate through the file, parse each review, append it to the slice of reviews, and return the slice when the function is done.

Listing 21.7 makes a few changes to the code.

Listing 21.7 adds a few things to the read_json_file function:

  • The returned type in the function signature. The function read_json_file now returns a slice where each element is of type Review.
  • You created a slice named reviews, which will hold all the reviews from the JSON file.
  • Within the for loop, once you parse a review, you append it to the slice of reviews.
  • Return the slice of reviews once you finish iterating through the file.

It is time now to see this code in action. Update the main function with the code shown in Listing 21.8. Again, remember to adjust your filename and path if necessary.

This new main function reads the JSON file using your latest read_json_file and stores the output in a slice named reviews. Next, you display the first two reviews in the file with a set of dashes between them to provide some separation. The output should look like this:

 It's hard to believe "Memory of Trees" came out 11 years ago;it has held up well over the passage of time.It's Enya's last great album before the New Age/pop of "Amarantine" and "Day without rain." Back in 1995,Enya still had her creative spark,her own voice.I agree with the reviewer who said that this is her saddest album;it is melancholy,bittersweet,from the opening title song."Memory of Trees" is elegaic&majestic.;"Pax Deorum" sounds like it is from a Requiem Mass,it is a dark threnody.Unlike the reviewer who said that this has a "disconcerting" blend of spirituality&sensuality;,I don't find it disconcerting at all."Anywhere is" is a hopeful song,looking to possibilities."Hope has a place" is about love,but it is up to the listener to decide if it is romantic,platonic,etc.I've always had a soft spot for this song."On my way home" is a triumphant ending about return.This is truly a masterpiece of New Age music,a must for any Enya fan!
----------
A clasically-styled and introverted album, Memory of Trees is a masterpiece of subtlety.  Many of the songs have an endearing shyness to them - soft piano and a lovely, quiet voice.  But within every introvert is an inferno, and Enya lets that fire explode on a couple of songs that absolutely burst with an expected raw power. If you've never heard Enya before, you might want to start with one of her more popularized works, like Watermark, just to play it safe.  But if you're already a fan, then your collection is not complete without this beautiful work of musical art.

TOKENIZING AN INPUT STRING

You need to split the reviews into individual words so that the words can be counted. A function to tokenize any input string is a simple approach in terms of input for accomplishing this. The function can accept as input a string and returns a list that represents the words in the string with the order preserved. This function should split the string into words based on spaces or punctuation.

You can simplify the problem by first identifying each punctuation mark and replacing it with a space. You can then split based solely on spaces. For example, consider the following:

"Hello, Sean! -How are you?"

The first step is to replace the punctuation with a space. This results in the following string:

 Hello  Sean   How are you  

Next, you can use the split function to split the string based on spaces and get the list of words in the string, which will be the following:

[hello sean how are you]

The logic of this function is as follows:

  1. Identify and replace punctuation with a space in the string.
  2. Convert the input text string to lowercase.
  3. Split the string into words based on spaces.

Identifying and Replacing Punctuation with a Space

First, let's focus on step 1. To identify/replace punctuation with a space in the string, you will leverage regular expressions, or regex, which you learned about in Lesson 19, “Sorting and Data Processing.” Regex is a standard that developers use frequently for searching through strings. In fact, most modern search engines use regex as a standard part of their matching process. You will leverage regex to identify and replace punctuation with a space.

In our example, you will use the regexp package from Go to search for punctuation marks and the ReplaceAllString function to substitute spaces for those punctuation marks. Let's consider the code in Listing 21.9.

First, you use the MustCompile function, to which you pass a regex expression. This regex expression will then be used automatically for the subsequent operations that are done with the regexp package. Note that if you need to apply other regex expressions, then you must compile again. In this listing, your regex is a list of all the possible punctuation that you want to replace with a space ([.,!?-_#^()+=;/&'"]).

Keep in mind that MustCompile will panic if the input regular expression is not a valid expression. That means you should double-check your regex if it throws an error.

The next step is to use the ReplaceAllString function to replace any identified punctuation with a space. Since you compiled the regex in the previous step, you don't need to specify the regex expression to ReplaceAllString.

Finally, you display both strings so that you can compare the results. As you can see, you are able to replace any punctuation with a space. The results should look like the following:

original string: Hello, Sean! -How are you?
string after replacing punctuation with a space: Hello  Sean   How are you

Converting Input Text to Lowercase

The next step in the tokenization process is to convert the string to lowercase. Listing 21.10 adds this code to your main function.

This listing introduces an additional line that calls the ToLower function of strings. In order to use this, you will also need to import the strings package into your program. The strings.ToLower function converts the received string (w in this case) to lowercase and returns it. In our example, you return the string back to w as well. You also added the punctuation class mentioned in the previous note for the regex expression.

Splitting the String into Words

Finally, you need to split the string at spaces and retrieve the slice of strings, which represents the different words (in order). To do that, you can use the built-in function, Fields, from the strings package.

The Fields function splits an input string around each instance of one or more consecutive whitespace characters. It is important to note that there could be multiple spaces grouped together. Since you are replacing punctuation with a space, if you have two consecutive punctuation marks, then you will have double spaces. The Fields function allows you to split based on any number of consecutive spaces. Let's add another instruction to the previous code, as shown in Listing 21.11.

In this code, the call to the Fields function has been added to split the string w into a slice of strings representing the different words/tokens in the string. Finally, the tokens are displayed from the input string. Running the code should produce the following results:

original string: Hello, Sean! --How are you?
string after replacing punctuation with a space: Hello  Sean    How are you 
Tokens: [hello sean how are you]

CREATING A TOKENIZE FUNCTION

Now that you have a working code, you can create a function that will leverage the code you've seen to tokenize any input string. Create a function called tokenize, as shown in Listing 21.12.

In this listing, you wrap the previous code into a function called tokenize that takes as input a string and returns a slice of strings that represent the different tokens/words in the input string.

Tokenizing an Input Review

Let's leverage the tokenize function to tokenize a review from the JSON file. Replace the main function you used in Listing 21.8 with the code in Listing 21.13.

In this code, you use the read_json_file function to read the JSON file containing the reviews. Next, you call the tokenize function to tokenize the text of the first review and display the token.

Note that in addition to updating the main function with the code in Listing 21.13, you also have to add the tokenize function from Listing 21.12 to your program as well as include "strings" and "regexp" in your list of imported packages. With the updated code, the output produced should be similar to the following:

tokens: [it s hard to believe memory of trees came out 11 years ago it has held up well over the passage of time it s enya s last great album before the new age pop of amarantine and day without rain back in 1995 enya still had her creative spark her own voice i agree with the reviewer who said that this is her saddest album it is melancholy bittersweet from the opening title song memory of trees is elegaic majestic pax deorum sounds like it is from a requiem mass it is a dark threnody unlike the reviewer who said that this has a disconcerting blend of spirituality sensuality i don t find it disconcerting at all anywhere is is a hopeful song looking to possibilities hope has a place is about love but it is up to the listener to decide if it is romantic platonic etc i ve always had a soft spot for this song on my way home is a triumphant ending about return this is truly a masterpiece of new age music a must for any enya fan]

Tokenizing the Entire Dataset

The last step is to implement the code that will iterate through the entire dataset and tokenize each review. First, let's start with the basic code in Listing 21.14.

Here we open and read the JSON file, and then iterate through all the reviews and tokenize each of them. There is no output at this time.

COUNTING THE WORDS IN EACH REVIEW

Now that each review in the dataset is tokenized, you can proceed to compute the word count in each review. The word count is the frequency of occurrence of each unique word in the review.

In order to achieve this, you will adopt the same logic as the tokenize step. That is, you will build a function that computes the word count of an input list of words. This function, shown in Listing 21.15, takes as input a list of words and iterates through them and computes the frequency of occurrence of each unique word in the input list.

The count_word function takes a slice of strings as input, and it returns a map. In the returned map, the keys are strings, which represent the unique words in the slice, and the values are integers, which represent the frequency of occurrence of the corresponding unique word.

In the count_words function, you create an empty map that will hold the unique words/frequency of occurrence. You then iterate through the slice of words. First, you check if the current word in the slice already exists in the map:

if _, ok := word_count[words[i]]; ok {

If the word is in the map, then you have seen this word before, so you need to increase the current count by 1:

word_count[words[i]] = word_count[words[i]] + 1

If the word doesn't exist in the map already, that means that this is the first time you see this word. In this case, you will need to initialize the count to 1:

word_count[words[i]] = 1

After looping through the entire slice, you return the word_count map.

TOKENIZING AND COUNTING THE REVIEWS

With the word_count function written, you can add it to your reviews listing along with the tokenize function. Listing 21.16 pulls all the code you've written into a single listing.

The only code in this listing that is different from what you've seen before is in the main function. In the main function you can see that the tokenization and word counting operations have been combined in the code.

When this listing is executed, nothing is shown on the screen; however, each review is being read and broken into tokens, and the word count of each word in each review is being counted.

DESIGNING IMPROVEMENTS

The code you have so far performs all the necessary tasks that you set out to do. However, you can give it a few tweaks to make it more elegant and reusable. Here are some changes you can make:

  • Improve the structs
  • Add custom error and exception handling
  • Improve tokenizing
  • Improve word counting

Improvement 1: Improving the Structs

If you look closely, so far you didn't store the tokens or the word count anywhere. It will be beneficial to keep track of that data. To do that, you will leverage structs. First, let's modify the Review struct, as shown in Listing 21.17.

In this updated Review struct, you include two additional fields, Tokens and WordCount. This means that you will have a place to store the tokens and word count of each review in the review variable itself.

Next, add the Dataset struct shown in Listing 21.18.

The Dataset struct includes two attributes:

  • filepath: This represents the file path to the dataset.
  • reviews: This is a slice where each element is of type Review.

Update read_json_file

With these two structs, you can make some modifications to your code base. You can refactor the read_json_file to have a receiver of type Dataset. For instance, let's consider the read_json_file function presented in Listing 12.19. Because they have different signatures, you can add this read_json_file function to the same file (for now) where the existing read_json_file function is.

The code in this new read_json_file function is similar to the original read_json_file function except for the following:

  • The function includes a receiver of type Dataset, which will allow you to execute this function (now called a method) through Dataset types.
  • We append the review variable to the list of reviews in the dataset.

Let's see the new read_json_file function in action. In the main function, add the code shown in Listing 21.20.

In the code, you create a variable of type Dataset and initialize the filepath with the JSON file that you want to read. Next, you execute the read_json_file function, which populates the field reviews within the dataset type with the raw reviews from the JSON file. Finally, you display the second and third reviews from the dataset.

Update to Tokenize

Similar to the read_json_file, we can implement a tokenize function that will allow you to tokenize the entire dataset. Let's look at the tokenize function presented in Listing 21.21. Note that you can keep the other tokenize function for now.

The tokenize function is very simple. First, it includes a receiver in the signature of type Dataset. The body of the function includes a for loop that iterates through the reviews and tokenizes each review using the tokenize function you built earlier.

main Function Update

Listing 21.22 adds more code to the main function so that you can see the new tokenize function in action.

In this updated code, you first read the dataset using the read_json_file function, and then you execute the tokenize function, which tokenizes the entire dataset. Finally, you display the second and third reviews as well as their corresponding tokens. The output should show the review text and then the tokens:

A clasically-styled and introverted album, Memory of Trees is a masterpiece of subtlety.  Many of the songs have an endearing shyness to them - soft piano and a lovely, quiet voice.  But within every introvert is an inferno, and Enya lets that fire explode on a couple of songs that absolutely burst with an expected raw power. If you've never heard Enya before, you might want to start with one of her more popularized works, like Watermark, just to play it safe.  But if you're already a fan, then your collection is not complete without this beautiful work of musical art.
---
[a clasically styled and introverted album memory of trees is a masterpiece of subtlety many of the songs have an endearing shyness to them soft piano and a lovely quiet voice but within every introvert is an inferno and enya lets that fire explode on a couple of songs that absolutely burst with an expected raw power if you ve never heard enya before you might want to start with one of her more popularized works like watermark just to play it safe but if you re already a fan then your collection is not complete without this beautiful work of musical art]
---
I never thought Enya would reach the sublime heights of Evacuee or Marble Halls from 'Shepherd Moons.' 'The Celts, Watermark and Day…' were all pleasant and admirable throughout, but are less ambitious both lyrically and musically. But Hope Has a Place from 'Memory…' reaches those heights and beyond. It is Enya at her most inspirational and comforting. I'm actually glad that this song didn't get overexposed the way Only Time did. It makes it that much more special to all who own this album.
---
[i never thought enya would reach the sublime heights of evacuee or marble halls from shepherd moons the celts watermark and day were all pleasant and admirable throughout but are less ambitious both lyrically and musically but hope has a place from memory reaches those heights and beyond it is enya at her most inspirational and comforting i m actually glad that this song didn t get overexposed the way only time did it makes it that much more special to all who own this album]

Word Count Update

Finally, let's implement the count_words function for the Dataset struct. This function will allow you to count the unique words in each review in the entire dataset.

Let's consider the simple implementation of the count_words function in Listing 21.23. This function iterates through the reviews in the dataset and performs a word count for each review.

With this code, you can update your main function to execute the count_words function on your dataset of reviews. Update the code to use the main function shown in Listing 21.24.

As the listing shows, you add the execution of the count_words method, and then display the word_count for the second and third reviews.

Improvement 2: Adding Custom Error and Exception Handling

So far, you've implemented minimal error and exception handling. Let's investigate improving our program in the way it handles unexcepted errors and exceptions.

First, to add the appropriate errors for your methods, replace the existing read_json_file method with the code in Listing 21.25.

Compared to the previous implementation of read_json_file, nothing much has changed. The first change is that the method now returns a Boolean and an error type. The Boolean is true if there is an error, and it is false if there is no error. The error type returns an error description if an error occurs. Next, you need to identify where possible errors might occur. In our example, there are two possible exceptions.

The first exception might occur if you can't open the file because it's corrupted or the path is erroneous:

if err != nil {
   return true, err
}

The second exception might occur when you are parsing the JSON data into the struct type due to an erroneous JSON file. The second exception might occur when umarshaling the JSON data into Go objects.

The Unmarshal function returns an error type, so it is a matter of propagating the error:

err := json.Unmarshal([]byte(scanner.Text()), &review)
if err != nil {
   return true, err
}

Finally, if the function finishes executing (meaning everything went well), then you simply return no error:

return false, nil

Improvement 3: Improving Tokenizing

What if you execute the tokenize method before reading JSON data? That means you are tokenizing an empty dataset. In our example, if someone executes the tokenize function without reading the JSON data first, you want to show them an error function.

To do that, first add the method shown in Listing 21.26. The method called empty checks if the dataset is empty by checking whether or not the reviews slice is empty.

With this function added to your program, you then need to update the tokenize method. Listing 21.27 contains the new code.

In this update, you have added a few things to the tokenize method. First, you now return a Boolean and an error type. Additionally, before you perform any tokenization, you check if the dataset is empty. If that's the case, then you return true and a custom message instructing the user to read the data before performing tokenization. If there's no error, you perform the tokenization on the dataset. At the end of the method, you return false and nil, which means no error occurred.

To see these custom errors in action, update the main function with the code in Listing 21.28. You must also include the errors package, so add "errors" to the imports in the listing.

If you look at the code in Listing 21.28, you will notice that the tokenize method is being executed before reading any data, which is erroneous. This should trigger the custom exception handling, which you implemented in the previous listing:

2022/02/09 19:16:53 Dataset is empty. Please read data from json first.
exit status 1

Improvement 4: Improving Word Counting

If you execute the count_words method before executing the tokenize method, it means that the tokens aren't computed and so you would be performing a word count on empty slices, which is not what you want. For instance, look at the code in Listing 21.29.

This code will execute without any issues. However, you are running count_words before performing any tokenization. This results in the code running the word count on empty slices, which returns an empty word_count. You can see this in the output:

A clasically-styled and introverted album, Memory of Trees is a masterpiece of subtlety.  Many of the songs have an endearing shyness to them - soft piano and a lovely, quiet voice.  But within every introvert is an inferno, and Enya lets that fire explode on a couple of songs that absolutely burst with an expected raw power. If you've never heard Enya before, you might want to start with one of her more popularized works, like Watermark, just to play it safe.  But if you're already a fan, then your collection is not complete without this beautiful work of musical art.
[]
map[]

Let's fix the situation by checking first if the token slice is empty. If that's the case, you will execute the tokenization (just in case it didn't run before). This will effectively make word_count able to perform tokenization with or without calling the tokenize method. Update the count_words method with the code in Listing 21.30.

Now you simply check if the dataset is empty. If that's the case, you return an error message. Otherwise, you iterate through each review, first determining if you have any tokens doing the following check:

if len(dataset.reviews[i].Tokens) == 0 {
}

If the number of tokens is 0, it means that you didn't run any tokenization yet. In this case, you force the method to perform the tokenization prior to performing a word count:

if len(dataset.reviews[i].Tokens) == 0 {
    dataset.reviews[i].Tokens = tokenize(dataset.reviews[i].ReviewText)
}

POSSIBLE FURTHER IMPROVEMENTS

At this point, you have a working program that reads a JSON file and tokenizes the data as well as counts the words. We've gone through a few improvements that could made; however, there are many more that could be applied as well. Additional changes could include:

  • Adding support to read reviews from CSV files.
  • Performing a word count on the entire dataset. This means that after the word count for each review is completed, the results could then be combined into one single count for the entire dataset.

FINAL CODE LISTING

Your final code is going to be dependent on how you added the various suggestions throughout this lesson. Listing 21.31 contains a complete listing that includes both versions of tokenize and count_words.

Running this listing will tokenize the JSON file and print the first review. The review will be printed, followed by its tokens, followed by the word counts.

SUMMARY

In this lesson, you applied what you've learned up to this point to implement a tokenizer/word counter from scratch using built-in Go packages. Tokenization and word count are both important concepts used in many data analysis tasks such as topic detection. In this lesson you created a program that did the following:

  • Read a JSON file that contains a list of online e-commerce reviews.
  • Tokenized each review in the dataset.
  • Computed word counts in each review in the dataset.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset