In this lesson, you will apply what you've learned up to this point to perform a common text analysis process using Go. Specifically, you will build a program that takes a dataset of e-commerce reviews and analyzes them to calculate the number of times each word appears.
When you start any data analysis process, the first step is to ensure that the data is in a format that your system can use and ensure that the data is available for use. For our project, you will need to download the data. You'll use the Digital Music review set from Julian McAuley's Amazon product data website, which can be found at http://jmcauley.ucsd.edu/data/amazon
. This file, reviews.json
, can also be found in the downloadable zip file for this book, which can be found at www.wiley.com/go/jobreadygo
. The data is in the file reviews_Digital_Music_5.json.gz
and will need to be extracted.
This file is in a modified JSON format. If you open the extracted file using any text editor, the first two records look like this:
{"reviewerID": "A3EBHHCZO6V2A4", "asin": "5555991584", "reviewerName": "Amaranth "music fan"", "helpful": [3, 3], "reviewText": "It's hard to believe "Memory of Trees" came out 11 years ago;it has held up well over the passage of time.It's Enya's last great album before the New Age/pop of "Amarantine" and "Day without rain." Back in 1995,Enya still had her creative spark,her own voice.I agree with the reviewer who said that this is her saddest album;it is melancholy,bittersweet,from the opening title song."Memory of Trees" is elegaic&majestic.;"Pax Deorum" sounds like it is from a Requiem Mass,it is a dark threnody.Unlike the reviewer who said that this has a "disconcerting" blend of spirituality&sensuality;,I don't find it disconcerting at all."Anywhere is" is a hopeful song,looking to possibilities."Hope has a place" is about love,but it is up to the listener to decide if it is romantic,platonic,etc.I've always had a soft spot for this song."On my way home" is a triumphant ending about return.This is truly a masterpiece of New Age music,a must for any Enya fan!", "overall": 5.0, "summary": "Enya's last great album", "unixReviewTime": 1158019200, "reviewTime": "09 12, 2006"}
{"reviewerID": "AZPWAXJG9OJXV", "asin": "5555991584", "reviewerName": "bethtexas", "helpful": [0, 0], "reviewText": "A clasically-styled and introverted album, Memory of Trees is a masterpiece of subtlety. Many of the songs have an endearing shyness to them - soft piano and a lovely, quiet voice. But within every introvert is an inferno, and Enya lets that fire explode on a couple of songs that absolutely burst with an expected raw power. If you've never heard Enya before, you might want to start with one of her more popularized works, like Watermark, just to play it safe. But if you're already a fan, then your collection is not complete without this beautiful work of musical art.", "overall": 5.0, "summary": "Enya at her most elegant", "unixReviewTime": 991526400, "reviewTime": "06 3, 2001"}
Each record is enclosed in curly brackets ({ }), and the records are separated by new lines. In standard JSON, each record would be enclosed in square brackets ([ ]). Your code will need to take this into account when you import the data to be analyzed.
The fields in each record include a name and value using a colon (:
) as the separator:
"reviewerID": "A3EBHHCZO6V2A4"
The fields are separated by commas:
"reviewerID": "A3EBHHCZO6V2A4", "asin": "5555991584"
For this analysis, we are most interested in the reviews themselves. The review of the first record looks like the following:
"reviewText": "It's hard to believe "Memory of Trees" came out 11 years ago;it has held up well over the passage of time.It's Enya's last great album before the New Age/pop of "Amarantine" and "Day without rain." Back in 1995,Enya still had her creative spark,her own voice.I agree with the reviewer who said that this is her saddest album;it is melancholy,bittersweet,from the opening title song."Memory of Trees" is elegaic&majestic.;"Pax Deorum" sounds like it is from a Requiem Mass,it is a dark threnody.Unlike the reviewer who said that this has a "disconcerting" blend of spirituality&sensuality;,I don't find it disconcerting at all."Anywhere is" is a hopeful song,looking to possibilities."Hope has a place" is about love,but it is up to the listener to decide if it is romantic,platonic,etc.I've always had a soft spot for this song."On my way home" is a triumphant ending about return.This is truly a masterpiece of New Age music,a must for any Enya fan!"
You can see that in addition to words, the text includes punctuation. While you can use spaces as one option for separating and identifying distinct words in the data, you can also use punctuation.
Because the focus of our project is to count the number of times a word appears, you have another problem with the raw data. Some words are capitalized and others are not. Keeping in mind that Go is case-sensitive, you also want to normalize the text so that it is all in lowercase, so that both hope and Hope are considered the same word.
Now that you have looked at the data and identified what you want your code to do, you're ready to start writing code.
In the previous lessons, you didn't have to know how to read a JSON file, but the concept is fairly similar to our use case. Let's look at the function read_json_file
in Listing 21.1.
The read_json_file
function takes as an input the file path. The function uses the os
package to read the file using the Open
function.
Next, the code checks to see if an error occurred. If that's the case, you log/display the error and the program is terminated. Otherwise, the file is valid, and you can proceed to read it.
Because you are opening a file, you also want to make sure it gets closed. The listing defers closing the file until the function finishes executing.
To read the file, you will leverage the NewScanner
function from the bufio
package to create a scanner called scanner
. If you look closely, our JSON file is structured in a way where each review in the JSON file is on a separate line. Thus, you will need to split the text based on lines using the Split
function from scanner
.
At this point, all the function does is read the file and split the JSON file based on new lines. The next step will be to iterate and scan the lines in the text. Add the code shown in Listing 21.2 to the same read_json_file
function.
The code adds a for
loop that iterates and scans each line using the Scan
function, and then it prints each line. Note that in the listing, you have commented out the Println
function. You will need to uncomment it (remove the //) to see the data actually print. The entire contents of the file will be printed, which can take a while.
It's important to remember that at this point, you are still reading each review as a string. The string represents the bulk of the JSON version of the review. You will have to convert that string into a valid representation where you can access all the attributes of the review easily. In this case, using structs sounds like the best choice. Go ahead and model the JSON review using structs, as shown in Listing 21.3.
As Listing 21.3 shows, the Review
struct represents the different fields in the JSON review. For each field, you use the appropriate data type. For instance, the helpful
field must be [2]int
to be parsed correctly. (If you choose string
, then the JSON won't parse.)
Going back to the reviews, each review is defined as:
var review Review
The goal now is to take the string representation of the JSON review and convert it into the Review
struct that you've defined. This is where the json
package comes into play. The json
package allows you to encode and decode JSON objects. In our example, it will allow you to convert the string into a valid review represented by the Review
struct in Listing 21.3.
To do that we will need to use the Unmarshal
function, which has the following format:
json.Unmarshal(data []byte, v interface{}) error
As it can be seen, the Unmarshal
function takes as input the data that we want to unmarshal. The function will parse the JSON data and store the results in the value of v
. In our case, v
is the review (or the struct defined earlier in this case). The type of v
is an interface that is empty (called empty interface). The empty interface type is an interface that implements at least zero or more methods and is defined as interface{}
.
Since all types in Go implement zero (or more methods), we can use the struct type of the review we defined earlier as input to the Unmarshal
function (some similarity with object-oriented programming [OOP] concepts in this case).
One thing to notice is that the data must be a slice of bytes, so you will need to convert the string representation of the review into the equivalent byte representation. Conveniently, you can do that by passing the text data during initialization of the byte slice, as shown here:
[]byte(scanner.Text())
Going back to the Unmarshal
function, you can parse the review into the struct defined earlier by using the following code:
var review Review
json.Unmarshal([]byte(scanner.Text()), &review)
This code will parse the JSON-encoded review into the variable of type Review
. Note that to use the json.Unmarshal
function, you have to add "encoding/json"
to the current list of imports in the listing:
import (
"bufio"
"encoding/json"
"fmt"
"log"
"os"
)
Going back to the read_json_file
function, you have to perform this process for each line of text. Listing 21.4 adds this process to the function.
Listing 21.4 adds a few lines within the for
loop so that the code will scan and convert each line to a type Review
. It also checks for errors returned by the Unmarshal
function, and in case an error occurs, it logs the error and exits the function.
At this point, you can display individual attributes of the review/map by adding the code shown in Listing 21.5.
In Listing 21.5, you are simply displaying the asin
attribute of each review. Let's see the code in action. Listing 21.6 presents a full listing using the read_json_file
function.
When you execute the listing, the code will read the file, iterate through each line, parse the data from the line into a type Review
, and display the asin
attribute. Note that you pass the location of the review file to the read_json_file
function. In Listing 21.6, the JSON file is in the same directory as the Go program. If you saved the JSON file in a different directory, then you will need to adjust the path accordingly. If you downloaded a different review file from Amazon, then you'll want to change the JSON filename to match as well.
At this point, let's discuss what the read_json_file
function should return. Ideally, you want to return a slice of reviews. In other words, the read_json_file
should return the following type:
[]Review
As you can see, the slice holds elements where each element is of type Review
. By doing so, you can iterate through the file, parse each review, append it to the slice of reviews, and return the slice when the function is done.
Listing 21.7 makes a few changes to the code.
Listing 21.7 adds a few things to the read_json_file
function:
read_json_file
now returns a slice where each element is of type Review
.reviews
, which will hold all the reviews from the JSON file.for
loop, once you parse a review, you append it to the slice of reviews.It is time now to see this code in action. Update the main
function with the code shown in Listing 21.8. Again, remember to adjust your filename and path if necessary.
This new main
function reads the JSON file using your latest read_json_file
and stores the output in a slice named reviews
. Next, you display the first two reviews in the file with a set of dashes between them to provide some separation. The output should look like this:
It's hard to believe "Memory of Trees" came out 11 years ago;it has held up well over the passage of time.It's Enya's last great album before the New Age/pop of "Amarantine" and "Day without rain." Back in 1995,Enya still had her creative spark,her own voice.I agree with the reviewer who said that this is her saddest album;it is melancholy,bittersweet,from the opening title song."Memory of Trees" is elegaic&majestic.;"Pax Deorum" sounds like it is from a Requiem Mass,it is a dark threnody.Unlike the reviewer who said that this has a "disconcerting" blend of spirituality&sensuality;,I don't find it disconcerting at all."Anywhere is" is a hopeful song,looking to possibilities."Hope has a place" is about love,but it is up to the listener to decide if it is romantic,platonic,etc.I've always had a soft spot for this song."On my way home" is a triumphant ending about return.This is truly a masterpiece of New Age music,a must for any Enya fan!
----------
A clasically-styled and introverted album, Memory of Trees is a masterpiece of subtlety. Many of the songs have an endearing shyness to them - soft piano and a lovely, quiet voice. But within every introvert is an inferno, and Enya lets that fire explode on a couple of songs that absolutely burst with an expected raw power. If you've never heard Enya before, you might want to start with one of her more popularized works, like Watermark, just to play it safe. But if you're already a fan, then your collection is not complete without this beautiful work of musical art.
You need to split the reviews into individual words so that the words can be counted. A function to tokenize any input string is a simple approach in terms of input for accomplishing this. The function can accept as input a string and returns a list that represents the words in the string with the order preserved. This function should split the string into words based on spaces or punctuation.
You can simplify the problem by first identifying each punctuation mark and replacing it with a space. You can then split based solely on spaces. For example, consider the following:
"Hello, Sean! -How are you?"
The first step is to replace the punctuation with a space. This results in the following string:
Hello Sean How are you
Next, you can use the split
function to split the string based on spaces and get the list of words in the string, which will be the following:
[hello sean how are you]
The logic of this function is as follows:
First, let's focus on step 1. To identify/replace punctuation with a space in the string, you will leverage regular expressions, or regex, which you learned about in Lesson 19, “Sorting and Data Processing.” Regex is a standard that developers use frequently for searching through strings. In fact, most modern search engines use regex as a standard part of their matching process. You will leverage regex to identify and replace punctuation with a space.
In our example, you will use the regexp
package from Go to search for punctuation marks and the ReplaceAllString
function to substitute spaces for those punctuation marks. Let's consider the code in Listing 21.9.
First, you use the MustCompile
function, to which you pass a regex expression. This regex expression will then be used automatically for the subsequent operations that are done with the regexp
package. Note that if you need to apply other regex expressions, then you must compile again. In this listing, your regex is a list of all the possible punctuation that you want to replace with a space ([.,!?-_#^()+=;/&'"]
).
Keep in mind that MustCompile
will panic if the input regular expression is not a valid expression. That means you should double-check your regex if it throws an error.
The next step is to use the ReplaceAllString
function to replace any identified punctuation with a space. Since you compiled the regex in the previous step, you don't need to specify the regex expression to ReplaceAllString
.
Finally, you display both strings so that you can compare the results. As you can see, you are able to replace any punctuation with a space. The results should look like the following:
original string: Hello, Sean! -How are you?
string after replacing punctuation with a space: Hello Sean How are you
The next step in the tokenization process is to convert the string to lowercase. Listing 21.10 adds this code to your main
function.
This listing introduces an additional line that calls the ToLower
function of strings
. In order to use this, you will also need to import the strings
package into your program. The strings.ToLower
function converts the received string (w
in this case) to lowercase and returns it. In our example, you return the string back to w
as well. You also added the punctuation class mentioned in the previous note for the regex expression.
Finally, you need to split the string at spaces and retrieve the slice of strings, which represents the different words (in order). To do that, you can use the built-in function, Fields
, from the strings
package.
The Fields
function splits an input string around each instance of one or more consecutive whitespace characters. It is important to note that there could be multiple spaces grouped together. Since you are replacing punctuation with a space, if you have two consecutive punctuation marks, then you will have double spaces. The Fields
function allows you to split based on any number of consecutive spaces. Let's add another instruction to the previous code, as shown in Listing 21.11.
In this code, the call to the Fields
function has been added to split the string w
into a slice of strings representing the different words/tokens in the string. Finally, the tokens are displayed from the input string. Running the code should produce the following results:
original string: Hello, Sean! --How are you?
string after replacing punctuation with a space: Hello Sean How are you
Tokens: [hello sean how are you]
Now that you have a working code, you can create a function that will leverage the code you've seen to tokenize any input string. Create a function called tokenize
, as shown in Listing 21.12.
In this listing, you wrap the previous code into a function called tokenize
that takes as input a string and returns a slice of strings that represent the different tokens/words in the input string.
Let's leverage the tokenize
function to tokenize a review from the JSON file. Replace the main
function you used in Listing 21.8 with the code in Listing 21.13.
In this code, you use the read_json_file
function to read the JSON file containing the reviews. Next, you call the tokenize
function to tokenize the text of the first review and display the token.
Note that in addition to updating the main
function with the code in Listing 21.13, you also have to add the tokenize
function from Listing 21.12 to your program as well as include "strings"
and "regexp"
in your list of imported packages. With the updated code, the output produced should be similar to the following:
tokens: [it s hard to believe memory of trees came out 11 years ago it has held up well over the passage of time it s enya s last great album before the new age pop of amarantine and day without rain back in 1995 enya still had her creative spark her own voice i agree with the reviewer who said that this is her saddest album it is melancholy bittersweet from the opening title song memory of trees is elegaic majestic pax deorum sounds like it is from a requiem mass it is a dark threnody unlike the reviewer who said that this has a disconcerting blend of spirituality sensuality i don t find it disconcerting at all anywhere is is a hopeful song looking to possibilities hope has a place is about love but it is up to the listener to decide if it is romantic platonic etc i ve always had a soft spot for this song on my way home is a triumphant ending about return this is truly a masterpiece of new age music a must for any enya fan]
The last step is to implement the code that will iterate through the entire dataset and tokenize each review. First, let's start with the basic code in Listing 21.14.
Here we open and read the JSON file, and then iterate through all the reviews and tokenize each of them. There is no output at this time.
Now that each review in the dataset is tokenized, you can proceed to compute the word count in each review. The word count is the frequency of occurrence of each unique word in the review.
In order to achieve this, you will adopt the same logic as the tokenize step. That is, you will build a function that computes the word count of an input list of words. This function, shown in Listing 21.15, takes as input a list of words and iterates through them and computes the frequency of occurrence of each unique word in the input list.
The count_word
function takes a slice of strings as input, and it returns a map. In the returned map, the keys are strings, which represent the unique words in the slice, and the values are integers, which represent the frequency of occurrence of the corresponding unique word.
In the count_words
function, you create an empty map that will hold the unique words/frequency of occurrence. You then iterate through the slice of words. First, you check if the current word in the slice already exists in the map:
if _, ok := word_count[words[i]]; ok {
If the word is in the map, then you have seen this word before, so you need to increase the current count by 1:
word_count[words[i]] = word_count[words[i]] + 1
If the word doesn't exist in the map already, that means that this is the first time you see this word. In this case, you will need to initialize the count to 1:
word_count[words[i]] = 1
After looping through the entire slice, you return the word_count
map.
With the word_count
function written, you can add it to your reviews listing along with the tokenize
function. Listing 21.16 pulls all the code you've written into a single listing.
The only code in this listing that is different from what you've seen before is in the main
function. In the main
function you can see that the tokenization and word counting operations have been combined in the code.
When this listing is executed, nothing is shown on the screen; however, each review is being read and broken into tokens, and the word count of each word in each review is being counted.
The code you have so far performs all the necessary tasks that you set out to do. However, you can give it a few tweaks to make it more elegant and reusable. Here are some changes you can make:
If you look closely, so far you didn't store the tokens or the word count anywhere. It will be beneficial to keep track of that data. To do that, you will leverage structs. First, let's modify the Review
struct, as shown in Listing 21.17.
In this updated Review struct, you include two additional fields, Tokens
and WordCount
. This means that you will have a place to store the tokens and word count of each review in the review
variable itself.
Next, add the Dataset
struct shown in Listing 21.18.
The Dataset
struct includes two attributes:
filepath
: This represents the file path to the dataset.reviews
: This is a slice where each element is of type Review
.With these two structs, you can make some modifications to your code base. You can refactor the read_json_file
to have a receiver of type Dataset
. For instance, let's consider the read_json_file
function presented in Listing 12.19. Because they have different signatures, you can add this read_json_file
function to the same file (for now) where the existing read_json_file
function is.
The code in this new read_json_file
function is similar to the original read_json_file
function except for the following:
Dataset
, which will allow you to execute this function (now called a method) through Dataset
types.review
variable to the list of reviews
in the dataset.Let's see the new read_json_file
function in action. In the main
function, add the code shown in Listing 21.20.
In the code, you create a variable of type Dataset
and initialize the filepath
with the JSON file that you want to read. Next, you execute the read_json_file
function, which populates the field reviews
within the dataset type with the raw reviews from the JSON file. Finally, you display the second and third reviews from the dataset.
Similar to the read_json_file
, we can implement a tokenize function that will allow you to tokenize the entire dataset. Let's look at the tokenize
function presented in Listing 21.21. Note that you can keep the other tokenize
function for now.
The tokenize
function is very simple. First, it includes a receiver in the signature of type Dataset
. The body of the function includes a for
loop that iterates through the reviews and tokenizes each review using the tokenize
function you built earlier.
Listing 21.22 adds more code to the main
function so that you can see the new tokenize
function in action.
In this updated code, you first read the dataset using the read_json_file
function, and then you execute the tokenize
function, which tokenizes the entire dataset. Finally, you display the second and third reviews as well as their corresponding tokens. The output should show the review text and then the tokens:
A clasically-styled and introverted album, Memory of Trees is a masterpiece of subtlety. Many of the songs have an endearing shyness to them - soft piano and a lovely, quiet voice. But within every introvert is an inferno, and Enya lets that fire explode on a couple of songs that absolutely burst with an expected raw power. If you've never heard Enya before, you might want to start with one of her more popularized works, like Watermark, just to play it safe. But if you're already a fan, then your collection is not complete without this beautiful work of musical art.
---
[a clasically styled and introverted album memory of trees is a masterpiece of subtlety many of the songs have an endearing shyness to them soft piano and a lovely quiet voice but within every introvert is an inferno and enya lets that fire explode on a couple of songs that absolutely burst with an expected raw power if you ve never heard enya before you might want to start with one of her more popularized works like watermark just to play it safe but if you re already a fan then your collection is not complete without this beautiful work of musical art]
---
I never thought Enya would reach the sublime heights of Evacuee or Marble Halls from 'Shepherd Moons.' 'The Celts, Watermark and Day…' were all pleasant and admirable throughout, but are less ambitious both lyrically and musically. But Hope Has a Place from 'Memory…' reaches those heights and beyond. It is Enya at her most inspirational and comforting. I'm actually glad that this song didn't get overexposed the way Only Time did. It makes it that much more special to all who own this album.
---
[i never thought enya would reach the sublime heights of evacuee or marble halls from shepherd moons the celts watermark and day were all pleasant and admirable throughout but are less ambitious both lyrically and musically but hope has a place from memory reaches those heights and beyond it is enya at her most inspirational and comforting i m actually glad that this song didn t get overexposed the way only time did it makes it that much more special to all who own this album]
Finally, let's implement the count_words
function for the Dataset
struct. This function will allow you to count the unique words in each review in the entire dataset.
Let's consider the simple implementation of the count_words
function in Listing 21.23. This function iterates through the reviews in the dataset and performs a word count for each review.
With this code, you can update your main
function to execute the count_words
function on your dataset of reviews. Update the code to use the main
function shown in Listing 21.24.
As the listing shows, you add the execution of the count_words
method, and then display the word_count
for the second and third reviews.
So far, you've implemented minimal error and exception handling. Let's investigate improving our program in the way it handles unexcepted errors and exceptions.
First, to add the appropriate errors for your methods, replace the existing read_json_file
method with the code in Listing 21.25.
Compared to the previous implementation of read_json_file
, nothing much has changed. The first change is that the method now returns a Boolean and an error type. The Boolean is true
if there is an error, and it is false
if there is no error. The error type returns an error description if an error occurs. Next, you need to identify where possible errors might occur. In our example, there are two possible exceptions.
The first exception might occur if you can't open the file because it's corrupted or the path is erroneous:
if err != nil {
return true, err
}
The second exception might occur when you are parsing the JSON data into the struct type due to an erroneous JSON file. The second exception might occur when umarshaling the JSON data into Go objects.
The Unmarshal
function returns an error type, so it is a matter of propagating the error:
err := json.Unmarshal([]byte(scanner.Text()), &review)
if err != nil {
return true, err
}
Finally, if the function finishes executing (meaning everything went well), then you simply return no error:
return false, nil
What if you execute the tokenize
method before reading JSON data? That means you are tokenizing an empty dataset. In our example, if someone executes the tokenize
function without reading the JSON data first, you want to show them an error function.
To do that, first add the method shown in Listing 21.26. The method called empty
checks if the dataset is empty by checking whether or not the reviews slice is empty.
With this function added to your program, you then need to update the tokenize
method. Listing 21.27 contains the new code.
In this update, you have added a few things to the tokenize
method. First, you now return a Boolean and an error type. Additionally, before you perform any tokenization, you check if the dataset is empty. If that's the case, then you return true
and a custom message instructing the user to read the data before performing tokenization. If there's no error, you perform the tokenization on the dataset. At the end of the method, you return false
and nil
, which means no error occurred.
To see these custom errors in action, update the main
function with the code in Listing 21.28. You must also include the errors
package, so add "errors"
to the imports in the listing.
If you look at the code in Listing 21.28, you will notice that the tokenize
method is being executed before reading any data, which is erroneous. This should trigger the custom exception handling, which you implemented in the previous listing:
2022/02/09 19:16:53 Dataset is empty. Please read data from json first.
exit status 1
If you execute the count_words
method before executing the tokenize
method, it means that the tokens aren't computed and so you would be performing a word count on empty slices, which is not what you want. For instance, look at the code in Listing 21.29.
This code will execute without any issues. However, you are running count_words
before performing any tokenization. This results in the code running the word count on empty slices, which returns an empty word_count
. You can see this in the output:
A clasically-styled and introverted album, Memory of Trees is a masterpiece of subtlety. Many of the songs have an endearing shyness to them - soft piano and a lovely, quiet voice. But within every introvert is an inferno, and Enya lets that fire explode on a couple of songs that absolutely burst with an expected raw power. If you've never heard Enya before, you might want to start with one of her more popularized works, like Watermark, just to play it safe. But if you're already a fan, then your collection is not complete without this beautiful work of musical art.
[]
map[]
Let's fix the situation by checking first if the token slice is empty. If that's the case, you will execute the tokenization (just in case it didn't run before). This will effectively make word_count
able to perform tokenization with or without calling the tokenize
method. Update the count_words
method with the code in Listing 21.30.
Now you simply check if the dataset
is empty. If that's the case, you return an error message. Otherwise, you iterate through each review, first determining if you have any tokens doing the following check:
if len(dataset.reviews[i].Tokens) == 0 {
}
If the number of tokens is 0, it means that you didn't run any tokenization yet. In this case, you force the method to perform the tokenization prior to performing a word count:
if len(dataset.reviews[i].Tokens) == 0 {
dataset.reviews[i].Tokens = tokenize(dataset.reviews[i].ReviewText)
}
At this point, you have a working program that reads a JSON file and tokenizes the data as well as counts the words. We've gone through a few improvements that could made; however, there are many more that could be applied as well. Additional changes could include:
Your final code is going to be dependent on how you added the various suggestions throughout this lesson. Listing 21.31 contains a complete listing that includes both versions of tokenize
and count_words
.
Running this listing will tokenize the JSON file and print the first review. The review will be printed, followed by its tokens, followed by the word counts.
In this lesson, you applied what you've learned up to this point to implement a tokenizer/word counter from scratch using built-in Go packages. Tokenization and word count are both important concepts used in many data analysis tasks such as topic detection. In this lesson you created a program that did the following: