Converting CSV variation files into SQLite3

In this section, we are going to be discussing CSV variations to SQLite3. CSV file stands for comma-separated values. Perhaps I mentioned in Chapter 1Descriptive Statistics,  about how CSV isn't really a standard, and it may come as a surprise to you that the comma-separated values do not have to be separated by commas in order to still be considered a CSV file. That's the lack of standardness there. So, in this section, we're going to be downloading the MovieLens dataset. We're going to be exploring the types of CSV file formats in this particular dataset and converting these datasets using SQLite3.

Let's go back to our machine and, in the Google search, type in MovieLens dataset. The first link in the search result will have our dataset and we want to download the MovieLens 100K dataset, as shown in the following screenshot:

Now, there are several datasets in this zipped file. MovieLens is a project at the University of Minnesota in which the researchers have asked people to rate movies on a scale of 1 to 5 with 1 being the least rating and 5 being the highest. The first 100K dataset was released back in 1998. We're going to be working with this older dataset because it's smaller. Let's download the ml-100k.zip file, go to our Download folder, unzip the file, and explore it, as shown in the following screenshot:

Inside the ml-100k folder you will see the following files:

There is a file in our folder called README. Let's open that up. I do encourage you to read this entire file. There are guidelines on how this dataset can be used. For example, you must download it from the MovieLens site and you must not use this dataset for commercial purposes. We're using it for academic reasons, so it shouldn't be a problem. So, let's scroll down until we find the description of the file formats. The following screenshot shows the brief descriptions of the data:

We're looking for u.data - that is our primary file. Let's open up a new terminal, and explore u.data,as demonstrated in the following screenshot:

There are 100,000 ratings in this file and you can see that there are 100,000 lines in the file mentioned at the bottom of our window. So, one line represents one record. The description says that there are tabs separating the list of values, and there are four fields. And, in the terminal below, we can see that there are four fields and that they are all integers, and they represent the user ID, item ID, the rating, and the timestamp. Now, we will have to open one more terminal, hopefully you will stay with me as I jump back and forth between the two terminals. In the newly opened terminal, we're going to type in the following command:

Here's where we'll be creating the tables. Let's create a table for our data, as demonstrated by the following command:

Now, because this is a tab-separated value file, we need to use our mode tabs, as shown in the following example:

In the past, we've used CSV, and there's a built-in mode just for tab-separated values, and it's called tabs. Now we need to import our data, as shown in the following screenshot:

If all goes according to plan, you should now be able to run SELECT queries on your data. Let's test it. You should receive the following result:

We've got the first three rows in our dataset.

Now, go to our README file and scroll down to the next file, which is u.info. We will see that it has the number of users, items, and ratings. Open the file using the terminal and you will see the following output:

There are only three lines in the file, and it just talks about the dataset. Next up we have u.item. So, u.item includes information about the movies but they call it items, as shown in the following example:

So, this is a tab-separated list, it has a lot of fields in this dataset, 19 in total. Now, you'll notice the documentation says tab separated, but, if we look down at the file itself, these are pipes rather than tabs. So, what we would like to do next is to create a table for our items. Because this is going to be a really long line, I have included it in the code files that will be provided for you along with this book so that you don't have to retype all of it.

The following example makes this clear:

Because this is the pipe-separated file there is no built-in, pipe-separated mechanism, so we just have to declare that the separator is a pipe, as shown in the following example:

Next, we say that the mode is a list. Whenever we set up list mode, it will look for the separator field and determine how we separate our file, as shown in the following example:

Now, we import u.item into items, as demonstrated in the following code:

If all goes according to plan, you should have your items set up. Now we are going to select the first two items using the following command:

Well, we got Toy Story and GoldenEye.

Let's continue looking at our README file. If you scroll down, you will see u.genre, which is just a listing of genres. We are going to skip it as it's not that interesting. Then we have u.user, which has the demographic information of all of the users interviewed by the researchers. Let's open that dataset. We should see the following screenshot:

You'll see again it is another pipe-separated file, despite the fact that the documentation says it's tab. Again, we're going to create a table based on this content, as shown in the following example:

Now we already have our mode set up for pipe-separated values, so we should just be able to do an import, as demonstrated by the following example:

Now let's do a SELECT query, as shown in the following example:

Alright. Next, in the README file, you have u.occupation, which is a list of occupations, again, we are skipping it as it is not that interesting. The remaining files in this dataset are used for machine learning projects, as shown in the following example:

Since machine learning is not within the scope of our book we are just going to stick with the three tables that we explored in this section. So, we have our dataset set up, and I believe we're ready to go. In the next section, we're going to be using descriptive stats with SQLite3.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset