A sample use case

To better understand what a Watson Analytics Content Analytics cycle might be like, let's walk through a user case scenario.

Let's suppose I am engaged by a university in the United States that is interested in the idea of possibly investing unallocated budget resources into developing additional extra-curricular programs for its students. It is also, however, not willing to risk the possibility of negatively affecting its students' academic performance.

The university wonders: if a broader range of, or more cultured, options (or non-academic activities) are offered and more of the student body becomes involved with these activities, will grade point averages drop, or will the ratio of students graduating change compared to the number enrolled? If so, perhaps the university would allocate the additional resources elsewhere. Thankfully, we have data available to load into Watson Analytics to see if we can gain insight into this decision making process.

Step 1: Define the purpose

As we mentioned earlier in this chapter, the first step would be to explicitly define what we believe our purpose or objective is. This can be something like, Will more student involvement in activities affect grade point averages or the number of students who ultimately graduate?

Step 2: Obtaining the data

With our objective or purpose clearly in mind, we can now start obtaining what we think is relevant data. The university has supplied us with a file containing student information from (approximately) the last 10 years—complete with sex, age, home addresses, enrolment dates, credits attempted, credits completed, average GPAs, graduation dates, average credits per semester, and so on. It also includes the number of non-academic activities each student is enrolled in.

This appears to be enough to start our development of the data, so let's take a look at what's provided. The goal is to first identify and understand. Looking through the information, I see that there is a large variety of information; some information seems to be relevant to our purpose, other information maybe not. The following fields stood out to me, and so I've made an effort to clarify their meaningfulness (for brevity, I've listed only the fields that I decided were relevant to our purpose). Part of this exercise is to look for missing, incomplete, or otherwise seemingly incorrect data:

  • Enrollment age: The age of the student upon enrollment/acceptance
  • Sex, Marital status: Personal details
  • Home state: Geographical location
  • Major (Primary, Secondary): Could be undeclared or blank
  • Enrollment date: Date of joining
  • Sponsor: This text field is usually blank
  • Credits attempted: The number of credits the student enrolled in
  • Credits completed: The number of credits the student completed successfully
  • Credits dropped: The number of credits dropped (important to know if they dropped it, or failed to complete it)
  • Current GPA/average GPA: reports for all credits completed to date
  • FT or PT: Is the student full time or part time?
  • Athlete: Is the student an active athlete (any sport). Values active/inactive or blank?
  • Number of sports: Zero or any other number
  • Intramural participant: Does the student participate in intramural sports? Only tracked as yes or no
  • Scholarship type: Academic award by university, privately awarded by a non-university source, none (no scholarship), or blank
  • Number of clubs enrolled in: The number of non-sport, non intramural activities, but organized clubs the student is a part of
  • Expected grad date: Calculated by major on enrolment date
  • Class of: Inferred from expected grad date
  • Actual grad date: The actual date the student gradated can be blank for active students
  • Employed by university: Does the student have a job at the university? (Paid, non paid, none)
  • Transferred in: Did the student transfer in from another university?
  • Transferred out: Did the student transfer out to another university?
  • RA: Is or was the student a resident advisor?
  • Comments: These are free form comments from teachers or faculty members about the student
  • Residence: Where is the student residing while attending? (Dorm, other university housing, other, commuter)
  • Alumni: Is the student an alumnus (yes or no)? Did a family member graduate from here?

Step 3: Performing the analysis

A good amount of time can be required for the previous step; at some point, you'll want to move on though and load the data into Watson Analytics. Once it's loaded, you certainty have the option to further develop the data, either using the Watson Analytics redefine feature, or by externally modifying the data and reloading and replacing the file (or as a new file).

So let's load this file.

From the welcome page, select Add then upload data (as we did in previous examples):

Step 3: Performing the analysis

Now that Watson Analytics has our data, the fun begins. Note that Watson Analytics gave our data a score of 84 and considers it of high quality. Remember, the higher the quality of your data, the better Watson Analytics predicts and explores. We've had higher data quality scores in some of the other sample data files in the book, so let's see if we can improve the score of this file.

To do that, let's click on the file panel shown in the preceding screenshot and then select Refine:

Step 3: Performing the analysis

Now, our data is visible in the Refinement screen, and we can scroll though the columns and rows and determine what developments we may want to make.

First, a bit of housekeeping; I noticed one of the column headings contains a typo, so I start thinking ahead to how this data will visualize. Let's clean that up:

Step 3: Performing the analysis

This is better:

Step 3: Performing the analysis

One of the more important fields in my file is GPA Average. When I click on the heading (see the next screenshot), I see that some records in my file have a blank or missing value for this field. That won't contribute to my analysis well, so I can uncheck the Include (blank) to ignore those records:

Step 3: Performing the analysis

The field GPA current has the same dilemma (some blank values), so we can eliminate records with blanks in those fields as well. It is a good practice to identify blank or missing values and consider removing those records or populating those values based upon reasonable assumptions.

Another very important field (based upon our defined purpose) is Athlete. We understood that this field would indicate if the student was an active athlete (in any sport) and values would be active/inactive or blank. But after clicking on that field heading, we see something different:

Step 3: Performing the analysis

We see that this field actually contains three values: Athlete, Non-Athlete, and RA. The concern is, can a student classified as RA also be an athlete?

Whenever possible, you should take the time to validate what you see in the file compared to what you thought should be there. In this example, suppose we checked with the university and they explained that RA indicates that the student was a Resident Advisor for other students. Although RA is a highly regarded job, it isn't an indication whether the individual was an athlete (or not). With further discussion, we found that the university, as a rule, does not encourage athletes to also be RAs. So it's very rare that a student is both an athlete and an RA. For our needs, we can consider a student RA also a Non-Athlete. So that Watson Analytics understands this, we want to transform the field values (change the RA to Non-Athlete).

Note

Transformations are rules applied to a field to change values of the field.

Although Watson Analytics Refine offers certain features like the ability to add calculations and create data groupings to perform filtering, at the time of writing, the easiest method for performing field transformations is external to Watson Analytics using a tool such as IBM SPSS or perhaps even MS Excel (if the volume of your data is small enough).

For our example here, I used MS Excel. Once my transformations were completed, I saved the file with a new name and loaded that file into Watson Analytics. Notice that, even though we've made a few changes to our file, the data quality score did not change:

Step 3: Performing the analysis

Now when we look at the field Athlete, we see only two values:

Step 3: Performing the analysis

We could continue to develop our data, but we can also have a look at what Watson Analytics can expose using the data as it is right now. To do that, we can click on the data panel and select Explore. Watson Analytics already prompts us with some starting points (questions), but I have a particular question in mind. I want to see what the average GPA is for athletes and non-athletes. So I can type my question and hit Enter:

Step 3: Performing the analysis

Watson Analytics rephrases my question a bit into possible visualizations:

Step 3: Performing the analysis

Clicking on the first (left to right) question, we see the following visualization:

Step 3: Performing the analysis

It would appear that on average, students who are athletes have accumulated higher GPAs then those students who are non-athletes. However, is that the full story? What if we looked at a few other important indicators (for athletes versus non-athletes)? Let's take a look at the credits completed.

To do this, we can click on GPA Average, and from the drop-down list select Credits Completed. Watson Analytics shows us the updated visualization:

Step 3: Performing the analysis

Here, Watson Analytics illustrates that athletes, although they usually have had higher GPAs, have actually completed fewer credits. What about dropped credits?:

Step 3: Performing the analysis

Another question might be how many credits were attempted?:

Step 3: Performing the analysis

With the preceding Watson Analytics visualizations, we might come to understand that non-athletes attempt more credits, complete more credits, but also drop more credits. Athletes seem to perhaps be more stable in the number of credits attempted, dropped, and completed—all the while out performing non-athletes from an average GPA perspective. This insight might support the proposal originally stated of adding funding to the universities athletic programs.

Step 4: Determining actions to take

Of course, in the preceding example, we completed only superficial development of our data and looked at only a few visualizations (based upon that data) before noticing an insight and drawing a conclusion. Practically speaking, you would perform multiple iterations of steps two (obtaining the data) and three (performing the analysis) before considering any action to take based upon the data.

As you can see, insights need to be validated with others. This may require you to gather more data, or perhaps better understand the data you have (for example, were transformations made actually correct?). In addition, once an insight is validated and a conclusion accepted, don't stop there. For example, if it is eventually accepted that it is a good idea to increase funding for athletic programs (since it appears to positively affect student academic performance), you can ask, "Is there an even better option for the available funding?" What about offering more university scholarships? What about funding more clubs?

Let's have a quick look at these two questions with Watson Analytics.

Next, I've asked Watson Analytics, How do the values of GPA Average compare by Scholarship? From this visualization, we notice a new data abnormality. The Scholarship field includes the value All Scholarship Types, which is a consolidation of each value—including the value None—so it doesn't make sense to include that here:

Step 4: Determining actions to take

Rather than take the time to cleanse the data (remove this consolidation), we can click on that value and select Exclude (shown in the next screenshot):

Step 4: Determining actions to take

Now, Watson Analytics displays an updated visualization (shown in the next screenshot) that seems to indicate that scholarships don't really affect the average GPA of the student:

Step 4: Determining actions to take

Let's ask Watson Analytics about the effect of club involvement on average GPA, "How do the values of GPA Average compare by Number of Clubs?

Step 4: Determining actions to take

So again, according to the preceding Watson Analytics visualization, it would appear that the number of clubs a student is a member of has no material effect on the students average GPA.

As you can see, there are numerous patterns of thought that can be explored using Watson Analytics.

Step 5: Validation

Finally, it's been communicated that the validation of any actions taken or decisions made based upon analytical insights is the final step in the cycle. In fact, it is absolutely critical to carefully appraise and record the effects of any actions taken. You will find (as we mentioned earlier in this chapter) that the results of this step will be input to the next cycle (of analysis), perhaps driving realignment of purpose and/or further development of the data used.

With each iteration (of the analysis cycle), your data quality scores should improve and the number of useable insights noticed should increase. In my experience, insights drive data development, which in turn drives additional insights. In addition, using Watson Analytics sharpens your data analysis skills.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset