How do we obtain and sample data?

If statistics is about taking samples of populations, it must be very important to know how we obtain these samples, and you'd be correct. Let's focus on just a few of the many ways of obtaining and sampling data.

Obtaining data

There are two main ways of collecting data for our analysis: observational and experimentation. Both these ways have their pros and cons, of course. They each produce different types of behavior and, therefore, warrant different types of analysis.

Observational

We might obtain data through observational means, which consists of measuring specific characteristics but not attempting to modify the subjects being studied. For example, you have a tracking software on your website that observes users' behavior on the website, such as length of time spent on certain pages and the rate of clicking on ads, all the while not affecting the user's experience, then that would be an observational study.

This is one of the most common ways to get data because it's just plain easy. All you have to do is observe and collect data. Observational studies are also limited in the types of data you may collect. This is because the observer (you) is not in control of the environment. You may only watch and collect natural behavior. If you are looking to induce a certain type of behavior, an observational study would not be useful.

Experimental

An experiment consists of a treatment and the observation of its effect on the subjects. Subjects in an experiment are called experimental units. This is usually how most scientific labs collect data. They will put people into two or more groups (usually just two) and call them the control and the experimental group.

The control group is exposed to a certain environment and then observed. The experimental group is then exposed to a different environment and then observed. The experimenter then aggregates data from both the groups and makes a decision about which environment was more favorable (favorable is a quality that the experimenter gets to decide).

In a marketing example, consider that we expose half of our users to a certain landing page with certain images and a certain style (website A), and we measure whether or not they sign up for the service. Then, we expose the other half to a different landing page, different images, and different styles (website B) and again measure whether or not they sign up. We can then decide which of the two sites performed better and should be used going further. This, specifically, is called an A/B test. Let's see an example in Python! Let's suppose we run the preceding test and obtain the following results as a list of lists:

results = [ ['A', 1], ['B', 1], ['A', 0], ['A', 0] … ]

Here, each object in the list result represents a subject (person). Each person then has the following two attributes:

  • Which website they were exposed to, represented by a single character
  • Whether or not they converted (0 for no and 1 for yes)

We can then aggregate and come up with the following results table:

users_exposed_to_A = []
users_exposed_to_B = []
# create two lists to hold the results of each individual website

Once we create these two lists that will eventually hold each individual conversion Boolean (0 or 1), we will iterate all of our results of the test and add them to the appropriate list, as shown:

for website, converted in results: # iterate through the results
  # will look something like website == 'A' and converted == 0
  if website == 'A':
    users_exposed_to_A.append(converted)
  elif website == 'B':
    users_exposed_to_B.append(converted)

Now, each list contains a series of 1s and 0s.

Note

Remember that a 1 represents a user actually converting to the site after seeing that web page, and a 0 represents a user seeing the page and leaving before signing up/converting.

To get the total number of people exposed to website A, we can use the len() feature in Python, as illustrated:

len(users_exposed_to_A) == 188 #number of people exposed to website A
len(users_exposed_to_B) == 158 #number of people exposed to website B

To count the number of people who converted, we can use the sum() of the list, as shown:

sum(users_exposed_to_A) == 54 # people converted from website A
sum(users_exposed_to_B) == 48 # people converted from website B

If we subtract the length of the lists and the sum of the list, we are left with the number of people who did not convert for each site, as illustrated:

len(users_exposed_to_A) - sum(users_exposed_to_A) == 134 # did not convert from website A

len(users_exposed_to_B) - sum(users_exposed_to_B) == 110 # did not 
convert from website B

We can aggregate and summarize our results in the following table that represents our experiment of website conversion testing:

 

Did not sign up

Signed up

Website A

134

54

Website B

110

48

We can quickly drum up some descriptive statistics. We can say that the website conversion rates for the two websites are as follows:

  • Conversion for website A: Experimental
  • Conversion for website B: Experimental

Not much difference, but different nonetheless. Even though B has the higher conversion rate, can we really say that the version B significantly converts better? Not yet. To test the statistical significance of such a result, a hypothesis test should be used. These tests will be covered in depth in the next chapter, where we will revisit this exact same example and finish it using the proper statistical test.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset