Case study–making a chi-squared decision

We'll look at a common statistical decision. The decision is described in detail at http://www.itl.nist.gov/div898/handbook/prc/section4/prc45.htm.

This is a chi-squared decision on whether or not data is distributed randomly. To make this decision, we'll need to compute an expected distribution and compare the observed data to our expectations. A significant difference means there's something that needs further investigation. An insignificant difference means we can use the null hypothesis that there's nothing more to study: the differences are simply random variation.

We'll show how we can process the data with Python. We'll start with some backstory—some details that are not part of the case study, but often feature an Exploratory Data Analysis (EDA) application. We need to gather the raw data and produce a useful summary that we can analyze.

Within the production quality assurance operations, silicon wafer defect data is collected in a database. We may use SQL queries to extract defect details for further analysis. For example, a query could look like this:

SELECT SHIFT, DEFECT_CODE, SERIAL_NUMBER FROM some tables;

The output from this query could be a .csv file with individual defect details:

shift,defect_code,serial_number 
1,None,12345 
1,None,12346 
1,A,12347 
1,B,12348 
and so on. for thousands of wafers

We need to summarize the preceding data. We may summarize at the SQL query level using the COUNT and GROUP BY statements. We may also summarize at the Python-application level. While a pure database summary is often described as being more efficient, this isn't always true. In some cases, a simple extract of raw data and a Python application to summarize can be faster than a SQL summary. If performance is important, both alternatives must be measured, rather than hoping that the database is fastest.

In some cases, we may be able to get summary data from the database efficiently. This summary must have three attributes: the shift, type of defect, and a count of defects observed. The summary data looks like this:

shift,defect_code,count 
1,A,15 
2,A,26 
3,A,33 
and so on. 

The output will show all of the 12 combinations of shift and defect type.

In the next section, we'll focus on reading the raw data to create summaries. This is the kind of context in which Python is particularly powerful: working with raw source data.

We need to observe and compare shift and defect counts with an overall expectation. If the difference between observed counts and expected counts can be attributed to random fluctuation, we have to accept the null hypothesis that nothing interesting is going wrong. If, on the other hand, the numbers don't fit with random variation, then we have a problem that requires further investigation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset