Reason-for-visit codes

The reason-for-visit variables encode the reason for the patient visit, which can be seen as the chief complaint of the visit (we talked about chief complaints in Chapter 2, Healthcare Foundations). In this dataset, these reasons are coded using a code set called A Reason for Visit Classification for Ambulatory Care (refer to Page 16 and Appendix II of the 2011 documentation for further information; a screenshot of the first page of the Appendix is provided at the end of the chapter). While the exact code may not be determined early during the patient encounter, we include it here because:

  • It reflects information available early during the patient encounter.
  • We would like to demonstrate how to process a coded variable (all the other coded variables occur too late in the patient encounter to be of use for this modeling task):

Coded variables require special attention for the following reasons:

  • Often there are multiple entries designated in the table for more than one code, and the reason-for-visit codes are no exception. Notice that this dataset contains three RFV columns (RFV1, RFV2, and RFV3). A code for asthma, for example, may appear in any of these columns. Therefore, it is not enough to do one-hot encoding for these columns. We must detect the presence of each code in any of the three columns, and we must write a special function to do that.
  • Codes are categorical, but the numbers themselves usually carry no meaning. For easier interpretation, we must name the columns accordingly, using suitable descriptions. To do this, we have put together a special .csv file that contains the typed description for each code (available for download at the book's GitHub repository).
  • One output format possibility is a column for each code, where a 1 indicates the presence of that code and a 0 indicates its absence (as done in Futoma et al., 2015). Any desired combinations/transformations can then be performed. We have used that format here.

Without further ado, let's start transforming our reason-for-visit variables. First, we import the RFV code descriptions:

rfv_codes_path = HOME_PATH + 'RFV_CODES.csv'

rfv_codes = pd.read_csv(rfv_codes_path,header=0,dtype='str')

Now we will do our RFV code processing.

First, to name the columns properly, we import the sub() function from the re module (re stands for regular expression).

Then we write a function that scans any given RFV columns for the presence of an indicated code, and returns the dataset with a new column, with a 1 if the code is present and a 0 if the code is absent.

Next, we use a for loop to iterate through every code in the .csv file, effectively adding a binary column for every possible code. We do this for both the training and testing sets.

Finally, we drop the original RFV columns, since we no longer need them. The full code is as follows:

from re import sub

def add_rfv_column(data,code,desc,rfv_columns):
column_name = 'rfv_' + sub(" ", "_", desc)
data[column_name] = (data[rfv_columns] == rfv_code).any(axis=1).astype('int')
return data

rfv_columns = ['RFV1','RFV2','RFV3']
for (rfv_code,rfv_desc) in zip(
rfv_codes['Code'].tolist(),rfv_codes['Description'].tolist()
):
X_train = add_rfv_column(
X_train,
rfv_code,
rfv_desc,
rfv_columns
)
X_test = add_rfv_column(
X_test,
rfv_code,
rfv_desc,
rfv_columns
)

# Remove original RFV columns
X_train.drop(rfv_columns, axis=1, inplace=True)
X_test.drop(rfv_columns, axis=1, inplace=True)

Let's take a look at our transformed dataset with the head() function:

X_train.head(n=5)

Notice that there are now 1,264 columns. While the full DataFrame has been truncated, if you scroll horizontally, you should see some of the new rfv_ columns appended to the end of the DataFrame.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset