Almost every exploration of such provider-centered data will include analyzing facilities based on their quality scores. We will dive into that next.
First, let's get a count on the different scores received by the dialysis facilities:
print(df.groupby('Total Performance Score').size())
The output is as follows:
Total Performance Score 10 10 100 30 11 2 12 2 13 1 14 3 15 1 ... 95 15 96 2 97 11 98 8 99 12 No Score 276 Length: 95, dtype: int64
One thing to notice is that the Total Performance Score column is in a string rather than integer format, and so to do numerical sorting, we must first convert the column to integer format. Second, after running the preceding code, you'll notice that 276 dialysis facilities have a value of No Score for the Total Performance Score column. These rows must be eliminated before converting to integer format to avoid throwing an error.
In the following code, we first eliminate the No Score rows, and then we use the to_numeric() function of pandas to convert the string column to an integer column:
df_filt= df.loc[df['Total Performance Score'] != 'No Score']
df_filt['Total Performance Score'] = pd.to_numeric(
df_filt['Total Performance Score']
)
Now, we create a new DataFrame that selects just a few columns that we are interested in and sorts them, with the worst performing centers at the top. Such a code block would be helpful for identifying the worst performing dialysis centers, for example. We display the first five results:
df_tps = df_filt[[
'Facility Name',
'State',
'Total Performance Score'
]].sort_values('Total Performance Score')
print(df_tps.head(n=5))
The output is as follows:
Facility Name State 5622 462320 PRIMARY CHILDREN'S DIALYSIS CENTER UT 698 PEDIATRIC DIALYSIS UNIT AT UCSF CA 6766 VITAL LIFE DIALYSIS CENTER FL 4635 BELMONT COURT DIALYSIS - DOYLESTOWN CAMPUS PA 3763 WOODMERE DIALYSIS LLC NY Total Performance Score 5622 5 698 7 6766 8 4635 8 3763 9
Alternatively, if we wished to analyze the mean total performance of each state's dialysis centers, we could do that by using numpy.mean() in conjunction with groupby():
import numpy as np
df_state_means = df_filt.groupby('State').agg({
'Total Performance Score': np.mean
})
print(df_state_means.sort_values('Total Performance Score', ascending=False))
The output is as follows:
Total Performance Score State ID 73.178571 WY 71.777778 HI 70.500000 UT 70.421053 CO 70.173333 WA 70.146067 ME 70.058824 OR 70.046154 KS 69.480769 AZ 68.905983
...
According to the results of this query, Idaho and Wyoming have the nation's best performing dialysis centers. You could also add a column including the number of dialysis centers in each state using the following modification:
import numpy as np
df_state_means = df_filt.groupby('State').agg({
'Total Performance Score': np.mean,
'State': np.size
})
print(df_state_means.sort_values('Total Performance Score', ascending=False))
The output is as follows:
Total Performance Score State State ID 73.178571 28 WY 71.777778 9 HI 70.500000 26 UT 70.421053 38 CO 70.173333 75 WA 70.146067 89 ME 70.058824 17 OR 70.046154 65 KS 69.480769 52 AZ 68.905983 117
...
The results indicate that when considering only states that have at least 100 dialysis centers, Arizona has the best total performance.