For the remainder of this section, we will be using a lot of SQL-like operations in pandas to manipulate the data. Here is a conversion table between SQL and pandas for some basic operations:
Action | SQL syntax | pandas function | SQL example | pandas example |
Selecting columns |
SELECT | [[]] | SELECT col1, col2, FROM df; | df[['col1','col2']] |
Selecting rows |
WHERE | loc(), iloc() |
SELECT * FROM df WHERE age=50; |
df.loc[df['age']==50] |
Sorting by a column | ORDER BY | sort_values() | SELECT * FROM df ORDER BY col1; | df.sort_values('col1') |
Aggregating by a column | GROUP BY | groupby() | SELECT COUNT(*) FROM df GROUP BY col1; | df.groupby('col1').size() |
Limiting the number of rows | LIMIT | head() | SELECT * FROM df LIMIT 5; | df.head(n=5) |
With these conversions in mind, we can begin to explore the data geographically.
For starters, 6,825 dialysis facilities is a lot. Let's try narrowing that down by state. First, we count the dialysis facilities in each state:
"""Equivalent SQL: SELECT COUNT(*)
FROM df
GROUP BY State;
"""
df_states = df.groupby('State').size()
print(df_states)
The output is as follows:
State AK 9 AL 170 AR 69 AS 1 AZ 120 CA 625 CO 75 CT 49 DC 23 DE 27
...
You should see a table containing 50 rows (one per state, each containing the associated count).
Let's now sort the rows in descending order:
"""Equivalent SQL: SELECT COUNT(*) AS Count
FROM df
GROUP BY State
ORDER BY Count ASC;
"""
df_states = df.groupby('State').size().sort_values(ascending=False)
print(df_states)
The output is as follows:
State CA 625 TX 605 FL 433 GA 345 OH 314 IL 299 PA 294 NY 274 NC 211 MI 211
...
Let's further refine our query by limiting the output to 10 states:
"""Equivalent SQL: SELECT COUNT(*) AS Count
FROM df
GROUP BY State
ORDER BY Count DESC
LIMIT 10;
"""
df_states = df.groupby('State').size().sort_values(ascending=False).head(n=10)
print(df_states)
According to the results, California is the state with the most dialysis centers, followed by Texas. If we wanted to filter dialysis facilities based on state, we could do that by selecting the appropriate rows:
"""Equivalent SQL: SELECT *
FROM df
WHERE State='CA';
"""
df_ca = df.loc[df['State'] == 'CA']
print(df_ca)