Exploring the data geographically

For the remainder of this section, we will be using a lot of SQL-like operations in pandas to manipulate the data. Here is a conversion table between SQL and pandas for some basic operations:

Action	SQL syntax	`pandas` function	SQL example	`pandas` example
Selecting columns	`SELECT`	`[[]]`	`SELECT col1, col2, FROM df;`	`df[['col1','col2']]`
Selecting rows	`WHERE`	`loc()`, `iloc()`	`SELECT * FROM df WHERE age=50;`	`df.loc[df['age']==50]`
Sorting by a column	`ORDER BY`	`sort_values()`	`SELECT * FROM df ORDER BY col1;`	`df.sort_values('col1')`
Aggregating by a column	`GROUP BY`	`groupby()`	`SELECT COUNT(*) FROM df GROUP BY col1;`	`df.groupby('col1').size()`
Limiting the number of rows	`LIMIT`	`head()`	`SELECT * FROM df LIMIT 5;`	`df.head(n=5)`

With these conversions in mind, we can begin to explore the data geographically.

For starters, 6,825 dialysis facilities is a lot. Let's try narrowing that down by state. First, we count the dialysis facilities in each state:

"""Equivalent SQL: SELECT COUNT(*) 
                   FROM df 
                   GROUP BY State;
"""
df_states = df.groupby('State').size()
print(df_states)

The output is as follows:

State
AK      9
AL    170
AR     69
AS      1
AZ    120
CA    625
CO     75
CT     49
DC     23
DE     27
...

You should see a table containing 50 rows (one per state, each containing the associated count).

Let's now sort the rows in descending order:

"""Equivalent SQL: SELECT COUNT(*) AS Count 
                   FROM df 
                   GROUP BY State 
                   ORDER BY Count ASC;
"""
df_states = df.groupby('State').size().sort_values(ascending=False)
print(df_states)

The output is as follows:

State
CA    625
TX    605
FL    433
GA    345
OH    314
IL    299
PA    294
NY    274
NC    211
MI    211
...

Let's further refine our query by limiting the output to 10 states:

"""Equivalent SQL: SELECT COUNT(*) AS Count 
                   FROM df 
                   GROUP BY State 
                   ORDER BY Count DESC
                   LIMIT 10;
"""
df_states = df.groupby('State').size().sort_values(ascending=False).head(n=10)
print(df_states)

According to the results, California is the state with the most dialysis centers, followed by Texas. If we wanted to filter dialysis facilities based on state, we could do that by selecting the appropriate rows:

"""Equivalent SQL: SELECT *
                   FROM df
                   WHERE State='CA';
"""
df_ca = df.loc[df['State'] == 'CA']
print(df_ca)

Table of Contents for Exploring the data geographically

Create new playlist

Sign In

Sign Up

Table of Contents for
Exploring the data geographically