Exploring the data geographically

For the remainder of this section, we will be using a lot of SQL-like operations in pandas to manipulate the data. Here is a conversion table between SQL and pandas for some basic operations:

Action SQL syntax pandas function SQL example pandas example

Selecting columns

SELECT [[]] SELECT col1, col2, FROM df; df[['col1','col2']]

Selecting rows

WHERE loc(), iloc()

SELECT * FROM df WHERE age=50;

df.loc[df['age']==50]
Sorting by a column ORDER BY sort_values() SELECT * FROM df ORDER BY col1; df.sort_values('col1')
Aggregating by a column GROUP BY groupby() SELECT COUNT(*) FROM df GROUP BY col1; df.groupby('col1').size()
Limiting the number of rows LIMIT head() SELECT * FROM df LIMIT 5; df.head(n=5)

 

With these conversions in mind, we can begin to explore the data geographically.

For starters, 6,825 dialysis facilities is a lot. Let's try narrowing that down by state. First, we count the dialysis facilities in each state:

"""Equivalent SQL: SELECT COUNT(*) 
FROM df
GROUP BY State;
"""
df_states = df.groupby('State').size()
print(df_states)

The output is as follows:

State
AK      9
AL    170
AR     69
AS      1
AZ    120
CA    625
CO     75
CT     49
DC     23
DE     27
...

You should see a table containing 50 rows (one per state, each containing the associated count).

Let's now sort the rows in descending order:

"""Equivalent SQL: SELECT COUNT(*) AS Count 
FROM df
GROUP BY State
ORDER BY Count ASC;
"""
df_states = df.groupby('State').size().sort_values(ascending=False)
print(df_states)

The output is as follows:

State
CA    625
TX    605
FL    433
GA    345
OH    314
IL    299
PA    294
NY    274
NC    211
MI    211
...

Let's further refine our query by limiting the output to 10 states:

"""Equivalent SQL: SELECT COUNT(*) AS Count 
FROM df
GROUP BY State
ORDER BY Count DESC
LIMIT 10;
"""
df_states = df.groupby('State').size().sort_values(ascending=False).head(n=10)
print(df_states)

According to the results, California is the state with the most dialysis centers, followed by Texas. If we wanted to filter dialysis facilities based on state, we could do that by selecting the appropriate rows:

"""Equivalent SQL: SELECT *
FROM df
WHERE State='CA';
"""
df_ca = df.loc[df['State'] == 'CA']
print(df_ca)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset