SAS is the analytics sledgehammer of yesteryear. It was the market leader in analytics solutions before R and Python, the poster boys of the open source movement, dethroned it from its numero uno position. Nevertheless, many enterprises still trust it with all their analytics requirements, despite the unreasonably high costs.
In this section, we will keep all the comparisons to a tabular format. The SAS and pandas equivalents are summarized in the following table:
Pandas |
SAS |
DataFrame |
dataset |
column |
variable |
row |
observation |
groupby |
BY-group |
NaN |
. |
Now, let's see how we can perform the basic data operations in pandas and SAS:
Task |
Pandas |
SAS |
Creating a dataset |
pd.DataFrame({'odds': [1, 3, 5, 7, 9], 'evens': [2, 4, 6, 8, 10]}) |
data df; input x y; datalines; 1 2 3 4 5 6 7 8 9 10; run; |
Reading a dataset |
pd.read_csv(‘DallasData.csv’) |
proc import datafile='DallasData.csv' dbms=csv out=tips replace; getnames=yes; run; |
Exporting a dataset |
DallasData.to_csv(‘dallas.csv’) |
proc export data=DallasData outfile='dallas.csv' dbms=csv; run; |
Column operations |
DallasData['days_old_year'] = DallasData['days_old']/365 |
data DallasData; set DallasData; days_old_year = days_old / 365; run; |
Filtering |
DallasData[DallasData['days_old']>800].head() |
data tips; set DallasData; if days_old > 800; run; |
If-else |
DallasData['income_class'] = np.where(DallasData['income_average'] < 40000, 'low', 'high') |
data DallasData; set dallas; format income_average $5.;
if days_old < 40000 then bucket = 'low'; else bucket = 'high'; run; |
Column selection |
DallasData[['zip','customerID','days_old','services_due']].head() |
data dallas; set DallasData; keep zip CustomerID days_old services_due; run; |
Sort |
dallas = DallasData.sort_values(['days_old','services_completed']) |
proc sort data=DallasData; by days_old services_completed; run; |
String length |
DallasData['state_name'].str.len().head() |
data _null_; set DallasData; put(LENGTHN(state_name)); put(LENGTHC(state_name)); run; |
Groupby aggregation |
dallas_grouped = DallasData.groupby(['zip', 'customerID'])['days_old', 'services_completed'].sum() |
proc summary data=DallasData nway; class zip customerID; var days_old services_completed; output out=dallas_summed sum=; run; |
Join |
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],'value': np.random.randn(4)})
inner_join = df1.merge(df2, on=['key'], how='inner')
left_join = df1.merge(df2, on=['key'], how='left')
right_join = df1.merge(df2, on=['key'], how='right') |
proc sort data=df1; by key; run;
proc sort data=df2; by key; run;
data left_join inner_join right_join outer_join; merge df1(in=a) df2(in=b);
if a and b then output inner_join; if a then output left_join; if b then output right_join; if a or b then output outer_join; run; |