Comparison with SAS

SAS is the analytics sledgehammer of yesteryear. It was the market leader in analytics solutions before R and Python, the poster boys of the open source movement, dethroned it from its numero uno position. Nevertheless, many enterprises still trust it with all their analytics requirements, despite the unreasonably high costs.

In this section, we will keep all the comparisons to a tabular format. The SAS and pandas equivalents are summarized in the following table: 

Pandas

SAS

DataFrame

dataset

column

variable

row

observation

groupby

BY-group

NaN

.

 

Now, let's see how we can perform the basic data operations in pandas and SAS: 

Task

Pandas

SAS

 

 

Creating a dataset

 

 

pd.DataFrame({'odds': [1, 3, 5, 7, 9], 'evens': [2, 4, 6, 8, 10]})

data df;

    input x y;

    datalines;

    1 2

    3 4

    5 6

    7 8

    9 10;

run;

 

Reading a dataset

 

 

pd.read_csv(‘DallasData.csv’)

proc import datafile='DallasData.csv' dbms=csv out=tips replace;

    getnames=yes;

run;

Exporting a dataset

 

DallasData.to_csv(‘dallas.csv’)

proc export data=DallasData outfile='dallas.csv' dbms=csv;

run;

 

Column operations

 

DallasData['days_old_year'] = DallasData['days_old']/365

data DallasData;

    set DallasData;

    days_old_year = days_old / 365;

run;

 

Filtering

 

DallasData[DallasData['days_old']>800].head()

data tips;

    set DallasData;

    if days_old > 800;

run;

 

 

 

If-else

 

 

DallasData['income_class'] = np.where(DallasData['income_average'] < 40000, 'low', 'high')

data DallasData;

    set dallas;

    format income_average $5.;

 

    if days_old < 40000 then bucket = 'low';

    else bucket = 'high';

run;

 

 

Column selection

 

 

DallasData[['zip','customerID','days_old','services_due']].head()

data dallas;

    set DallasData;

    keep zip CustomerID days_old services_due;

run;

 

 

Sort

 

dallas = DallasData.sort_values(['days_old','services_completed'])

 

proc sort data=DallasData;

    by days_old services_completed;

run;

 

String length

 

DallasData['state_name'].str.len().head()

data _null_;

set DallasData;

put(LENGTHN(state_name));

put(LENGTHC(state_name));

run;

 

Groupby aggregation

 

dallas_grouped = DallasData.groupby(['zip', 'customerID'])['days_old', 'services_completed'].sum()

proc summary data=DallasData nway;

    class zip customerID;

    var days_old services_completed;

    output out=dallas_summed sum=;

run;

 

 

 

 

 

 

Join

 

df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})

 

df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],'value': np.random.randn(4)})

 

inner_join = df1.merge(df2, on=['key'], how='inner')

 

left_join = df1.merge(df2, on=['key'], how='left')

 

right_join = df1.merge(df2, on=['key'], how='right')

proc sort data=df1;

    by key;

run;

 

proc sort data=df2;

    by key;

run;

 

data left_join inner_join right_join outer_join;

    merge df1(in=a) df2(in=b);

 

    if a and b then output inner_join;

    if a then output left_join;

    if b then output right_join;

    if a or b then output outer_join;

run;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset