Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

1. Why Python?

Randy Betancourt and Sarah Chen

Python for SAS Users

A SAS-Oriented Introduction to Python

../images/440803_1_En_BookFrontmatter_Figa_HTML.png

Randy Betancourt

Chadds Ford, PA, USA

Sarah Chen

Livingston, NJ, USA

ISBN 978-1-4842-5000-6e-ISBN 978-1-4842-5001-3

https://doi.org/10.1007/978-1-4842-5001-3

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

Introduction

For decades, Base SAS software has been the “gold standard” for data manipulation and analysis. The software can read any data source and is superb at transforming and shaping data for analysis. It has been the beneficiary of enormous resource investments over its lifetime. The company has one of the industry’s most innovative R&D staff, and its products are well supported by an outstanding technical support and well documented by very capable technical writers. SAS Institute Inc. has remained focused on gathering customer input and building desired features. All of these characteristics help explain its popularity.

Since the beginning of this millennium, the accelerated growth of open source software has produced outstanding projects offering data scientists enormous capabilities to tackle problems that were previously considered outside the realm of feasibility. Chief among these is Python. Python has its heritage in scientific and technical computing domains and has a very compact syntax. It is a full-featured language that is relatively easy to learn and is able to scale offering good performance with large data volumes. This is one of the reasons why firms like Netflix ¹ use it so extensively.

By nature, SAS users are intrepid and are constantly trying to find new ways to expand the use of the software in pursuit of meeting business objectives. And given the extensive role of SAS within organizations, it only makes sense to find ways to combine the capabilities of these two languages to complement one another.

We have four main goals for our readers. The first is to provide a quick start to learning Python for users already familiar with the SAS language.

Both languages have advantages and disadvantages when it comes to a particular task. And since they are programming languages, their designers had to make certain trade-offs which can manifest themselves as features or quirks, depending on one’s perspective. This is our second goal: help readers compare and contrast common tasks taking into account differences in their default behaviors. For example, SAS names are case-insensitive, while Python names are case-sensitive. Or the default sort sequence for the pandas library is the opposite of SAS’ default sort sequence and so on.

Rather than attempting to promote one language over the other, our third goal is to point out the integration points between the two languages. The choice of which tool to utilize for a given task typically comes down to a combination of what you as a user are familiar with and the context of the problem being solved. Knowing both languages enlarges the set of tools you can apply for the task at hand.

And finally, our fourth goal is to develop working examples for all of the topics in both Python and SAS which allows you the opportunity to “try out” the examples by not just executing them but by extending them to suit your own needs.

We assume you already have some basic knowledge of Python, for example, you already know how to import modules and execute Python scripts. If you don’t, then you will want to spend more time with Chapter 1 , “Introduction,” covering topics such as Python installation, executing Python in a Windows environment, and executing Python in a Linux environment.

In Chapter 2 , “Python Types and Formatting,” we cover topics related to the Python Standard Library such as data types, Booleans with a focus on truth testing, numerical and string manipulations, and basic formatting. If you are new to Python, then it is worthwhile to spend time on this chapter practicing execution of the Python and SAS examples.

If you have a solid grasp of Python Standard Library, you can skip to Chapter 3 , “pandas Library.” Beyond introducing you to DataFrames, we deal with the missing data problem endemic to any analysis task. The understanding of the pandas library underpins the remainder of the book.

Chapter 4 , “Indexing and Grouping,” extends your knowledge of the pandas library by focusing on DataFrame indexing and GroupBy operations. A detailed understanding of these operations is essential for shaping data. We end this chapter by introducing techniques you can use for report production.

Data manipulation such as merging, concatenation, subsetting, updating, appending, sorting, finding duplicates, drawing samples, and transposing are covered in Chapter 5 . We have developed scores of examples in both Python and SAS to address and illustrate the range of problems you commonly face in preparing data for analysis.

In Chapter 6 , “pandas Readers,” we cover many of the popular readers and writers used to read and write data from a range of different sources including Excel, .csv files, relational databases, JSON, web APIs, and more. And while we offer detailed explanations, it is the numerous working examples you can use in your own work that make this chapter so valuable.

Working with date, datetime, time, and time zone is the focus in Chapter 7 . In this increasingly instrumented world we live in, we are faced with processing time-based data from literally trillions of sensors. Forming and appropriately handling time Series data is no longer just the domain for time-based forecasting. Once again, we rely on the breadth of the provided examples to help you improve your skills.

In our last chapter, we introduce and discuss SASPy, the open source library from SAS Institute used to expose a Python interface to Base SAS software. The provided examples focus on building useful pipelines where the strengths of both languages come together in a single program to accomplish common data analysis tasks. This integration point between SAS and Python offers an enormous range of possibilities limited only by your imagination.

We hope you enjoy this book as much as we enjoyed putting it together!

Feedback

We would love to receive your feedback. Tell us what you liked, what you didn’t like, and provide suggestions for improvements. You can go to our web site, www.pythonforsasusers.com , where you can find all of the examples from this book, get updated examples, and provide us feedback.

Acknowledgments

As this project developed, there are a number of individuals whose contributions and insights were invaluable. Alan Churchill provided many suggestions used to form the introduction and provided sound advice on how to effectively write the SAS and Python examples. Soon Tan, CEO of Ermas, and Kit Chaksuvej, Senior Technical Consultant at Ermas, provided ideas for improving the source code examples. Tom Weber provided detailed feedback on Chapter 8 , “SASPy Module.” Tom has been with SAS Institute’s R&D for 31 years and recently co-authored “SASPy, a Python interface to Base SAS,” available on GitHub.

We were fortunate to have Randy’s former boss at SAS Institute, Ferrell Drewry, generously agree to review all of the SAS code and related content. He is an outstanding SAS programmer. Ferrell’s meticulous attention to detail helped us tremendously.

And in another piece of good fortune, we received a great deal of benefit by having Travis Oliphant as a technical reviewer for the Python sections of the book. Not only was he generous with his time by providing detailed input, Travis’ experiences with Python gave us insights into how the language is constructed. His commitment to the open source development community is a genuine inspiration.

We also want to acknowledge the great support from the Apress Media team we collaborated with. Susan McDermott, Senior Editor, helped us by navigating us through the process of book writing. We sincerely appreciate Susan’s enthusiasm and encouragements, and Rita Fernando, coordinating editor, whose steady guidance helped to keep us on track and was always available to offer us assistance when we needed it.

Finally, a personal acknowledgment from Randy to his wife Jacqueline and sons Ethan and Adrian who were enthusiastic supporters from the very beginning of this project.

Chapter 1: Why Python? 1

Setting Up a Python Environment 2

Anaconda3 Install Process for Windows 3

Troubleshooting Python Installation for Windows 9

Anaconda3 Install Process for Linux 13

Executing a Python Script on Windows 16

Case Sensitivity 19

Line Continuation Symbol 19

Executing a Python Script on Linux 20

Integrated Development Environment (IDE) for Python 21

Jupyter Notebook 22

Jupyter Notebook for Linux 24

Summary 25

Chapter 2: Python Types and Formatting 27

Numerics 28

Python Operators 30

Boolean 31

Comparison Operators 32

IN/NOT IN 37

AND/OR/NOT 38

Numerical Precision 40

Strings 44

String Slicing 47

Formatting 51

Summary 63

Chapter 3: pandas Library 65

Column Types 67

Series 68

DataFrames 73

DataFrame Validation 75

DataFrame Inspection 78

Summary 109

Chapter 4: Indexing and GroupBy 111

Create Index 112

Return Columns by Position 114

Return Rows by Position 117

Return Rows and Columns by Label 119

Conditionals 123

Updating 126

Return Rows and Columns by Position 128

MultiIndexing 131

Basic Subsets with MultiIndexes 137

Advanced Indexing with MultiIndexes 141

Cross Sections 148

GroupBy 150

Iteration Over Groups 155

GroupBy Summary Statistics 159

Filtering by Group 161

Group by Column with Continuous Values 162

Transform Based on Group Statistic 165

Pivot 168

Summary 176

Chapter 5: Data Management 177

SAS Sort/Merge 181

Inner Join 184

Right Join 186

Left Join 189

Outer Join 191

Right Join Unmatched Keys 192

Left Join Unmatched Keys 195

Outer Join Unmatched Keys 197

Validate Keys 200

Joining on an Index 201

Join Key Column with an Index 203

Update 205

Conditional Update 209

Concatenation 213

Finding Column Min and Max Values 222

Sorting 223

Finding Duplicates 227

Dropping Duplicates 228

Sampling 231

Convert Types 234

Rename Columns 235

Map Column Values 235

Transpose 237

Summary 241

Chapter 6: pandas Readers and Writers 243

Reading .csv Files 244

Date Handling in .csv Files 250

Read .xls Files 253

Write .csv Files 260

Write .xls Files 262

Read JSON 264

Write JSON 268

Read RDBMS Tables 269

Query RDBMS Tables 279

Read SAS Datasets 286

Write RDBMS Tables 289

Summary 294

Chapter 7: Date and Time 295

Date Object 295

Return Today’s Date 296

Date Manipulation 299

Shifting Dates 308

Date Formatting 309

Dates to Strings 313

Strings to Dates 316

Time Object 318

Time of Day 321

Time Formatting 323

Times to Strings 324

Strings to Time 326

Datetime Object 329

Combining Times and Dates 332

Returning Datetime Components 334

Strings to Datetimes 336

Datetimes to Strings 339

Timedelta Object 342

Time zone Object 351

Naïve and Aware Datetimes 352

pytz Library 355

SAS Time zone 363

Summary 372

Chapter 8: SASPy Module 373

Install SASPy 373

Set Up the sascfg_personal.py Configuration File 374

Make SAS-Supplied .jar Files Available 376

SASPy Examples 378

Basic Data Wrangling 380

Write DataFrame to SAS Dataset 383

Define the Libref to Python 384

Write the DataFrame to a SAS Dataset 385

Execute SAS Code 391

Write SAS Dataset to DataFrame 393

Passing SAS Macro Variables to Python Objects 397

Prompting 400

Scripting SASPy 401

Datetime Handling 404

Summary 409

Appendix A: Generating the Tickets DataFrame 411

Appendix B: Many-to-Many Use Case 415

Index 425

About the Authors and About the Technical Reviewers

About the Authors

Randy Betancourt’s

../images/440803_1_En_BookFrontmatter_Figb_HTML.png

professional career has been in and around data analysis. His journey began by managing a technical support group supporting over 2000 technical research analysts and scientists from the US Environmental Protection Agency at one of the largest mainframe complexes run by the federal government. He moved to Duke University, working for the administration, to analyze staff resource utilization and costs. There, he was introduced to the politics of data access as the medical school had most of the data and computer resources.

He spent the majority of his career at SAS Institute Inc. in numerous roles, starting in marketing and later moving into field enablement and product management. He subsequently developed the role for Office of the CTO consultant.

Randy traveled the globe meeting with IT and business leaders discussing the impact of data analysis to drive their business. And they also discussed challenges they faced. At the same time, he talked to end users, wanting to hear their perspective. Together, these experiences shaped his understanding of trade-offs that businesses make allocating scarce resources to data collection, analysis, and deployment of models.

More recently, he has worked as independent consultant for firms including the International Institute for Analytics, Microsoft’s SQL Server Group, and Accenture’s Applied Intelligence Platform.

Sarah Chen

../images/440803_1_En_BookFrontmatter_Figc_HTML.png

has 12 years of analytics experience in banking and insurance, including personal auto pricing, compliance, surveillance, and fraud analytics, sales analytics, credit risk modeling for business, and regulatory stress testing. She is a Fellow of both the Casualty Actuarial Society and the Society of Actuaries (FCAS, FSA), an actuary, data scientist, and innovator.

Sarah’s career began with five and a half years at Verisk Analytics in the Personal Auto Actuarial division, building predictive models for various ISO products. At Verisk she learned and honed core skills in data analysis and data management.

Her skills and domain expertise were broadened when she moved to KPMG, working with leading insurers, banks, and large online platforms on diverse business and risk management problems.

From 2014 to present, Sarah has been working at HSBC bank on wholesale credit risk models. She has experiences in PD, LGD, and EAD models in commercial real estate, commercial and industrial banks, and non-bank financial institution portfolios. She has been active in innovations within the organization.

Over the years, she has used many analytics tools including R and SAS and Python.

Sarah graduated summa cum laude with BA in Mathematics from Columbia University in 2007. She is the founder of Magic Math Mandarin, a school that emphasizes values and tomorrow’s skills for children.

About the Technical Reviewers

Ferrell Drewry

../images/440803_1_En_BookFrontmatter_Figd_HTML.png

wrote his first SAS program in 1977 and has continued to use SAS software throughout his 30-year career in the pharmaceutical industry that includes experience managing data management, programming, biostatistics, and information technology departments. Today, Ferrell works as a SAS programmer and manager (in that order) on phase II/III clinical trials.

Ferrell holds a BS in Accounting from the University of Northern Colorado and an MS in Business Administration with a concentration in Management Information Sciences from Colorado State University.

Ferrell lives on the southeastern North Carolina coast where he enjoys being close to his grandchildren, surfing, fishing, and tinkering with his Raspberry Pi (almost in that order).

Travis E. Oliphant

../images/440803_1_En_BookFrontmatter_Fige_HTML.jpg

is the Founder and CEO/CTO of Quansight, an innovation incubation company that builds and connects companies with open source communities to help gain actionable, quantitative insight from their data. Travis previously co-founded Anaconda, Inc. and is still a Director. Since 1997, he has worked in the Python ecosystem, notably as the primary creator of the NumPy package and as a founding contributor of the SciPy package. Travis also started the Numba project and organized and led the teams that built Conda, Dask, Bokeh, and XND. Travis holds a PhD from the Mayo Clinic and BS and MS in Mathematics and Electrical Engineering from Brigham Young University.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
Front Matter

Python for SAS Users

A SAS-Oriented Introduction to Python

Feedback

Table of Contents

About the Authors and About the Technical Reviewers

About the Authors

About the Technical Reviewers

Table of Contents for Front Matter

Create new playlist

Sign In

Sign Up

Python for SAS Users

A SAS-Oriented Introduction to Python

Feedback

Table of Contents

About the Authors and About the Technical Reviewers

About the Authors

About the Technical Reviewers

Table of Contents for
Front Matter