Randy Betancourt and Sarah Chen

Python for SAS Users

A SAS-Oriented Introduction to Python

Randy Betancourt
Chadds Ford, PA, USA
Sarah Chen
Livingston, NJ, USA
ISBN 978-1-4842-5000-6e-ISBN 978-1-4842-5001-3
© Randy Betancourt, Sarah Chen 2019
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.
Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.
Introduction

For decades, Base SAS software has been the “gold standard” for data manipulation and analysis. The software can read any data source and is superb at transforming and shaping data for analysis. It has been the beneficiary of enormous resource investments over its lifetime. The company has one of the industry’s most innovative R&D staff, and its products are well supported by an outstanding technical support and well documented by very capable technical writers. SAS Institute Inc. has remained focused on gathering customer input and building desired features. All of these characteristics help explain its popularity.

Since the beginning of this millennium, the accelerated growth of open source software has produced outstanding projects offering data scientists enormous capabilities to tackle problems that were previously considered outside the realm of feasibility. Chief among these is Python. Python has its heritage in scientific and technical computing domains and has a very compact syntax. It is a full-featured language that is relatively easy to learn and is able to scale offering good performance with large data volumes. This is one of the reasons why firms like Netflix 1 use it so extensively.

By nature, SAS users are intrepid and are constantly trying to find new ways to expand the use of the software in pursuit of meeting business objectives. And given the extensive role of SAS within organizations, it only makes sense to find ways to combine the capabilities of these two languages to complement one another.

We have four main goals for our readers. The first is to provide a quick start to learning Python for users already familiar with the SAS language.

Both languages have advantages and disadvantages when it comes to a particular task. And since they are programming languages, their designers had to make certain trade-offs which can manifest themselves as features or quirks, depending on one’s perspective. This is our second goal: help readers compare and contrast common tasks taking into account differences in their default behaviors. For example, SAS names are case-insensitive, while Python names are case-sensitive. Or the default sort sequence for the pandas library is the opposite of SAS’ default sort sequence and so on.

Rather than attempting to promote one language over the other, our third goal is to point out the integration points between the two languages. The choice of which tool to utilize for a given task typically comes down to a combination of what you as a user are familiar with and the context of the problem being solved. Knowing both languages enlarges the set of tools you can apply for the task at hand.

And finally, our fourth goal is to develop working examples for all of the topics in both Python and SAS which allows you the opportunity to “try out” the examples by not just executing them but by extending them to suit your own needs.

We assume you already have some basic knowledge of Python, for example, you already know how to import modules and execute Python scripts. If you don’t, then you will want to spend more time with Chapter 1 , “Introduction,” covering topics such as Python installation, executing Python in a Windows environment, and executing Python in a Linux environment.

In Chapter 2 , “Python Types and Formatting,” we cover topics related to the Python Standard Library such as data types, Booleans with a focus on truth testing, numerical and string manipulations, and basic formatting. If you are new to Python, then it is worthwhile to spend time on this chapter practicing execution of the Python and SAS examples.

If you have a solid grasp of Python Standard Library, you can skip to Chapter 3 , “pandas Library.” Beyond introducing you to DataFrames, we deal with the missing data problem endemic to any analysis task. The understanding of the pandas library underpins the remainder of the book.

Chapter 4 , “Indexing and Grouping,” extends your knowledge of the pandas library by focusing on DataFrame indexing and GroupBy operations. A detailed understanding of these operations is essential for shaping data. We end this chapter by introducing techniques you can use for report production.

Data manipulation such as merging, concatenation, subsetting, updating, appending, sorting, finding duplicates, drawing samples, and transposing are covered in Chapter 5 . We have developed scores of examples in both Python and SAS to address and illustrate the range of problems you commonly face in preparing data for analysis.

In Chapter 6 , “pandas Readers,” we cover many of the popular readers and writers used to read and write data from a range of different sources including Excel, .csv files, relational databases, JSON, web APIs, and more. And while we offer detailed explanations, it is the numerous working examples you can use in your own work that make this chapter so valuable.

Working with date, datetime, time, and time zone is the focus in Chapter 7 . In this increasingly instrumented world we live in, we are faced with processing time-based data from literally trillions of sensors. Forming and appropriately handling time Series data is no longer just the domain for time-based forecasting. Once again, we rely on the breadth of the provided examples to help you improve your skills.

In our last chapter, we introduce and discuss SASPy, the open source library from SAS Institute used to expose a Python interface to Base SAS software. The provided examples focus on building useful pipelines where the strengths of both languages come together in a single program to accomplish common data analysis tasks. This integration point between SAS and Python offers an enormous range of possibilities limited only by your imagination.

We hope you enjoy this book as much as we enjoyed putting it together!

Feedback

We would love to receive your feedback. Tell us what you liked, what you didn’t like, and provide suggestions for improvements. You can go to our web site, www.pythonforsasusers.com , where you can find all of the examples from this book, get updated examples, and provide us feedback.

Acknowledgments

As this project developed, there are a number of individuals whose contributions and insights were invaluable. Alan Churchill provided many suggestions used to form the introduction and provided sound advice on how to effectively write the SAS and Python examples. Soon Tan, CEO of Ermas, and Kit Chaksuvej, Senior Technical Consultant at Ermas, provided ideas for improving the source code examples. Tom Weber provided detailed feedback on Chapter 8 , “SASPy Module.” Tom has been with SAS Institute’s R&D for 31 years and recently co-authored “SASPy, a Python interface to Base SAS,” available on GitHub.

We were fortunate to have Randy’s former boss at SAS Institute, Ferrell Drewry, generously agree to review all of the SAS code and related content. He is an outstanding SAS programmer. Ferrell’s meticulous attention to detail helped us tremendously.

And in another piece of good fortune, we received a great deal of benefit by having Travis Oliphant as a technical reviewer for the Python sections of the book. Not only was he generous with his time by providing detailed input, Travis’ experiences with Python gave us insights into how the language is constructed. His commitment to the open source development community is a genuine inspiration.

We also want to acknowledge the great support from the Apress Media team we collaborated with. Susan McDermott, Senior Editor, helped us by navigating us through the process of book writing. We sincerely appreciate Susan’s enthusiasm and encouragements, and Rita Fernando, coordinating editor, whose steady guidance helped to keep us on track and was always available to offer us assistance when we needed it.

Finally, a personal acknowledgment from Randy to his wife Jacqueline and sons Ethan and Adrian who were enthusiastic supporters from the very beginning of this project.

Table of Contents

Index 425

About the Authors and About the Technical Reviewers

About the Authors

Randy Betancourt’s
../images/440803_1_En_BookFrontmatter_Figb_HTML.png
professional career has been in and around data analysis. His journey began by managing a technical support group supporting over 2000 technical research analysts and scientists from the US Environmental Protection Agency at one of the largest mainframe complexes run by the federal government. He moved to Duke University, working for the administration, to analyze staff resource utilization and costs. There, he was introduced to the politics of data access as the medical school had most of the data and computer resources.

He spent the majority of his career at SAS Institute Inc. in numerous roles, starting in marketing and later moving into field enablement and product management. He subsequently developed the role for Office of the CTO consultant.

Randy traveled the globe meeting with IT and business leaders discussing the impact of data analysis to drive their business. And they also discussed challenges they faced. At the same time, he talked to end users, wanting to hear their perspective. Together, these experiences shaped his understanding of trade-offs that businesses make allocating scarce resources to data collection, analysis, and deployment of models.

More recently, he has worked as independent consultant for firms including the International Institute for Analytics, Microsoft’s SQL Server Group, and Accenture’s Applied Intelligence Platform.

 
Sarah Chen
../images/440803_1_En_BookFrontmatter_Figc_HTML.png
has 12 years of analytics experience in banking and insurance, including personal auto pricing, compliance, surveillance, and fraud analytics, sales analytics, credit risk modeling for business, and regulatory stress testing. She is a Fellow of both the Casualty Actuarial Society and the Society of Actuaries (FCAS, FSA), an actuary, data scientist, and innovator.

Sarah’s career began with five and a half years at Verisk Analytics in the Personal Auto Actuarial division, building predictive models for various ISO products. At Verisk she learned and honed core skills in data analysis and data management.

Her skills and domain expertise were broadened when she moved to KPMG, working with leading insurers, banks, and large online platforms on diverse business and risk management problems.

From 2014 to present, Sarah has been working at HSBC bank on wholesale credit risk models. She has experiences in PD, LGD, and EAD models in commercial real estate, commercial and industrial banks, and non-bank financial institution portfolios. She has been active in innovations within the organization.

Over the years, she has used many analytics tools including R and SAS and Python.

Sarah graduated summa cum laude with BA in Mathematics from Columbia University in 2007. She is the founder of Magic Math Mandarin, a school that emphasizes values and tomorrow’s skills for children.

 

About the Technical Reviewers

Ferrell Drewry
../images/440803_1_En_BookFrontmatter_Figd_HTML.png
wrote his first SAS program in 1977 and has continued to use SAS software throughout his 30-year career in the pharmaceutical industry that includes experience managing data management, programming, biostatistics, and information technology departments. Today, Ferrell works as a SAS programmer and manager (in that order) on phase II/III clinical trials.

Ferrell holds a BS in Accounting from the University of Northern Colorado and an MS in Business Administration with a concentration in Management Information Sciences from Colorado State University.

Ferrell lives on the southeastern North Carolina coast where he enjoys being close to his grandchildren, surfing, fishing, and tinkering with his Raspberry Pi (almost in that order).

 
Travis E. Oliphant
../images/440803_1_En_BookFrontmatter_Fige_HTML.jpg
is the Founder and CEO/CTO of Quansight, an innovation incubation company that builds and connects companies with open source communities to help gain actionable, quantitative insight from their data. Travis previously co-founded Anaconda, Inc. and is still a Director. Since 1997, he has worked in the Python ecosystem, notably as the primary creator of the NumPy package and as a founding contributor of the SciPy package. Travis also started the Numba project and organized and led the teams that built Conda, Dask, Bokeh, and XND. Travis holds a PhD from the Mayo Clinic and BS and MS in Mathematics and Electrical Engineering from Brigham Young University.
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset