Chapter 11: Integrated Data and Regulatory Submissions

Regulatory Guidance

Data Integration Challenges

Data Integration Strategies

Representing Subjects Who Appear in Multiple Studies in Subject-Level Data Sets

Deciding Which Data to Integrate

Coding Dictionary Issues

Summary of Data Integration Strategies

Data Integration and Submission Tools

Setting Variable Lengths Based on the Longest Observed Value

Converting from Native SAS to Version 5.0 Transport Files

Converting from Version 5.0 Transport Files to Native SAS

Getting Submission Ready

Chapter Summary

The term integration in clinical trial parlance has its roots in the Code of Federal Regulations that outlines the content and format of a new drug application. These regulations state that applicants shall submit an “integrated summary of the data demonstrating substantial evidence of effectiveness for the claimed indications” and “an integrated summary of all available information about the safety of the drug product.” The terms integrated summary of safety and integrated summary of efficacy came about in subsequent guidance documents to describe these two cornerstones of the NDA submission. In order to meet the needs of SAS programmers responsible for placing key study results side-by-side in pages of output, or to present adverse event rates with patients pooled and accounted for in one denominator, the patient-level data also had to be combined, or integrated, into one database.

On the surface, the concept of integrating CDISC data seems straightforward. In fact, one of the primary motivations behind data standards has been to facilitate data integration. In concept, with similar (or identical) formats across studies, like data could be stacked on top of one another using a study identifier to distinguish between studies and unique subject IDs to differentiate data. For example, Subject 101 in Study 301 would be differentiated from Subject 101 in Study 302. Not only would this make life easier for the programmers working on integration for a regulatory submission, but it would also make life easier for statistical and medical reviewers at the FDA who are responsible for investigating, in great detail, safety signals that appeared to be part of a drug class effect. Ideally, data from various sponsors, who at one point in time all had their own proprietary data format, could be combined, stored, and extracted from one common database.

However, when the people working on data standards started delving into the details, they began to realize that, in practice, data integration could be a little more complicated. In this chapter, we discuss some of these challenges with integrating data, strategies for data integration, and tools for preparing your CDISC data for an FDA submission.

In the absence of any direct advice from regulators, an ADaM team has been organized to help tackle the problem of how to integrate standardized data for regulatory submissions.  They have released a draft version of a document titled “CDISC ADaM Data Structures for Integration: General Considerations and Model for Integrated ADSL (IADSL).”  This document will be referenced throughout this chapter.

See Also

CDISC ADaM Data Structures for Integration: General Considerations and Model for Integrated ADSL (IADSL), Version 1.0 Draft (http://www.cdisc.org/standards/foundational/adam)

ADaM Structure for Occurrence Data (OCCDS) version 1.0 (http://www.cdisc.org/standards/foundational/adam)

Study Data Technical Conformance Guide (http://www.fda.gov/downloads/ForIndustry/DataStandards/StudyDataStandards/UCM384744.pdf)

Regulatory Guidance

An abundance of regulatory guidance addresses integrated summaries and data pooling. Here is a list of some of the documents that contain this guidance:

   ICH M4E(R1)–Clinical Overview and Clinical Summary of Module 2

   FDA, Guidance for Industry: Integrated Summary of Effectiveness

   FDA, Guidance for Industry: Premarketing Risk Assessment

   FDA, Study Data Technical Conformance Guide

   EMA, Points to Consider on Application with 1. Meta-Analyses; 2. One Pivotal Study

Much of the language in the first two documents comes from the original 1988 FDA Guideline for the Format and Content of the Clinical and Statistical Sections of an Application (a.k.a. the “Clin-Stat guidance”). These documents provide important information about how and when to conduct integrated analyses and meta-analyses. However, they do not go into detail about how to put an integrated database together.

It should be pointed out that the scope of an integrated database can go beyond the integrated analyses. An integrated database might contain studies that are not necessarily valid for pooling or summarizing with other studies in a regulatory submission, but might be worthwhile for other purposes, such as an organization’s pharmacovigilance responsibilities.

Data Integration Challenges

As previously mentioned, the process of integrating data is not always clear-cut. For example, consider a key efficacy variable that collects subject assessments of improvements from baseline on a Likert scale. One study could have categories for 1=None, 2=A Little Better, 3=Better, 4=A Lot Better, and 5=Completely Better (No Symptoms at All). Another very similar study could have categories for -1 = Worse, 0 = No Improvement, 1=Somewhat Improved, 2=Improved, and 3=Greatly Improved. Not only is the wording of the categories different, but so are the numeric ratings, making the task of presenting the data in a unified way more challenging. How does this happen? Sometimes there are good intentions to make improvements from earlier studies. Sometimes the changes are at the behest of regulatory agencies. Sometimes changes are made to reflect new standards adapted by working groups. Whatever the reason, the question about how to deal with the change often is not asked until the integration work needs to be done.

In other cases, the challenges have less to do with the data collection and more to do with the study design. Consider open-label extension trials for example. These are studies where, for the sake of capturing long-term safety data, subjects being treated for a chronic condition are rolled over from a double-blind, randomized trial that has a primary objective of capturing efficacy data; these subjects are rolled over to an open-label extension trial where everyone is given the experimental treatment and followed primarily for long-term safety evaluations. In rarer cases, subjects are enrolled in extension trials and just observed for long-term safety or efficacy, without any continued experimental intervention. One challenge that these studies present is how to represent study subjects who appear in multiple trials in an integrated summary and in a subject-level data set such as DM (for SDTM data) or ADSL (for ADaM data).

With regard to the subject’s unique identifier, the SDTM IG and the CDER Common Issues document are very clear that USUBJID should be unique for any individual within a submission. It is therefore the responsibility of the sponsor, and SAS programmers involved with the studies, to ensure that study subjects who are rolled over from one study to the next are tracked and that only one ID is used for these subjects across studies. A common convention is to construct USUBJID as a concatenation of the study ID, the site ID, and the subject ID (or, if the subject ID already contains the site ID, just the study and subject IDs). For subjects who appear in multiple studies, the ID of the first study in which the subject was enrolled could be used for the USUBJID throughout.

One other challenge when integrating data is deciding whether you should “up version” your data to a newer standard. The need to up version could occur when different studies are based on different versions of the same standard or when individual studies are in the same version of the same standard, but a newer version exists that might be preferred. In particular, if this problem relates to your SDTM data, then the consequences of your decision on the ADaM data also have to be given consideration. In such cases, a change to your SDTM data might necessitate a change to your ADaM data and metadata as well.

Data Integration Strategies

With regard to how to represent subjects who appear in multiple studies in subject-level integrated data sets (such as DM and ADSL), the solution is rarely obvious and not yet prescribed by CDISC standards or regulatory guidance. The options range from representing subjects on multiple rows (one row for each study), representing subjects only once (regardless of the number of previous studies), or representing subjects using some combination of these two options. Throughout this chapter, when speaking of this situation, we are primarily considering the scenario where subjects are rolled over from an earlier trial to an extension trial. Having the same subject appear in multiple unrelated trials (for one development program) is not commonly allowed for a clinical development program. Even if it is allowed, identifying such subjects is often difficult for the programmer or others involved with trial operations. This is due to a combination of privacy regulations that restrict the amount of personal data that can be collected and a lack of a CRF field, or other supporting data, to indicate which subjects have appeared in previous trials.

Another integration strategy that deserves some thought is whether to first integrate the SDTM data and then, from there, integrate the ADaM data, or whether to integrate only the ADaM data from the ADaM data sets in the individual studies.

The advantages and disadvantages of these different approaches are discussed in the following sections.

Representing Subjects Who Appear in Multiple Studies in Subject-Level Data Sets

A common approach to dealing with the integration of subject-level data sets is to keep one record per subject per study, even if this means subjects who appear in multiple studies are represented in multiple rows. This approach is typically easier than the alternative of collapsing subjects who appear in multiple studies into one row of data. It was also supported, to some extent, by the former CDER Common Data Standards Issues document, which stated (before being supplanted by the Study Data Technical Conformance Guide):

“Integrated summaries may contain more than one record per unique subject in the case that an individual subject was enrolled in more than one study.”

This pragmatic recommendation was a motivation behind a subsequent recommendation by the ADaM integration sub-team to allow the same subject to be represented in multiple rows in IADSL, if required by analysis needs.

One drawback of this approach is that it can complicate counting of unique subjects, particularly if pre-existing macros, functions, or software were built under the assumption that each subject is represented only once in data sets such as DM and ADSL. For example, in an integrated summary of adverse events, subjects who appear in multiple studies should be counted only once in denominators when calculating adverse event rates. This is because these denominators represent the number of subjects at risk for an event. If a data set contains the same subject on multiple rows and the AE reporting macro does nothing to check or correct for this assumption, then the results would be incorrect.  The IADSL structure includes a flag variable, UADSLFL, that can be used for reducing the data set to one record per subject.

Another issue to consider is that subjects can change treatment groups as they go from one study to the next. Then some thought has to go into how to summarize subjects by treatment group. Take the earlier example of the randomized trial that is followed by an open-label extension trial. In the open-label extension, everybody receives the experimental treatment. Subjects who were initially randomized to receive a placebo will therefore switch treatments from placebo to the experimental treatment. Table 11.1 displays sample data for such a scenario.

Table 11.1: DM Data for a Double-Blind Study and Open-Label Extension Study

STUDYID USUBJID ARMCD
Study 1:    
XYZ123 XYZ123-UNI101 PLACEBO
XYZ123 XYZ123-UNI102 ALG123
XYZ123 XYZ123-UNI103 PLACEBO
Study 1 Extension:    
XYZ124 XYZ123-UNI101 ALG123
XYZ124 XYZ123-UNI102 ALG123

Different organizations can implement different strategies for such a scenario. They might summarize AEs by individual treatment group so that subjects who receive different treatments are represented multiple times—once for each unique treatment that they receive. Or they might summarize AEs by a treatment sequence, as is done in crossover trials. Subjects who cross over from the placebo to the experimental treatment would be represented in one summary group. An example of how to represent this in an IADSL data set, using one record per subject per study, is shown in Table 11.2. The reserved variable TRTSEQP, which exists in the ADaM IG specifically for representing a subject’s planned treatment sequence, is used to pool treatment sequences across trials.

Table 11.2: Integration Option 1–IADSL Data for a Double-Blind Study and Open-Label Extension Study (One Record per Subject per Study)

STUDIES USUBJID ANLCAT NUMSTUDY ARMCD TRT01P TRTSEQP
Study 1:
XYZ123 XYZ123-UNI101 DB 2 PLACEBO Placebo Placebo/Analgezia
XYZ123 XYZ123-UNI102 DB 2 ALG123 Analgezia HCL 30 mg Analgezia/Analgezia
XYZ123 XYZ123-UNI103 DB 1 PLACEBO Placebo Placebo/Analgezia
Study 1 Extension:
XYZ124 XYZ123-UNI101 EXT 2 ALG123 Analgezia HCL 30 mg Placebo/Analgezia
XYZ124 XYZ123-UNI102 EXT 2 ALG123 Analgezia HCL 30 mg Analgezia/Analgezia
Study 1 and the Extension Study:
XYZ123+XYZ124 XYZ123-UNI101 DB+EXT 2 PLACEBO Placebo Placebo/Analgezia
XYZ123+XYZ124 XYZ123-UNI102 DB+EXT 2 ALG123 Analgezia HCL 30 mg Analgezia/Analgezia

Note subject XYZ123-UNI103, who did not roll-over into the extension trial. This subject’s TRTSEQP value is Placebo/Analgezia, even though she never enrolled in the extension trial. For such subjects, you could then use TRTSEQA to represent the actual treatment sequence, which, for this subject, could contain a value of Placebo or Placebo Only.

Note also the use of three new variables that appear in the IADSL draft document: STUDIES replaces STUDYID to allow multiple studies to be concatenated, thereby indicating the record to be used for integrated or pooled summaries; NUMSTUDY is an integer that indicates the number of studies a subject participated in; and ANLCAT is used to define an analysis category associated with each record.  A fourth new variable mentioned earlier, UADSLFL, is not shown but is required to be populated with a Y for one and only one record per unique subject.

The approach of thinking of extension trials in similar terms to the crossover trial provides the motivation for constructing the integrated ADSL data set in a way similar to how an ADSL data set would be constructed for a crossover trial—with one record per subject, but multiple columns for each treatment received. Because ADSL was designed with crossover studies in mind (and other study designs that allow subjects to receive multiple treatment groups), this is rather straightforward. Table 11.3 displays an example for such a scenario.

Table 11.3: Integration Option 2–ADSL Data for a Double-Blind Study and Open-Label Extension Study Combined (One Record per Subject)

STUDYID STUDYID2 USUBJID ARMCD TRT01P TRT02P TRTSEQP
XYZ123 XYZ124 XYZ123-UNI101 PLACEBO Placebo Analgezia HCL 30 mg Placebo/Analgezia
XYZ123 XYZ124 XYZ123-UNI102 ALG123 Analgezia HCL 30 mg Analgezia HCL 30 mg Analgezia/Analgezia
XYZ123   XYZ123-UNI103 PLACEBO Placebo   Placebo/Analgezia

In this example, rows of data for subjects who entered the extension trial are replaced with new columns such as STUDYID2 and TRT02P. STUDYID2 is not in the ADaM IG. It was added to capture additional study IDs for situations such as this. TRT02P does exist in the ADaM IG. Here again, the variable TRTSEQA could be used to display the actual treatment sequence for patients such as XYZ123-UNI103 who do not enter the extension trial.

There are many other columns that could be added to capture information specific to the extension trial. Examples include the enrollment date, flags to indicate which subjects entered the extension trial, and treatment start and stop dates (TR02SDT and TR02EDT).

For this set of studies, this integration solution might work. There can be other studies involved with the integration that can complicate matters. Consider another study similar to Study 1 but without an extension (we will call it Study 2). The trick then becomes how to combine treatment groups. For example, should placebo subjects from Study 2 be grouped and summarized with placebo subjects from Study 1 who switch to active drug in the extension phase? When it comes to integration, the scenarios, and methods for dealing with them, can be quite complicated. As the saying goes, the devil is in the details.

Deciding Which Data to Integrate

When you are integrating ADaM data only, the integrated analysis data sets are created from the ADaM data that was created for the individual studies. The primary advantage of this approach is that, by not having to integrate the SDTM data, it involves less work. There are some issues to consider, however, when taking this route:

   There is no source SDTM data: As emphasized throughout this book, one of the underlying assumptions of ADaM data (and one of the ADaM principles) is that there is corresponding SDTM data to which the ADaM data can be traced. When you are taking the approach of integrating ADaM data only, the source data (or immediate predecessor) becomes the ADaM data from the individual studies, which could create some traceability problems. For example, you will need to decide how to populate variables that support traceability such as SRCDOM. Should it refer to the same analysis data set at the study level, or should it refer to the SDTM domain at the study level? If referring to the same analysis data set at the study level, then is an additional field needed to identify the source study? Or will this be obvious from the STUDYID?

   With integrated SDTM data, the obvious choice for ADaM traceability variables would be to refer to the integrated SDTM data.

   Certain ADaM data might not exist at the study level: Not all summaries done for a clinical study report necessarily have a corresponding ADaM data set created. For many standard yet simple summaries, such as a summary of medical history, there are few derivations and less of a need to create ADaM data. Rather than conduct another set of transformations, programmers might instead want to simply merge in the treatment codes and produce summaries using the SDTM data. Not only is this easier for programmers, but it also gives data reviewers one less data format with which to become familiar.

   Analysis data that do not exist at the study level might be needed at the integration level: When taking the ADaM-only integration approach, you might find yourself with a need to summarize data that do not exist in ADaM at the individual study level across studies. You might therefore end up having to integrate SDTM data anyway in order to create the required integrated analysis data.

The problems noted above with the ADaM-only integration approach can be avoided by instead integrating the SDTM data first. When you are deciding which approach to take, these points must be weighed against the time and effort required to integrate the SDTM data. Any other potential benefits gained from integrating SDTM data must also be given careful consideration, especially in the context of how medical reviewers, both at regulatory agencies and with sponsor companies, evaluate safety.

At the study level, reviewers rely on SDTM data for two primary reasons: 1) to gain insight into the data’s lineage from collection to analysis and 2) to have access to all data collected (which is particularly important for a clinical review). With respect to the latter reason, sometimes certain data that are not used for analysis are needed after analyses are conducted. This can be for reasons that are unforeseen when preparing an analysis plan. At the submission level, the same thing is often true. Reviewers will need access to data that are not summarized in a report, but that are perhaps needed for some other reason.

Consider a medical reviewer who wants to look into the background information for all subjects in a submission who experienced AEs with fatal outcomes. This can involve medical and disease history and other data elements that are not typically provided in analysis data sets. Without such data integrated in SDTM data sets, the reviewer must manually pull the data from the individual studies, which can be much more time-consuming compared to having all of the data already available in one location.

Coding Dictionary Issues

Another common issue with integrated data is dealing with the various versions of coding dictionaries that were used across the individual trials. With MedDRA, for example, which is updated every six months, sometimes just keeping the version consistent within a study can be a challenge. With regard to integrated data, this issue is also recognized in the FDA’s Study Data Technical Conformance Guide, which states:

“Regardless of the specific versions used for individual studies, pooled analyses of coded terms across multiple studies (e.g., for an integrated summary of safety) should be conducted using a single version of a terminology. This will ensure a consistent and coherent comparison of clinical and scientific concepts across multiple studies . . . .”

For MedDRA codes, it is possible to programmatically update coding from an older version to a more recent version. With access to the original lower-level term (LLT), programmers can merge the old LLT with the current version of the MedDRA dictionary to find the new mapping from the LLT to the preferred term (PT). Then, with the new PT, mappings to the higher levels of the MedDRA hierarchy can also be updated. The other alternative is to re-code all events from scratch. This might be necessary if new LLTs exist that better represent the investigator’s description of the event. In practice, depending on the number of events, a combination of these two approaches might be most appropriate.

When events are re-coded, it is important to maintain traceability to the original coding used in earlier reports. For example, if a study report for a pivotal study contains events that do not appear in an integrated summary, a reviewer might question the integrity of the data. This is not an uncommon issue when coding is updated.

In an integrated SDTM AE data set, the coding variables should reflect the new coding. To see what changed from the original study, reviewers must manually make comparisons.

However, in an integrated ADAE data set, the original coding can exist on the same row of data that contains the new coding. The variables that allow you to trace back to the original coding are shown in Table 11.4.

Table 11.4: Historic Coding Variables Recommended by ADAE

Variable Name Variable Label
DECDORGw PT in Original Dictionary w
BDSYORGw SOC in Original Dictionary w
HLGTORGw HLGT in Original Dictionary w
HLTORGw HLT in Original Dictionary w
LLTORGw LLT in Original Dictionary w
LLTNORGw LLT Code in Original Dictionary w

The suffix w represents an integer (1-9) that corresponds to a previous version. The metadata should include the dictionary name and version for each variable. For example, consider a study in which one report is done for an interim analysis and another is done for the final analysis. Assume that this study also appears in an integrated analysis. Also assume that version 10.0 of MedDRA was used for the interim analysis, version 12.1 was used for the final analysis, and version 15.0 was used for the integrated analysis. In the integrated ADAE data set, DECDORG1 could contain the preferred term coding used for the interim analysis, and DECDORG2 could contain the preferred term coding used for the final analysis. AEDECOD would contain the term coding used for the integrated analysis.

The metadata for all three variables should describe the version of MedDRA used for each of these variables. However, this can be problematic if multiple studies have multiple coding in an integrated data set. DECDORG1 might contain version 10.0 coding for one study, but version 8.1 for a different study. When many studies are being integrated and they all have used different versions of the dictionary in their original analysis, the metadata would have to be quite extensive to describe this history for all studies.

Summary of Data Integration Strategies

Ultimately, the decision on how to integrate clinical trial data requires careful planning and consideration toward the issues pointed out above and in regulatory guidance documents. Many of the decisions depend on the number of trials for a particular development program, the design of those trials, and the number of different treatment groups evaluated in those trials. In summary, the list of possibilities might not be endless, but enumerating them all goes well beyond the scope of this book.

Lastly, before implementing any major decisions for an FDA submission, be sure to discuss your issues and proposals with your review division beforehand.

Data Integration and Submission Tools

Similar to the conversion process itself from raw or collected data to the SDTM and from the SDTM to ADaM, data integration can be automated only to a certain extent. Issues described earlier in this chapter need to be addressed, for the most part, on a case-by-case manner. For the actual integration work, of course, the SAS CDI tool can be used in a fashion similar to that illustrated in Chapter 5. Assuming that the data being integrated all have similar (or identical) CDISC structures, the integration process will be much easier than the conversion process shown in Chapter 5.

However, that is not to say that tools cannot be developed to assist with expected common tasks. Such tools are discussed in the following sections. One tool can be used to address the issue of large file sizes in integrated data. Another addresses conversions from native SAS data sets to the version 5 transport format needed for FDA submissions (and conversions back from the transport format to native SAS).

Setting Variable Lengths Based on the Longest Observed Value

Both the CDER Common Issues document and the FDA’s Study Data Specifications address an issue that can often occur with integrated data, especially lab data—data sets that are so big they are difficult to open and work with on a standard-issue PC or laptop. As stated in the Study Data Conformance Guide:

“The allotted length for each column containing character (text) data should be set to the maximum length of the variable used across all datasets in the study. This will significantly reduce file sizes. For example, if USUBJID has a maximum length of 18, the USUBJID’s column size should be set to 18, not 200.”

The habit of assigning character variable lengths to 200 is tempting because it avoids the need to investigate beforehand how long a field actually needs to be to avoid truncation. In some cases, such as with --TESTCD variables, the length is predefined (to eight characters in the case of --TESTCD variables). In many other cases, however, there is no such restriction (aside from it being less than 200 characters due to the SAS transport file limitation).

In order to avoid the problem of unnecessarily inflating data set file sizes, a macro could be used to determine the maximum observed length of character fields and to then dynamically assign the length based on the data. This does, admittedly, go against the virtue that we have espoused of defining your metadata (including variable lengths) up-front. But during the integration process, when the studies to be integrated tend to be complete or mostly complete, this can be evaluated beforehand. Alternatively, there could be a loop-back process where the metadata are updated after the integration.

The following macro performs this task. The macro is called MAXLENGTH.

/*----------------------------------------------------------------

This macro determines the minimum required length of a variable,

     based on the maximum length observed

DATALIST should be a list of similar data sets (for example, from

         different studies) that each contain each variable in

         VARLIST, separated by a space

VARLIST should be a list of character variables for which the

        maximum length is to be examined each variable in the list

        should be separated by a space                        

Both DATALIST and VARLIST can contain only one item                                     

Set INTEGRATE=1 if you want to have all datasets in DATALIST

    combined into one data set via: SET &DATALIST                                                                    

If INTEGRATE=1, then IDS should contain the name of the resulting

   Integrated Data Set

----------------------------------------------------------------*/

%macro maxlength(datalist, varlist, integrate=0, ids= );

  ** create global macro variables that will contain the

  **   maximum required length for each variable in &VARLIST;

  %let wrd = 1;

  %do %while(%scan(&varlist,&wrd)^= );

     %global %scan(&varlist,&wrd)max ;

     %let wrd = %eval(&wrd+1);

  %end;

  ** initialize each maximum length to 1;

  %do %while(%scan(&varlist,&wrd)^= );

    %let %scan(&varlist,&wrd)max=1;

    %let wrd = %eval(&wrd+1);

  %end;

  ** find the maximum required length across each data

  ** set in &DATALIST;

  %let d = 1;

  %do %while(%scan(&datalist,&d, )^= );

    %let data=%scan(&datalist,&d, );

    %put data=&data;

    %let wrd = 1;

    data _null_;

      set &data end=eof;

      %do %while(%scan(&varlist,&wrd)^= );

        %let thisvar = %scan(&varlist,&wrd) ;

        retain &thisvar.max &&&thisvar.max ;

        &thisvar.max = max(&thisvar.max,length(&thisvar));

        if eof then

          call symput("&thisvar.max", put(&thisvar.max,4.));

        %let wrd = %eval(&wrd+1);

      %end;

    run;

    %let d = %eval(&d+1);

  %end;

  %let datasets=%eval(&d - 1);

  %if (&integrate=1 and &ids^= ) or &datasets=1 %then

    %do;

      %let wrd = 1;

      data %if &integrate=1 %then &ids; %else &datalist; ;

        length %do %while(%scan(&varlist,&wrd)^= );

                 %let thisvar=%scan(&varlist,&wrd);

                 &thisvar   $&&&thisvar.max..

                 %let wrd = %eval(&wrd+1);

               %end;

        ;

        set &datalist;

      run;

    %end;

  %else %if &integrate=1 and &ids= %then

    %put PROBLEM: Integration requested but parameter IDS is blank;

  ;

%mend maxlength;

   Each variable in &VARLIST becomes a global macro variable with MAX appended to the end of the macro variable name. These macro variables are initialized to a value of 1.

   In this DO-WHILE loop, each data set specified in &DATALIST is evaluated separately. The values of the --MAX macro variables are compared to the largest observed length for each new data set and, if necessary, are re-assigned to the new largest observed value.

In order to test the macro, consider the following sample program:

data a b c;

  length x y z $200 ;

        x = 'CDISC, SDTM, ADaM';

        y = 'Y';

        z = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx';

        output;

        x = 'hi';

        output;

run;

%maxlength(work.a work.b c, x y z, integrate=1, ids=d);

%put xmax=&xmax ymax=&ymax zmax=&zmax;

proc contents

  data = d;

run;

The maximum lengths of variables X, Y, and Z are evaluated in each data set A, B, and C. With the parameters INTEGRATE=1 and IDS=D, these three data sets are SET together to create integrated data set D. Before the SET statement and after the DATA statement, a LENGTH statement appears that assigns the lengths of variables X, Y, and Z using the &XMAX, &YMAX, and &ZMAX global macro parameters. The LOG file shows the following results from the %put command:

%put xmax=&xmax ymax=&ymax zmax=&zmax;

xmax=  17 ymax=   1 zmax=  40

The output file shows that the lengths of variables X, Y, and Z in data set D are, as expected, 17, 1, and 40, respectively. Applied to a large data set with a lot of padded character variables, the result can be a significant reduction in the size of the data set.

There is one potential problem to be aware of with this approach, however. Keep in mind that certain variables appear in multiple data sets. Having different lengths for these variables in these different data sets could cause truncation problems when merging data sets. To address this problem, another macro can be used that looks for the minimum required length to avoid truncation across all data sets in the same directory. This macro, called %MAXLENGTH2, is a bit more complicated and too long to include within the text of this book, but it can be found online on the authors’ SAS web pages (http://support.sas.com/publishing/authors/index.html). An example of how this macro could be used in practice appears in a later section of this chapter titled “Getting Submission Ready.”

Converting from Native SAS to Version 5.0 Transport Files

Many programmers might prefer to keep their SDTM and ADaM data sets in a native SAS format until they are ready to submit the data to the FDA for a regulatory submission. SAS provides a set of macros that facilitate conversions from native SAS formats to SAS transport files. These can be found at http://www.sas.com/en_us/industry/government/accessibility/accessibility-compliance.html#fda-submission-standards. As suggested by the URL, these macros are to help programmers convert their data for FDA submissions.

Many of these macros build off one another. The %TOEXP macro converts a library of native SAS data sets to a separate directory of XPT files. It calls another macro, %DIRDELIM, which is also included in the set. The %TOEXP macro is quite simple to use:

%let outdir=&pathdatasdtmxpt2;

%let indir=&pathdatasdtm;

%toexp(&indir, &outdir);

These macros are designed to work in a Microsoft Windows (or PC) or a UNIX or Linux environment. They can handle format catalogs as well (although that is not needed when dealing with SDTM and ADaM data sets because they should have no user-defined formats attached).

Because validation software such as Pinnacle 21 Community requires data sets to be transport files, this macro can also be used before running another software application’s validation report. (See Chapters 8 and 9.)

Converting from Version 5.0 Transport Files to Native SAS

Regulatory reviewers and, perhaps, anyone else who prefers to deal with native SAS data sets rather than transport files, might be more interested in the %FROMEXP macro. As the name suggests, this macro converts a directory of transport files to native SAS data sets. Coupling this macro with the %MERGSUPP macro, reviewers can quickly convert their submission data to native SAS data sets and have all supplemental qualifiers merged in with their parent domains, which can make review work much easier.

Here is a sample demonstration:

%let indir=&pathdatasdtmxpt2;

%let outdir=&pathdatasdtmmergsupp;

%fromexp(&indir, &outdir);

libname sdtm "&outdir";

%mergsupp(sourcelib=sdtm, outlib=sdtm);

Running these macros on the SDTM data that were created in earlier chapters creates a version of the DM domain (in the MERGSUPP subdirectory) that has columns for RANDDTC and RACEOTH, the two supplemental qualifiers in SUPPDM.

Getting Submission Ready

 Assume that you have finished creating your SDTM and ADaM data, have used them to run analyses for a clinical study report, and are ready to submit them for a new drug application. Now you can use tools that you have learned about in this and other chapters to create and run a simple little program to achieve three important tasks:

1.   Convert the data sets from their native SAS format to the SAS transport file format.

2.   Shorten text strings to the minimum required length (directory-wide) to avoid truncation (per FDA and CDER recommendations).

3.   Run a Pinnacle 21 Community report on the final data.

The following program can achieve these tasks with a minimal amount of code:

%let sdtmpath=&pathdatasdtm;

%let truncpath=&sdtmpath runc;

%let xptpath=&pathdatasdtmxpt;

libname sdtm "&sdtmpath";

libname trunc "&truncpath";

%maxlength2(sourcelib=sdtm, outlib=trunc);  

%toexp(&truncpath, &xptpath);  

%run_p21 (sources=&xptpath,config=config-sdtm-3.1.2.xml,define=N);

   The %MAXLENGTH2 macro was mentioned in the earlier section (“Setting Variable Lengths Based on the Longest Observed Value”), but was not shown due to its length. (But it is available on the authors’ web pages at http://support.sas.com/publishing/authors/index.html.) This macro reads in all of the character variables from all of the data sets in the directory specified by the SOURCELIB parameter and determines the minimum length required to avoid truncation (with the exceptions of **TESTCD or PARAMCD variables, which are set to a length of $8; **TEST variables, which are set to a length of $40; and **DTC variables, which are set to a length of $20). Numeric variables are assigned a length of 8. The macro writes the new data sets with the shortened text fields to the directory specified by the OUTLIB parameter. Remember that changing the lengths of variables will affect your metadata. As a result, some sort of loop-back is needed so that those length values can be corrected.

   The macro %TOEXP was just introduced in a previous section titled “Converting from Native SAS to Version 5.0 Transport Files.” It converts the data sets to XPT files.

   Finally, using the %run_p21 macro introduced in Chapter 9, the newly created transport files are checked for their SDTM compliance. As specified by that macro, results are written to an Excel file in the same directory where the data sets reside.

Chapter Summary

Ease of integration is one of the primary motivations for developing data standards. However, even data with common formats and structures across multiple studies present some challenges when it comes to integration. Considerations for dealing with some of these issues are discussed in this chapter. The process of doing the actual integrations can be a manual one, with Base SAS programming, or performed using SAS CDI, as shown in Chapter 5.

Certain macros were introduced in this chapter to assist with data integration and regulatory submissions (and review). The %maxlength macro can be used to determine the minimum length needed to avoid truncation of character fields without unnecessary padding that can greatly inflate data set sizes. Other macros available from the SAS website were shown to demonstrate how you can easily convert an entire directory of native SAS data sets to transport files and, for reviewers, how you can convert an entire directory of transport files back to native SAS data sets, while also using the %mergsupp macro (introduced in Chapter 7) to merge supplemental qualifiers into their parent domain along the way.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset