Thursday, April 25, 2013

Appending Dates to File Names

I have gotten to the point in my work where people like to request the same things over and over again just to see how things change with updated data.  At first, I was just overwriting my output (VERY BAD IDEA).  Once I saw the flicker of light, I started giving ridiculous, generic names like Safety_Update_1.rtf, Safety_Update_2.rtf, etc.

This is by no means ground breaking code.  It is not even an ingenious idea.  But maybe it will help someone out there save a little time with trying to find a way to simply append the date to your output files.

When I use dates in file names (or even folder names), I like to use pure numerals rather than alphanumeric for sorting purposes only.  For example, I could name a file "Updated_Output_25APR2013.csv", but this does not like to be sorted properly; therefore, making it difficult to find files quickly.  In other words, usingly strictly numbers for the date, the same file would read "Updated_Output_2013-04-25.csv".

Suppose you have 3 updated files named

  • Updated_Output_2Jan2013.csv
  • Updated_Output_11Jan2013.csv
  • Updated_Output_1Feb2013.csv

The would show up in your output folder sorted as:

  • Updated_Output_1Feb2013.csv
  • Updated_Output_2Jan2013.csv
  • Updated_Output_11Jan2013.csv

This is not the correct ordering because Feb. 1, 2013 does not come before Jan. 2, 2013.  In fact, it should be on the bottom.  This is because the date is a character string and is sorted alphabetically.

Now suppose you have the same files, but you used a different date convention:

  • Updated_Output_2013-01-02.csv
  • Updated_Output_2013-01-11.csv
  • Updated_Output_2013-02-01.csv
The ordered file names will come out in the same order.  Note that you must use YYYY-MM-DD or things will not sort chronologically.  If you did MM-DD-YYYY like Americans like to speak, things can go amiss.  They will sort fine for the examples above, but if you throw in different years, you can see that "Updated_Output_12-31-2013.csv" will come after "Updated_Output_02-01-2013.csv" because alphabetically, a "0" comes before a "1".  Thus, "02" (or February) will ALWAYS come before "12" (or December) regardless of the year.

Some might ask, "Well, you can sort files in a folder by the 'Date modified' column, so why do we need the date to sort things?"  There are two issues with that.  First this is the date modified, not the date created.  If you were to go into the Jan. 1 file and make any edit and then save it, the date modified would change to today.  Now when you sort, it shows up as the most recent when it is actually the oldest file.  Second, if you are storing a whole bunch of files with different files names, things can become very jumbled very quickly.  You may be able to assume that no changes are ever made to the files, so the date modified will remain the same as the date created; however, with different files being added in between updates with toally unrelated names, you can see how they may be sorted in date order, but they will not be grouped by file name.

It is best to use the same base name for the file and then append the date to it.  That way, when you sort by Name, everything is grouped together, and all similar file names are sorted chronologically.

Code

First, let me show you my little snippet of code, then we can walk through what it means.  This code would obviously be put at the end of a larger program.  You would do your data manipulation and then add:

data _null_;


      year=year(today());
      month=month(today());
      day=day(today());
      date=put(year,best.) || '-' || put(month,z2.) || '-' || put(day,z2.);
      call symputx('date',date);
run;
%put ~~~&date.~~~;

proc export data=safety outfile="C:\Temp\Safety Update - &date..csv" dbms=csv replace;
run;

Explanation

data _null_ is a way to do data-step processing without taking the time to actually create a dataset.  It is kind of like a function in Excel where you give the formula some parameters, and it gives back a value.  None of the processing is saved anywhere.  In this case, we are going to create a macro variable, but we need to do a little processing in order to get it.

SAS has some nice date functions.  The first is today().  You can get the same result by using date().  The function takes in no parameters, but it returns today's date.  So, if we submitted that code today (April 25, 2013), then you would get the SAS date for today.  Remember that SAS dates are the number of days from January 1, 1960, so today would be stored as 19473.

The other important date function used here are year()month(), and day().  These return just what you would expect: the year of a given date, the numeric month of a given date, or the day of a given date, respectively.  Thus, you can see that the first 3 lines inside the data _null_ statement are saving the numeric parts of today's date.

The 4th line is concatenating ("||" is the concatenation operator in SAS) the strings to appear as we want.  I have to use the put() function to change a numeric value to a character value.  The z2. format will change the first 9 months of the year or the first 9 days of the month to 01, 02, 03, etc. in stead of 1, 2, 3, etc.  It just makes things look nicer.  It essetially says, "I want my value to be 2 digits long.  If the value has fewer digits than that, then add zeros to the front to make it 2 digits long."

Finally, using the call symputx() function, we can take the value stored in the variable date and store it in a macro variable (which I also named date).

Just for fun, I used the %put ~~~&date.~~~; line to print to the log the value of the macro variable &date. This way, I can be sure that it will come out correctly without any spaces.  For example, try running the same code, but use call symput('date',date); instead of call symputx('date',date);.  Notice that I just removed the x from the end of the function name.  call symputx() removes leading and trailing blanks.  My %put statement just makes sure that there are not leading or trailing blanks.

The last thing to do is to output your results and simply adding the &date macro variable in the file name.  Tada!  Now you can run your same code over and over with updated data, and not tweaks are required to manually change the file name.

Well, this certainly turned into a much longer post that I imagined, but I hope it is a little bit of help in better understanding the how and why of appending dates to file names.

Wednesday, September 26, 2012

Clopper-Pearson Confidence Interval

Confidence intervals are a common statistical output.  In clinical trials, it is not uncommon to report rates such as the Overall Response Rate (ORR) or the Disease Control Rate (DCR).  Such rates are accompanied by confidence intervals.  There are many types of confidence intervals, but the interval that summarizes these rates is a binomial proportion confidence interval.

If you have taken an elementary probability course, you have been exposed to the binomial distribution.  Remember that the binomial distribution finds the probability of x success occurring in n trials.  In other words, if we flip a coin 10 times (n = 10), what is the probability that we would observe 6 heads (x = 6)?  My mother gave birth to 7 children, 4 of which were boys.  We could apply the binomial distribution to discover the probability that my mom could have 4 boys (x = 4) out of 7 children (n = 7).

ORR and DCR can be related to the binomial distribution because either a patient experiences an ORR (or a DCR) or she doesn't.  Success or failure.  Heads or tails.  Boy or girl.  Recall that ORR and DCR are derivations from RECIST standards that show if a patient's tumor is increasing (progressing), decreasing (responding), or stable (no getting better and not getting worse).  Here is a general table to help define ORR and DCR:

Term
Abbreviation
Definition
Complete Response
CR
A tumor completely disappears
Partial Response
PR
A tumor decreases in size by at least 30% from baseline but does not fully disappear
Stable Disease
SD
A tumor does not decrease enough to be a PR or increase enough to be a PD
Progressive Disease
PD
A tumor increases by 20% or more from baseline



Overall Response Rate
ORR
Number of patients experiencing a CR or PR out of all patients of interest
Disease Control Rate
DCR
Number of patients experiencing a CR, PR, or SD out of all patients of interest



Now that we have cleared up the background information, let's look at how to get the confidence intervals.

There are many forms of the binomial proportion confidence interval, but the most common used among statisticians is the Clopper-Pearson confidence interval.  The formula for the confidence interval is as follows:



Where:
  • x is the number of successes
  • n is the number of trials
  • p is the proportion of successes (x/n)
  • α is the level of significance
  • F() is the specified percentile of the F distribution

Now that the hard part is out of the way, we get to do the fun partSAS programming!



The Macro

This macro truly only requires 2 parameters.  The other 5 are for more specification and output preferences. The 2 required parameters are any two of x, n, or p (i.e., x and n, x and p, or n and p).  If only one of these parameters is specified, an error message is displayed, and the macro stops execution.  xn, and p have the same meanings as above (number of successes, number of trials, and proportion of successes).

alpha is defaulted to 0.05 for a 95% confidence interval, but it can be specified to 0.01 or 0.10 (for a 99% or 90% confidence interval, respectively) or any other level of desired significance.

sided defaults to 2, meaning a two-sided confidence interval.  Replace this with  sided=1 for a one-sided confidence interval.

print defaults to Y (or yes), meaning that the user wants the results to print to the output screen in SAS.  If printed output is not needed, this can be changed to N (or no).

Finally,  dsout allows the user to specify an output dataset.  If the results are not printed, it would be a good idea to specify an output dataset; otherwise, no results will be retained.  This is also handy if the results are needed for use in another program.



%macro cp_ci(x=,n=,p=,alpha=0.05,sided=2,print=Y,dsout=);
      %if (%sysfunc(upcase(&print.))=Y | %sysfunc(upcase(&print.))=YES) & &dsout.= %then %do;
            %put WARNING: No print option or output dataset has been specified.  No output will be displayed or saved.;
      %end;

      %let missing=0;
      %if &x.= %then %let missing=%eval(&missing.+1);
      %if &n.= %then %let missing=%eval(&missing.+1);
      %if &p.= %then %let missing=%eval(&missing.+1);

      /***  Print ERROR message and quit execution if insufficient number of entries  ***/
      %if &missing.>1 %then %do;
            %put ERROR: The CP_CI macro requires at least 2 of the 3 parameters - x, n, p;
            %if &x.= %then %put WARNING: x is empty.;
            %if &n.= %then %put WARNING: n is empty.;
            %if &p.= %then %put WARNING: p is empty.;
            %return;
      %end;

      /***  Calculate Confidence Limits  ***/
      data cp;
            /* calculate missing parameter */
            %if &x.= %then %do;
                  n=&n.;
                  p=&p.;
                  x=round(n*p);
            %end;
                  %else %if &n.= %then %do;
                        x=&x.;
                        p=&p.;
                        n=round(x/p);
                  %end;
                  %else %if &p.= %then %do;
                        x=&x.;
                        n=&n.;
                        p=x/n;
                  %end;
                  %else %do* if all 3 parameters are given, run a check to overwrite p based on x and n;
                        x=&x.;
                        n=&n.;
                        p=x/n;
                  %end;

            /* determine alpha based on 1-sided or 2-sided test */
            %if &sided.=1 %then %do;
                  a=1-(&alpha.);
            %end;
            %else %if &sided.=2 %then %do;
                  a=1-(&alpha./2);
            %end;

            /* calculate the lower limit */
            v11=2*(n-x+1);
            v12=2*x;
            fscore1=finv(a,v11,v12);
            coef1=(n-x+1)/x;
            cp_lcl=1/(1+coef1*fscore1);

            /* calculate the upper limit */
            v21=2*(x+1);
            v22=2*(n-x);
            fscore2=finv(a,v21,v22);
            coef2=(x+1)/(n-x);
            cp_ucl=(coef2*fscore2)/(1+coef2*fscore2);

            /* combine lower and upper limits into a single string and add variable labels */
            label x='Successes'
                    n='Trials'
                    p='Proportion'
                    cp_lcl='Clopper-Pearson Lower Limit'
                    cp_ucl='Clopper-Pearson Upper Limit';
            keep x n p cp_lcl cp_ucl;
      run;

      %let ci=%sysevalf(100*(1-&alpha.));       * for use in the title;

      /***  Set up title information  ***/
      %if &sided.=1 %then %do;
            title "1-sided &ci.% Clopper-Pearson Confidence Interval";
      %end;
      %else %if &sided.=2 %then %do;
            title "2-sided &ci.% Clopper-Pearson Confidence Interval";
      %end;

      /***  Print to Output screen if requested  ***/
      %if %sysfunc(upcase(&print.))=Y | %sysfunc(upcase(&print.))=YES %then %do;
            proc print data=cp label noobs;
                  var x n p cp_lcl cp_ucl;
                  format p cp_lcl cp_ucl 6.4;
            run;
      %end;

      /***  Save to output dataset if requested  ***/
      %if &dsout.^= %then %do;
            data &dsout.;
                  set cp;
            run;
      %end;

      /***  Clean up  ***/
      %if &dsout.= | %sysfunc(upcase(&dsout.))^=CP %then %do;
            proc datasets lib=work;
                  delete cp;
            run;
            quit;
      %end;
      title;
%mend cp_ci;



An Example

An investigational drug is given to 45 patients.  The following RECIST results are collected:



% (n)
CR
0% (0)
PR
4% (2)
SD
36% (16)
PD
60% (27)
ORR (CR+PR)
4% (2)
DCR (CR+PR+SD)
40% (18)


You now know the ORR and DCR (4% and 40%, respectively), but your manager is interested in the 95% confidence intervals for each result.  You should be able to see this as a binomial problem (a success/failure situation).  A patient is either counted in the ORR or not.  A patient is either counted in the DCR or not.

Simply running your new macro, you can now code:


%cp_ci(x=2,n=45);
%cp_ci(x=18,n=45);


and the following results will magically appear in your SAS output:


If you need the output saved to a dataset and do not need the results printed to the screen, the following code will get the desired results:


%cp_ci(x=2,n=45,dsout=orr_cp);
%cp_ci(x=18,n=45,dsout=dcr_cp);



where orr_cp is a SAS dataset containing the Clopper-Pearson confidence limits for the ORR and dcr_cp is a SAS dataset containing the Clopper-Pearson confidence limits for the DCR, both datasets saved in the WORK library.

Pretty simple, huh?