Many times in an analysis, multiple variables in the data will be measuring the same quantity. For example, in the mri data available at Scott Emerson’s website and documented on the same page, both the yrsquit and packyrs variables measure the amount of smoking that a person does.

To fully analyze these variables, we need to run multiple-partial F-tests. Prior to the uwIntroStats package, the process to perform these tests involved more code than was necessary. First the user had to create a linear model (or perhaps multiple linear models), and then run an ANOVA test.

Now, using the U() function, the user can specify multiple-partial F-tests within a call to regress(), the regression function supplied by uwIntroStats. A full explanation of that function can be found in “Regression in uwIntroStats”.

This document provides an introduction to using the U() function as a supplement to regression analyses. In each case, we will use linear regression to avoid confusion, and leave all of the arguments to regress() up to its own vignette.

# Arguments to the U() function

To continue our example above, if we want to describe the association between cerebral atrophy and smoking and age using linear regression, we would have to use both the yrsquit and packyrs variables, in addition to the age variable. But as we already described, the former two both measure smoking habits, and thus are truly one variable.

The U() function only requires a formula when it is used to create a multiple-partial F-test. However, this is not a usual formula, because the response variable has already been defined in the outer formula in the call to regress(). For example, the formula given to regress() without the multiple-partial F-test would follow the usual convention of lm().

atrophy ~ age + packyrs + yrsquit

Now if we want to make the F-test, we give U() the formula

~ packyrs + yrsquit

and it knows to use the response variable atrophy. In fact, an error will be returned if a response variable is entered to the U() formula.

Now we can run the regression.

library(uwIntroStats)
##
## Attaching package: 'uwIntroStats'
##
## The following object is masked from 'package:base':
##
##     tabulate
data(mri)
regress("mean", atrophy ~ age + U(~packyrs + yrsquit), data = mri)
## ( 1  cases deleted due to missing values)
##
##
## Call:
## regress(fnctl = "mean", formula = atrophy ~ age + U(~packyrs +
##     yrsquit), data = mri)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -35.673  -8.610  -0.873   7.727  52.552
##
## Coefficients:
##                             Estimate  Naive SE  Robust SE    95%L
## [1] Intercept                -18.22     6.312     6.812        -31.60
## [2] age                       0.7096   0.08401   0.09077       0.5314
##     U(packyrs + yrsquit)
## [3]   packyrs                0.02860   0.01694   0.01685     -4.488e-03
## [4]   yrsquit                0.07252   0.03241   0.03221      9.288e-03
##                             95%H         F stat    df Pr(>F)
## [1] Intercept                -4.850           7.16 1    0.0076
## [2] age                       0.8878         61.12 1  < 0.00005
##     U(packyrs + yrsquit)                      4.37 2    0.0130
## [3]   packyrs                0.06168          2.88 1    0.0901
## [4]   yrsquit                 0.1358          5.07 1    0.0246
##
## Residual standard error: 12.27 on 730 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.09961,    Adjusted R-squared:  0.09591
## F-statistic: 23.05 on 3 and 730 DF,  p-value: 2.882e-14

The regression output indicates that the variable for smoking should be in the model. The F-statistic for the multiple-partial F-test, which tests that the packyrs and yrsquit coefficient estimates are simultaneously equal to zero, is 4.37 with a p-value of less than 0.05. Thus we would conclude that both age and smoking are associated with cerebral atrophy. For a full example of the inference we would make from this model, see the vignette for using regress().

# Naming the groups defined by U()

In our example above, we stated that both variables were actually measuring smoking habits. Thus in our regression call we could name this group to have more informative output. The U() function allows us to name the groups by placing an “=” before the tilde in the formula, and assigning a name on the left. In our example above, we could name the group “smoke” by writing

U(smoke = ~packyrs + yrsquit)

This would return the following output.

regress("mean", atrophy ~ age + U(smoke = ~packyrs + yrsquit), data = mri)
## ( 1  cases deleted due to missing values)
##
##
## Call:
## regress(fnctl = "mean", formula = atrophy ~ age + U(smoke = ~packyrs +
##     yrsquit), data = mri)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -35.673  -8.610  -0.873   7.727  52.552
##
## Coefficients:
##                  Estimate  Naive SE  Robust SE    95%L       95%H
## [1] Intercept     -18.22     6.312     6.812        -31.60    -4.850
## [2] age            0.7096   0.08401   0.09077       0.5314     0.8878
##     smoke
## [3]   packyrs     0.02860   0.01694   0.01685     -4.488e-03  0.06168
## [4]   yrsquit     0.07252   0.03241   0.03221      9.288e-03   0.1358
##                     F stat    df Pr(>F)
## [1] Intercept            7.16 1    0.0076
## [2] age                 61.12 1  < 0.00005
##     smoke                4.37 2    0.0130
## [3]   packyrs            2.88 1    0.0901
## [4]   yrsquit            5.07 1    0.0246
##
## Residual standard error: 12.27 on 730 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.09961,    Adjusted R-squared:  0.09591
## F-statistic: 23.05 on 3 and 730 DF,  p-value: 2.882e-14

This is more informative than above, because now we are immediately reminded that yrsquit and packyrs are measuring smoking history when we look at the output.