Lack of independence in anova() 2005-07-06 - By Spencer Graves
I'm confused: I understood Doug to be describing a traditional, normal theory ANOVA with k rows in the table for different effects plus a (k+1)st for residuals and the mean squares (MS) column are all indpendent chi-squares scaled by the same unknown sigma^2. The k statistics in the F column are ratios of independent chi-squares with the same independent denominator chi-square. How can they be indpendent?
spencer graves p.s. How is the method of synchronized permutations relevant to a traditional, normal theory ANOVA?
Phillip Good wrote:
> Do you or Lumley have a citation for this conclusion? Most people go > forward with the ANOV on the basis that the various tests are independent. > > Phillip Good > P.S. Tests based on the method of synchronized permutations are > independent. > > -----Original Message----- > From: Douglas Bates [mailto:dmbates@(protected)] > Sent: Monday, July 04, 2005 1:44 PM > To: pigood@(protected) > Cc: r-help > Subject: Re: [R] Lack of independence in anova() > > > I wrote up a note on how to do a more efficient simulation of the > p-values from anova then discovered to my surprise that the chi-square > test for independence of the significance of the F-tests indicated > that they were not independent. I was stumped by this but fortunately > Thomas Lumley came to my rescue with an explanation. There is no > reason why the results of the F tests should be independent. The > numerators are independent but the denominator is the same for both > tests. When, due to random variation, the denominator is small, then > the p-values for both tests will tend to be small. If, instead of > F-tests you use chi-square tests then you do see independence. > > Here is the note on the simulation. > > There are several things that could be done to speed the simulation of > the p-values of an anova like this under the null distribution. > > If you examine the structure of a fitted lm object (use the function > str()) you will see that there are components called `qr', `effects' > and `assign'. You can verify by experimentation that `qr' and > `assign' depend only on the experimental design. Furthermore, the > `effects' vector can be reproduced as qr.qty(qrstr, y). > > The sums of squares for the different terms in the model are the sums > of squares of elements of the effects vector as indexed by the assign > vector. The residual sum of squares is the sum of squares of the > remaining elements of the assign vector. You can generate 10000 > replications of the calculations of the relevant sums of squares as > > >>set.seed(1234321) >>vv <- data.frame(c = gl(3,3,18), r = gl(2,9,18)) >>vv > > c r > 1 1 1 > 2 1 1 > 3 1 1 > 4 2 1 > 5 2 1 > 6 2 1 > 7 3 1 > 8 3 1 > 9 3 1 > 10 1 2 > 11 1 2 > 12 1 2 > 13 2 2 > 14 2 2 > 15 2 2 > 16 3 2 > 17 3 2 > 18 3 2 > >>fm1 <- lm(rnorm(18) ~ c*r, vv) >>fm1$assign > > [1] 0 1 1 2 3 3 > >>asgn <- c(fm1$assign, rep(4, 12)) >>system.time(res <- replicate(10000, tapply(qr.qty(fm1$qr, rnorm(18))^2, > > asgn, sum))) > [1] 20.61 0.01 20.61 0.00 0.00 > >>res[,1:6] > > [,1] [,2] [,3] [,4] [,5] [,6] > 0 0.4783121 0.3048634 0.713689 0.6937838 0.03649023 2.63392426 > 1 0.5825213 1.4756395 1.127018 0.5209751 1.18697199 3.32972093 > 2 0.2612723 3.6396106 0.547506 1.1641910 0.37843963 0.03411672 > 3 2.6259806 3.5504584 1.645215 0.1197238 0.85361018 4.53895212 > 4 9.1942755 8.2122693 4.863392 5.4413571 2.03715439 22.94815118 > > The rows of that array correspond to the sum of squares for the > Intercept (which we generally ignore), the factor 'c', the factor 'r', > their interaction and the residuals. > > As you can see that took about 21 seconds on my system, which I expect > is a bit faster than your simulation ran. > > Because I set the seed to a known value I can reproduce the results > for the first few simulations to check that the sums of squares are > correct. Remember that the original fit (fm1) is not included in the > table. > > >>set.seed(1234321) >>fm1 <- lm(rnorm(18) ~ c*r, vv) >>anova(fm2 <- lm(rnorm(18) ~ c*r, vv)) > > Analysis of Variance Table > > Response: rnorm(18) > Df Sum Sq Mean Sq F value Pr(>F) > c 2 0.5825 0.2913 0.3801 0.6917 > r 1 0.2613 0.2613 0.3410 0.5701 > c:r 2 2.6260 1.3130 1.7137 0.2215 > Residuals 12 9.1943 0.7662 > > You can continue this process if you wish further verification that > the results correspond to the fitted models. > > You can get the p-values for the F-tests as > > >>pvalsF <- data.frame(pc = pf((res[2,]/2)/(res[5,]/12), 2, 12, low = > > FALSE), > + pr = pf((res[3,]/1)/(res[5,]/12), 1, 12, low = > FALSE), > + pint = pf((res[4,]/2)/(res[5,]/12), 2, 12, low = > FALSE)) > > Again you can check this for the first few by hand. > > >>pvalsF[1:5,] > > pc pr pint > 1 0.69171238 0.57006574 0.2214847 > 2 0.37102129 0.03975286 0.1158059 > 3 0.28634939 0.26771167 0.1740633 > 4 0.57775850 0.13506561 0.8775828 > 5 0.06363138 0.16124100 0.1224806 > > If you wish you could then check marginal distributions using > techniques like an empirical density plot. > > >>library(lattice) >>densityplot(~ pc, pvals) > > > At this point I would recommend checking the joint distribution but if > you want to choose a specific level and check the contingency table > that could be done as > > >>xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF) > > I(pint < 0.16) > I(pr < 0.16) FALSE TRUE > FALSE 7204 1240 > TRUE 1215 341 > > The summary method for an xtabs object provides a test of independence > > >>summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF)) > > Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsF) > Number of cases in table: 10000 > Number of factors: 2 > Test for independence of all factors: > Chisq = 51.6, df = 1, p-value = 6.798e-13 > > for which you can see the puzzling result. However, if you use the > chisquared test based on the known residual variance of 1, you get > > >>pvalsC <- data.frame(pc = pchisq(res[2,], 2, low = FALSE), > > + pr = pchisq(res[3,], 1, low = FALSE), > + pint = pchisq(res[4,], 2, low = FALSE)) > >>pvalsC[1:5,] > > pc pr pint > 1 0.7473209 0.60924741 0.2690144 > 2 0.4781553 0.05642013 0.1694446 > 3 0.5692081 0.45933855 0.4392846 > 4 0.7706757 0.28059805 0.9418946 > 5 0.5523983 0.53843951 0.6525907 > >>xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC) > > I(pint < 0.16) > I(pr < 0.16) FALSE TRUE > FALSE 7121 1319 > TRUE 1324 236 > >>summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC)) > > Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsC) > Number of cases in table: 10000 > Number of factors: 2 > Test for independence of all factors: > Chisq = 0.25041, df = 1, p-value = 0.6168 > > > > On 7/4/05, Phillip Good <pigood@(protected)> wrote: > >>To my surprise, the R functions I employed did NOT verify the independence >>property. I've no quarrel with the theory--it's the R functions that > > worry > >>me. >> >>pG >> >>-----Original Message----- >>From: Douglas Bates [mailto:dmbates@(protected)] >>Sent: Monday, July 04, 2005 9:13 AM >>To: pigood@(protected); r-help >>Subject: Re: [R] Lack of independence in anova() >> >> >>I have already had email exchanges off-list with Phillip Good pointing >>out that the independence property that he wishes to establish by >>simulation is a consequence of orthogonality of the column span of the >>row contrasts and the interaction contrasts. If you set the contrasts >>option to a set of orthogonal contrasts such as contr.helmert or >>contr.sum, which has no effect on the results of the anova, this is >>easily established. >> >> >>>build >> >>function(size, v = rnorm(sum(size))) { >> col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]), >> rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8])) >> row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6] >> +size[7]+size[8])) >> return(data.frame(c=factor(col), r=factor(row),yield=v)) >>} >> >>>size >> >>[1] 3 3 3 0 3 3 3 0 >> >>>set.seed(1234321) >>>vv <- build(size) >>>vv >> >> c r yield >>1 0 0 1.2369081 >>2 0 0 1.5616230 >>3 0 0 1.8396185 >>4 1 0 0.3173245 >>5 1 0 1.0715115 >>6 1 0 -1.1459955 >>7 2 0 0.2488894 >>8 2 0 0.1158625 >>9 2 0 2.6200816 >>10 0 1 1.2624048 >>11 0 1 -0.9862578 >>12 0 1 -0.3235653 >>13 1 1 0.2039706 >>14 1 1 -1.4574576 >>15 1 1 1.9158713 >>16 2 1 -2.0333909 >>17 2 1 1.0050272 >>18 2 1 0.6789184 >> >>>options(contrasts = c('contr.helmert', 'contr.poly')) >>>crossprod(model.matrix(~c*r, vv)) >> >> (Intercept) c1 c2 r1 c1:r1 c2:r1 >>(Intercept) 18 0 0 0 0 0 >>c1 0 12 0 0 0 0 >>c2 0 0 36 0 0 0 >>r1 0 0 0 18 0 0 >>c1:r1 0 0 0 0 12 0 >>c2:r1 0 0 0 0 0 36 >> >> >>On 7/4/05, Phillip Good <pigood@(protected)> wrote: >> >>> If the observations are normally distributed and the 2xk design is >>>balanced, theory requires that the tests for interaction and row > > effects > >>be >> >>>independent. In my program, appended below, this would translate to > > cntT > >>>(approx)= cntR*cntI/N if all R routines were functioning correctly. > > They > >>>aren't. >>> >>>sim2=function(size,N,p){ >>> cntR=0 >>> cntC=0 >>> cntI=0 >>> cntT=0 >>> cntP=0 >>> for(i in 1:N){ >>> #generate data >>> v=gendata(size) >>> #analyze after build(ing) design containing data >>> lm.out=lm(yield~c*r,build(size,v)) >>> av.out=anova(lm.out) >>> #if column effect is significant, increment cntC >>> if (av.out[[5]][1]<=p)cntC=cntC+1 >>> #if row effect is significant, increment cntR >>> if (av.out[[5]][2]<=p){ >>> cntR=cntR+1 >>> tmp = 1 >>> } >>> else tmp =0 >>> if (av.out[[5]][3]<=p){ >>> #if interaction is significant, increment cntI >>> cntI=cntI+1 >>> #if both interaction and row effect are significant, increment >> >>cntT >> >>> cntT=cntT + tmp >>> } >>> } >>> list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT) >>>} >>> >>>build=function(size,v){ >>>#size is a vector containing the sample sizes >>>col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]), >>>rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8])) >>>row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6] >>>+size[7]+size[8])) >>>return(data.frame(c=factor(col), r=factor(row),yield=v)) >>>} >>> >>>gendata=function(size){ >>> ssize=sum(size); >>> return (rnorm(ssize)) >>>} >>> >>>#Example >>> size=c(3,3,3,0,3,3,3,0) >>> sim2(size,10000,10,.16) >>> >>> >>> >>>Phillip Good >>>Huntington Beach CA >>> >>> >>> >>>______________________________________________ >>>R-help@(protected) mailing list >>>https://stat.ethz.ch/mailman/listinfo/r-help >>>PLEASE do read the posting guide! >> >>http://www.R-project.org/posting-guide.html >> >>> >>______________________________________________ >>R-help@(protected) mailing list >>https://stat.ethz.ch/mailman/listinfo/r-help >>PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > ______________________________________________ > R-help@(protected) mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
-- Spencer Graves, PhD Senior Development Engineer PDF Solutions, Inc. 333 West San Carlos Street Suite 700 San Jose, CA 95110, USA
spencer.graves@(protected) www.pdf.com <http://www.pdf.com> Tel: 408-938-4420 Fax: 408-280-7915
______________________________________________ R-help@(protected) mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
|
|