DETECTING PATTERNS IN DATA: A NEW STATISTIC FOR SMOOTHNESS AND NONRANDOMNESS (RESIDUALS, MODELING, RANDOMNESS TESTS)
The problem of detecting a pattern in data obscured by noise is common in much of science. An intuitive and often unstated assump- tion about this pattern is that it is smooth. We therefore consider the intuitive concept of "smoothness" as the basis of pattern detection. This concept is formalized in the context of mathematical splines and leads to a convenient, powerful nonrandomness measure called the curvature statistic. We define roughness of a function as the integrated squared second derivative R(f) = (INT)f''('2)dx. It is known that, of functions inter- polating a set of data, the cubic spline g is smoothest, i.e. R(g) = Inf(,f)R(f). We show that R(g) may be calcuted as a quadratic form (')y('T)A(')y in the vector of observations (')y. The matrix A is given as the product of tridiagonal and the inverse of tridiagonal matrices. The curvature statistic is defined to be T((')y) = (')y('T)A(')y/(SIGMA)(y - y)('2). Under the null assumption that the y(,i) are independent normally distributed, the distribution of T may be approximated by matching its low order moments to a Beta-Jacobi polynomial series. The curvature statistic T is closely related to the well known mean square successive differences (MSSD). The numerator of the MSSD is exactly (INT)h'('2)dx for the interpolating linear spline, h. This observation leads to a generalized MSSD (GMSSD) for unequally spaced data, specifically (SIGMA)(y(,i+1)-y(,i))('2)/(x(,i+1)-x(,i)) /(SIGMA)(y(,i)-y)('2). An analo- gous statistic, based on properties of the smoothing spline, is named the penalized curvature statistic, T(,(lamda)). In a Monte Carlo study, the curvature statistic is shown to have more power for detecting smooth patterns compared to existing approaches, when the pattern sought is moderately complex and the expected level of contamination of the underlying deterministic function is small. The advantages of using the curvature statistic increase when data are non-uniformly spaced and the number of data points is increased. The penalized curvature statistic is more resistant to large error contamination levels yet retains most of the power to detect complex patterns, and is thus to be recommended. The calculation and use of this statistic is illustrated on several problems involving regression residuals.