Curve Fitting, Data Modelling, Interpolation, Extrapolation

This vague heading covers a multitude of closely related endeavors. There are many approaches, described in the following sections.
• Interpolation
• Statistical Curve Fitting
• Levenberg-Marquardt algorithm
• Time series analysis
• Splines
• Nearest Neighbor algorithms
• Fitting a circle to a set of points
• ZunZun Interactive curve fitting

Commercial packages:

• CurveExpert

Other approaches to data modelling are:

• Neural Networks .
• Wavelets .

Interpolation

For one dimensional work, see Stoer & Bulirsch 1980 or Press et al 1992. No doubt some of the techniques described there can be found in netlib (see Netlib ) by using the NIST guide (see Indices of NA Software on the Net ).

For greater than or equal one dimensions, some references are:

IAMG Resources: IAMG International Association for Mathematical Geology: resources

Watson, Dave, 1992, Contouring: A guide to the analysis and display of spatial data, Pergamon Press, ISBN 0 08 040286 0. Text includes a floppy with source code in Basic (sigh). Subject areas: isoline maps, isochor maps, methodology, data sorting, subset selection, local coordinates, binary local coordinates, barycentric coordinates, rectangular local coordinates, natural neighbour coordinates, gradient estimation, least squares gradients, least squares quadratics, spline gradients, cross product gradients, neighbourhood-based gradients, interpolation, inverse distance weighting, inverse distance weighted gradients, minimum curvature splines, neighbourhood-based interpolation, selected references to contouring literature

[Dave Watson] writes: Linear triangle-based interpolation is quick and dirty - probably the best compromise available if you are in a hurry. Natural neighbor interpolation is much slower but does provide a result that most people consider to be closest to that which an experienced manual draftsman would generate. The 'kriging' method is a variation on the splines technique - the main difference being that a subjectively selected probabilistic function is used rather than one of the analytically derived radial basis functions. This class of method generates discontinuities at points where the interpolation subset changes, and kriging attempts to ameliorate this effect be recomputing the coefficients for each interpolation point - this makes it very slow.

Interpolation on spheres: Dave Rusin's FAQ See also the comp.graphics.algorithms FAQ and the graphics FAQ.

Software:

[Albrecht Preusser] wrote: You could take routines from ACM Alg. 624 "Triangulation and Interpolation of Arbitrarily Distributed Points in the Plane" by Robert J. Renka. This Algorithm is available from netlib (see Netlib ) in the toms directory. Another algorithm in FORTRAN is contained in Akima's ACM Alg. 526, it is faster, but takes much more memory. By the way, I have developed methods based on the Delaunay Triangulation from both algorithms for computing nonlinear contours and nonlinear interpolation, which are published as ACM Alg. 626 and 684.

Spatial and Geometric Analysis toolbox (SaGA) - a MATLAB package for various aspects of spatial data interpolation and analysis and geometric modeling. It is available at: Glenn Kirill OAX is an spacial objective analysis program. OAX applies the method of optimal interpolation (objective analysis) to estimate the values of variables at specified points in a multidimensional space. For each point, the computation is based on a weighted average of the values of a specified number of data points (i.e. "nearest neighbours") which are closest to that point. For the sake of efficiency, several dependent variables can be interpolated simultaneously using the same weights if the underlying statistical model and relative noise level in the input data are assumed to be same for each dependent variable.

In this version, OAX extends the capability of earlier implementations (OPTIMAL_ESTIMATOR directive, oax3.0, oax4.0) by allowing the parameters controlling the calculation of the weights to vary as a function of the independent variables, and by allowing the specification of a relative uncertainty (NOISE) for each input data record separately.

Statistical Curve Fitting

Books

Jobson, J.D., Applied Multivariate Data Analysis, volumes 1 and 2, Springer-Verlag, New York. 1991.

Johnson, Richard A.; Wichern, Dean W, Applied Multivariate Statistical Analysis, Third Edition. [Ramin Samadani] writes: I like the book by Lawson and Hanson, solving least squares problems. I think its supposed to be back in print now or soon. Lawson, Charles L.; Hanson, Richard J. Prentice-Hall, Englewood Cliffs, N.J. 1974

Press et al 1991 also has a description of the field.

Spaeth, Helmuth: "Mathematical Algorithms for Linear Regression" 1987, Academic Press.

Software

ODRPACK, from Netlib , is a portable collection of Fortran subprograms for fitting a model to data. It is designed primarily for instances when the explanatory as well as the response variables have significant errors, implementing a highly efficient algorithm for solving the weighted orthogonal distance regression problem, i.e., for minimizing the sum of the squares of the weighted orthogonal distances between each data point and the curve described by the model equation.

Levenberg-Marquardt algorithm

[Alan Miller]: The non-linear least-squares problem is to find a vector of parameter values, a, to minimize the sum of squares of differences between a given vector, y, of observed values and a vector of fitted values f(a,x) where the values of x are known. e.g. a particularly notorious problem is to fit:

` y = a(0) + a(1).exp(-a(2).x) + a(3).exp(-a(4).x)`

sometimes with more exponential terms, to a set of (x,y)-pairs.

Writing S = sum [y(i) - f(a,x(i))]^2, where i denotes the i-th observation pair, we have that the first and second derivatives of S with respect to the parameters, a(j), are:

` S' = -2.sum [r(i).(df/da(j))]`

` S'' = 2.sum [(df/da(j)).(df/da(k))] - 2.sum [r(i).d2f/(da(j)da(k))]`

where `r(i)` is the `i-th residual = y(i) - f(a,x(i)).`

The Gauss-Newton (G-N) method uses the linearization of f about the latest estimates of a, and solves a linear least sqaures problem to obtain a new vector a.

If J denotes the Jacobian = the n x k matrix of values of (df/da(j)) at the current estimates of the parameter values, where n is the number of observations and k is the number of parameters, then the G-N method ignores the second derivatives of f in S'' and uses

` a(new) = a(old) + inv(J'J).J'r`

This guarantees a downhill search direction but often overshoots unless good starting values are available.

The Levenberg-Marquardt (L-M) method uses a trust-region approach, also known as ridge regression or regularization, to shrink the step sizes to ensure a reduction in S at each iteration. In the L-M method

` a(new) = a(old) + inv(J'J + D).J'r`

where D is a diagonal matrix. Levenberg suggested that D = d.I, that is that the same scalar value be added to each diagonal element. In practice, the diagonal elements of J'J are often of vastly different magnitudes, and Marquardt suggested

`D = lambda.diag(J'J).`

For the first iteration, lambda is given a fairly large value (say 1.0). If the first attempt at an iteration reduces S then lambda is reduced for the next iteration by a factor of say 5-10. If the first step increases S then lambda is increased by a factor of perhaps 3-5 until S is reduced. The final value of lambda for the iteration is then used for the next iteration.

The L-M method converges upon a stationary point (or sometimes a sub-space) which may be the minimum which is being sought, but it could be a saddle-point, a local minimum, or one or more of the parameter values may tend to infinity.

The Jacobian of first derivatives is often ill-conditioned so that QR factorizations are usually employed to solve the LS equations.

The L-M algorithm is often very efficient. Methods which use second derivatives require substantially more computation and then it is common to find that the Hessian matrix is not positive definite very close to the minimum.

Software:

MINPACK includes L-M algorithms using either analytic derivatives or derivatives estimated from differences. Obtain from the minpack directory of netlib (see Netlib ).

For many problems, the function f is linear in some of the parameters. In the sum of exponentials example mentioned, f is linear in a(0), a(1) and a(3). The VARPRO code uses an L-M type of algorithm but separates the linear from the non-linear parameters. This code is available from the opt directory of netlib (see Netlib ).

The package NL2SOL (TOMS algorithm 573) uses a hybrid combination of G-N and estimation of a positive-definite approximation to the Hessian from differences of first derivatives. It is available from the toms directory of netlib (see Netlib ). (Fortran 90 version from Alan Miller ).

Updates which allow for bounds on parameter values can be found in netlib directory port with file names starting n2f, n2g, rn2f or rn2g. (See Netlib ).

There is a very easy-to-use shareware package for PCs including graphics (NLREG) which is based upon the NL2SOL package, and includes graphics. It is more successful than many of the non-linear least-squares solvers found in commercial statistical packages. It can be downloaded from the math directory of simtelnet.

Time series analysis

Time series analysis may be used for some types of data. See:

Brockwell, Peter J.; Davis, Richard A., 1991, Time Series: Theory and Methods, Second Edition, Springer-Verlag, New York. ISBN 0-387-97429-6. [SJS]: Beautiful approach with emphasis on theory. Emphasis on the linear model.

Hamilton, James D., 1994, Time Series Analysis, Princeton U. Press, Princeton, NJ. ISBN 0-691-04289-6. [SJS]: Applied approach, with heavy emphasis on applications from economics. Not having an economics background is a detriment in reading this text. Broad mix of models.

Montgomery, Douglas C. / Forecasting and time series analysis (1990)

Pankratz, Alan / Forecasting with Univariate Time Series Models: Concepts and Cases / (1983) / good, clear

Vandaele, Walter / Applied Time Series and Box-Jenkins Models (1983). Good; pedagogical and practical

ARfit is a Matlab package for the estimation and spectral decomposition of multivariate autoregressive models.

ARfit is a collection of Matlab routines for

• estimation of parameters and confidence regions for multivariate autoregressive (AR) processes,
• diagnostic checking of fitted models, and
• spectral decomposition of AR models

ARfit may be freely distributed. Suggestions for improvement are greatly appreciated. M-files, documentation, and a paper describing the algorithms used are available.

Also see Statlib .

Also see Probability and Statistics .

Splines

Useful for modelling single and multiple dimensional data.

Books:

de Boor, Carl, 1978, A Practical Guide to Splines, Springer-Verlag, New York, ISBN 0-540-90356-9. [Tim Strotman]: Provides an excellent blending of theory and practical implementation.

Dierckx, Paul, 1993, Curve and surface fitting with splines, Clarendon Press, ISBN 0198534418

Farin, Gerald E., 1995, NURB curves and surfaces : from projective geometry to practical use, A.K. Peters, ISBN 1568810385

Farin, Gerald E., 1993, Curves and surfaces for computer aided geometric design : a practical guide, third edition, Academic Press

Nurnberger, Gunther, 1989, Approximation by Spline Functions, Springer-Verlag.

Piegl, Les; Tiller, Wayne, 1995, The NURBS Book, Springer-Verlag, New York, ISBN 0-387-55069-0. [Tim Strotman]: Provides an excellent blending of theory and practical implementation.

Spaeth, Helmuth, 1995, One Dimensional Spline Interpolation Algorithms, A.K. Peters, ISBN 1-56881-016-4

Spaeth, Helmuth, and A.K. Peters, 1995, Two Dimensional Spline Interpolation Algorithms, A.K. Peters, ISBN 1-56881-017-2

Wahba, Grace, 1990, Spline Models for Observational Data

Software:

See Indices of NA Software on the Net .

Nearest Neighbor algorithms

Books:

Dasarathy, Belur V., 1991, Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, IEEE Computer Society Press, Los Alamitos, CA. [SJS]: Excellent.

Fitting a circle to a set of points

Problem: Given N Points:

`(x1,y1); (x2,y2).. (xN,yN),`

we want to find for best-fit circle:

`(X0,Y0), R.`

(Note: for fitting an *ellipse*, substitute the equation for an ellipse for the equation for a circle in the "Brute Force Approach").

Brute Force Approach (leads to a non-linear system): [Amara Graps]

Idea: Minimize by least squares the root-mean-squared error of the radius in the equation for a circle. In this method, one minimizes the sum of the squares of the *length* differences, which gives you a non-linear problem with all the associated pitfalls.

` (x-X0)^2 + (y-Y0)^2 = R^2 Equation for a circle`

` R = SQRT[ (x-X0)^2 + (y-Y0)^2) ] Radius of the circle`

where:

` (X0,Y0) = center of circle`

` (x,y) = point coordinates`

` R = radius`

1) Get first estimate for `(X0,Y0,R).`

(Details: Find some points to make first estimates- either solve the circle exactly (3 equations, 3 unknowns) to get a first estimate of the center and radius, or just do a kind of centroid calculation on all points- to get a rough estimate of the center and radius.)

2) Calculate `r (r1, r2,.. rN)` for each of your N points from the equation for a radius of a circle.

3) Calculate the root-mean-squared error For example, for 5 given points on the circle:

`RMS error = SQRT[ [ (r1-R)^2 + (r2-R)^2 + (r3-R)^2 + (r4-R)^2`

` + (r5-R)^2] / 3 ]`

(dividing by "3" because we have 3 free parameters.)

4) Change `(X0,Y0,R)` `slightly in your minimization algorithm` to try a new `(X0,Y0,R).`

Minimization details:
Because minimization algorithms can get very computationally intensive, if one's circle-fitting problem is simple, I would look for a "canned" minimization routine. Some commercial computer programs for plotting and spreadsheets do this sort of thing. For example, the Excel spreadsheet has the built-in "Solver" that will perform minimzation.
Other possibilties for minimization software:

Matlab with the optimization toolbox.

IDL with Markwardt's MPFIT or MACSYMA.

ODRPACK from Netlib (see Netlib ) is an excellent library for a production mode approach to fitting these data, with a graphical user interface at Plato: Decision Tree for Optimization Software At the very beginning of the User's Guide there is a description of how to fit data to an ellipse or circle. [Don Wells, Hans D. Mittelmann].

Gaussfit is a superb response to the whole family of problems of adjusting parameters to fit functions to data; it is especially effective and easy to use for one-time problems and for exploratory work [Don Wells].

The L-BFGS-B Nonlinear Optimization package allows one to specify bounds on the variables, from: L-BFGS-B [Helmut Jarausch].

The Fortran programs fcircle.f, fellipse.f and the Lawson & Hanson least-squares routines ls.f shows how to implement these least-squares problems at: fcircle.f, fellipse.f, etc.

Alan Miller has updated the above for Fortran 90. The least-squares software is at: Alan Miller's Sofware

Some Web Guides [Amara Graps]

User's Manual -- GaussFit: {A} system for least squares and robust estimation", William H. Jefferys and Michael J. Fitzpatrick and Barbara E. McArthur and James E. McCartney", [Don Wells]

Other Papers

I. Kasa, "A Curve Fitting Procedure and Its Error Analysis", IEEE Trans. on Inst. Meas., 3/96, describes an algorithm that is a modified least squares analysis that uses the difference of the square of the radii rather than the absolute difference. [Rod Loewen]

Moura, L. and Kitney, R. I. (1991). A direct method for least -squares circle fitting. Computer Physics Communications 64, 57-63. [Colin B. Desmond]

"A Buyers Guide to Conic Fitting", A.W. Fitzgibbon, British Machine Vision Conference, 1995.

Cunningharn, Robert W.: Comparison of three methods for determining fit parameter uncertainties for the Marquardt Compromise, Computers in Physics 7(5), 570, (1993) [Amara Graps]

W. Gander, G. H. Golub and R. Strebel Least Squares Fitting of Circles and Ellipses BIT vol.34, 1994, pgs. 558-578 [Suhail A Islam].

Berman & Somlo: "Efficient procedures for fitting circles..." IEEE Instrum & Meast., IM-35, no.1, pp. 31-35, 1986. [Peter Somlo]

Paul T. Boggs and Richard H. Byrd and Robert B. Schnabel, "A Stable and efficient algorithm for nonlinear orthogonal distance regression, SIAM Journal on Scientific and Statistical Computing, 1987, 8, 6,1052-1078 [Don Wells].

P. T. Boggs and J. R. Donaldson and R. H. Byrd and R. B. Snabel, "Algorithm 676, {ODRPACK} -- Software for Weighted orthogonal distance regression", ACM Transactions On Mathematical Software, 1989, 15, 348-364 [Don Wells].

5) Calculate r (r1, r2 etc.) again from new (X0,Y0,R) above.

6) Calculate RMS error again.

7) Compare current RMS error with previous RMS error. If it doesn't vary by more some small amount, say 10^{-3} then you're done, otherwise continue steps 4 -- 7.

Other (more elegant) approaches that reduce to a linear system.

Linear solution

If you choose to minimize the squares of the *area* differences, you get a linear problem, which is a much safer way. [Pawel Gora, Zdislav V. Kovarik, Daniel Pfenniger, Condensed by Amara Graps]

1. Function to minimize is (sum of the area differences):

` Q = Sum { [ (xi - X0)^2 + (yi -Y0)^2 - R^2 ]^2 , i=1..N }`

2. A necessary condition is the system of 3 equations with 3 unknowns `X0, Y0, R.` Calculate the partial derivatives of Q, with respect to ` X0, Y0, R.` (all d's are partial)

` dQ / dX0 = 0`

` dQ / dY0 = 0`

` dQ / dR = 0`

3. Developing we get the linear least-squares problem:

` | x1 y1 1 | | a | | -x1^2-y1^2 |`

` | x2 y2 1 | | b | =~ | -x2^2-y2^2 |`

` | x3 y3 1 | | c | | -x3^2-y3^2 |`

` | x4 y4 1 | | -x4^2-y4^2 |`

` | x5 y5 1 | | -x5^2-y5^2 |`

` ..... .........`

(for example, for 5 points)

where

` a = -2 X0; b = -2 Y0 ; c = X0^2 + Y0^2 - R^2.`

Take any good least-squares algorithm to solve it, yielding a,b,c. So the final circle solution will be given with

`X0 = -a/2; Y0 = -b/2; R^2 = X0^2+Y0^2 - c.`

By the way, with 5 points you can also find the unique quadratic form (ellipse, hyperbola) which passes exactly through 5 points. With more than 5 points one can do a linear least-squares as well. The problem is then to minimize:

` | x1^2-y1^2 x1y1 x1 y1 1 | | a | | -x1^2-y1^2 |`

` | x2^2-y2^2 x2y2 x2 y2 1 | | b | =~ | -x2^2-y2^2 |`

` | x3^2-y3^2 x3y3 x3 y3 1 | | c | | -x3^2-y3^2 |`

` | x4^2-y4^2 x4y4 x4 y4 1 | | e | | -x4^2-y4^2 |`

` | x5^2-y5^2 x5y5 x5 y5 1 | | f | | -x5^2-y5^2 |`

` ..... .........`

The robust or L1 or least-first-power approximation [Zdislav V. Kovarik].

If you try to minimize

` W(a,b,r) = SUM(j=1:N) ABS (((x_j-a)^2 + (y_j-b)^2)^(1/2) - r)`

all you have to do is set up the 10 (i.e. if 5, choose 3) circles passing through every choice of 3 of 5 points, calculate W(a,b,r) for each of them and pick the minimizing circle. The extra feature is that this procedure separates and discards points which are conspicuously far from the majority trend. (Of course, it becomes increasingly time-consuming when the number of given points increases.)

This method of determining the minimum bounding circle from a set of _circles_ is solved, and with code available at: [Sky Coyote]

ZunZun Interactive 2D and 3D Data Modeling

ZunZun is an online interactive fitting and graphing for 2D and 3D data sets.

CurveExpert

CurveExpert is a comprehensive curve fitting system for Windows. XY data can be modelled using a toolbox of linear regression models, nonlinear regression models, interpolation, or splines. Over 30 models are built-in, but custom regression models may also be defined by the user. Full-featured graphing capability allows thorough examination of the curve fit. The process of finding the best fit can be automated by letting CurveExpert compare your data to each model to choose the best curve. This program was designed to be simple but powerful, so that all users can obtain a model for their data quickly and easily.

Copyright 1995-2013 Mathcom Solutions Inc                               Updated May 15, 2013