PRO fit_polynomial_1d, xdata, ydata, weights, polyterms, polycoeffs, polycoeffs_err, model, chisq, status, danidl_cppcode, ERRBAR = errbar, AIC = aic, BIC = bic, $
; Description: This module fits a polynomial model to a set of input data in one dimension. The model used is:
;
; y = P(x)
;
; where x is the independent variable with data "xdata", y is the dependent variable with data
; "ydata", and P is the polynomial function consisting of the polynomial terms specified by
; "polyterms". If required, the fit can be weighted via the parameter "weights". The fitted
; polynomial coefficients are returned via the parameter "polycoeffs" along with the status of
; the fit "status".
; In the case that the weights of the fit "weights" represent the inverse variance weights for
; the dependent data "ydata", then the uncertainties on the fitted parameters are represented by
; "polycoeffs_err", and the chi squared of the fit is represented by "chisq".
; The module also provides the option of treating the weights as representing the 1-sigma
; uncertainties on the data values "ydata", in which case the actual weights that are used in
; the fit are the inverse variance weights, and again the uncertainties on the fitted parameters
; are represented by "polycoeffs_err", and the chi squared of the fit is represented by "chisq".
; This module may also be used to attempt to automatically find the most appropriate degree
; of the polynomial to fit to the data. This option will only produce a meaningful result if the
; weights used in the fit are the inverse variance weights. By setting the keyword AIC to a
; non-negative integer N, the module will perform (N + 1) fits to the data using polynomials
; with degrees running from zero to N, and it will calculate the Akaike information criterion
; (AIC) for each fit. The polynomial model with the smallest value of the AIC is chosen as the
; preferred model for the data, and the corresponding polynomial coefficients "polycoeffs",
; polynomial coefficient uncertainties "polycoeffs_err", model values "model", and chi squared
; "chisq" are returned. Clearly, if the keyword AIC is set as described, then the input parameter
; "polyterms" is ignored. We note that the AIC is defined as:
;
; AIC = Chi^2 + [ (2*N_p) / (1 - ((N_p - 1)/N_d)) ]
;
; where Chi^2 is the chi squared of the fit, N_p is the number of free parameters, and N_d is
; the number of data points.
; This module also provides the option of using the Bayesian information criterion (BIC) to
; automatically find the most appropriate degree of the polynomial to fit to the data. The
; keyword BIC works in exactly the same way as the keyword AIC, except that the BIC is defined
; as:
;
; BIC = Chi^2 + N_p*ln(N_d)
;
; where Chi^2 is the chi squared of the fit, N_p is the number of free parameters, and N_d is
; the number of data points. Clearly, it is not possible to set the AIC and BIC keywords at the
; same time, and if this is the case, then the module will fail.
; We note that the BIC tends to perform better than the AIC since the AIC tends to "over fit"
; the data. Our tests on fake data (Horne, private communication) indicate that the BIC is more
; reliable in recovering the correct polynomial degree than the AIC for polynomials in one
; dimension.
; For further flexibility, the module provides the option of iterating the fitting process. By
; setting the keyword ITERATION_PAR to an IDL structure that contains the six tags "method",
; "sigma", "maxrej", "permanent", "tol" and "maxiter" defined below (see under "Keywords"), the
; module will do the following.
; If "sigma" is positive, then the sigma-clip threshold S is taken to be the value of "sigma".
; If "sigma" is negative or zero, then the sigma-clip threshold S is calculated as sqrt(ln(Ngood)),
; where Ngood is the number of data values in "ydata" with positive weights. This expression for
; the sigma-clip threshold S may be derived from considerations of the BIC.
; If the tag "method" is set to "abs_resid", then, during each iteration of the fitting process,
; all data values with absolute residuals greater than S times the standard deviation of the
; residuals are flagged as candidate outliers. If the tag "method" is set to "norm_abs_resid",
; then, during each iteration of the fitting process, all data values with normalised absolute
; residuals greater than S, where the normalising factors are the square roots of the corresponding
; weights, are flagged as candidate outliers.
; If the tag "maxrej" is set to an INTEGER/LONG type positive number, then this number is
; considered to be the maximum number of data values that can be rejected in any one iteration,
; and consequently, only this number of candidate outliers with the largest residuals are
; rejected. If the tag "maxrej" is set to a FLOAT/DOUBLE type positive number that is less than
; or equal to "1.0", then this number is considered to be the maximum fraction of data values
; that can be rejected in any one iteration, and consequently, only the corresponding number of
; candidate outliers with the largest residuals are rejected. If the tag "maxrej" is set to
; anything else, then there is no limit on the number of data values that can be rejected in any
; one iteration.
; If the tag "permanent" is set to "1", then data values that are rejected will remain
; rejected for all subsequent iterations, meaning that the set of data values used at each
; iteration will only diminish. If the tag "permanent" is set to any other value, then data
; values will not be permanently rejected so that if a data value is initially rejected, then it
; may be subsequently included in the fitting process during the later iterations. This protects
; against non-outlier data values being permanently excluded from the calculation of the fit
; parameters when real outlier data values skew the parameter estimates during the early
; iterations.
; The iterations stop when the maximum fractional change in the fit parameters is less than or
; equal to "tol", or when there are not enough good data values to be used in the fit. To protect
; against the possibility of the fitting process not converging (or entering into an oscillatory
; regime), a limit of "maxiter" iterations is placed on the number of iterations that are
; performed.
; Useful extra information about the fitting process is returned via the keyword EXTRA_INFO,
; and the subscripts of the data values that were used in the (final iteration of the) fit are
; returned via the keyword DATA_USED_SUBS.
; This module will use C++ code if possible to achieve a faster execution than the default IDL
; code. The C++ code can be up to a factor of ~1.05 times faster than the IDL code in this case.
;
; Input Parameters:
;
; xdata - FLOAT/DOUBLE SCALAR/VECTOR/ARRAY - A scalar/vector/array of input data for the independent variable.
; ydata - FLOAT/DOUBLE SCALAR/VECTOR/ARRAY - A scalar/vector/array of input data for the dependent variable.
; This input parameter must have the same number of elements as
; "xdata".
; weights - FLOAT/DOUBLE SCALAR/VECTOR/ARRAY - A scalar/vector/array with the same number of elements as
; "xdata", that contains the weights to be used in the calculation
; of the fitted polynomial model. If this parameter does not have
; the correct number type or the same number of elements as
; "xdata", then all data points are weighted equally. Data points
; to be ignored can be pre-flagged by setting the corresponding
; weights in this scalar/vector/array to zero (or negative
; values). If the keyword ERRBAR is set, then the module will
; assume that the values in this parameter represent the 1-sigma
; uncertainties on the data values, and it will calculate the
; actual weights to be used in the fit as the inverse variance
; weights.
; polyterms - INTEGER/LONG VECTOR - A one-dimensional vector of length "nterms" that contains the exponents of
; the independent variable for each polynomial term. All values in this
; vector must be non-negative, and each value must be unique.
; danidl_cppcode - STRING - The full directory path indicating where the DanIDL C++ code is installed. The
; shared library of DanIDL C++ routines should exist as
; "/dist/libDanIDL.so" within the installation.
;
; Output Parameters:
;
; polycoeffs - DOUBLE VECTOR - A one-dimensional vector that contains the set of fitted polynomial coefficients.
; If neither of the keywords AIC or BIC have been set, then this output parameter
; is of length "nterms" and the polynomial coefficients correspond to the polynomial
; terms defined by "polyterms". However, if either of the keywords AIC or BIC have
; been set, then the polynomial coefficients contained in this parameter correspond
; to those of the preferred polynomial model.
; polycoeffs_err - DOUBLE VECTOR - A one-dimensional vector of the same length as "polycoeffs" that contains
; the uncertainties on the set of fitted polynomial coefficients "polycoeffs".
; The elements of this output parameter are only meaningful if the input
; weights "weights" represent the inverse variances for the dependent data
; "ydata", or if the keyword ERRBAR is set (see below) and the weights
; represent the 1-sigma uncertainties on the dependent data "ydata".
; model - DOUBLE SCALAR/VECTOR/ARRAY - A scalar/vector/array (of the same dimensions as the input parameter
; "ydata") representing the fitted model values for the dependent data
; "ydata".
; chisq - DOUBLE - The chi squared of the fit ignoring bad data values flagged in "weights" and rejected data
; values identified during the iterative fitting process. This quantity only represents the
; chi squared in the case where the input weights "weights" represent the inverse variances
; for the dependent data "ydata", or if the keyword ERRBAR is set (see below) and the
; weights represent the 1-sigma uncertainties on the dependent data "ydata".
; status - INTEGER - If the fit parameters were calculated successfully and no iterations were performed,
; then "status" is returned with a value of "4". If the fit parameters were calculated
; successfully and the iterations were stopped because the algorithm tolerance was
; achieved, then "status" is returned with a value of "3". If the fit parameters were
; calculated successfully but the iterations were stopped because the maximum number
; of iterations "maxiter" had been performed, then "status" is returned with a value
; of "2". If the fit parameters were calculated successfully but the iterations were
; stopped because there were not enough good data values to be used in the next
; iteration, then "status" is returned with a value of "1". If the fit parameters were
; not calculated successfully, then "status" is returned with a value of "0".
;
; Keywords:
;
; If the keyword ERRBAR is set (as "/ERRBAR"), then the module will assume that the weights "weights"
; represent the 1-sigma uncertainties on the data values "ydata", and it will calculate the actual weights
; to be used in the fit as the inverse variance weights.
;
; If the keyword AIC is set to an INTEGER/LONG type non-negative number N, then the input parameter
; "polyterms" will be ignored, and the module will perform (N + 1) fits to the data using polynomials with
; degrees running from zero to N, and it will calculate the Akaike information criterion (AIC - defined
; above) for each fit. The polynomial model with the smallest value of the AIC is then chosen as the
; preferred model for the data. This option will only produce a meaningful result if the weights used in
; the fit are the inverse variance weights. This keyword cannot be set at the same time as the keyword BIC.
;
; If the keyword BIC is set to an INTEGER/LONG type non-negative number N, then the input parameter
; "polyterms" will be ignored, and the module will perform (N + 1) fits to the data using polynomials with
; degrees running from zero to N, and it will calculate the Bayesian information criterion (BIC - defined
; above) for each fit. The polynomial model with the smallest value of the BIC is then chosen as the
; preferred model for the data. This option will only produce a meaningful result if the weights used in
; the fit are the inverse variance weights. This keyword cannot be set at the same time as the keyword AIC.
;
; If the keyword ITERATION_PAR is set to an IDL structure with the six tags described below, then the
; module will iterate the fitting process (as described under "Description"). The tags that must be present
; are:
;
; (i) method - STRING - This tag has two acceptable values. If "method" is set to "abs_resid", then, during
; each iteration of the fitting process, all data values with absolute residuals
; greater than S times (see "sigma") the standard deviation of the residuals are
; flagged as candidate outliers. If "method" is set to "norm_abs_resid", then, during
; each iteration of the fitting process, all data values with normalised absolute
; residuals greater than S, where the normalising factors are the square roots of the
; corresponding weights, are flagged as candidate outliers.
;
; (ii) sigma - FLOAT/DOUBLE - This tag specifies the sigma-clip rejection threshold to be used. If "sigma"
; is positive, then the sigma-clip threshold S is taken to be the value of
; "sigma". If "sigma" is negative or zero, then the sigma-clip threshold S is
; calculated as sqrt(ln(Ngood)), where Ngood is the number of data values in
; "ydata" with positive weights. This expression for the sigma-clip threshold
; S may be derived from considerations of the Bayesian information criterion
; (BIC).
;
; (iii) maxrej - INTEGER/LONG or FLOAT/DOUBLE - If this tag is set to an INTEGER/LONG type positive number,
; then the value of this tag is considered to be the maximum
; number of data values that can be rejected in any one
; iteration. If this tag is set to a FLOAT/DOUBLE type
; positive number that is less than or equal to "1.0", then
; the value of this tag is considered to be the maximum
; fraction of data values that can be rejected in any one
; iteration. If this tag is set to anything else, then there
; is no limit on the number of data values that can be
; rejected in any one iteration.
;
; (iv) permanent - INTEGER/LONG - If this tag is set to "1", then data values that are rejected will remain
; rejected for all subsequent iterations, meaning that the set of data
; values used at each iteration will only diminish. If this tag is set to
; any other value, then data values will not be permanently rejected so that
; if a data value is initially excluded then it may be subsequently included
; in the fitting process during the later iterations.
;
; (v) tol - FLOAT/DOUBLE - This tag specifies the tolerance of the iterative fitting process. The iterations
; will stop if the maximum fractional change in the fit parameters between subsequent
; iterations is less than or equal to "tol". This parameter must have a value
; greater than or equal to "0.0", and less than "1.0". A value for "tol" of "0.0"
; means that the iterations will continue until none of the fit parameters change
; from one iteration to the next.
;
; (vi) maxiter - INTEGER/LONG - This tag specifies the maximum number of iterations to be performed before
; returning the fit parameters. If this parameter is set to a negative value
; or zero, then there will be no constraint on the maximum number of
; iterations that may be performed.
;
; If the keyword EXTRA_INFO is set to a named variable, then, on return, this variable will contain an IDL
; structure with the following tags:
;
; (i) ndata - LONG - The number of data values in the input parameter "ydata".
; (ii) ndata_good - LONG - The initial number of data values with positive weights.
; (iii) ndata_used - LONG - The number of data values used in the calculation of the final fit parameters.
; (iv) niter - LONG - The number of iterations performed in the fitting process.
;
; If the keyword DATA_USED_SUBS is set to a named variable, then, on return, this variable will contain
; a VECTOR of LONG type numbers representing the subscripts for the data values that were used in the
; calculation of the final fit parameters.
;
; Author: Dan Bramich (dan.bramich@hotmail.co.uk)
;
; History:
;
; 30/01/2011 - Module created (dmb).
;Set the default output parameter values
;Check that "xdata" is of the correct number type
;Determine if the AIC or BIC are to be used to automatically determine the degree of the polynomial fit
;If neither the AIC or BIC are to be used to automatically determine the degree of the polynomial fit
;Check that "polyterms" is a one-dimensional vector of the correct number type, and that it has no negative elements
;Check that each element in "polyterms" is unique
;Generate the required set of polynomial basis functions
;Fit the polynomial model to the input data
;Return with the details of the fitted polynomial
;If either the AIC or BIC are to be used to automatically determine the degree of the polynomial fit, then
;generate the required set of polynomial basis functions
;For each polynomial degree to be fit
;Fit the polynomial model to the input data
;Check that the fit was successful
;If this fit was the first successful fit
;Calculate the relevant statistic
;Store the current information associated with the fit
;Record that the next fit is not the first
;If this fit is the second (or later) successful fit
;Calculate the relevant statistic
;If the current statistic is smaller than the stored statistic, then the current fit is the preferred fit so far
;Check that there was at least one successful fit
;Set the output parameters