Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polynomial1D model not fitting at all, and other models are not matching gbtidl #86

Open
etsmit opened this issue Oct 5, 2023 · 12 comments · May be fixed by #403
Open

Polynomial1D model not fitting at all, and other models are not matching gbtidl #86

etsmit opened this issue Oct 5, 2023 · 12 comments · May be fixed by #403
Assignees
Labels
bug Something isn't working high priority

Comments

@etsmit
Copy link
Collaborator

etsmit commented Oct 5, 2023

First issue: Near misses

Using 'legendre' baseline model option fits nearly the same baseline as gbtidl, with most deviation at the edges of the baseline. The attached plot shows there is a difference in at least the third order between Dysh legendre and gbtidl, shown by the red line. (This is example dataset AGBT17A_404_...)

bline_comparison_2

Comparing the coefficients themselves is apples to oranges since dysh is using a Legendre model while gbtidl is using some form of basic polynomial sum(m) a_m x^m. Using the Polynomial1D model from the dysh side results in the second issue.

Second Issue: Polynomial1D fits are not working

A 2nd or 3rd order polynomial fit was applied to the data below, respectively. The 2nd order fit seemed to just find an average (indeed, all coefficients after the first are 0) and the 3rd order fit applied a straight, sloped line. Trying higher orders resulted in no change. The vertical black lines denote the regions of the fit.

bline_comparison3

bline_comparison_5

@etsmit
Copy link
Collaborator Author

etsmit commented Oct 11, 2023

GBTIDL fitting code: /home/gbtidl/2.10.1/gbtidl/pro/toolbox/ortho_fit.pro

`;+
; Function uses general orthogonal polynomial to do least squares
; fitting.
;
;

This code came from Tom Bania's GBT_IDL work. Local
; modifications include:
;


    ;
  • Documentation modified for use by idldoc
    ;
  • Array syntax changed to [] and compile_opt idl2 used.
    ;
  • Indententation used to improve readability.
    ;
  • Unnecessary code removed.
    ;
  • Some argument checks added.
    ;

;
; @param xx {in}{required} The x-values to use in the fit.
; @param yy {in}{required} The data to be fit at xx.
; @param nfit {in}{required}{type=integer} The order of the polynomial to fit.
; @param cfit {out}{required} On return, cfit has coefficients of
; the polynomial
;

; fit = sum(m) a_m x^m
;

; because of round-off using this form gives unreliable results
; above about order 15 even for double precision. See <a
; href="ortho_poly.html">ortho_poly for a better discussion on the
; contents of cfit.
; @param rms {out}{required} The rms error for each polynomial up to
; order nfit.
;
; @returns polyfit array: contains information for fitting to arbitrary
; points using the recursion relations
;
; @examples
; fit a 3rd order polynomial to the data in !g.s[0]
;

; yy = *(!g.s[0].data_ptr)
; xx = dindgen(n_elements(yy))
; f = ortho_fit(xx, yy, 3, cfit, rms)
;

;
; @Version $Id$
;-

function ortho_fit,xx,yy,nfit,cfit,rms
compile_opt idl2

; argument checks
if (n_elements(xx) ne n_elements(yy)) then $
    message, 'number of elements of xx is not equal to number of elements of yy'

if (nfit lt 0) then message, 'nfit must be >= 0'

n = n_elements(xx)

; initialize needed arrays
polyfit = dblarr(4,nfit+1)
rms = dblarr(nfit+1)
coef = dblarr(nfit+1)
; dblarr initializes these to 0.0, no need to do so here
c0 = coef 
c1 = coef
c2 = coef
cfit = coef 

fit = dblarr(n)
p0 = fit
p1 = fit
pnp1 = fit 
pn = fit 
pnm1 = fit

; copy values to ensure double precision 
x = double(xx)
f = double(yy)

; get the first couple of polynomials and fit coefficients

p0[0:n-1] = 1.0
xnorm = sqrt(total(p0^2))
p0 = p0/xnorm
c0[0] = 1./xnorm
coef[0] = total(f*p0)
polyfit[0,0] = c0[0]
polyfit[3,0] = coef[0]
rms[0] = total( (f-coef[0]*p0)^2)

if (nfit eq 0) then begin
    cfit = coef[0]*c0
    return, polyfit
endif

a = total(x*p0)
p1 = x - a*p0
xnorm = sqrt(total(p1^2))
c1[1] = 1.0
p1 = p1/xnorm
c1 = (c1 - a*c0)/xnorm

; first couple of fit coefficients
coef[1] = total(f*p1)
polyfit[0:1,1] = c1[0:1]
polyfit[3,1] = coef[1]
cfit = coef[0]*c0 + coef[1]*c1
fit = coef[0]*p0 + coef[1]*p1
rms[1] = total( (f-fit)^2 )

; loop up the order, using general recursion relation
pnm1 = p0
pn = p1
cnm1 = c0
cn = c1
;  print, 'cfit before loop', cfit
; in IDL, this loop is not entered if nfit < 2
for m=2,nfit do begin
    a = -1./total(x*pn*pnm1)
    b = -a*total(x*pn^2)
    ;  print,'iteration ',m,a,b
    pnp1 = (a*x + b)*pn + pnm1
 
    xnorm = sqrt(total(pnp1^2))
    pnp1 = pnp1/xnorm
    coef[m] = total(f*pnp1)
    ctmp = shift(cn,1)
    ctmp[0] = 0.
    cnp1 = (a*ctmp + b*cn + cnm1)/xnorm
    cfit = cfit + coef[m]*cnp1
    fit = fit + coef[m]*pnp1
    polyfit[0,m] = 1./xnorm
    polyfit[1,m] = b/xnorm
    polyfit[2,m] = a/xnorm
    polyfit[3,m] = coef[m]
    rms[m] = total( (f-fit)^2 )
    cnm1 = cn
    cn = cnp1
    pnm1 = pn
    pn = pnp1
endfor

rms = sqrt( rms/double(n) )

return, polyfit

end`

@etsmit
Copy link
Collaborator Author

etsmit commented Oct 19, 2023

I've verified that specutils is treating masked parts of the spectrum differently.

After recreating ortho_fit in my jupyter notebook, I can confirm that it produces the same outputs as gbtidl - for an unmasked spectrum.

I can produce a match between dysh and gbtidl for ABGT17A_404 scan 19 if there are no regions selected excluded in either - and the differences are no greater than 1e-9.

If I just use one region to fit (channels 0-750 selected in gbtidl, 751-819 excluded in dysh), it produces a match with no differences greater than 1e-4.

If I go back to using the full region selection before, the differences reach <1e-2, with <1e-3 in the selected regions of the spectrum. In this case, I'm selecting the (100, 380), (450,750) regions in gbtidl and excluding [(0,99),(379,449),(749,819)] in dysh. The difference between the spectra is functional, shown by the green line below.

Pasted image 20231019112417

Changing the dysh exclude regions by +/- 1 channel on either side (e.g. (381,451 instead of 379,449 and any permutation) doesn't result in any large difference, and certainly doesn't let us match gbtidl.

Also, the Polynomial1D errors cropping up above can be solved by using the LevMarLSQFitter or LMLSQFitter options.

@etsmit
Copy link
Collaborator Author

etsmit commented Jan 19, 2024

The numpy polyfit function returns a better fit than any of the dysh options, but it is still not quite right. Using the function below, I get 0.6717 as the sum of the absolute difference between the numpy and gbtidl models, while the dysh options return 2.0868.

    if include is not None:
        popt = np.polyfit(x[include],y[include],deg)
    else:
        popt = np.polyfit(x,y,deg)
    p = np.poly1d(popt)
    fit = p(x)
    if remove:
        return y-fit,fit
    else:
        return fit

Here are the differences between the two models compared to the GBTIDL result, overlaid on top of the GBTIDL-fitted spectrum.
image

@mpound
Copy link
Collaborator

mpound commented Jan 24, 2024

This is all very interesting as well as disturbing. Have you done the test where you create a spectrum with a known polynomial baseline added plus gaussian noise and see if both GBTIDL and dysh recover the polynomial? And repeat test with and without channel exclusion.
It's important to establish that GBTIDL is actually delivering the correct answer if that is going to be our reference.

@etsmit
Copy link
Collaborator Author

etsmit commented Jan 25, 2024

It appears that I was using the wrong boundaries to fit in dysh compared to GBTIDL. After fixing this, it looks like numpy polyfit returns the same values as the python implementation of ortho_fit.pro and the GBTIDL result. I suggest using numpy polyfit as a baseline fitting tool.

@astrofle astrofle added the bug Something isn't working label Apr 26, 2024
@astrofle
Copy link
Collaborator

Use numpy.polyfit for polynomial models, astropy modeling functions for the rest.

@astrofle
Copy link
Collaborator

astrofle commented Aug 28, 2024

Double check that GBTIDL is accurate by using synthetic data.
Same for dysh.
Use a grid of models and exclusion/inclusion regions.

@etsmit
Copy link
Collaborator Author

etsmit commented Sep 10, 2024

This code:

#numpy poly1d baseline removal, with extra model choices!
#x: channel index array
#y: data values
#deg: degree of fit (deg=3 means 4 terms in the polynomial)
#include: array of indices to keep (np.r_[])
#remove: True = return subtracted spectrum and model; False = just return model
def np_baseline_v2(x,y,deg,model,include=None,remove=True):
    if include is not None:
        xx = np.copy(x[include])
        yy = np.copy(y[include])
    else:
        xx = np.copy(x)
        yy = np.copy(y)
        
    if model == "polynomial":
        popt = np.polynomial.Polynomial.fit(xx,yy,deg)
    elif model == "chebyshev":
        popt = np.polynomial.Chebyshev.fit(xx,yy,deg)
    elif model == "legendre":
        popt = np.polynomial.Legendre.fit(xx,yy,deg)
    elif model == "hermite":
        popt = np.polynomial.Hermite.fit(xx,yy,deg)

    fit = popt(x)
    if remove:
        return y-fit,fit
    else:
        return fit

returns total errors on order 1e-6 (1.6e-9 per channel, and baselined spectrum has 0.018 standard deviation) for each of the model cases above. This is for the AGBT17A_404_01 scan 19 test data set.

@etsmit
Copy link
Collaborator Author

etsmit commented Sep 10, 2024

*note, you must pass the "window = [0,xx[-1]]" argument to the polynomial fit function to map the returned series' domain correctly. This will return the same coefficients as the GBTIDL format (domain = channel index instead of [-1,1]). Currently working on testing fake polynomials.

@etsmit
Copy link
Collaborator Author

etsmit commented Sep 11, 2024

I am finding good agreement with polynomial fitting, with a (small) range of polynomial degrees and selection ranges.

Chebyshev polynomial fitting is also agreeing, although I am seeing a different on order 1e-(2*n), where n is the order of the given polynomial coefficient. I have not compared the actual spectra to each other since this is starting to get quite time intensive (generating identical arrays in GBTIDL vs. Python, porting spectra between the two platforms, etc.), but they are showing good fits by eye.

Also, the numpy polynomial fit routine translates the dataset to the domain [-1, 1], which may be useful for higher-order fits when the x-axis has a large magnitude (chan number, Hz)

@etsmit
Copy link
Collaborator Author

etsmit commented Sep 16, 2024

The problem is indeed the specutils inclusion region error as found by issue 378. With that fix, the specutils baseline fitting returns the same precision as the numpy and python-based gbtidl ortho_fit implementation, in all cases except when using the Polynomial model, which still gives the "Fit may be poorly conditioned" warning.

image

@astrofle
Copy link
Collaborator

A possibility, that may not require modifying specutils (although they should make some changes on their end, at least to expose the excise method to the users doing continuum or line fits) would be to excise in dysh and then do the fitting without an exclude argument. That would require writing a new excise method as you did in issue #378, using this excise method and then doing the fitting in the excised Spectrum. It may require additional bookkeeping though.

@etsmit etsmit linked a pull request Oct 15, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants