Spectral Emulator

The spectral emulator can be likened to the engine behind Starfish. While the novelty of Starfish comes from using Gaussian processes to model and account for the covariances of spectral fits, we still need a way to produce model spectra by interpolating from our synthetic library. While we could interpolate spectra from the synthetic library using something like linear interpolation in each of the library parameters, it turns out that high signal-to-noise data requires something more sophisticated. This is because the error in any interpolation can constitute a significant portion of the error budget. This means that there is a chance that non-interpolated spectra (e.g., the parameters of the synthetic spectra in the library) might be given preference over any other interpolated spectra, and the posteriors will be peaked at the grid point locations. Because the spectral emulator returns a probability distribution over possible interpolated spectra, this interpolation error can be quantified and propagated forward into the likelihood calculation.

Eigenspectra decomposition

The first step of configuring the spectral emulator is to choose a subregion of the spectral library corresponding to the star that you will fit. Then, we want to decompose the information content in this subset of the spectral library into several eigenspectra. [Figure A.1 here].

The eigenspectra decomposition is performed via Principal Component Analysis (PCA). Thankfully, most of the heavy lifting is already implemented by the sklearn package.

Emulator.from_grid() allows easy creation of spectral emulators from an Starfish.grid_tools.HDF5Interface, which includes doing the initial PCA to create the eigenspectra.

>>> from Starfish.grid_tools import HDF5Interface
>>> from Starfish.emulator import Emulator
>>> emulator = Emulator.from_grid(HDF5Interface('grid.hdf5'))

Optimizing the emulator

Once the synthetic library is decomposed into a set of eigenspectra, the next step is to train the Gaussian Processes (GP) that will serve as interpolators. For more explanation about the choice of Gaussian Process covariance functions and the design of the emulator, see the appendix of our paper.

The optimization of the GP hyperparameters can be carried out by any maximum likelihood estimation framework, but we include a direct method that uses scipy.optimize.minimize.

To optimize the code, we will use the Emulator.train() routine.

Example optimizing using minimization optimizer

>>> from Starfish.grid_tools import HDF5Interface
>>> from Starfish.emulator import Emulator
>>> emulator = Emulator.from_grid(HDF5Interface('grid.hdf5'))
>>> emulator
Emulator
--------
Trained: False
lambda_xi: 2.718
Variances:
    10000.00
    10000.00
    10000.00
    10000.00
Lengthscales:
    [ 600.00  1.50  1.50 ]
    [ 600.00  1.50  1.50 ]
    [ 600.00  1.50  1.50 ]
    [ 600.00  1.50  1.50 ]
Log Likelihood: -1412.00
>>> emulator.train()
>>> emulator
Emulator
--------
Trained: True
lambda_xi: 2.722
Variances:
    238363.85
    5618.02
    9358.09
    2853.22
Lengthscales:
    [ 1582.39  3.19  3.11 ]
    [ 730.81  1.61  2.14 ]
    [ 1239.45  3.71  2.78 ]
    [ 1127.40  1.63  4.46 ]
Log Likelihood: -1158.83
>>> emulator.save('trained_emulator.hdf5')

Note

The built in optimization target changes the state of the emulator, so even if the output of the minimizer has not converged, you can simply run Emulator.train() again.

If you want to perform MLE with a different method, feel free to make use of the general modeling framework provided by the function Emulator.get_param_vector(), Emulator.set_param_vector(), and Emulator.log_likelihood().

Model spectrum reconstruction

Once the emulator has been optimized, we can finally use it as a means of interpolating spectra.

>>> from Starfish.emulator import Emulator
>>> emulator = Emulator.load('trained_emulator.hdf5')
>>> flux = emulator.load_flux([7054, 4.0324, 0.01])
>>> wl = emu.wl

If you want to take advantage of the emulator covariance matrix, you must use the interface via the Emulator.__call__() function

>>> from Starfish.emulator import Emulator
>>> emulator = Emulator.load('trained_emulator.hdf5')
>>> weights, cov = emulator([7054, 4.0324, 0.01])
>>> X = emulator.eigenspectra * emulator.flux_std
>>> flux = weights @ X + emulator.flux_mean
>>> emu_cov = X.T @ weights @ X

Lastly, if you want to process the model, it is useful to process the eigenspectra before reconstructing, especially if a resampling action has to occur. The Emulator provides the attribute Emulator.bulk_fluxes for such processing. For example

>>> from Starfish.emulator import Emulator
>>> from Starfish.transforms import instrumental_broaden
>>> emulator = Emulator.load('trained_emulator.hdf5')
>>> fluxes = emulator.bulk_fluxes
>>> fluxes = instrumental_broaden(emulator.wl, fluxes, 10)
>>> eigs = fluxes[:-2]
>>> flux_mean, flux_std = fluxes[-2:]
>>> weights, cov = emulator([7054, 4.0324, 0.01])
>>> X = emulator.eigenspectra * flux_std
>>> flux = weights @ X + flux_mean
>>> emu_cov = X.T @ weights @ X

Note

Emulator.bulk_fluxes provides a copy of the underlying arrays, so there is no change to the emulator when bulk processing.

Reference

Emulator

class Starfish.emulator.Emulator(grid_points: nptyping.types._ndarray.NDArray, param_names: Sequence[str], wavelength: nptyping.types._ndarray.NDArray, weights: nptyping.types._ndarray.NDArray, eigenspectra: nptyping.types._ndarray.NDArray, w_hat: nptyping.types._ndarray.NDArray, flux_mean: nptyping.types._ndarray.NDArray, flux_std: nptyping.types._ndarray.NDArray, factors: nptyping.types._ndarray.NDArray, lambda_xi: float = 1.0, variances: Optional[nptyping.types._ndarray.NDArray] = None, lengthscales: Optional[nptyping.types._ndarray.NDArray] = None, name: Optional[str] = None)[source]

A Bayesian spectral emulator.

This emulator offers an interface to spectral libraries that offers interpolation while providing a variance-covariance matrix that can be forward-propagated in likelihood calculations. For more details, see the appendix from Czekala et al. (2015).

Parameters
  • grid_points (numpy.ndarray) – The parameter space from the library.

  • param_names (array-like of str) – The names of each parameter from the grid

  • wavelength (numpy.ndarray) – The wavelength of the library models

  • weights (numpy.ndarray) – The PCA weights for the original grid points

  • eigenspectra (numpy.ndarray) – The PCA components from the decomposition

  • w_hat (numpy.ndarray) – The best-fit weights estimator

  • flux_mean (numpy.ndarray) – The mean flux spectrum

  • flux_std (numpy.ndarray) – The standard deviation flux spectrum

  • lambda_xi (float, optional) – The scaling parameter for the augmented covariance calculations, default is 1

  • variances (numpy.ndarray, optional) – The variance parameters for each of Gaussian process, default is array of 1s

  • lengthscales (numpy.ndarray, optional) – The lengthscales for each Gaussian process, each row should have length equal to number of library parameters, default is arrays of 3 * the max grid separation for the grid_points

  • name (str, optional) – If provided, will give a name to the emulator; useful for keeping track of filenames. Default is None.

params

The underlying hyperparameter dictionary

Type

dict

__call__(params: Sequence[float], full_cov: bool = True, reinterpret_batch: bool = False) → Tuple[nptyping.types._ndarray.NDArray, nptyping.types._ndarray.NDArray][source]

Gets the mu and cov matrix for a given set of params

Parameters
  • params (array_like) – The parameters to sample at. Should be consistent with the shapes of the original grid points.

  • full_cov (bool, optional) – Return the full covariance or just the variance, default is True. This will have no effect of reinterpret_batch is true

  • reinterpret_batch (bool, optional) – Will try and return a batch of output matrices if the input params are a list of params, default is False.

Returns

  • mu (numpy.ndarray (len(params),))

  • cov (numpy.ndarray (len(params), len(params)))

Raises
  • ValueError – If full_cov and reinterpret_batch are True

  • ValueError – If querying the emulator outside of its trained grid points

property bulk_fluxes

A vertically concatenated vector of the eigenspectra, flux_mean, and flux_std (in that order). Used for bulk processing with the emulator.

Type

numpy.ndarray

determine_chunk_log(wavelength: Sequence[float], buffer: float = 50)[source]

Possibly truncate the wavelength and eigenspectra in response to some new wavelengths

Parameters
  • wavelength (array_like) – The new wavelengths to truncate to

  • buffer (float, optional) – The wavelength buffer, in Angstrom. Default is 50

See also

Starfish.grid_tools.utils.determine_chunk_log()

classmethod from_grid(grid, **pca_kwargs)[source]

Create an Emulator using PCA decomposition from a GridInterface.

Parameters
  • grid (GridInterface or str) – The grid interface to decompose

  • pca_kwargs (dict, optional) – The keyword arguments to pass to PCA. By default, n_components=0.99 and svd_solver=’full’.

See also

sklearn.decomposition.PCA()

get_index(params: Sequence[float]) → int[source]

Given a list of stellar parameters (corresponding to a grid point), deliver the index that corresponds to the entry in the fluxes, grid_points, and weights.

Parameters

params (array_like) – The stellar parameters

Returns

index

Return type

int

get_param_dict() → dict[source]

Gets the dictionary of parameters. This is the same as Emulator.params

Returns

Return type

dict

get_param_vector() → nptyping.types._ndarray.NDArray[source]

Get a vector of the current trainable parameters of the emulator

Returns

Return type

numpy.ndarray

property lambda_xi

The tuning hyperparameter

Setter

Sets the value.

Type

float

property lengthscales

The lengthscales for each Gaussian process kernel.

Setter

Sets the lengthscales given a 2d array

Type

numpy.ndarray

classmethod load(filename: Union[str, os.PathLike])[source]

Load an emulator from and HDF5 file

Parameters

filename (str or path-like) –

load_flux(params: Union[Sequence[float], nptyping.types._ndarray.NDArray], norm=False) → nptyping.types._ndarray.NDArray[source]

Interpolate a model given any parameters within the grid’s parameter range using eigenspectrum reconstruction by sampling from the weight distributions.

Parameters

params (array_like) – The parameters to sample at.

Returns

flux

Return type

numpy.ndarray

log_likelihood() → float[source]

Get the log likelihood of the emulator in its current state as calculated in the appendix of Czekala et al. (2015)

Returns

Return type

float

Raises

scipy.linalg.LinAlgError – If the Cholesky factorization fails

norm_factor(params: Union[Sequence[float], nptyping.types._ndarray.NDArray]) → float[source]

Return the scaling factor for the absolute flux units in flux-normalized spectra

Parameters

params (array_like) – The parameters to interpolate at

Returns

factor – The multiplicative factor to normalize a spectrum to the model’s absolute flux units

Return type

float

save(filename: Union[str, os.PathLike])[source]

Save the emulator to an HDF5 file

Parameters

filename (str or path-like) –

set_param_dict(params: dict)[source]

Sets the parameters with a dictionary

Parameters

params (dict) – The new parameters.

set_param_vector(params: nptyping.types._ndarray.NDArray)[source]

Set the current trainable parameters given a vector. Must have the same form as get_param_vector()

Parameters

params (numpy.ndarray) –

train(**opt_kwargs)[source]

Trains the emulator’s hyperparameters using gradient descent. This is a light wrapper around scipy.optimize.minimize. If you are experiencing problems optimizing the emulator, consider implementing your own training loop, using this function as a template.

Parameters

**opt_kwargs – Any arguments to pass to the optimizer. By default, method=’Nelder-Mead’ and maxiter=10000.

See also

scipy.optimize.minimize()

property variances

The variances for each Gaussian process kernel.

Setter

Sets the variances given an array.

Type

numpy.ndarray