Spectral Emulator¶
The spectral emulator can be likened to the engine behind Starfish. While the novelty of Starfish comes from using Gaussian processes to model and account for the covariances of spectral fits, we still need a way to produce model spectra by interpolating from our synthetic library. While we could interpolate spectra from the synthetic library using something like linear interpolation in each of the library parameters, it turns out that high signal-to-noise data requires something more sophisticated. This is because the error in any interpolation can constitute a significant portion of the error budget. This means that there is a chance that non-interpolated spectra (e.g., the parameters of the synthetic spectra in the library) might be given preference over any other interpolated spectra, and the posteriors will be peaked at the grid point locations. Because the spectral emulator returns a probability distribution over possible interpolated spectra, this interpolation error can be quantified and propagated forward into the likelihood calculation.
Eigenspectra decomposition¶
The first step of configuring the spectral emulator is to choose a subregion of the spectral library corresponding to the star that you will fit. Then, we want to decompose the information content in this subset of the spectral library into several eigenspectra. [Figure A.1 here].
The eigenspectra decomposition is performed via Principal Component Analysis (PCA). Thankfully, most of the heavy lifting is already implemented by the sklearn
package.
Emulator.from_grid()
allows easy creation of spectral emulators from an Starfish.grid_tools.HDF5Interface
, which includes doing the initial PCA to create the eigenspectra.
>>> from Starfish.grid_tools import HDF5Interface
>>> from Starfish.emulator import Emulator
>>> emulator = Emulator.from_grid(HDF5Interface('grid.hdf5'))
Optimizing the emulator¶
Once the synthetic library is decomposed into a set of eigenspectra, the next step is to train the Gaussian Processes (GP) that will serve as interpolators. For more explanation about the choice of Gaussian Process covariance functions and the design of the emulator, see the appendix of our paper.
The optimization of the GP hyperparameters can be carried out by any maximum likelihood estimation framework, but we include a direct method that uses scipy.optimize.minimize
.
To optimize the code, we will use the Emulator.train()
routine.
Example optimizing using minimization optimizer
>>> from Starfish.grid_tools import HDF5Interface
>>> from Starfish.emulator import Emulator
>>> emulator = Emulator.from_grid(HDF5Interface('grid.hdf5'))
>>> emulator
Emulator
--------
Trained: False
lambda_xi: 2.718
Variances:
10000.00
10000.00
10000.00
10000.00
Lengthscales:
[ 600.00 1.50 1.50 ]
[ 600.00 1.50 1.50 ]
[ 600.00 1.50 1.50 ]
[ 600.00 1.50 1.50 ]
Log Likelihood: -1412.00
>>> emulator.train()
>>> emulator
Emulator
--------
Trained: True
lambda_xi: 2.722
Variances:
238363.85
5618.02
9358.09
2853.22
Lengthscales:
[ 1582.39 3.19 3.11 ]
[ 730.81 1.61 2.14 ]
[ 1239.45 3.71 2.78 ]
[ 1127.40 1.63 4.46 ]
Log Likelihood: -1158.83
>>> emulator.save('trained_emulator.hdf5')
Note
The built in optimization target changes the state of the emulator, so even if the output of the minimizer has not converged, you can simply run Emulator.train()
again.
If you want to perform MLE with a different method, feel free to make use of the general modeling framework provided by the function Emulator.get_param_vector()
, Emulator.set_param_vector()
, and Emulator.log_likelihood()
.
Model spectrum reconstruction¶
Once the emulator has been optimized, we can finally use it as a means of interpolating spectra.
>>> from Starfish.emulator import Emulator
>>> emulator = Emulator.load('trained_emulator.hdf5')
>>> flux = emulator.load_flux([7054, 4.0324, 0.01])
>>> wl = emu.wl
If you want to take advantage of the emulator covariance matrix, you must use the interface via the Emulator.__call__()
function
>>> from Starfish.emulator import Emulator
>>> emulator = Emulator.load('trained_emulator.hdf5')
>>> weights, cov = emulator([7054, 4.0324, 0.01])
>>> X = emulator.eigenspectra * emulator.flux_std
>>> flux = weights @ X + emulator.flux_mean
>>> emu_cov = X.T @ weights @ X
Lastly, if you want to process the model, it is useful to process the eigenspectra before reconstructing, especially if a resampling action has to occur. The Emulator
provides the attribute Emulator.bulk_fluxes
for such processing. For example
>>> from Starfish.emulator import Emulator
>>> from Starfish.transforms import instrumental_broaden
>>> emulator = Emulator.load('trained_emulator.hdf5')
>>> fluxes = emulator.bulk_fluxes
>>> fluxes = instrumental_broaden(emulator.wl, fluxes, 10)
>>> eigs = fluxes[:-2]
>>> flux_mean, flux_std = fluxes[-2:]
>>> weights, cov = emulator([7054, 4.0324, 0.01])
>>> X = emulator.eigenspectra * flux_std
>>> flux = weights @ X + flux_mean
>>> emu_cov = X.T @ weights @ X
Note
Emulator.bulk_fluxes
provides a copy of the underlying arrays, so there is no change to the emulator when bulk processing.
Reference¶
Emulator¶
-
class
Starfish.emulator.
Emulator
(grid_points: nptyping.types._ndarray.NDArray, param_names: Sequence[str], wavelength: nptyping.types._ndarray.NDArray, weights: nptyping.types._ndarray.NDArray, eigenspectra: nptyping.types._ndarray.NDArray, w_hat: nptyping.types._ndarray.NDArray, flux_mean: nptyping.types._ndarray.NDArray, flux_std: nptyping.types._ndarray.NDArray, factors: nptyping.types._ndarray.NDArray, lambda_xi: float = 1.0, variances: Optional[nptyping.types._ndarray.NDArray] = None, lengthscales: Optional[nptyping.types._ndarray.NDArray] = None, name: Optional[str] = None)[source]¶ A Bayesian spectral emulator.
This emulator offers an interface to spectral libraries that offers interpolation while providing a variance-covariance matrix that can be forward-propagated in likelihood calculations. For more details, see the appendix from Czekala et al. (2015).
- Parameters
grid_points (numpy.ndarray) – The parameter space from the library.
param_names (array-like of str) – The names of each parameter from the grid
wavelength (numpy.ndarray) – The wavelength of the library models
weights (numpy.ndarray) – The PCA weights for the original grid points
eigenspectra (numpy.ndarray) – The PCA components from the decomposition
w_hat (numpy.ndarray) – The best-fit weights estimator
flux_mean (numpy.ndarray) – The mean flux spectrum
flux_std (numpy.ndarray) – The standard deviation flux spectrum
lambda_xi (float, optional) – The scaling parameter for the augmented covariance calculations, default is 1
variances (numpy.ndarray, optional) – The variance parameters for each of Gaussian process, default is array of 1s
lengthscales (numpy.ndarray, optional) – The lengthscales for each Gaussian process, each row should have length equal to number of library parameters, default is arrays of 3 * the max grid separation for the grid_points
name (str, optional) – If provided, will give a name to the emulator; useful for keeping track of filenames. Default is None.
-
params
¶ The underlying hyperparameter dictionary
- Type
dict
-
__call__
(params: Sequence[float], full_cov: bool = True, reinterpret_batch: bool = False) → Tuple[nptyping.types._ndarray.NDArray, nptyping.types._ndarray.NDArray][source]¶ Gets the mu and cov matrix for a given set of params
- Parameters
params (array_like) – The parameters to sample at. Should be consistent with the shapes of the original grid points.
full_cov (bool, optional) – Return the full covariance or just the variance, default is True. This will have no effect of reinterpret_batch is true
reinterpret_batch (bool, optional) – Will try and return a batch of output matrices if the input params are a list of params, default is False.
- Returns
mu (numpy.ndarray (len(params),))
cov (numpy.ndarray (len(params), len(params)))
- Raises
ValueError – If full_cov and reinterpret_batch are True
ValueError – If querying the emulator outside of its trained grid points
-
property
bulk_fluxes
¶ A vertically concatenated vector of the eigenspectra, flux_mean, and flux_std (in that order). Used for bulk processing with the emulator.
- Type
numpy.ndarray
-
determine_chunk_log
(wavelength: Sequence[float], buffer: float = 50)[source]¶ Possibly truncate the wavelength and eigenspectra in response to some new wavelengths
- Parameters
wavelength (array_like) – The new wavelengths to truncate to
buffer (float, optional) – The wavelength buffer, in Angstrom. Default is 50
See also
Starfish.grid_tools.utils.determine_chunk_log()
-
classmethod
from_grid
(grid, **pca_kwargs)[source]¶ Create an Emulator using PCA decomposition from a GridInterface.
- Parameters
grid (
GridInterface
or str) – The grid interface to decomposepca_kwargs (dict, optional) – The keyword arguments to pass to PCA. By default, n_components=0.99 and svd_solver=’full’.
See also
sklearn.decomposition.PCA()
-
get_index
(params: Sequence[float]) → int[source]¶ Given a list of stellar parameters (corresponding to a grid point), deliver the index that corresponds to the entry in the fluxes, grid_points, and weights.
- Parameters
params (array_like) – The stellar parameters
- Returns
index
- Return type
int
-
get_param_dict
() → dict[source]¶ Gets the dictionary of parameters. This is the same as Emulator.params
- Returns
- Return type
dict
-
get_param_vector
() → nptyping.types._ndarray.NDArray[source]¶ Get a vector of the current trainable parameters of the emulator
- Returns
- Return type
numpy.ndarray
-
property
lambda_xi
¶ The tuning hyperparameter
- Setter
Sets the value.
- Type
float
-
property
lengthscales
¶ The lengthscales for each Gaussian process kernel.
- Setter
Sets the lengthscales given a 2d array
- Type
numpy.ndarray
-
classmethod
load
(filename: Union[str, os.PathLike])[source]¶ Load an emulator from and HDF5 file
- Parameters
filename (str or path-like) –
-
load_flux
(params: Union[Sequence[float], nptyping.types._ndarray.NDArray], norm=False) → nptyping.types._ndarray.NDArray[source]¶ Interpolate a model given any parameters within the grid’s parameter range using eigenspectrum reconstruction by sampling from the weight distributions.
- Parameters
params (array_like) – The parameters to sample at.
- Returns
flux
- Return type
numpy.ndarray
-
log_likelihood
() → float[source]¶ Get the log likelihood of the emulator in its current state as calculated in the appendix of Czekala et al. (2015)
- Returns
- Return type
float
- Raises
scipy.linalg.LinAlgError – If the Cholesky factorization fails
-
norm_factor
(params: Union[Sequence[float], nptyping.types._ndarray.NDArray]) → float[source]¶ Return the scaling factor for the absolute flux units in flux-normalized spectra
- Parameters
params (array_like) – The parameters to interpolate at
- Returns
factor – The multiplicative factor to normalize a spectrum to the model’s absolute flux units
- Return type
float
-
save
(filename: Union[str, os.PathLike])[source]¶ Save the emulator to an HDF5 file
- Parameters
filename (str or path-like) –
-
set_param_dict
(params: dict)[source]¶ Sets the parameters with a dictionary
- Parameters
params (dict) – The new parameters.
-
set_param_vector
(params: nptyping.types._ndarray.NDArray)[source]¶ Set the current trainable parameters given a vector. Must have the same form as
get_param_vector()
- Parameters
params (numpy.ndarray) –
-
train
(**opt_kwargs)[source]¶ Trains the emulator’s hyperparameters using gradient descent. This is a light wrapper around scipy.optimize.minimize. If you are experiencing problems optimizing the emulator, consider implementing your own training loop, using this function as a template.
- Parameters
**opt_kwargs – Any arguments to pass to the optimizer. By default, method=’Nelder-Mead’ and maxiter=10000.
See also
scipy.optimize.minimize()
-
property
variances
¶ The variances for each Gaussian process kernel.
- Setter
Sets the variances given an array.
- Type
numpy.ndarray