SCoPe script guide¶
The hpc_files
directory in the scope-ml
repository contains scripts, files and directory structures that can be used to quick-start running SCoPe on HPC resources (like SDSC Expanse or NCSA Delta). This page documents the constituents of this directory and provides a high-level overview of what the scripts do and how to use them. After installing SCoPe, all the contents of hpc_files
can be placed within the scope
directory generated by the scope-initialize
command.
Note that data files are not included in the hpc_files
directory. The main files necessary to run the scripts detailed below are listed here and available on Zenodo:
trained_models_dnn
andtrained_models_xgb
: download on Zenodo, unzip, and place directories intomodels_dnn
andmodels_xgb
directories, respectivelytraining_set.parquet
: download on Zenodo and place into the directory calledfritzDownload
Note also that most included scripts and directories can also be generated from scratch using the following SCoPe scripts: train-algorithm-slurm
, generate-features-slurm
, run-inference-slurm
, and combine-preds-slurm
. The directories generated by these scripts generally are populated with two subdirectories: logs
to contain slurm logs, and slurm
to contain slurm scripts.
Configuration¶
The hpc_config.yaml
file contains the settings that have been used for SCoPe through April 2024. This file can be renamed to config.yaml
, overwriting the standard file generated by scope-initialize
and fast-tracking HPC runs. Tokens for for kowalski, wandb and fritz should be obtained and added to this file to enable SCoPe code to run.
It is generally advisable to run SCoPe scripts from the main scope
directory that contains your config file. You can also provide the --config-path
argument to any script. Keep in mind that the code will default to checking your current directory for a file called config.yaml
if you have not specified this argument.
Training¶
Training scripts: train_dnn_DR16.sh
and train_xgb_DR16.sh
¶
Each of these scripts can be generated with create-training-script
. They contain several calls to scope-train
and initially served as the primary way to sequentially train each model. When train-algorithm-job-submission
is run to train all classifiers in parallel, these scripts are parsed to identify the tags, group name and algorithm to pass to the training code.
Directories: dnn_training
and xgb_training
¶
These two directories are generated when running train-algorithm-slurm
. The slurm
subdirectories within each one are populated with three example scripts:
slurm_sing.sub
: trains a single classifier (specified with the--tag
argument) usingscope-train
slurm.sub
: uses a wildcard to serve as a training script for any--tag
slurm_submission.sub
: runs thetrain-algorithm-job-submission
python code to submit training jobs for all classifiers, referencing the training scripts mentioned above
Output: trained models in models_dnn
and models_xgb
¶
Trained models are saved in these two directories. The --group
name passed to the training code will determine the subdirectory where the models are saved. Within this, each classifier gets its own subdirectory that includes the model files, diagnostic plots, and feature importance data (XGB only).
To run inference with the latest trained models, download trained_dnn_models.zip
and trained_xgb_models.zip
from Zenodo and unzip them within the corresponding models_dnn
or models_xgb
directory.
Generating Features¶
Field-by-field feature generation¶
The primary way to generate feature with SCoPe is by specifying a specific ZTF field to run. The following directories contain example slurm scripts to perform this process.
Directories: generated_features_new
, generated_features_delta
¶
Each of these directories can be generated with generate-features-slurm
. generated_features_new
has been used extensively for SDSC Expanse jobs, while generated_features_delta
contains experimental slurm scripts for the NCSA Delta resource. The slurm
subdirectories within each one are populated with a data file and three example scripts:
slurm.dat
: “quadrant file”, generated usingcheck-quads-for-sources
, mapping each field/ccd/quadrant combination to an integer job number. Files names for DR16, DR19 and DR20 are also included. The genericslurm.dat
file is identical to the DR20 file.slurm_sing.sub
: generates features for a single field, CCD, and quad (specified with--field
,--ccd
, and--quad
arguments) usinggenerate-features
slurm.sub
: uses a wildcard to serve as a feature generation script for any--quadrant-index
inslurm.dat
slurm_submission.sub
: runs thegenerate-features-job-submission
python code to submit feature generation jobs for all config-specified fields (feature_generation: fields_to_run:
) while excluding fields listed infields_to_exclude:
Lightcurve-by-lightcurve feature generation¶
Another way to run SCoPe feature generation is to provide individual ZTF lightcurve IDs instead of fields. This requires some data wrangling to put the source list in the appropriate format for SCoPe to recognize.
Notebook: underMS_data_wrangling_notebook.ipynb
¶
This notebook contains an example on wrangling a list of designations, right ascensions and declinations into a SCoPe-friendly format. This notebook demonstrates running a cone search for all ZTF lightcurves within a specified radius and then formatting column names as SCoPe requires. The notebook then saves the resulting lightcurve list in batches so the feature generation process does not time out when running on SDSC Expanse.
Directory: generated_features_underMS
¶
Once the lightcurve lists are generated and saved (in this example to the underMS_ids_DR20
subdirectory), the following slurm script can be repeatedly queues to run feature generation:
dr20_slurm.sub
: uses a wildcard to serve as a feature generation script for any index ($IDX
) in the batched filenames. For example, runsbatch --export=IDX=0 dr20_slurm.sub
to run feature generation onsources_ids_2arcsec_renamed_0.parquet
General feature generation advice/troubleshooting¶
The following advice and troubleshooting list is based on running ~70 fields’ worth of feature generation on SDSC Expanse resources. It may need to be adjusted when running on other resources.
Ensuring all quads run successfully¶
When a field/ccd/quad job is queued to run, an empty file with a
.running
extension will be saved. The code uses this file to keep track of which fields/ccds/quads have been queued for feature generation. Note that the existence of this file does not mean that feature generation necessarily completed; in some cases, the job may fail. It is important to verify that feature generation actually succeeded for all quads in a field. One may do this either by manually counting the files and comparing with expectations, or by re-running feature generation job submission for the same fields while setting the--reset-running
flag (assuming all jobs have concluded). This will re-submit any jobs that did not produce the requisite.parquet
file or conclude immediately if all jobs are complete.
Fields with > 10,000,000 lightcurves¶
Some fields have a particularly large number of lightcurves, especially those near the Galactic Plane. On the Expanse
gpu-shared
partition, there have been out-of-memory issues when using the standard91G
of memory for fields with more than around 10,000,000 lightcurves. To avoid lost GPU time, identify these fields ahead of time using the includedDR19_field_counts.json
file. (This file was generated by runningscope.utils.get_field_count
on theDR19_catalog_completeness.json
file, which itself was obtained usingtools.generate_features_slurm.check_quads_for_sources
.) Next, scale up the requested memory inslurm.sub
proportional to the number of lightcurves in the field divided by 10,000,000. Scale down the--max-instances
argument inslurm_submission.dat
by the same fraction to avoid running into cluster limitations on memory requested per user. Note that as a result of this scaling, “large” fields will take more real time to run than they would if the maximum number of instances could be simultaneously used.
Kowalski query limitations¶
Note that while some compute resources may offer many cores that could parallelize and speed up Kowalski queries, once this number exceeds ~200 simultaneous queries (e.g. the 20 jobs each with 9 cores each that we currently run in parallel), there will begin to be failed queries and wasted compute time.
Path to scope code/inputs/outputs should be the same¶
While the config file supports specifying a
path_to_features
andpath_to_preds
that are unique from the code installation location, it is easiest to installscope-ml
in the same directory where the inputs will be stored and outputs will be written. On a cluster, make sure this is not the home or scratch directory, but instead the project storage location.
Lightcurve-by-lightcurve memory requirements¶
For lightcurve-by-lightcurve feature generation, try to limit the number of lightcurves in a batch to 100,000 and increase the memory to
182G
. The current code requires the user to manually runsbatch
for each batch file, modifying the--export=IDX=N
argument for eachN
in the batched filenames.
Running Inference¶
Inference scripts: get_all_preds_dnn_DR16.sh
and get_all_preds_xgb_DR16.sh
¶
Each of these scripts can be generated with create-inference-script
. They contain a call to run-inference
can be run on their own to perform inference (one field at a time). They can also be used by running run-inference-job-submission
to perform inference for all fields in parallel.
Directories: dnn_inference
and xgb_inference
¶
These two directories are generated when running run-inference-slurm
. The slurm
subdirectories within each one are populated with three example scripts:
slurm_sing.sub
: runs inference on a single field specified in fileslurm.sub
: uses a wildcard to serve as an inference script for any fieldslurm_submission.sub
: runs therun-inference-job-submission
python code to submit inference jobs for all config-specified fields (inference: fields_to_run:
) while excluding fields listed infields_to_exclude:
Combining Predictions¶
Directory: combine_preds
¶
The slurm
subdirectory here contains a script to combine the predictions for the DNN and XGB algorithms:
slurm.sub
: runcombine-preds
for all config-specified fields (inference: fields_to_run:
) while excluding fields listed infields_to_exclude:
, writing both parquet and CSV files
Classifying variables near GCN transient candidates¶
One special application of SCoPe is to classify variable sources that are near (in angular separation) to GCN transient candidates listed on fritz. In this workflow, small-scale feature generation is run on SDSC Expanse before running inference locally and uploading any high-confidence classifications to fritz (see Guide for Fritz Scanners for more details).
GCN inference scripts: get_all_preds_dnn_GCN.sh
, get_all_preds_xgb_GCN.sh
¶
These scripts are nearly identical to the inference scripts referenced above, but inference results are saved to different directories.
Directory: generated_features_GCN_sources
¶
The slurm
subdirectory within contains two example scripts:
gpu-debug_slurm.sub
: uses wildcards to run small-scale feature generation for a list of sources from a given GCNdateobs
. This is the script that is run by default, since thegpu-debug
partition on Expanse offers enough resources with shorter wait times thangpu-shared
.gpu-shared_slurm.sub
: the same asgpu-debug_slurm.sub
, but running on thegpu-shared
partition of Expanse.
Script: gcn_cronjob.py
¶
See more details about how this script can be run automatically in the Usage/Running automated analyses section of the documentation.