SCoPe script guide

The hpc_files directory in the scope-ml repository contains scripts, files and directory structures that can be used to quick-start running SCoPe on HPC resources (like SDSC Expanse or NCSA Delta). This page documents the constituents of this directory and provides a high-level overview of what the scripts do and how to use them. After installing SCoPe, all the contents of hpc_files can be placed within the scope directory generated by the scope-initialize command.

Note that data files are not included in the hpc_files directory. The main files necessary to run the scripts detailed below are listed here and available on Zenodo:

  • trained_models_dnn and trained_models_xgb: download on Zenodo, unzip, and place directories into models_dnn and models_xgb directories, respectively

  • training_set.parquet: download on Zenodo and place into the directory called fritzDownload

Note also that most included scripts and directories can also be generated from scratch using the following SCoPe scripts: train-algorithm-slurm, generate-features-slurm, run-inference-slurm, and combine-preds-slurm. The directories generated by these scripts generally are populated with two subdirectories: logs to contain slurm logs, and slurm to contain slurm scripts.

Configuration

The hpc_config.yaml file contains the settings that have been used for SCoPe through April 2024. This file can be renamed to config.yaml, overwriting the standard file generated by scope-initialize and fast-tracking HPC runs. Tokens for for kowalski, wandb and fritz should be obtained and added to this file to enable SCoPe code to run.

It is generally advisable to run SCoPe scripts from the main scope directory that contains your config file. You can also provide the --config-path argument to any script. Keep in mind that the code will default to checking your current directory for a file called config.yaml if you have not specified this argument.

Training

Training scripts: train_dnn_DR16.sh and train_xgb_DR16.sh

Each of these scripts can be generated with create-training-script. They contain several calls to scope-train and initially served as the primary way to sequentially train each model. When train-algorithm-job-submission is run to train all classifiers in parallel, these scripts are parsed to identify the tags, group name and algorithm to pass to the training code.

Directories: dnn_training and xgb_training

These two directories are generated when running train-algorithm-slurm. The slurm subdirectories within each one are populated with three example scripts:

  • slurm_sing.sub: trains a single classifier (specified with the --tag argument) using scope-train

  • slurm.sub: uses a wildcard to serve as a training script for any --tag

  • slurm_submission.sub: runs the train-algorithm-job-submission python code to submit training jobs for all classifiers, referencing the training scripts mentioned above

Output: trained models in models_dnn and models_xgb

Trained models are saved in these two directories. The --group name passed to the training code will determine the subdirectory where the models are saved. Within this, each classifier gets its own subdirectory that includes the model files, diagnostic plots, and feature importance data (XGB only).

To run inference with the latest trained models, download trained_dnn_models.zip and trained_xgb_models.zip from Zenodo and unzip them within the corresponding models_dnn or models_xgb directory.

Generating Features

Field-by-field feature generation

The primary way to generate feature with SCoPe is by specifying a specific ZTF field to run. The following directories contain example slurm scripts to perform this process.

Directories: generated_features_new, generated_features_delta

Each of these directories can be generated with generate-features-slurm. generated_features_new has been used extensively for SDSC Expanse jobs, while generated_features_delta contains experimental slurm scripts for the NCSA Delta resource. The slurm subdirectories within each one are populated with a data file and three example scripts:

  • slurm.dat: “quadrant file”, generated using check-quads-for-sources, mapping each field/ccd/quadrant combination to an integer job number. Files names for DR16, DR19 and DR20 are also included. The generic slurm.dat file is identical to the DR20 file.

  • slurm_sing.sub: generates features for a single field, CCD, and quad (specified with --field, --ccd, and --quad arguments) using generate-features

  • slurm.sub: uses a wildcard to serve as a feature generation script for any --quadrant-index in slurm.dat

  • slurm_submission.sub: runs the generate-features-job-submission python code to submit feature generation jobs for all config-specified fields (feature_generation: fields_to_run:) while excluding fields listed in fields_to_exclude:

Lightcurve-by-lightcurve feature generation

Another way to run SCoPe feature generation is to provide individual ZTF lightcurve IDs instead of fields. This requires some data wrangling to put the source list in the appropriate format for SCoPe to recognize.

Notebook: underMS_data_wrangling_notebook.ipynb

This notebook contains an example on wrangling a list of designations, right ascensions and declinations into a SCoPe-friendly format. This notebook demonstrates running a cone search for all ZTF lightcurves within a specified radius and then formatting column names as SCoPe requires. The notebook then saves the resulting lightcurve list in batches so the feature generation process does not time out when running on SDSC Expanse.

Directory: generated_features_underMS

Once the lightcurve lists are generated and saved (in this example to the underMS_ids_DR20 subdirectory), the following slurm script can be repeatedly queues to run feature generation:

  • dr20_slurm.sub: uses a wildcard to serve as a feature generation script for any index ($IDX) in the batched filenames. For example, run sbatch --export=IDX=0 dr20_slurm.sub to run feature generation on sources_ids_2arcsec_renamed_0.parquet

General feature generation advice/troubleshooting

The following advice and troubleshooting list is based on running ~70 fields’ worth of feature generation on SDSC Expanse resources. It may need to be adjusted when running on other resources.

Ensuring all quads run successfully

  • When a field/ccd/quad job is queued to run, an empty file with a .running extension will be saved. The code uses this file to keep track of which fields/ccds/quads have been queued for feature generation. Note that the existence of this file does not mean that feature generation necessarily completed; in some cases, the job may fail. It is important to verify that feature generation actually succeeded for all quads in a field. One may do this either by manually counting the files and comparing with expectations, or by re-running feature generation job submission for the same fields while setting the --reset-running flag (assuming all jobs have concluded). This will re-submit any jobs that did not produce the requisite .parquet file or conclude immediately if all jobs are complete.

Fields with > 10,000,000 lightcurves

  • Some fields have a particularly large number of lightcurves, especially those near the Galactic Plane. On the Expanse gpu-shared partition, there have been out-of-memory issues when using the standard 91G of memory for fields with more than around 10,000,000 lightcurves. To avoid lost GPU time, identify these fields ahead of time using the included DR19_field_counts.json file. (This file was generated by running scope.utils.get_field_count on the DR19_catalog_completeness.json file, which itself was obtained using tools.generate_features_slurm.check_quads_for_sources.) Next, scale up the requested memory in slurm.sub proportional to the number of lightcurves in the field divided by 10,000,000. Scale down the --max-instances argument in slurm_submission.dat by the same fraction to avoid running into cluster limitations on memory requested per user. Note that as a result of this scaling, “large” fields will take more real time to run than they would if the maximum number of instances could be simultaneously used.

Kowalski query limitations

  • Note that while some compute resources may offer many cores that could parallelize and speed up Kowalski queries, once this number exceeds ~200 simultaneous queries (e.g. the 20 jobs each with 9 cores each that we currently run in parallel), there will begin to be failed queries and wasted compute time.

Path to scope code/inputs/outputs should be the same

  • While the config file supports specifying a path_to_features and path_to_preds that are unique from the code installation location, it is easiest to install scope-ml in the same directory where the inputs will be stored and outputs will be written. On a cluster, make sure this is not the home or scratch directory, but instead the project storage location.

Lightcurve-by-lightcurve memory requirements

  • For lightcurve-by-lightcurve feature generation, try to limit the number of lightcurves in a batch to 100,000 and increase the memory to 182G. The current code requires the user to manually run sbatch for each batch file, modifying the --export=IDX=N argument for each N in the batched filenames.

Running Inference

Inference scripts: get_all_preds_dnn_DR16.sh and get_all_preds_xgb_DR16.sh

Each of these scripts can be generated with create-inference-script. They contain a call to run-inference can be run on their own to perform inference (one field at a time). They can also be used by running run-inference-job-submission to perform inference for all fields in parallel.

Directories: dnn_inference and xgb_inference

These two directories are generated when running run-inference-slurm. The slurm subdirectories within each one are populated with three example scripts:

  • slurm_sing.sub: runs inference on a single field specified in file

  • slurm.sub: uses a wildcard to serve as an inference script for any field

  • slurm_submission.sub: runs the run-inference-job-submission python code to submit inference jobs for all config-specified fields (inference: fields_to_run:) while excluding fields listed in fields_to_exclude:

Combining Predictions

Directory: combine_preds

The slurm subdirectory here contains a script to combine the predictions for the DNN and XGB algorithms:

  • slurm.sub: run combine-preds for all config-specified fields (inference: fields_to_run:) while excluding fields listed in fields_to_exclude:, writing both parquet and CSV files

Classifying variables near GCN transient candidates

One special application of SCoPe is to classify variable sources that are near (in angular separation) to GCN transient candidates listed on fritz. In this workflow, small-scale feature generation is run on SDSC Expanse before running inference locally and uploading any high-confidence classifications to fritz (see Guide for Fritz Scanners for more details).

GCN inference scripts: get_all_preds_dnn_GCN.sh, get_all_preds_xgb_GCN.sh

These scripts are nearly identical to the inference scripts referenced above, but inference results are saved to different directories.

Directory: generated_features_GCN_sources

The slurm subdirectory within contains two example scripts:

  • gpu-debug_slurm.sub: uses wildcards to run small-scale feature generation for a list of sources from a given GCN dateobs. This is the script that is run by default, since the gpu-debug partition on Expanse offers enough resources with shorter wait times than gpu-shared.

  • gpu-shared_slurm.sub: the same as gpu-debug_slurm.sub, but running on the gpu-shared partition of Expanse.

Script: gcn_cronjob.py

See more details about how this script can be run automatically in the Usage/Running automated analyses section of the documentation.