SCoPe script guide¶
The hpc_files directory in the scope-ml repository contains scripts, files and directory structures that can be used to quick-start running SCoPe on HPC resources (like SDSC Expanse or NCSA Delta). This page documents the constituents of this directory and provides a high-level overview of what the scripts do and how to use them. After installing SCoPe, all the contents of hpc_files can be placed within the scope directory generated by the scope-initialize command.
Note that data files are not included in the hpc_files directory. The main files necessary to run the scripts detailed below are listed here and available on Zenodo:
trained_models_dnnandtrained_models_xgb: download on Zenodo, unzip, and place directories intomodels_dnnandmodels_xgbdirectories, respectivelytraining_set.parquet: download on Zenodo and place into the directory calledfritzDownload
Note also that most included scripts and directories can also be generated from scratch using the following SCoPe scripts: train-algorithm-slurm, generate-features-slurm, run-inference-slurm, and combine-preds-slurm. The directories generated by these scripts generally are populated with two subdirectories: logs to contain slurm logs, and slurm to contain slurm scripts.
Configuration¶
The hpc_config.yaml file contains the settings that have been used for SCoPe through April 2024. This file can be renamed to config.yaml, overwriting the standard file generated by scope-initialize and fast-tracking HPC runs. Tokens for for kowalski, wandb and fritz should be obtained and added to this file to enable SCoPe code to run.
It is generally advisable to run SCoPe scripts from the main scope directory that contains your config file. You can also provide the --config-path argument to any script. Keep in mind that the code will default to checking your current directory for a file called config.yaml if you have not specified this argument.
Training¶
Training scripts: train_dnn_DR16.sh and train_xgb_DR16.sh¶
Each of these scripts can be generated with create-training-script. They contain several calls to scope-train and initially served as the primary way to sequentially train each model. When train-algorithm-job-submission is run to train all classifiers in parallel, these scripts are parsed to identify the tags, group name and algorithm to pass to the training code.
Directories: dnn_training and xgb_training¶
These two directories are generated when running train-algorithm-slurm. The slurm subdirectories within each one are populated with three example scripts:
slurm_sing.sub: trains a single classifier (specified with the--tagargument) usingscope-trainslurm.sub: uses a wildcard to serve as a training script for any--tagslurm_submission.sub: runs thetrain-algorithm-job-submissionpython code to submit training jobs for all classifiers, referencing the training scripts mentioned above
Output: trained models in models_dnn and models_xgb¶
Trained models are saved in these two directories. The --group name passed to the training code will determine the subdirectory where the models are saved. Within this, each classifier gets its own subdirectory that includes the model files, diagnostic plots, and feature importance data (XGB only).
To run inference with the latest trained models, download trained_dnn_models.zip and trained_xgb_models.zip from Zenodo and unzip them within the corresponding models_dnn or models_xgb directory.
Generating Features¶
Field-by-field feature generation¶
The primary way to generate feature with SCoPe is by specifying a specific ZTF field to run. The following directories contain example slurm scripts to perform this process.
Directories: generated_features_new, generated_features_delta¶
Each of these directories can be generated with generate-features-slurm. generated_features_new has been used extensively for SDSC Expanse jobs, while generated_features_delta contains experimental slurm scripts for the NCSA Delta resource. The slurm subdirectories within each one are populated with a data file and three example scripts:
slurm.dat: “quadrant file”, generated usingcheck-quads-for-sources, mapping each field/ccd/quadrant combination to an integer job number. Files names for DR16, DR19 and DR20 are also included. The genericslurm.datfile is identical to the DR20 file.slurm_sing.sub: generates features for a single field, CCD, and quad (specified with--field,--ccd, and--quadarguments) usinggenerate-featuresslurm.sub: uses a wildcard to serve as a feature generation script for any--quadrant-indexinslurm.datslurm_submission.sub: runs thegenerate-features-job-submissionpython code to submit feature generation jobs for all config-specified fields (feature_generation: fields_to_run:) while excluding fields listed infields_to_exclude:
Lightcurve-by-lightcurve feature generation¶
Another way to run SCoPe feature generation is to provide individual ZTF lightcurve IDs instead of fields. This requires some data wrangling to put the source list in the appropriate format for SCoPe to recognize.
Notebook: underMS_data_wrangling_notebook.ipynb¶
This notebook contains an example on wrangling a list of designations, right ascensions and declinations into a SCoPe-friendly format. This notebook demonstrates running a cone search for all ZTF lightcurves within a specified radius and then formatting column names as SCoPe requires. The notebook then saves the resulting lightcurve list in batches so the feature generation process does not time out when running on SDSC Expanse.
Directory: generated_features_underMS¶
Once the lightcurve lists are generated and saved (in this example to the underMS_ids_DR20 subdirectory), the following slurm script can be repeatedly queues to run feature generation:
dr20_slurm.sub: uses a wildcard to serve as a feature generation script for any index ($IDX) in the batched filenames. For example, runsbatch --export=IDX=0 dr20_slurm.subto run feature generation onsources_ids_2arcsec_renamed_0.parquet
General feature generation advice/troubleshooting¶
The following advice and troubleshooting list is based on running ~70 fields’ worth of feature generation on SDSC Expanse resources. It may need to be adjusted when running on other resources.
Ensuring all quads run successfully¶
When a field/ccd/quad job is queued to run, an empty file with a
.runningextension will be saved. The code uses this file to keep track of which fields/ccds/quads have been queued for feature generation. Note that the existence of this file does not mean that feature generation necessarily completed; in some cases, the job may fail. It is important to verify that feature generation actually succeeded for all quads in a field. One may do this either by manually counting the files and comparing with expectations, or by re-running feature generation job submission for the same fields while setting the--reset-runningflag (assuming all jobs have concluded). This will re-submit any jobs that did not produce the requisite.parquetfile or conclude immediately if all jobs are complete.
Fields with > 10,000,000 lightcurves¶
Some fields have a particularly large number of lightcurves, especially those near the Galactic Plane. On the Expanse
gpu-sharedpartition, there have been out-of-memory issues when using the standard91Gof memory for fields with more than around 10,000,000 lightcurves. To avoid lost GPU time, identify these fields ahead of time using the includedDR19_field_counts.jsonfile. (This file was generated by runningscope.utils.get_field_counton theDR19_catalog_completeness.jsonfile, which itself was obtained usingtools.generate_features_slurm.check_quads_for_sources.) Next, scale up the requested memory inslurm.subproportional to the number of lightcurves in the field divided by 10,000,000. Scale down the--max-instancesargument inslurm_submission.datby the same fraction to avoid running into cluster limitations on memory requested per user. Note that as a result of this scaling, “large” fields will take more real time to run than they would if the maximum number of instances could be simultaneously used.
Kowalski query limitations¶
Note that while some compute resources may offer many cores that could parallelize and speed up Kowalski queries, once this number exceeds ~200 simultaneous queries (e.g. the 20 jobs each with 9 cores each that we currently run in parallel), there will begin to be failed queries and wasted compute time.
Path to scope code/inputs/outputs should be the same¶
While the config file supports specifying a
path_to_featuresandpath_to_predsthat are unique from the code installation location, it is easiest to installscope-mlin the same directory where the inputs will be stored and outputs will be written. On a cluster, make sure this is not the home or scratch directory, but instead the project storage location.
Lightcurve-by-lightcurve memory requirements¶
For lightcurve-by-lightcurve feature generation, try to limit the number of lightcurves in a batch to 100,000 and increase the memory to
182G. The current code requires the user to manually runsbatchfor each batch file, modifying the--export=IDX=Nargument for eachNin the batched filenames.
Running Inference¶
Inference scripts: get_all_preds_dnn_DR16.sh and get_all_preds_xgb_DR16.sh¶
Each of these scripts can be generated with create-inference-script. They contain a call to run-inference can be run on their own to perform inference (one field at a time). They can also be used by running run-inference-job-submission to perform inference for all fields in parallel.
Directories: dnn_inference and xgb_inference¶
These two directories are generated when running run-inference-slurm. The slurm subdirectories within each one are populated with three example scripts:
slurm_sing.sub: runs inference on a single field specified in fileslurm.sub: uses a wildcard to serve as an inference script for any fieldslurm_submission.sub: runs therun-inference-job-submissionpython code to submit inference jobs for all config-specified fields (inference: fields_to_run:) while excluding fields listed infields_to_exclude:
Combining Predictions¶
Directory: combine_preds¶
The slurm subdirectory here contains a script to combine the predictions for the DNN and XGB algorithms:
slurm.sub: runcombine-predsfor all config-specified fields (inference: fields_to_run:) while excluding fields listed infields_to_exclude:, writing both parquet and CSV files
Classifying variables near GCN transient candidates¶
One special application of SCoPe is to classify variable sources that are near (in angular separation) to GCN transient candidates listed on fritz. In this workflow, small-scale feature generation is run on SDSC Expanse before running inference locally and uploading any high-confidence classifications to fritz (see Guide for Fritz Scanners for more details).
GCN inference scripts: get_all_preds_dnn_GCN.sh, get_all_preds_xgb_GCN.sh¶
These scripts are nearly identical to the inference scripts referenced above, but inference results are saved to different directories.
Directory: generated_features_GCN_sources¶
The slurm subdirectory within contains two example scripts:
gpu-debug_slurm.sub: uses wildcards to run small-scale feature generation for a list of sources from a given GCNdateobs. This is the script that is run by default, since thegpu-debugpartition on Expanse offers enough resources with shorter wait times thangpu-shared.gpu-shared_slurm.sub: the same asgpu-debug_slurm.sub, but running on thegpu-sharedpartition of Expanse.
Script: gcn_cronjob.py¶
See more details about how this script can be run automatically in the Usage/Running automated analyses section of the documentation.