Usage

Download ids for ZTF fields/CCDs/quadrants

  • Create HDF5 file for single CCD/quad pair in a field:

get-quad-ids --catalog ZTF_source_features_DR16 --field 301 --ccd 2 --quad 3 --minobs 20 --skip 0 --limit 10000
  • Create multiple HDF5 files for some CCD/quad pairs in a field:

get-quad-ids --catalog ZTF_source_features_DR16 --field 301 --multi-quads --ccd-range 1 8 --quad-range 2 4 --minobs 20 --limit 10000
  • Create multiple HDF5 files for all CCD/quad pairs in a field:

get-quad-ids --catalog ZTF_source_features_DR16 --field 301 --multi-quads --minobs 20 --limit 10000
  • Create single HDF5 file for all sources in a field:

get-quad-ids --catalog ZTF_source_features_DR16 --field 301 --whole-field

Download SCoPe features for ZTF fields/CCDs/quadrants

  • First, run get-quad_ids for desired fields/ccds/quads.

  • Download features for all sources in a field:

get-features --field 301 --whole-field
  • Download features for all sources in a field, imputing missing features using the strategies in config.yaml:

get-features --field 301 --whole-field --impute-missing-features
  • Download features for a range of ccd/quads individually:

get-features --field 301 --ccd-range 1 2 --quad-range 3 4
  • Download features for a single pair of ccd/quad:

get-features --field 301 --ccd-range 1 --quad-range 2

Training deep learning models

For details on the SCoPe taxonomy and architecture, please refer to arxiv:2102.11304.

  • The training pipeline can be invoked with the scope.py utility. For example:

scope-train --tag vnv --path-dataset data/training/dataset.d15.csv --batch-size 64 --epochs 100 --verbose 1 --pre-trained-model models/experiment/vnv/vnv.20221117_001502.h5

Refer to scope-train --help for details.

  • All the necessary metadata/configuration could be defined in config.yaml under training, but could also be overridden with optional scope-train arguments, e.g. scope-train ... --batch-size 32 --threshold 0.6 ....

  • By default, the pipeline uses the DNN models defined in scope/nn.py using the tensorflow’s keras functional API. SCoPe also supports an implementation of XGBoost (set --algorithm xgb; see scope/xgb.py).

  • If --save is specified during DNN training, an HDF5 file of the model’s layers and weights will be saved. This file can be directly used for additional training and inferencing. For XGB, a json file will save the model along with a .params file with the model parameters.

  • The Dataset class defined in scope.utils hides the complexity of our dataset handling “under the rug”.

  • You can request access to a Google Drive folder containing the latest trained models here.

  • Feature name sets are specified in config.yaml under features. These are referenced in config.yaml under training.classes.<class>.features.

  • Feature stats to be used for feature scaling/standardization before training are either computed by the code (default) or defined in config.yaml under feature_stats.

  • We use Weights & Biases to track experiments. Project details and access credentials can be defined in config.yaml under wandb.

Initially, SCoPe used a bash script to train all classifier families, e.g:

for class in pnp longt i fla ew eb ea e agn bis blyr ceph dscu lpv mir puls rrlyr rscvn srv wuma yso; \
  do echo $class; \
  for state in 1 2 3 4 5 6 7 8 9 42; \
    do scope-train \
      --tag $class --path-dataset data/training/dataset.d15.csv \
      --scale-features min_max --batch-size 64 \
      --epochs 300 --patience 30 --random-state $state \
      --verbose 1 --gpu 1 --conv-branch --save; \
  done; \
done;

Now, a training script containing one line per class to be trained can be generated by running create-training-script, for example:

create-training-script --filename train_dnn.sh --min-count 100 --pre-trained-group-name experiment --add-keywords '--save --batch-size 32 --group new_experiment --period-suffix ELS_ECE_EAOV'

A path to the training set may be provided as input to this method or otherwise taken from config.yaml (training: dataset:). To continue training on existing models, specify the --pre-trained-group-name keyword containing the models in create-training-script. If training on a feature collection containing multiple sets of periodic features (from different algorithms), set the suffix corresponding to the desired algorithm using --period-suffix or the features: info: period_suffix: field in the config file. The string specified in --add-keywords serves as a catch-all for additional keywords that the user wishes to be included in each line of the script.

If --pre-trained-group-name is specified and the --train-all keyword is set, the output script will train all classes specified in config.yaml regardless of whether they have a pre-trained model. If --train-all is not set (the default), the script will limit training to classes that have an existing trained model.

Adding new features for training

To add a new feature, first ensure that it has been generated and saved in the training set file. Then, update the config file in the features: section. This section contains a list of each feature used by scope. Along with the name of the feature, be sure to specify the boolean include value (as true), the dtype, and whether the feature is periodic or not (i.e. whether the code should give append a period_suffix to the name.)

If the new feature is ontological in nature, add the same config info to both the phenomenological: and ontological: lists. For a phenomenological feature, only add this info to the phenomenological: list. Note that changing the config in this way will raise an error when running scope with pre-existing trained models that lack the new feature.

Running inference

Running inference requires the following steps: download ids of a field, download (or generate) features for all downloaded ids, run inference for all available trained models, e.g:

get-quad-ids --field <field_number> --whole-field
get-features --field <field_number> --whole-field --impute-missing-features

OR

generate-features --field <field_number> --ccd <ccd_number> --quad <quad_number> --doGPU

The optimal way to run inference is through an inference script generated by running create-inference-script with the appropriate arguments. After creating the script and adding the needed permissions (e.g. using chmod +x), the commands to run inference on the field <field_number> are (in order):

./get_all_preds.sh <field_number>
  • Requires models_dnn/ or models_xgb/ folder in the root directory containing the pre-trained models for DNN and XGBoost, respectively.

  • In a preds_dnn or preds_xgb directory, creates a single .parquet (and optionally .csv) file containing all ids of the field in the rows and inference scores for different classes across the columns.

  • If running inference on specific ids instead of a field/ccd/quad (e.g. on GCN sources), run ./get_all_preds.sh specific_ids

Handling different file formats

When our manipulations of pandas dataframes is complete, we want to save them in an appropriate file format with the desired metadata. Our code works with multiple formats, each of which have advantages and drawbacks:

  • Comma Separated Values (CSV, .csv): in this format, data are plain text and columns are separated by commas. While this format offers a high level of human readability, it also takes more space to store and a longer time to write and read than other formats.

    pandas offers the read_csv() function and to_csv() method to perform I/O operations with this format. Metadata must be included as plain text in the file.

  • Hierarchical Data Format (HDF5, .h5): this format stores data in binary form, so it is not human-readable. It takes up less space on disk than CSV files, and it writes/reads faster for numerical data. HDF5 does not serialize data columns containing structures like a numpy array, so file size improvements over CSV can be diminished if these structures exist in the data.

    pandas includes read_hdf() and to_hdf() to handle this format, and they require a package like PyTables to work. pandas does not currently support the reading and writing of metadata using the above function and method. See scope/utils.py for code that handles metadata in HDF5 files.

  • Apache Parquet (.parquet): this format stores data in binary form like HDF5, so it is not human-readable. Like HDF5, Parquet also offers significant disk space savings over CSV. Unlike HDF5, Parquet supports structures like numpy arrays in data columns.

    While pandas offers read_parquet() and to_parquet() to support this format (requiring e.g. PyArrow to work), these again do not support the reading and writing of metadata associated with the dataframe. See scope/utils.py for code that reads and writes metadata in Parquet files.

Mapping between column names and Fritz taxonomies

The column names of training set files and Fritz taxonomy classifications are not the same by default. Training sets may also contain columns that are not meant to be uploaded to Fritz. To address both of these issues, we use a ‘taxonomy mapper’ file to connect local data and Fritz taxonomies.

This file must currently be generated manually, entry by entry. Each entry’s key corresponds to a column name in the local file. The set of all keys is used to establish the columns of interest for upload or download. For example, if the training set includes columns that are not classifications, like RA and Dec, these columns should not be included among the entries in the mapper file. The code will then ignore these columns for the purpose of classification.

The fields associated with each key are fritz_label (containing the associated Fritz classification name) and taxonomy_id identifying the classification’s taxonomy system. The mapper must have the following format, also demonstrated in golden_dataset_mapper.json and DNN_AL_mapper.json:

{
"variable":
    {"fritz_label": "variable",
      "taxonomy_id": 1012
    },

"periodic":
    {"fritz_label": "periodic",
      "taxonomy_id": 1012
    },

    .
    . [add more entries here]
    .

"CV":
    {"fritz_label": "Cataclysmic",
      "taxonomy_id": 1011
    }
}

Generating features

Code has been adapted from ztfperiodic and other sources to calculate basic and Fourier stats for light curves along with other features. This allows new features to be generated with SCoPe, both locally and using GPU cluster resources. The feature generation script is run using the generate-features command.

Currently, the basic stats are calculated via tools/featureGeneration/lcstats.py, and a host of period-finding algorithms are available in tools/featureGeneration/periodsearch.py. Among the CPU-based period-finding algorithms, there is not yet support for AOV_cython. For the AOV algorithm to work, run source build.sh in the tools/featureGeneration/pyaov/ directory, then copy the newly created .so file (aov.cpython-310-darwin.so or similar) to lib/python3.10/site-packages/ or equivalent within your environment. The GPU-based algorithms require CUDA support (so Mac GPUs are not supported).

inputs:

  1. –source-catalog* : name of Kowalski catalog containing ZTF sources (str)

  2. –alerts-catalog* : name of Kowalski catalog containing ZTF alerts (str)

  3. –gaia-catalog* : name of Kowalski catalog containing Gaia data (str)

  4. –bright-star-query-radius-arcsec : maximum angular distance from ZTF sources to query nearby bright stars in Gaia (float)

  5. –xmatch-radius-arcsec : maximum angular distance from ZTF sources to match external catalog sources (float)

  6. –limit : maximum number of sources to process in batch queries / statistics calculations (int)

  7. –period-algorithms* : dictionary containing names of period algorithms to run. Normally specified in config - if specified here, should be a (list)

  8. –period-batch-size : maximum number of sources to simultaneously perform period finding (int)

  9. –doCPU : flag to run config-specified CPU period algorithms (bool)

  10. –doGPU : flag to run config-specified GPU period algorithms (bool)

  11. –samples-per-peak : number of samples per periodogram peak (int)

  12. –doScaleMinPeriod : for period finding, scale min period based on min-cadence-minutes (bool). Otherwise, set –max-freq to desired value

  13. –doRemoveTerrestrial : remove terrestrial frequencies from period-finding analysis (bool)

  14. –Ncore : number of CPU cores to parallelize queries (int)

  15. –field : ZTF field to run (int)

  16. –ccd : ZTF ccd to run (int)

  17. –quad : ZTF quadrant to run (int)

  18. –min-n-lc-points : minimum number of points required to generate features for a light curve (int)

  19. –min-cadence-minutes : minimum cadence between light curve points. Higher-cadence data are dropped except for the first point in the sequence (float)

  20. –dirname : name of generated feature directory (str)

  21. –filename : prefix of each feature filename (str)

  22. –doCesium : flag to compute config-specified cesium features in addition to default list (bool)

  23. –doNotSave : flag to avoid saving generated features (bool)

  24. –stop-early : flag to stop feature generation before entire quadrant is run. Pair with –limit to run small-scale tests (bool)

  25. –doQuadrantFile : flag to use a generated file containing [jobID, field, ccd, quad] columns instead of specifying –field, –ccd and –quad (bool)

  26. –quadrant-file : name of quadrant file in the generated_features/slurm directory or equivalent (str)

  27. –quadrant-index : number of job in quadrant file to run (int)

  28. –doSpecificIDs: flag to perform feature generation for ztf_id column in config-specified file (bool)

  29. –skipCloseSources: flag to skip removal of sources too close to bright stars via Gaia (bool)

  30. –top-n-periods: number of (E)LS, (E)CE periods to pass to (E)AOV if using (E)LS_(E)CE_(E)AOV algorithm (int)

  31. –max-freq: maximum frequency [1 / days] to use for period finding (float). Overridden by –doScaleMinPeriod

  32. –fg-dataset*: path to parquet, hdf5 or csv file containing specific sources for feature generation (str)

  33. –max-timestamp-hjd*: maximum timestamp of queried light curves, HJD (float)

output: feature_df : dataframe containing generated features

* - specified in config.yaml

Example usage

The following is an example of running the feature generation script locally:

generate-features --field 301 --ccd 2 --quad 4 --source-catalog ZTF_sources_20230109 --alerts-catalog ZTF_alerts --gaia-catalog Gaia_EDR3 --bright-star-query-radius-arcsec 300.0 --xmatch-radius-arcsec 2.0 --query-size-limit 10000 --period-batch-size 1000 --samples-per-peak 10 --Ncore 4 --min-n-lc-points 50 --min-cadence-minutes 30.0 --dirname generated_features --filename gen_features --doCPU --doRemoveTerrestrial --doCesium

Setting --doCPU will run the config-specified CPU period algorithms on each source. Setting --doGPU instead will do likewise with the specified GPU algorithms. If neither of these keywords is set, the code will assign a value of 1.0 to each period and compute Fourier statistics using that number.

Below is an example run the script using a job/quadrant file (containing [job id, field, ccd, quad] columns) instead of specifying field/ccd/quad directly:

generate-features --source-catalog ZTF_sources_20230109 --alerts-catalog ZTF_alerts --gaia-catalog Gaia_EDR3 --bright-star-query-radius-arcsec 300.0 --xmatch-radius-arcsec 2.0 --query-size-limit 10000 --period-batch-size 1000 --samples-per-peak 10 --Ncore 20 --min-n-lc-points 50 --min-cadence-minutes 30.0 --dirname generated_features_DR15 --filename gen_features --doGPU --doRemoveTerrestrial --doCesium --doQuadrantFile --quadrant-file slurm.dat --quadrant-index 5738

Slurm scripts

For large-scale feature generation, generate-features is intended to be run on a high-performance computing cluster. Often these clusters require jobs to be submitted using a utility like slurm (Simple Linux Utility for Resource Management) to generate scripts. These scripts contain information about the type, amount and duration of computing resources to allocate to the user.

Scope’s generate-features-slurm code creates two slurm scripts: (1) runs single instance of generate-features, and (2) runs the generate-features-job-submission which submits multiple jobs in parallel, periodically checking to see if additional jobs can be started. See below for more information about these components of feature generation.

generate-features-slurm can receive all of the arguments used by generate-features. These arguments are passed to the instances of feature generation begun by running slurm script (1). There are also additional arguments specific to cluster resource management:

inputs:

  1. –job-name : name of submitted jobs (str)

  2. –cluster-name : name of HPC cluster (str)

  3. –partition-type : cluster partition to use (str)

  4. –nodes : number of nodes to request (int)

  5. –gpus : number of GPUs to request (int)

  6. –memory-GB : amount of memory to request in GB (int)

  7. –submit-memory-GB : Memory allocation to request for job submission (int)

  8. –time : amount of time before instance times out (str)

  9. –mail-user: user’s email address for job updates (str)

  10. –account-name : name of account having HPC allocation (str)

  11. –python-env-name : name of Python environment to activate before running generate_features.py (str)

  12. –generateQuadrantFile : flag to map fields/ccds/quads containing sources to job numbers, save file (bool)

  13. –field-list : space-separated list of fields for which to generate quadrant file. If None, all populated fields included (int)

  14. –max-instances : maximum number of HPC instances to run in parallel (int)

  15. –wait-time-minutes : amount of time to wait between status checks in minutes (float)

  16. –doSubmitLoop : flag to run loop initiating instances until out of jobs (hard on Kowalski)

  17. –runParallel : flag to run jobs in parallel using slurm [recommended]. Otherwise, run in series on a single instance

  18. –user : if using slurm, your username. This will be used to periodically run squeue and list your running jobs (str)

  19. –submit-interval-minutes : Time to wait between job submissions, minutes (float)

Feature definitions

Selected phenomenological feature definitions

name

definition

ad

Anderson-Darling statistic

chi2red

Reduced chi^2 after mean subtraction

f1_BIC

Bayesian information criterion of best-fitting series (Fourier analysis)

f1_a

a coefficient of best-fitting series (Fourier analysis)

f1_amp

Amplitude of best-fitting series (Fourier analysis)

f1_b

b coefficient of best-fitting series (Fourier analysis)

f1_phi0

Zero-phase of best-fitting series (Fourier analysis)

f1_power

Normalized chi^2 of best-fitting series (Fourier analysis)

f1_relamp1

Relative amplitude, first harmonic (Fourier analysis)

f1_relamp2

Relative amplitude, second harmonic (Fourier analysis)

f1_relamp3

Relative amplitude, third harmonic (Fourier analysis)

f1_relamp4

Relative amplitude, fourth harmonic (Fourier analysis)

f1_relphi1

Relative phase, first harmonic (Fourier analysis)

f1_relphi2

Relative phase, second harmonic (Fourier analysis)

f1_relphi3

Relative phase, third harmonic (Fourier analysis)

f1_relphi4

Relative phase, fourth harmonic (Fourier analysis)

i60r

Mag ratio between 20th, 80th percentiles

i70r

Mag ratio between 15th, 85th percentiles

i80r

Mag ratio between 10th, 90th percentiles

i90r

Mag ratio between 5th, 95th percentiles

inv_vonneumannratio

Inverse of Von Neumann ratio

iqr

Mag ratio between 25th, 75th percentiles

median

Median magnitude

median_abs_dev

Median absolute deviation of magnitudes

norm_excess_var

Normalized excess variance

norm_peak_to_peak_amp

Normalized peak-to-peak amplitude

roms

Root of mean magnitudes squared

skew

Skew of magnitudes

smallkurt

Kurtosis of magnitudes

stetson_j

Stetson J coefficient

stetson_k

Stetson K coefficient

sw

Shapiro-Wilk statistic

welch_i

Welch I statistic

wmean

Weighted mean of magtnidues

wstd

Weighted standard deviation of magnitudes

dmdt

Magnitude-time histograms (26x26)

Selected ontological feature definitions

name

definition

mean_ztf_alert_braai

Mean significance of ZTF alerts for this source

n_ztf_alerts

Number of ZTF alerts for this source

period

Period determined by subscripted algorithms (e.g. ELS_ECE_EAOV)

significance

Significance of period

AllWISE_w1mpro

AllWISE W1 mag

AllWISE_w1sigmpro

AllWISE W1 mag error

AllWISE_w2mpro

AllWISE W2 mag

AllWISE_w2sigmpro

AllWISE W2 mag error

AllWISE_w3mpro

AllWISE W3 mag

AllWISE_w4mpro

AllWISE W4 mag

Gaia_EDR3__parallax

Gaia parallax

Gaia_EDR3__parallax_error

Gaia parallax error

Gaia_EDR3__phot_bp_mean_mag

Gaia BP mag

Gaia_EDR3__phot_bp_rp_excess_factor

Gaia BP-RP excess factor

Gaia_EDR3__phot_g_mean_mag

Gaia G mag

Gaia_EDR3__phot_rp_mean_mag

Gaia RP mag

PS1_DR1__gMeanPSFMag

PS1 g mag

PS1_DR1__gMeanPSFMagErr

PS1 g mag error

PS1_DR1__rMeanPSFMag

PS1 r mag

PS1_DR1__rMeanPSFMagErr

PS1 r mag error

PS1_DR1__iMeanPSFMag

PS1 i mag

PS1_DR1__iMeanPSFMagErr

PS1 i mag error

PS1_DR1__zMeanPSFMag

PS1 z mag

PS1_DR1__zMeanPSFMagErr

PS1 z mag error

PS1_DR1__yMeanPSFMag

PS1 y mag

PS1_DR1__yMeanPSFMagErr

PS1 y mag error

Running automated analyses

The primary deliverable of SCoPe is a catalog of variable source classifications across all of ZTF. Since ZTF contains billions of light curves, this catalog requires significant compute resources to assemble. We may still want to study ZTF’s expansive collection of data with SCoPe before the classification catalog is complete. For example, SCoPe classifiers can be applied to the realm of transient follow-up.

It is useful to know the classifications of any persistent ZTF sources that are close to transient candidates on the sky. Once SCoPe’s primary deliverable is complete, obtaining these classifications will involve a straightforward database query. Presently, however, we must run the SCoPe workflow on a custom list of sources repeatedly to account for the rapidly changing landscape of transient events. See “Guide for Fritz Scanners” for a more detailed explanation of the workflow itself. This section continues with a discussion of how the automated analysis in gcn_cronjob.py is implemented using cron.

cron job basics

cron runs scripts at specific time intervals in a simple environment. While this simplicity fosters compatibility between different operating systems, the trade-off is that some extra steps are required to run scripts compared to more familiar coding environments (e.g. within scope-env for this project).

To set up a cron job, first run EDITOR=emacs crontab -e. You can replace emacs with your text editor of choice as long as it is installed on your machine. This command will open a text file in which to place cron commands. An example command is as follows:

0 */2 * * * cd scope && ~/miniforge3/envs/scope-env/bin/python ~/scope/gcn_cronjob.py > ~/scope/log_gcn_cronjob.txt 2>&1

Above, the 0 */2 * * * means that this command will run every two hours, on minute 0 of that hour. Time increments increase from left to right; in this example, the five numbers are minute, hour, day (of month), month, day (of week). The */2 means that the hour has to be divisible by 2 for the job to run. Check out crontab.guru to learn more about cron timing syntax.

Next in the line, we change directories to scope in order for the code to access our config.yaml file located in this directory. Then, ~/miniforge3/envs/scope-env/bin/python ~/scope/gcn_cronjob.py is the command that gets run (using the Python environment installed in scope-env). The > character forwards the output from the command (e.g. what your script prints) into a log file in a specific location (here ~/scope/log_gcn_cronjob.txt). Finally, the 2>&1 suppresses ‘emails’ from cron about the status of your job (unnecessary since the log is being saved to the user-specified file).

Save the text file once you finish modifying it to install the cron job. Ensure that the last line of your file is a newline to avoid issues when running. Your computer may pop up a window to which you should respond in the affirmative in order to successfully initialize the job. To check which cron jobs have been installed, run crontab -l. To uninstall your jobs, run crontab -r.

Additional details for cron environment

Because cron runs in a simple environment, the usual details of environment setup and paths cannot be overlooked. In order for the above job to work, we need to add more information when we run EDITOR=emacs crontab -e. The lines below will produce a successful run (if SCoPe is installed in your home directory):

PYTHONPATH = /Users/username/scope

0 */2 * * * /opt/homebrew/bin/gtimeout 2h ~/miniforge3/envs/scope-env/bin/python ~/scope/gcn_cronjob.py > ~/scope/log_gcn_cronjob.txt 2>&1

In the first line above, the PYTHONPATH environment variable is defined to include the scope directory. Without this line, any code that imports from scope will throw an error, since the user’s usual PYTHONPATH variable is not accessed in the cron environment.

The second line begins with the familiar cron timing pattern described above. It continues by specifying the a maximum runtime of 2 hours before timing out using the gtimeout command. On a Mac, this can be installed with homebrew by running brew install coreutils. Note that the full path to gtimeout must be specified. After the timeout comes the call to the gcn_cronjob.py script. Note that the usual #/usr/bin/env python line at the top of SCoPe’s python scripts does not work within the cron environment. Instead, python must be explicitly specified, and in order to have access to the modules and scripts installed in scope-env we must provide a full path like the one above (~/miniforge3/envs/scope-env/bin/python). The line concludes by sending the script’s output to a dedicated log file. This file gets overwritten each time the script runs.

Check if cron job is running

It can be useful to know whether the script within a cron job is currently running. One way to do this for gcn_cronjob.py is to run the command ps aux | grep gcn_cronjob.py. This will always return one item (representing the command you just ran), but if the script is currently running you will see more than one item.

Local feature generation/inference

SCoPe contains a script that runs local feature generation and inference on sources specified in an input file. Example input files are contained within the tools directory (local_scope_radec.csv and local_scope_ztfid.csv). After receiving either ra/dec coordinates or ZTF light curve IDs (plus an object ID for each entry), the run-scope-local script will generate features and run inference using existing trained models, saving the results to timestamped directories. This script accepts most arguments from generate-features and scope-inference. Additional inputs specific to this script are listed below.

inputs:

  1. –path-dataset : path (from base scope directory or fully qualified) to parquet, hdf5 or csv file containing specific sources (str)

  2. –cone-radius-arcsec : radius of cone search query for ZTF lightcurve IDs, if inputting ra/dec (float)

  3. –save-sources-filepath : path to parquet, hdf5 or csv file to save specific sources (str)

  4. –algorithms : ML algorithms to run (currently dnn/xgb)

  5. –group-names : group names of trained models (with order corresponding to –algorithms input)

output: current_dt : formatted datetime string used to label output directories

Example usage

run-scope-local --path-dataset tools/local_scope_ztfid.csv --doCPU --doRemoveTerrestrial --scale_features min_max --group-names DR16_stats nobalance_DR16_DNN_stats --algorithms xgb

run-scope-local --path-dataset tools/local_scope_radec.csv --doCPU --write_csv --doRemoveTerrestrial --group-names DR16_stats nobalance_DR16_DNN_stats --algorithms xgb dnn

scope-download-classification

inputs:

  1. –file : CSV file containing obj_id and/or ra dec coordinates. Set to “parse” to download sources by group id.

  2. –group-ids : target group id(s) on Fritz for download, space-separated (if CSV file not provided)

  3. –start : Index or page number (if in “parse” mode) to begin downloading (optional)

  4. –merge-features : Flag to merge features from Kowalski with downloaded sources

  5. –features-catalog : Name of features catalog to query

  6. –features-limit : Limit on number of sources to query at once

  7. –taxonomy-map : Filename of taxonomy mapper (JSON format)

  8. –output-dir : Name of directory to save downloaded files

  9. –output-filename : Name of file containing merged classifications and features

  10. –output-format : Output format of saved files, if not specified in (9). Must be one of parquet, h5, or csv.

  11. –get-ztf-filters : Flag to add ZTF filter IDs (separate catalog query) to default features

  12. –impute-missing-features : Flag to impute missing features using scope.utils.impute_features

  13. –update-training-set : if downloading an active learning sample, update the training set with the new classification based on votes

  14. –updated-training-set-prefix : Prefix to add to updated training set file

  15. –min-vote-diff : Minimum number of net votes (upvotes - downvotes) to keep an active learning classification. Caution: if zero, all classifications of reviewed sources will be added

process:

  1. if CSV file provided, query by object ids or ra, dec

  2. if CSV file not provided, bulk query based on group id(s)

  3. get the classification/probabilities/periods of the objects in the dataset from Fritz

  4. append these values as new columns on the dataset, save to new file

  5. if merge_features, query Kowalski and merge sources with features, saving new CSV file

  6. Fritz sources with multiple associated ZTF IDs will generate multiple rows in the merged feature file

  7. To skip the source download part of the code, provide an input CSV file containing columns named ‘obj_id’, ‘classification’, ‘probability’, ‘period_origin’, ‘period’, ‘ztf_id_origin’, and ‘ztf_id’.

  8. Set --update-training-set to read the config-specified training set and merge new sources/classifications from an active learning group

output: data with new columns appended.

scope-download-classification --file sample.csv --group-ids 360 361 --start 10 --merge-features True --features-catalog ZTF_source_features_DR16 --features-limit 5000 --taxonomy-map golden_dataset_mapper.json --output-dir fritzDownload --output-filename merged_classifications_features --output-format parquet -get-ztf-filters --impute-missing-features

scope-download-gcn-sources

inputs:

  1. –dateobs: unique dateObs of GCN event (str)

  2. –group-ids: group ids to query sources, space-separated [all if not specified] (list)

  3. –days-range: max days past event to search for sources (float)

  4. –radius-arcsec: radius [arcsec] around new sources to search for existing ZTF sources (float)

  5. –save-filename: filename to save source ids/coordinates (str)

process:

  1. query all sources associated with GCN event

  2. get fritz names, ras and decs for each page of sources

  3. save json file in a useful format to use with generate-features  --doSpecificIDs

scope-download-gcn-sources --dateobs 2023-05-21T05:30:43

scope-upload-classification

inputs:

  1. –file : path to CSV, HDF5 or Parquet file containing ra, dec, period, and labels

  2. –group-ids : target group id(s) on Fritz for upload, space-separated

  3. –classification : Name(s) of input file columns containing classification probabilities (one column per label). Set this to “read” to automatically upload all classes specified in the taxonomy mapper at once.

  4. –taxonomy-map : Filename of taxonomy mapper (JSON format)

  5. –comment : Comment to post (if specified)

  6. –start : Index to start uploading (zero-based)

  7. –stop : Index to stop uploading (inclusive)

  8. –classification-origin: origin of classifications. If ‘SCoPe’ (default), Fritz will apply custom color-coding

  9. –skip-phot : flag to skip photometry upload (skips for existing sources only)

  10. –post-survey-id : flag to post an annotation for the Gaia, AllWISE or PS1 id associated with each source

  11. –survey-id-origin : Annotation origin name for survey_id

  12. –p-threshold : Probability threshold for posted classification (values must be >= than this number to post)

  13. –match-ids : flag to match input and existing survey_id values during upload. It is recommended to instead match obj_ids (see next line)

  14. –use-existing-obj-id : flag to use existing source names in a column named ‘obj_id’ (a coordinate-based ID is otherwise generated by default)

  15. –post-upvote : flag to post an upvote to newly uploaded classifications. Not recommended when posting automated classifications for active learning.

  16. –check-labelled-box : flag to check the ‘labelled’ box for each source when uploading classifications. Not recommended when posting automated classifications for active learning.

  17. –write-obj-id : flag to output a copy of the input file with an ‘obj_id’ column containing the coordinate-based IDs for each posted object. Use this file as input for future uploads to add to this column.

  18. –result-dir : name of directory where upload results file is saved. Default is ‘fritzUpload’ within the tools directory.

  19. –result-filetag: name of tag appended to the result filename. Default is ‘fritzUpload’.

  20. –result-format : result file format; one of csv, h5 or parquet. Default is parquet.

  21. –replace-classifications : flag to delete each source’s existing classifications before posting new ones.

  22. –radius-arcsec: photometry search radius for uploaded sources.

  23. –no-ml: flag to post classifications that do not originate from an ML classifier.

  24. –post-phot-as-comment: flag to post photometry as a comment on the source (bool)

  25. –post-phasefolded-phot: flag to post phase-folded photometry as comment in addition to time series (bool)

  26. –phot-dirname: name of directory in which to save photometry plots (str)

  27. –instrument-name: name of instrument used for observations (str)

process: 0. include Kowalski host, port, protocol, and token or username+password in config.yaml

  1. check if each input source exists by comparing input and existing obj_ids and/or survey_ids

  2. save the objects to Fritz group if new

  3. in batches, upload the classifications of the objects in the dataset to target group on Fritz

  4. duplicate classifications will not be uploaded to Fritz. If n classifications are manually specified, probabilities will be sourced from the last n columns of the dataset.

  5. post survey_id annotations

  6. (post comment to each uploaded source)

scope-upload-classification --file sample.csv --group-ids 500 250 750 --classification variable flaring --taxonomy-map map.json --comment confident --start 35 --stop 50 --skip-phot --p-threshold 0.9 --write-obj-id --result-format csv --use-existing-obj-id --post-survey-id --replace-classifications

scope-manage-annotation

inputs:

  1. –action : one of “post”, “update”, or “delete”

  2. –source : ZTF ID or path to .csv file with multiple objects (ID column “obj_id”)

  3. –group-ids : target group id(s) on Fritz, space-separated

  4. –origin : name of annotation

  5. –key : name of annotation

  6. –value : value of annotation (required for “post” and “update” - if source is a .csv file, value will auto-populate from source[key])

process:

  1. for each source, find existing annotations (for “update” and “delete” actions)

  2. interact with API to make desired changes to annotations

  3. confirm changes with printed messages

scope-manage-annotation --action post --source sample.csv --group_ids 200 300 400 --origin revisedperiod --key period

Scope Upload Disagreements (deprecated)

inputs:

  1. dataset

  2. group id on Fritz

  3. gloria object

process:

  1. read in the csv dataset to pandas dataframe

  2. get high scoring objects on DNN or on XGBoost from Fritz

  3. get objects that have high confidence on DNN but low confidence on XGBoost and vice versa

  4. get different statistics of those disagreeing objects and combine to a dataframe

  5. filter those disagreeing objects that are contained in the training set and remove them

  6. upload the remaining disagreeing objects to target group on Fritz

./scope_upload_disagreements.py -file dataset.d15.csv -id 360 -token sample_token