Observatory Data Flow

For users who wish to utilize the Science Archive for data archival and retrieval, this document describes the components necessary.

Main Components

  • OCS Ingester - commonly used as a library to simplify the ingestion process of observatory data products into the science archive.
  • OCS Science Archive - A Django application which aids in the archival storage and retrieval of observatory data products.

Both of these applications make use of a common library, the OCS Archive Library. The OCS Archive library allows the OCS Ingester and Science Archive to share common configuration and reduces the need for duplicated code across the two projects - this shared code contains classes used to abstract data files and data storage backends. See the Advanced Topics section for more information about customizing the behavior of the ocs_archive library.

Configuration

Because both the OCS Science Archive and OCS Ingester depend on the OCS Archive Library, the two can share common configuration. Runtime configuration for the OCS Archive Library is set using environment variables - see OCS Archive Configuration Documentation for more detail about possible configuration options.

Setting up FITS Headers

The OCS Archive Library has built in capacity to handle .fits files, but certain environment variables must be set to map between .fits header keys and concepts that are required by the Science Archive for data filtering. The table below lists the .fits header concepts expected, their environment variable to override the header key, and the default header key.

Archive model field Description Environment Variable Default Header Key
observation_date The time of the observation in UTC OBSERVATION_DATE_KEY DATE-OBS
observation_day The observing night in YYYYMMDD OBSERVATION_DAY_KEY DAY-OBS
site_id The configdb site code the data was taken at SITE_ID_KEY SITEID
telescope_id The configdb telescope code this data was taken at TELESCOPE_ID_KEY TELID
instrument_id The configdb instrument code this data was taken with INSTRUMENT_ID_KEY INSTRUME
observation_id The observation ID for this piece of data OBSERVATION_ID_KEY BLKUID
request_id The request ID for this piece of data REQUEST_ID_KEY REQNUM
target_name The user defined name of the target for this data TARGET_NAME_KEY OBJECT
configuration_type The configuration type of this observation CONFIGURATION_TYPE_KEY OBSTYPE
proposal_id The proposal ID this data was requested under PROPOSAL_ID_KEY PROPID
public_date The date at which this data should be made publicly accessible PUBLIC_DATE_KEY L1PUBDAT
primary_optical_element The value of the primary optical element this data was taken with PRIMARY_OPTICAL_ELEMENT_KEY FILTER
exposure_time The exposure time this data was taken with EXPOSURE_TIME_KEY EXPTIME
reduction level The reduction level of this data (0 is considered raw). REDUCTION_LEVEL_KEY RLEVEL

Note: If they do not exist already, these values must be added to the headers of your .fits files in your Telescope Control System when writing the files. Many of them can be pulled out of the Observation Portal’s schedule observation json and put directly into the header of data taken for that observation.

There are also several environment variables for mapping .fits header keys that are used in the ingestion of data but not stored directly into the archive. These include:

Environment Variable Description Default Header Key
OBSERVATION_END_TIME_KEY The ISO formatted observation end date UTSTOP
CONFIGURATION_ID_KEY The configuration ID for this piece of data MOLUID
REQUESTGROUP_ID_KEY The requestgroup ID for this piece of data TRACKNUM
CATALOG_TARGET_FRAME_KEY The base filename of the catalog file for this target L1IDCAT
RADIUS_KEY The FOV radius in arcseconds for a circular FOV, used to calculate WCS if specified RADIUS
RA_KEY The FOV center RA for a circular FOV, used to calculate WCS if specified RA
DEC_KEY The FOV center DEC for a circular FOV, used to calculate WCS if specified DEC
RELATED_FRAME_KEYS A comma delimited list of header keys to search for related frame base filenames in the header L1IDBIAS,L1IDDARK,L1IDFLAT,L1IDSHUT,L1IDMASK,L1IDFRNG,L1IDCAT,L1IDARC,L1ID1D,L1ID2D,L1IDSUM,TARFILE,ORIGNAME,ARCFILE,FLATFILE,GUIDETAR

If there are more complicated mappings that you need to pull the data out of your .fits file, or if you have a different file type entirely, then please read the section below on adding custom dataproduct classes.

Telescope Control System Integration

In the execution of an observation, the observatory’s Telescope Control System generates data products. For these products to be archived and served to users or staff via an OCS Science Archive, they must be ingested.

A Simple Ingestion Scheme

Once a TCS has created its final data products to be archived, the TCS may call out to the Ingester library directly. If the TCS is written in Python, the call is as simple as setting the relevant environment variables for your filestore type and running:

from ocs_ingester.ingester import upload_file_and_ingest_to_archive

with open('/path/to/file', 'rb') as file:
    ingester_response = upload_file_and_ingest_to_archive(file)

...
# optionally do something with the ingester response

The Ingester Response contains information about the archived frame, such as its location (URL for cloud filestore), ID in the Science Archive database and other frame metadata. This can be useful if the application ingesting the data wishes to keep its own record of data products and their location in the archive.

For more detailed information about the OCS Ingester and its capabilities, see the developer documentation.

Advanced Topics

Overriding Classes in the ocs_archive Library

The ocs_archive library provides base classes to represent observatory data products and filestore locations in order to provide a unified interface to various types of data products and filestore solutions.

Data Products

Within the input module of the ocs_archive library there exist classes to represent various different types of files such as FITS, GZipped TAR files, etc…

If an observatory wishes to add their own file specification, they may do so by sub-classing the DataFile class and adding any customizations needed. For examples of this functionality, see the implementation of the FitsFile class, which customizes the base DataFile class to work with FITS files. For Las Cumbres Observatory’s purposes, we have further sub-classed the FitsFile class into an LCOFitsFile to add additional functionality to meet the needs of our observatory.

Once a new file specification has been added, it should be added to the FileFactory class, which is responsible for returning the appropriate DataFile subclass given a file’s extension. It utilizes a mapping from file extension -> DataFile class to determine which class to construct, given an input file:

EXTENSION_TO_FILE_CLASS = {
    '.fits.fz': 'ocs_archive.input.lcofitsfile.LcoFitsFile',
    '.fits': 'ocs_archive.input.fitsfile.FitsFile',
    '.tar.gz': 'ocs_archive.input.tarwithfitsfile.TarWithFitsFile',
    '.pdf': 'ocs_archive.input.file.DataFile'
}

This mapping can be overridden using the FILETYPE_MAPPING_OVERRIDES environment variable if the default is not sufficient or needs to be changed.

Filestore

A set of classes which represent storage backends exists within the storage module of the ocs_archive library.

The base FileStore class is intended to represent a data product storage backend, and can be sub-classed and extended to work with many different storage solutions.

For example, the S3Store contains customizations to the base FileStore class to allow observatory data products to be stored in Amazon S3. If an observatory needs to use a different cloud provider, then a new subclass can be created.

The ocs_archive also provides a FileStore class, FileSystemStore, made to work with network or local filestores, if an observatory does not wish to store their data with a cloud provider. Filesystem storage does not provide version history or time authenticated generated urls, so S3 Storage should be heavily preferred for production setups.

Persistent Queue for Data Products

Note: This only applies when a separate process or application is responsible for ingesting data products from the TCS


When a TCS generates data products to be ingested, it’s helpful to use a queueing system such as RabbitMQ to keep track of images to be ingested. By utilizing a simple queuing architecture, if the application responsible for ingesting images ever experiences downtime, the TCS can still queue images to be ingested and the ingestion application will be able to catch up after its outage.

Ingester queueing scheme

The queue message can be as simple as the location from which to retrieve the file to be ingested, but should contain the minimum amount of information necessary for the ingestion application to successfully ingest the data to the science archive.

{"filepath": "/path/to/file/on/disk"}

Retry Scheme for Data Product Ingestion

For added robustness, it is good practice to integrate retry logic into the data product ingestion process. It’s for this reason that the OCS Ingester defines a set of Python exception classes that will be raised in appropriate situations:

  • RetryError - Exception that is raised when an error happens that can be retried.
  • BackoffRetryError - Exception that is raised when an error happens that can be retried with an expontential backoff. For example, networking latency errors that may succeed at a later time.
  • DoNotRetryError - Exception that is raised when an error happens that will undoubtedly repeat if called again. The task should not be retried.
  • NonFatalDoNotRetryError - Exception that is raised when an error happens that should not be retried and is also not a fatal condition. This is different from the DoNotRetryError as it indicates the failure wasn’t a fatal issue in the ingestion of an image (e.g. file doesn’t exist on disk or Astropy exception raised), but rather indicates an error such as the file already existing in the filestore.

Handling these errors and retrying ingestion if appropriate will increase the robustness of an observatory’s data product flow. Appropriate retry logic combined with queueing infrastructure leads to a system which is fault-tolerant and able to recover easily from intermittent service outages.

In Python, the Tenacity package provides a set of function decorators that allow for various retry schemes.

Keeping a Record of Data Product Ingestion

Applications may need to keep track of some of the data products they’ve ingested into the Science Archive. For instance, a data pipeline that creates calibration frames for use in later science data reduction may need to access these images from the science archive to perform scientific reductions.

One method to keep a record of these data products is using a relational database table that the application can access as needed. When a data product is created by the application, it creates a new image record in its database containing the Science Archive frame ID alongside any other useful metadata that the application may need (filter used, binning, instrument name, etc…).

By storing the frame ID in the record, the lookup process in the Science Archive is greatly sped up, as the Science Archive record for this image can be accessed by simply by sending an HTTP GET request to the /frames/<image-ID> endpoint.

For an observatory whose Science Archive is handling many requests, this scheme prevents internal applications from adding unnecessary load to the system by allowing for near-instant access by ID, rather than using more expensive HTTP filter parameters.