Accessing HEASARC and LAMBDA data in the Cloud

Intro

Beginning in 2023, the Year of Open Science, as part of NASA's Open Science Initiative, and in collaboration with the Amazon Web Services (AWS) Open Data project, HEASARC data are now available in the cloud. This effort is motivated by the need to increase the accessibility of this data in the broader community and to enable the kind of science that requires the significant resources of cloud computing.

HEASARC data are now on AWS and registered in their Open Data Registry. HEASARC is building a next-generation platform that will be like HEASARC@SciServer but running in AWS, but in the mean time, the data are already available. Below we show in a tutorial notebook how to do this in Python. These locations could then be used with cloud-compatible client software such as Astropy-affiliated packages Astroquery and PyVO to provide seamless access to data access in the cloud. Our Xamin data portal offers results in various formats including a list of cloud URIs.

Pythonic Data Access Tutorial

For a quick tutorial on accessing HEASARC or LAMBDA data in the cloud using Python, we have prepared a Python notebook that you can download, or view it rendered as HTML.

Some software, such as Astropy's FITS IO routines can read data directly from the S3 bucket, including with options to read only a subset of a FITS file. Tools like HEASoft based on cfitsio can also read any file out of a URL. See below.

Note that some HEASoft tools that rely on knowing the directory structure of an input dataset might require you to copy the data out of the S3 object store and into a file system it can access.

Direct Bucket Access

These data can currently be accessed by using the HEASARC or LAMBDA web tools to browse the archive and retrieve a list of observations or files to download, or by doing the same with one of our APIs. (See our archive pages for the HEASARC options or the LAMBDA data portal.) If the given tool does not return cloud URIs, they can be inferred from the on premises URL. Simply replace the beginning of the traditional access URL with the AWS S3 bucket address. For example, a Chandra image located at

https://heasarc.gsfc.nasa.gov/FTP/chandra/data/byobsid/5/4475/primary/acisf04475N004_full_img2.fits.gz

can also be found at

s3://nasa-heasarc/chandra/data/byobsid/5/4475/primary/acisf04475N004_full_img2.fits.gz
or
https://nasa-heasarc.s3.amazonaws.com/chandra/data/byobsid/5/4475/primary/acisf04475N004_full_img2.fits.gz

For LAMBDA data, similar URLs can be turned into URIs starting with "s3://nasa-lambda/". Note that for WMAP, there is one small change to the path from "map" to "wmap" to clarify that it's the mission name. I.e.,

https://lambda.gsfc.nasa.gov/data/map/dr5/skymaps/9yr/smoothed/wmap_band_smth_iqumap_r9_9yr_K_v5.fits

can also be found at

s3://nasa-lambda/wmap/dr5/skymaps/9yr/smoothed/wmap_band_smth_iqumap_r9_9yr_K_v5.fits
or
https://nasa-lambda.s3.amazonaws.com/wmap/dr5/skymaps/9yr/smoothed/wmap_band_smth_iqumap_r9_9yr_K_v5.fits

Thanks to Amazon's Open Data project, these data are free to access from anywhere, not subject to cloud data egress costs. As described on HEASARC's data policy web page, these data are available freely for your use.

Datasets

The datasets currently available include:

  • High-energy astrophysics datasets
    • Ariel5
    • ASCA
    • BBXRT
    • Chandra
    • Compton
    • Copernicus
    • COS-B
    • DXS
    • EXOSAT
    • Fermi (subset)
    • Ginga
    • HaloSat
    • HEAO-1
    • Hitomi
    • IXPE
    • Nicer
    • NuSTAR
    • OSO-8
    • ROSAT
    • SAS-2
    • BeppoSAX
    • Suzaku
    • Swift
    • VELA 5B
    • WASS
    • XQC
    • Rossi XTE
    • XMM-Newton
  • CMB datasets
    • WMAP
    • COBE

Please also see the HEASARC and LAMBDA entries in the AWS Open Data Registry.

Caveats

Some selection of datasets has been made to avoid putting into the cloud data that we don't believe will be useful to access this way, such as older mission data in non-standard file formats. We will also keep the nasa-heasarc bucket in sync with the on-prem archive on a best efforts basis for the ongoing missions. Therefore the most recent data products may only be available from the HEASARC on-prem archive for a few days until the next sync.