Accessing HEASARC and LAMBDA data in the Cloud

Intro

Beginning in 2023, the Year of Open Science, as part of NASA's Open Science Initiative, and in collaboration with the Amazon Web Services (AWS) Open Data project, HEASARC data are now available in the cloud. This effort is motivated by the need to increase the accessibility of this data in the broader community and to enable the kind of science that requires the significant resources of cloud computing.

For most of our user community, the traditional workflow is to browse for data in a portal such as the HEASARC's Xamin and from there to download the datasets of interest. With the increasing size of modern datasets, this approach is becoming harder. HEASARC's earliest science platform was Hera, where users could do limited analyses directly against data on the archive. The HEASARC@SciServer project is our newest science platform, in collaboration with SciServer. With this science platform, our users can log into a fully featured system for almost all analysis needs without needing to download any data or build any software. The data and software are made available to anyone through their browser along with free, but limited, compute resources provided on premises at SciServer.

The next step of this evolution is to take advantage of the expandable compute capacity provided by the cloud. For this reason, HEASARC data are now on AWS and registered in their Open Data Registry. HEASARC is building a next-generation platform that will be like SciServer in AWS, but in the mean time, the data are already available. We are working to update our services to offer cloud locations as an option. These locations could then be used with cloud-compatible client software such as Astropy-affiliated packages Astroquery and PyVO to provide seamless access to data access in the cloud.

Access

These data can currently be accessed by using the HEASARC or LAMBDA web tools to browse the archive and retrieve a list of observations or files to download, or by doing the same with one of our APIs. (See our archive pages for the HEASARC options or the LAMBDA data portal.) Once the user has the location of the dataset, they can replace the beginning of the traditional access URL with the AWS S3 bucket address. For example, a Chandra image located at

https://heasarc.gsfc.nasa.gov/FTP/chandra/data/byobsid/5/4475/primary/acisf04475N004_full_img2.fits.gz

can also be found at

s3://nasa-heasarc/chandra/data/byobsid/5/4475/primary/acisf04475N004_full_img2.fits.gz
or
https://nasa-heasarc.s3.amazonaws.com/chandra/data/byobsid/5/4475/primary/acisf04475N004_full_img2.fits.gz

For LAMBDA data, similar URLs can be turned into URIs starting with "s3://nasa-lambda/". Note that for WMAP, there is one small change to the path from "map" to "wmap" to clarify that it's the mission name. I.e.,

https://lambda.gsfc.nasa.gov/data/map/dr5/skymaps/9yr/smoothed/wmap_band_smth_iqumap_r9_9yr_K_v5.fits

can also be found at

s3://nasa-lambda/wmap/dr5/skymaps/9yr/smoothed/wmap_band_smth_iqumap_r9_9yr_K_v5.fits
or
https://nasa-lambda.s3.amazonaws.com/wmap/dr5/skymaps/9yr/smoothed/wmap_band_smth_iqumap_r9_9yr_K_v5.fits

Thanks to Amazon's Open Data project, these data are free to access from anywhere, not subject to cloud data egress costs. As described on HEASARC's data policy web page, these data are available freely for your use.

Tutorial

For a quick tutorial on accessing HEASARC or LAMBDA data in the cloud using Python, we have prepared a Python notebook that you can download, or view it rendered as HTML.

Analysis of these data with traditional tools such as HEASoft still requires the user to have a compute environment with the software build and to copy the data out of the S3 object store and into a file system it can access. Some software, such as Astropy's FITS IO routines can read data directly from the S3 bucket, including with options to read only a subset of a FITS file. (Neither is currently possible in HEASoft, but we are looking into it.)

Datasets

The datasets currently available include:

  • High-energy astrophysics datasets
    • Ariel5
    • ASCA
    • BBXRT
    • Chandra
    • Compton
    • Copernicus
    • COS-B
    • DXS
    • EXOSAT
    • Fermi (subset)
    • Ginga
    • HaloSat
    • HEAO-1
    • Hitomi
    • Nicer
    • NuSTAR
    • OSO-8
    • ROSAT
    • SAS-2
    • BeppoSAX
    • Suzaku
    • Swift
    • VELA 5B
    • WASS
    • XQC
    • Rossi XTE
    • XMM-Newton
  • CMB datasets
    • WMAP
    • COBE

Please also see the HEASARC and LAMBDA entries in the AWS Open Data Registry.

Caveats

Some selection of datasets has been made to avoid putting into the cloud data that we don't believe will be useful to access this way, such as older mission data in non-standard file formats. We will also keep the nasa-heasarc bucket in sync with the on-prem archive on a best efforts basis for the ongoing missions. Therefore the most recent data products may only be available from the HEASARC on-prem archive.