The HEASARC Database System

Adding Tables to the HEASARC Database

This chapter of the document discusses building a HEASARC table for use in the Browse system. The first section discusses the logical steps required in building a catalog, while the second section breaks this down into specific actions and the process of building the table. The primary tool for physically generating a table, HDBingest and the complementary tool for exporting a table into a file, HDBexgest, are described below.

Catalog Design and Preparation Issues

A number of tasks are required for the successful creation of a table in the HEASARC database. These include:

  1. Accessing, generating or recovering the information that will be used in the table.
  2. Designing the structure that will be used for the table.
  3. Modification of the original information to meet with HEASARC standards.
  4. Definition of the actual structure of the table.
  5. Creation of any indices that may be helpful in searching the table.
  6. Updating the metadata ZZGEN, ZZEXT and ZZPAR metadata tables to reflect the existence of the new table.
  7. Ingestion of the actual data values for the table.
  8. Definition of the data products associated with the table.
  9. Definition of any links between this table and any other tables in the database.
  10. Creation of appropriate table documentation.
  11. Modification of Browse HTML pages if the table refers to a new mission.
  12. Scanning of the HEASARC archive to find any new data product tags that should be associated with the new table.

In this section we will discuss the issues that will may come up with each of these. The next section describes a specific scenario and the how it might be addressed.

Step 1: Getting the Information

The first step in building a table is understanding the source of the information that will be in the table. Many tables are static. They are generated once and then never modified. Other tables, especially those describing observations from active missions, are dynamic. These tables need to be updated periodically with new information.

A full discussion of how to get data is beyond the scope of this document. Dynamic tables often involve some collaboration with the mission operations centers or guest observer facilities of some instrument, or periodic scanning of a location where another organization posts updates. Information for static tables may be culled directly from the scientists who generated them or for other organizations. Note that the in an increasingly interconnected world, it may not be necessary for the database to be included directly within the HEASARC for the HEASARC Browse systems to use it. Browse currently can query any table in the VizieR system and soon will have the capabilities of querying any VO compliant database.

Step 2: Designing the Overall Table Structure

Once the input information is at hand, it is possible to consider how this information may most effectively be stored in the HEASARC database. The basic constraint on tables is that each row in the table describes one 'thing', an observation, an object, perhaps one observation of one object. Once this basic concept for the table is established, the appropriate structure may come to hand.

For example, consider a table of proposed targets. Should this table include the abstract for the proposal as a field? In general the abstract does not refer to the individual targets in a proposal, but to the proposal as a whole. So the abstract probably does not belong in a table of proposed targets. On the other hand it is a very natural field in a table of proposals. So the most elegant approach may be to create two tables: proposals and proposed targets, and link them together using ZZLINK.

On the other hand, it may be very convenient to include the principle investigator for every target, even though that could also be gotten by linking back to the proposal. The question of how well 'normalized' the database should be is a matter of judgement.

For most tables the database structure and organization should follow the guidelines established by earlier tables in the HEASARC.

Step 3: Modification of the Data to HEASARC Standards

Unlike some other database system (e.g., VizieR at the CDS), the HEASARC is willing to modify tables to try to make them interoperable. The following modifications are commonly needed:

  • Column names may be changed to agree with usage in other tables. RA and Dec are normally used for the primary J2000 positions and there is now some effort to standardized on names for exposure and time fields.
  • The base coordinates of the table are converted to J2000 positions. If the table uses some other coordinates (e.g., Galactic), then these positions should be kept, but additional columns added using standard coordinate transformations. In some previous tables, RA and Dec columns expressed in B1950 coordiantes were converted to J2000 and the original positions dropped. This is no longer recommended.
  • Tables where there is information about the types of sources described in each row (or to be studied in the observations described by the row) should have a class field added where these classes are transformed into the HEASARC's standard class system. The original class information should also be retained.
  • The structure of the table should be made uniform: each row should have the same number and type of columns, though one or more values may be null.

Step 4: Defining the Table

The table is defined by an SQL create table statement which gives the names and types of the fields in the database. Since the HEASARC Database System was designed to be easily implemented and ported between a variety of commercial (and free) RDBMS implementations, the HEASARC Database System supports only a common subset of available SQL data types. In particular there is (currently) no support for variable-length character fields (varchar), very long (8 byte) integers, or any vendor-specific "blob" (binary) types. If such types are desirable, please contact Ed Sabol or Tom McGlynn.

In addition to creating the table, permission to view the table must be given to allow users other than the creator to query it. This usually involves a command of the form grant select on xxx to public.

Step 5: Indexing the Table

To search tables effectively indices are created on columns that are likely to be part of a query. Simple indices are created using the create index SQL command. The HEASARC currently indexes almost all columns. This is not likely the most effective strategy but it is simple. The primary index for most tables is the position (or if only one field is allowed) the declination. However even when both position values are supported in the primary index this is not the kind of r-Tree indexing that allows for rapid spatial searches and in practice the system treats the database as if the index was on declination alone.

For tables with more than about 1 million entries, alternative approaches to organizing the data would probably be useful. If you have very large tables you may wish to explore generating 2-D indices and making these the primary key of the table. There are a number of effective mechanisms for this, (Hierachical Triangular Meshes(HTM), Hierachical Equal Area Latitudinal Pixelization (HEALPix), and other simpler tools).

Note that effective use of two-dimensional indices requires that the query software be aware of the existence of these indexes. Standard Browse software does not currently support these two-dimensional indices.

Step 6: Copying Data into the Table

A common issue to be addressed in copying data is the frequency with which the table will be updated and whether changes are purely of the form of additional rows. If the latter is not true, then the table may need to be recreated from scratch every time it is updated.

Tables can be ingested either using insert SQL statement or through vendor-specified "bulk copy" tools. The HDBingest utility handles the ingest of data for almost all HEASARC tables. The HDBingest utility may not be adequate if binary data ("blob" fields, e.g., images) or other unsupported data types are needed. Tables with these data types cannot be accommodated in the current Browse interfaces either.

HDBingest uses an ASCII transfer mechanism. For very large tables a binary transfer mechanism may be preferable. Most database systems provide vendor-specific facilities for binary bulk copying of databases.

Step 7: Added Table-Specific Metadata to the HEASARC Metabase

Adding the metadata information is relatively straightforward process, though it can be tedious if automated tools are not used. The HDBingest tool takes care of this step automatically. When a table is first ingested using HDBingest, the relevant ZZGEN, ZZPAR and ZZEXT entries will be inserted by HDBingest. If the table is dynamically updated, then usually only ZZGEN's table_rows and modify_date fields and ZZPAR's parameter_min and parameter_max fields will need to be updated in subsequent ingests of data.

Step 8: Defining Data Products

One of the more complex issues in developing a catalog is understanding the data products to be associated with the catalog. Internally all data products are stored as URLs. When a user requests a single data product they can access it through the URL. When multiple data products are to be combined into a TAR file, the underlying system recognizes local files and accesses them directly rather than using HTTP or FTP.

Data products are generally grouped into data product sets where each member of the set is a file, directory or URL to a remote service. Users can retrieve data product sets as an entity. Sets should be created so that the science that can be done with a given set is clear. Data products sets can overlap: the same data product can be in many sets. The system will try to ensure that only a single copy is retrieved if the user asks for many sets that include the same product. Data product sets might include:

"Quick look" files
GIFs, JPEGs, or ASCII summaries of the observation
An image data product
Counts, intensity and exposure maps
Spectral data products
PHA and background data, perhaps with redistribution matrix files
Housekeeping data
Records describing instrument health and engineering readouts
Event data
Event files, good time intervals, aspect files

The science goals of a mission should drive the data products associated with its tables. Remember that data products need not be at the HEASARC and that they may include dynamic Web pages.

The association of data products with entries in the catalog tables ultimately depends upon an association of the full path to the file with fields in the database. The organization of files in the archive should be looked at and the contents of the catalog should include fields that reference elements of the path.

Step 9: Linking Catalogs

Recent versions of Browse allow a user to jump from a row in one catalog to related rows in another. In designing a new catalog, users should consider what if any associated catalogs are appropriate. Some links to consider:

  • Observations to (and from) the objects found in the observations.
  • Proposed targets to (and from) the proposal.
  • Master observation tables to the detailed observation tables.
  • One object table to nearby objects in another object table.
  • Observations in one instrument to nearby observations in another table.

Links can be made on spatial or temporal proximity, or any fields that the two tables have in common using standard SQL.

Step 10: Writing Documentation

Each table should have a general description of table and each column in the table should have its own description as well. Table documentation for the HEASARC Database System should be created in the ".info" file format. A software tool is then used to render the ".info" files into HTML files used for the Browse Web interface. Many examples of ".info" files can be found under /dba_dbase/work/. Note that the Browse system expects that there will be named anchors in the HTML document with the name of each column in the table.

Step 11: New Missions

If a table represents a new mission for Browse, then certain handcrafted HTML files in the Browse Web interface may need to be updated. These organize the display of missions in the top Browse pages. Contact the Browse software developer to have this task performed.

Step 12: Scanning the Archive

Once a table is in the archive, if it has data products, the archive must be scanned to ensure that appropriate entries are put in the ZZDP table. The procedure for this is discussed in detail below. This activity typically takes place every few days. If new data products files need to be recognized immediately, then special arrangements can be made. It is possible to create data products proactively, so that there are place holders in the database for the data products; however, this may cause Browse to create links to empty files.

Building a Catalog

In this section we reiterate the discussion of the previous section in a concrete context of building a table for inclusion in the HEASARC. In this scenario Dr. Charles Messier has agreed to periodically deliver updates of his ongoing project to survey extended optical objects. The HEASARC will store a set of images and build a catalog of the object he is delivering.

Step 1: Getting the Data

In this scenario we are apparently building a dynamic and archive which will be updated by deliveries from Dr. Messier. Ideally, an memo of understanding (MOU) will be agreed upon between the HEASARC and Dr. Messier. This will describe in detail the responsbilities of both parties in the data delivery. Suppose in this simple example Dr. Messier agrees that he will deliver optical images in FITS files to drop off area discussed. Information from within the FITS files will be used to create the catalog data. The MOU describes the information that will be given for each keyword. The HEASARC would thus be responsible for periodically scanning the delivery area and copying new files when they are found. The HEASARC will also build the software to analyze the FITS files.

Step 2: Designing the Table

The key information that this new catalog is to contain is the name, position, brightness and size of the objects being observed. Some objects may have class information, and the time of observation is to be included. This table seems to be a hybrid object/observation table, describing an observation of a particular object, similar to our WGACAT table. The table will need name, ra, dec, size, magnitude, class and time columns. Where possible the field names used in other HEASARC tables should be reused here.

Some information may be missing and the appropriate action should be defined for each field: Can the field be left null? That is probably appropriate for the size and magnitude columns. However we may wish to treat missing RA, Dec or time values as fatal errors. Dr Messier will be requested to resend these data. The class field should be given the default value 'Unclassified' when no classification information is received.

This table has both temporal and spatial information. We would anticipate needing to specify the following ZZEXT entries:

frequency_regime
'Optical'
default_search_radius
Dr. Messier suggests that the maximum size of the image that he is going to deliver is approximately 1 degree, so this field gets a value of '60' (arcminutes).
right_ascension
Just the pointer to the RA column, so '@ra'
declination
Ditto for dec, '@dec'
equinox
The equinox field should always be 2000. While other equinoxes are supported, it is more efficient to do comparisons using a standard coordinate system.
start_time
While we're not sure that the time given is the start time of the observation, the convention we use is to use the start_time to indicate data when only one time is specified, so '@time'.
observatory_name
This one is required, but this catalog doesn't fit well. It's not associated with any specific mission, nor does it fit cleanly within the psuedo-missions like 'STAR CATALOG'. By default this table slips into 'GENERAL CATALOG'.
unique_key
This indicates what is required to distinguish this table from all others. Since we're getting just one observation per object, the object name is probably enough: '@name'.
table_priority
Clearly this is a very important table.We give it a priority of 3, so that it shows up early in lists of tables.
table_type
This is an object catalog, so it gets the value 'Object'.

Step 3: Transforming the Input Data

Dr. Messier chooses to send us the data using B1775 coordinates. Since this not compatible with our other Browse tables, we prepare to tranform the tables using software tools. Similarly, his classification fields use phrases like 'spiral nebula' that we plan on translating to spiral galaxy. His naming convention for the sources is quite simple.

The actual software used for this transformation is not part of the database system. One might build an IDL script that looked at the input directory and did something like:

   pro process_inputs, deliveryDirectory

      files = findfile(deliveryDirectory +"/*.fits")
      openw, lun, "Output.file", /get_lun

      for ifile=0, nelements(files)-1 do begin

	  data = mrdfits(files[i], 0, header)

	  targetname = fxpar(header, "TARGET")
	  ra1775     = fxpar(header, "RA")
	  dec1775    = fxpar(header, "DEC")
	  type       = fxpar(header, "TARGTYPE")
	  time       = fxpar(header, "OBS-TIME")
	  size       = fxpar(header, "EXTENT")
	  magnitude  = fxpar(header, "MAG")

	  ra         = ra1775
	  dec        = dec1775


	  precess, ra, dec, 1775, 2000

	  class = ""

	  if (type == "stellar cluster") then class = 8800
	  if (type == 'spiral nebulae") then class = 6600
	  ...


	  printf, lun, targetname, ra, dec, class, time, size, magnitude, &
                 ra1775, dec1775, classInput, format='(...)'
      endfor

      free_lun, lun
  end

Of course the software used at this point is entirely up to the convenience of the table developer, and the IDL script above is a very simple example of such software. In practice almost all of the database tables ingested by the HEASARC are processed using Perl, since Perl is particularly adept at that sort of thing. If many tables are to be ingested scripts that automate some of this process may be desirable. Note that we saved the original data values and we should anticipate including them in the table as well as ra_1775, dec_1775, and type columns.

Steps 4-7: Table Creation, Ingest, Indexing, and Metadata Update

These steps are all done by first creating a TDAT file from the inputs we have been provided by above and then using the HDBingest command.

The TDAT file might look something like:

<HEADER>
table_name = heasarc_messier
table_description = "New Messier catalog"
table_document_url = http://heasarc.gsfc.nasa.gov/W3Browse/general-catalog/messier.html

field[name]      = char6             (index) // Source Designation
field[ra]        = float8:.4f_degree (key)   // Right Ascension
field[dec]       = float8:.4f_degree (key)   // Declination
field[size]      = float4:.1f_degree (index) // Size of the Object
field[magnitude] = float4:.1f        (index) // Visual Magnitude
field[class]     = int2              (index) // Browse Object Classification
field[ra_1775]   = float8:.4f_degree (index) // Right Ascension B1775
field[dec_1775]  = float8:.4f_degree (index) // Declination B1775
field[type]      = char20            (index) // Messier's Original Classification

parameter_defaults = name ra dec size magnitude class

default_search_radius = 60
equinox = 2000
right_ascension = @ra
declination = @dec
target_name = @name
frequency_regime = Optical
observatory_name = GENERAL CATALOG
table_priority = 3
table_type = Object
unique_key = name
#
# Data Format Specification
#
line[1] = name ra dec size magnitude class ra_1775 dec_1775 type
#
<DATA>
M1|10.3|10.3|0.3|7.2|8800|10.0|10.0|Nebula|
M2|190.2|-14.3|1.3|8840|190.0|-14.0|Cluster of Stars|
<END>

Let's go through this TDAT file in detail.

The table name and description are a short name and brief description of the table. We'll use the table name throughout the system to identify this table, and users will normally see the table description in contexts where they may choose to use this table.

The table document URL is the documentation we need to write as step 10.

The fields of the table are given with each parameter having a single field element. There is a lot of information encoded in each line. Look at the description of the TDAT file for more details.

We have made the ra and dec the key fields in the table -- the database should try to organize the table physically according to these parameters (with dec first, since it comes first alphabetically). All of the other fields are made indices (which is probably not the optimal choice).

Note that the output format of the position fields is limited to the equivalent of 4 decimal places or roughly to the arcminutes. The size and magnitude are to be output only to one decimal place.

We have kept the input position and class information but put it in table specific columns. The parameter defaults gives the fields to be displayed by default and their order. The 1775 positions and inputclass will not be displayed by default.

The set of fields starting with declination it where we indicate the fields we want to set in ZZEXT.

The line[1] line gives the actual order the data in the fields below. It need not be the same as either the default fields or the field name specifications.

Once the TDAT file is created then we ingest it with the HDBingest command, e.g., HDBingest updatefile.tdat . Assuming Dr Messier never sends us updates on old objects, we can just run updates on the table. However if he occasionally generates a new entry for an old target, we may want to generate a TDAT file for the entire table and use the HDBingest -rebuild option to regenerate the table from scratch.

Step 8: Defining Data Products

Dr. Messier delivered his data products to the deliver directory using arbitrary file names. For each of these files he also included a small ASCII file giving a photometric fit to the data. We have files like 'Qxx9xxx.fits' and Qxx9xxx.phot'. These data products need to be linked to the table so that the table functions as an index for the archive.

The names of the data files are not directly derivable from the entries we have in the table. There are several choices we can make. We can add the strings of the form 'Qxx9xxx' to the archive as an 'archive_name' field. However we can just as easily choose to rename are files so that when we copy the files from the delivery directory to where they reside in the permanent archive they are named 'm1.fits' and 'm1.phot'. We could also create directories in the archive 'm1', 'm2' and then place each observation in a separate directory. Each of these options has been chosen in various places in the HEASARC system. The third is the recommended approach.

If we choose the third option, then our archive may look like:

   /FTP/messier/data/
       m1/
          Q019abc.fits
          Q019abc.phot
       m2/
          Q239ffa.fits
          Q239ffa.phot
       ...

In looking at these data we may decide that it's appropriate to generate a quicklook GIF image as part of our ingest processing. We can use IDL, [Ximage?], or other tools like ImageMagick to create the GIFs. These will be stored with the delivered data.

Dr. Messier also supplied us with a flat field file that is to be used in flat fielding all of our datasets. We place this in the archive as /FTP/messier/data/calib/flatfield.fits.

What data products do we want? An appropriate choice might be an analysis set, an image data product consisting of the FITS file and the flat field, and a quicklook product consisting of the photometry description and the GIF file. We have to link these data to the data product files. However, in the HEASARC we do not link to the files themselves. We link to proxies of these files, data product tags. In Step 12 below we'll discuss how to build the tags. For the moment we assume that the data product tags for these products look like:

messier.m1.img
for the FITS file for the M1 observation.
messier.m1.gif
for the corresponding GIF image
messier.m1.phot
for the photometry file
messier.calib.flatfield
for the flat field file.

For other objects the object name in the middle of the tag would vary.

We want to put two entries into the ZZDPSETS table. We'll defer discussing exactly how to do that, and just look at what we want the entries to be. The set_names might just be 'Messier Image' and 'Messier Quicklook' and with descriptions of 'Image and Flat Field for Messier Object' and 'Quick look data for Messier Object'. The table name is just 'heasarc_messier'. The hard thing is what do we want to put in the set description.

The set_description is a kind of regular expression for finding matching tags that can use the fields of the table. In our case for the Image set, we might use the string:

  messier.@{name}.img,messier.calib.flatfield

This says that we want to look for two possible kinds of tags. The first one substitutes the name field from the current row of the table to find the right tag. The second one is constant. Every image data set will look for the tag messier.calib.flatfield. It's possible to do wild card matches too. E.g., if Dr Messier sent us three different image so that we had tags for

  messier.m1.img.red
  messier.m1.img.blue
  messier.m1.img.green

We could match all of them with a set description of the form:

  messier.@{name}.img.*,messier.calib.flatfield

This would match 4 files, and all four would be considered as part of the same dataset. Note the use of '*' as a wildcard match here.

The quicklook dataset would similarly have a set_description like:

  messier.@{name}.gif,messier.@{name}.phot

The details of the standard procedure for entering these descriptions in the database is given below in Step 12.

Step 9: Linking Catalogs

Linking of catalogs is done through entries in the ZZLINK table. Currently there is no standard task for updating ZZLINK entries, so it is best done by creating a small SQL script that does the update.

For our new table we want to create two links. The first is to the WGACAT table, where we want to find all WGACAT sources within 30 arcminutes of the Messier object. The second is to the ROSPUBLIC where we are going ask for all objects where the target name is identical to the Messier object.

These aren't especially good links, WGACAT objects have little a priori relationship with Messier objects, and even if a ROSAT observatin were of a Messier object, it is likely that the name would not have exactly the same (or even similar) format. However these two illustrate common approaches to linking tables.

To create the link we need to know the table we are linking from, the table we are linking to, the criterion for the link, and two bits of description: something to display for the user to click on when they want to link, and a text description of the link. In Browse this text description is displayed if you hover over the link.

So the SQL file for these two links might look like:

    delete from zzlink where table_name='heasarc_messier'

    insert into zzlink table_name, link_table_name, link_priority, link_symbol,
           link_criterion, link_description
      values(
        'heasarc_messier', 'heasarc_wgacat', 1, 'W',
	   'cone: heasarc_messier.ra,heasarc_messier.dec,30', 'Nearby WGACAT sources')

    insert into zzlink table_name, link_table_name, link_priority, link_symbol,
           link_criterion, link_description
      values(
        'heasarc_messier', 'heasarc_rosmaster', 2, 'R',
	   'target_name = heasarc_messier.name', 'ROSAT observations of this source')

Note that the program first ensures that any old link definitions for the table are removed. Then it adds the link with the WGACAT table. This has a link priority of 1 -- that controls the order of display when multiple links are present. It has a link_symbol of 'W', so the user will see a 'W' as the symbol to click on to go to the linked WGACAT resources. The link description of 'Nearby WGACAT sources', can be used by programs to tell the user what kind of data is being linked to.

The interesting part is the link_criterion. The first link uses a special cone search syntax. It says that we want each row of the Messier table to link to all the all the rows in teh WGACAT table which are within 30' of the RA and Dec specified in the Messier table.

The second link is very similar, but it uses a straightforward SQL join to find the fields of interest. The row in ROSMASTER must have an identical target name to the Messier name to be linked.

Generally, in building an SQL style link criterion, the fields that are from the source table (heasarc_messier) should be qualified with the table name. The fields that are in the destination table are not qualified.

Once a script is written it should be executed using the SQL commands provided by the database system.

Recently, the HEASARC has created a new internal tool to edit and populate links in ZZLINK. This tool is called zzlink, and it is available in /heasarc/bin/ on heasarcdev.gsfc.nasa.gov, dbms1.gsfc.nasa.gov, and dbsm2.gsfc.nasa.gov. This tool is now preferred procdure for performing this step.

Step 10: Writing the Documentation

Ideally, this has not been left so late in the development process. Documentation is essential for effective use of the table. A suggested skeleton for the format of the documentation is given here, and the developer should consult other documentation. The documentation files may be found in the W3Browse area of the Web server. Tools for automatically generating documentation, or at least document stubs from table descriptions, are available.

Here is an example:

<HTML>
<HEAD>
<TITLE>MESSIER - Messier Catalogue</TITLE>
</HEAD>
<BODY BGCOLOR="#CCEEFF">
<CENTER><H1>MESSIER - Messier Catalogue</H1></CENTER>
<hr>
<H2><a name="overview">Overview</a></H2>
The Messier Catalog of bright, extended objects is being compiled by the
comet-hunter Charles Messier in the 18th century.
<p>

<hr><H3><a name="parameters">Parameters</a></H3>
<p>
<b><a name="name">Name</a></b><br>
  The Messier Catalog designation.
<p>
<b><a name="ra">ra</a></b><br>
  Right Ascencion in 2000 coordinates.
<p>
<b><a name="dec">dec</a></b><br>
  Declination in 2000 coordinates.
<p>
<b><a name="ra1775">ra1775</a></b><br>
  Original coordiantes
<p>
<b><a name="dec1775">dec1775</a></b><br>
  Original coordinates
<p>
<b><a name="size">Size</a></b><br>
  The Dimension of the source.  Angular size in degrees.
<p>
<b><a name="mag">Mag</a></b><br>
  The visual magnitude of the object
<p>
<b><a name="inputClass">inputClass</a></b><br>
The input class as given by Messier.
<p>
<b><a name="class">Class</a></b><br>
  BROWSE classification type.  The classification is based on the inputClass
  parameter, if one is available.

<hr><H3><a name="dataproducts">Data Products</a></H3>
<dl>
<dt>
<a name=quicklook> Quicklook data products. </a>
<dd> GIF images and ASCII description of the object.
<dt><a name=image>Image </a>
<dl>Image and flat field.
</dl>


<hr><H3><a name="contact_person">Contact Person</a></H3>
Questions regarding the MESSIER database table can be addressed to the
<a href="/cgi-bin/Feedback">HEASARC User Hotline</a>.

<!--#include virtual="/W3Browse/.misc/w3browse-help-footer.html"-->
</body>
</html>

As mentioned above, the one feature that Browse Web interface expects HTML help documents to have are internal anchors for each of the columns. The Browse system will link to the table documentation with the column name used as the fragment identifier to jump directly to the appropriate location in the document. The lack of these internal anchors will not break the system, but will make accessing the table documentation less convenient.

Step 11: Handling New Missions

Occasionally a new table will require a change to static HTML pages (or page templates) used in Browse. This happens most commonly when a page is for a new mission or possibly a new wavelength regime. If a new table is to be directly linked from the initial Browse page, then the static HTML used there needs to be updated.

New missions may also wish to build there own customized interface to Browse. This involves writing custom HTML that uses invokes the same CGI script and provides the same parameters as the standard Browse page. However the customized page need not include tables or missions that are not directly relevant, and it can include parameter search capabilities that are normally buried deep within the Browse environment on the initial page.

In most cases, including our Messier table, no changes are required here.

Step 12: Scanning the Archive

In step 8 we discussed the design of the data products, but we did not describe how either the ZZDPSETS or ZZDP tables are populated. The procedure is somewhat complex and is detailed here. The ingest of data into the ZZDPSETS and ZZDP commands is currently done with the build-zzdp.pl which in turn is controlled by a configuration file. The build-zzdp.pl script reads the configuration file, updates the ZZDPSETS and ZZEXT tables as described below and then scans the HEASARC archive and create data product tags as needed.

The minimal configuration file for the data products we defined above might look like:

#
# Directory prefixes for URLs and local file system access:
#
url_prefix = ftp://heasarc.gsfc.nasa.gov
dir_prefix = /FTP
#
#
# Directory shortcut:
#
${messier} = ${prefix}/messier/data
#
#
# Data products sets:
#
set[heasarc_messier(image)]        = messier.@{name}.img,messier.calib.flatfield  // Image for Messier Data
set[heasarc_messier(quicklook)]    = messier.@{name}.gif,messier.@{name}.phot     // Quicklook Messier Data
#
# XTE ASM data products tags:
#
tag[messier.{%1}.img]       = ${messier}/(.+)/(.*)\.img // Messier Image (FITS)
tag[messier.{%1}.gif]       = ${messier}/(.+)/(.*)\.gif // Messier Quicklook Image (GIF)
tag[messier.{%1}.phot]      = ${messier}/(.+)/(.*)\.gif // Messier Photometry Analysis
tag[messier.calib.flatfield = ${messier}/calib/flatfield.fits // Flatfield for Messier Images (FITS)

The configuration file has a complex syntax with four distinct elements that we now discuss in detail. Comment lines may be included by starting a line with the '#' character.

The first two non-comment lines describe where the root of the HEASARC archive is located. The root is given as both a URL address and as a directory in the local file system. Within Browse the URL address will normally be used, but the build-zzdp.pl command will scan the archive and recognize that it can transform a local file name into a URL simply by replacing initial directory given in the dir_prefix line with the base URL given in the url_prefix line. E.g., in the example given above we would know that the file /FTP/messier/data/m1/Q02X123.img can be accessed through the URL ftp://heasarc.gsfc.nasa.gov/messier/data/m1/Q02X123.img. The value of the url_prefix is stored in ZZEXT as a shortcut with the name prefix. All local data products will use this prefix value as the base location for building data product URLs.

The next set of lines to be discussed are those that begin with a '$'. These are directory shortcuts that will be used (along with the prefix shortcut described in the previous paragraph) to make it easier to define the URLs that will be placed in the ZZDP table. In our example there is only a single one of these which defines a shortcut for messier. This shortcut is defined in terms of the prefix shortcut. Essentially it says that when we see the shortcut ${messier} in a file or URL that's intended to be replaced by the directory /FTP/messier/data if we are looking at local files, or http://heasarc.gsfc.nasa.gov/messier/data if we are addressing the file as a URL. Shortcuts are stored in ZZEXT with a table_name of 'zzdp', the parameter_name as the shortcut, and the parameter value as the shortcut value. Not that shortcuts can (and usually are) defined using other shortcuts. Shortcuts are not resolved until a file name is actually needed by either Browse or the build-zzdp.pl command.

The next set of lines is how we populate the ZZDPSETS command. Every line that begins with 'set' corresponds to one line in ZZDPSETS. The table_name and set_name fields are to the left of the '=' sign which has the form set[table_name(set_name)]. The tag_format immediately follows the '=' and the set_description follows the '//'. The appropriate values for these fields in our example were discussed above in section 8.

The last set of lines begin with 'tag'. These are the lines pthat control the scan of the archive that is used to populate the ZZDP table. There are four of these lines in our sample file: one for each type of file in the Messier archive. Each line is of the form:

tag[tag_expression] = file_match // Description (format)

The description and format will be used to populate the ZZDP dp_type and dp_format fields. The file match defines a Perl regular expression. build-zzdp.pl will expand any and all shortcuts found in the file_match to generate a full file path. E.g., for the first tag entry the file_match is expanded to /FTP/messier/data/(.+)/(.*)\.img. The program then checks to see if there is any file that matches this regular expression. Regular expressions can be quite complex and building these matches is one of the trickiest parts of ingesting new data. In this case note that the role of '*' is rather different for Perl regular expressions than for the wild card matching -- it means 0 or more repetitions of whatever immediately precedes it. The '+' has a similar meaning, but it requires that there be at least 1 match -- 0 is not allowed. The period '.' is special. It matches any single character. So to match the '.' before the file type, we needed to escape it with a backslash. The parentheses in the regular expression are used to group elements. This may be needed to use the '*' or '+', but they also play a key role in parsing the file path as discussed below.

This regular expression will match at file with a path like: /FTP/messier/data/m101/xyzzy.img but it would not match /FTP/messier/data/m101/xyzzy.gif since the file type does not match.

Perl supports the concepts of back-references in regular expressions. In Perl these use the variables $1, $2, .... These back references are used within build-zzdp.pl, but use the slightly different syntax {%1}, {%2}, ... The idea of the back references is to look to at whatever was matched in the n'th set of parentheses in the regular expression match. E.g., our first tag definition has two sets of parentheses. So for /FTP/messier/data/m101/xyzzy.img we find that {%1} has the value m101 and {%2} has the value xyzzy. One set of parentheses may be enclosed in another. The order of backreferences is based upon the order of the left parentheses. Thus by careful use of parentheses we can pick out part of the file path.

These parts of the file path are inserted into the tag expression at the beginning of the line. We now have a tag, URL, description and format and are ready to insert an entry into ZZDP. E.g., after the match that we have discussed the program does the equivalent of:

  insert into zzdp (dp_tag, dp_url, dp_type, dp_format)
    values ('messier.m101.img', '${messier}/m101/xyzzy.img', 'Messier image', 'FITS' )

The build-zzdp.pl scans the entire archive matching all the possible regular expressions against the possible files. Periodically, the entire HEASARC archive is scanned for new data products using this tool.

The build-zzdp.pl command is normally run as a system service. New shortcut, tag and set entries can be debugged and are then added to the system file.

Remote Data Products

The build-zzdp.pl tool cannot build data products that are not local to the HEASARC. Such data products can be built in two way: prospectively or through custom tools. If there is special knowledge or access to the remote system, then tools to build the ZZDP entries can be configured for a given mission or dataset.

Often prospective data products may be built using information in the table from which the dataset is to be created. These are usually populated into the Data Products Layer using SQL statements. E.g., perhaps the remote system has a preview mechanism that we wish to link to. The URL to get a preview is http://remote_url?sometstuff&dataset=dataset_id where the dataset ID is a field that is stored in the HEASARC table. If so, then we can create data product tags to the remote URL for each distinct entry in the HEASARC table. It is possible that the remote system may not have a preview for every dataset. If so, then Browse may occasionally links to the remote serice that will not work. This may be an acceptable cost, but it should be documented in the table documentation.

Table Building Utilities

HDBingest

HDBingest (/heasarc/hrcdba/bin/HDBingest on dbms1.gsfc.nasa.gov and dbms2.gsfc.nasa.gov) accepts a file in the Transportable Database Aggregate Table (TDAT) format and creates a table ready for inclusion in the HEASARC database.

HDBexgest

HDBexgest (/heasarc/bin/HDBexgest on dbsrv.gsfc.nasa.gov) will export a HEASARC database table to an ASCII file of the same format that HDBingest uses.


Documentation prepared by the HEASARC Database Group
HEASARC Home | Observatories | Archive | Calibration | Software | Tools | Students/Teachers/Public

Last modified: Wednesday, 20-Oct-2021 11:16:38 EDT

The Astrophysics Science Division (ASD) at NASA's Goddard Space Flight Center (GSFC) seeks a creative, innovative individual with strong teamwork and leadership skills to serve as Director of the High Energy Astrophysics Science Archive Research Center (HEASARC). This will be a permanent civil servant position. + Learn more.