Developing a Federated NASA Astrophysics Archive 1. Introduction This paper describes a proposal for providing a unified NASA Astrophysics data archive system. It describes integration of resources of the NASA Infrared Science Archive (IRSA), the NASA Extragalactic Database (NED), the High Energy Astrophysics Science Archive Research Center (HEASARC), the Multi-mission Archive at Space Telescope (MAST), the Chandra X-ray Observatory Science Center (CXC), the Astronomy Data Center (ADC), and the Astrophysics Data System (ADS). This proposal tries to address actions that these centers can take immediately, in the context of the developing Virtual Observatory frameworks, that will facilitate user access the full breadth of the NASA's astrophysics resources. However his short-term effort is not to provide the full level of integration of data archives envisaged in the NVO. The following suggests two distinct activities. The first is a relatively straight-forward cross-linking of our archive resources. Users querying at one site any of our sites can go directly to query other sites. Technically this a fairly trivial operation. The maintenance issues can be managed using the same approach as Astrobrowse. The second activity is more complex but begins to allow true integration of our resources. Each site will provide metadata and query results in a common format. Given trends both in industry and astronomy we suggest one of the emerging XML standards. Our user interfaces can then be built to query these remote resources and provide results to users in a fashion where the distributed nature of the query is transparent. The structure of the user interfaces would be specific to each of the sites. 2. Cross Linking Searches The first stage in providing an integrated archive system is cross-linking of the major query tools at each site. Currently all of our sites use a common paradigm where the user specifies a set of criteria which describe the information they are interested in, and the system then returns with a 'tabular' listing of the information available. If there is additional information/archive resources associated with a given row in the results the system allows the user to extract it in subsequent steps. While there are many criteria that a user can specify to get information from a site, a few are very commonly used, notably position, target name and time of observation. As part of the query results, each site will provide a cross-link to the query system of the other archive centers with at least this information filled in - if provided by the user and used by the other site. E.g., suppose a user queries MAST for HST WFPC2 observations of 3c273 prior to 1995. Currently the user gets a table listing 5 observations. With the archive site cross-linking there would also be a link to query the other archive sites for information at the same position and time. So if a user is interested in Chandra observations from the same object, they can follow the link to the CXC. In this way MAST becomes a portal to all of the NASA sites. The following pages are suggested as the initial portal destinations for each site. Each of these pages should be modified as needed to allow the user to bring up the form with the indicated fields filled in: archive.stsci.edu/index.html target name/position heasarc.gsfc.nasa.gov/cgi-bin/W3Browse/w3browse.pl target name/position, time ned.ipac.caltech.edu/forms/nearposn.html position ned.ipac.caltech.edu/forms/byname.html name cda.harvard.edu:9011/chaser/mainEntry.jsp name or position irsa.ipac.caltech.edu/applications/Gator/ target name/position1 adc.gsfc.nasa.gov/viewer/ target name/position1 adsabs.harvard.edu/abstract_service.html target name The result pages for a query at each NASA site (i.e., from one of these portal pages or one of the closely related pages at the site) should include a link to all the other NASA sites. The ADEC web page will provide a single page that could be used as the destination address. I.e., IRSA results might have a link to adec.gsfc.nasa.gov/arclinks.html This would in turn link to the appropriately filled in forms at each site. Alternatively a Web page may provide direct cross-links. The portals used at each site should not change frequently, however GLU/Astrobrowse entries for both the ADEC linking page and the ultimate destination pages for each NASA site will be developed. Using GLU each site will be able to move or modify its portal and propagate the changes to other sites automatically. Sites can also choose to mange changes manually. Note that links should not be directly to the results pages of queries at the alternative sites, rather clicking on this link will put the user in a filled-out form. The user will need to enter 'SUBMIT' or some equivalent. In our example, the user may wish to remove some of the requirements from the query. The restriction to observations before 1995 would make a CXC query rather uninteresting! Or they might add specific requirements that are appropriate only to the second query, e.g., restrict the Chandra observations to the ACIS instrument. Once the user has followed a cross-link, the user is in the interface of the cross-linked site, not the originating site. The cross-link portals should ensure that the user is informed of this transition. 3. Standard query results While cross-linking of our archives will help users to get to our data archives, it still leaves the user with a very scattershot view of them. To provide integrated access to our archives we propose that NASA archives support a common format for query results. The Astrores XML format (or its recent enhancement as VOTable) seems well adapted to our needs. To develop an interface to a catalog/archive system we need three levels of information: metadata describing the tables to be queried, query results, and information regarding data products associated with particular query results. The Astrores format adequately describes the first two. A prototype for the third is in use the HEASARC. Metadata. Each site would provide a link that would allow users to retrieve just the metadata information for a specified table. There might also be a link providing the list of tables available for systems like the ADC where there are many tables. The formats for these metadata queries should be standardized among all sites. As with the portals GLU may be used to allow local management of the URLs for these queries with automatic propagation to the other ADEC sites. Query data. The key capability is to query and get the result in XML/Astrores. Sites could decide to make this an option on the existing Web pages, or provide a new page for this format. The specific URL syntax will be provided in metadata queries. The metadata will describe at least how to query by position/name and/or time. Data Products. In most of our sites data products are available which correspond to given 'rows' in the results: archive products for the observation tables of MAST, IRSA, CXC and the HEASARC; images and other datasets for NED; journal articles from the bibliographic tables of the ADS. Sites that support data products should also support an Astrores-style query to retrieve them. The table metadata should describe the metadata associated with a table, and which columns in the table are needed to get the data products. The user specifies the table name, the data product type, and the values of the key columns to get back a data products table which gives URLs to the data products. 4. User interfaces. Providing output in XML does not itself address the issue of integration of data. Substantial work is needed to develop user interfaces that can access these XML resources. Some generic query tools for XML databases do exist, but we anticipate that several sites will to provide links to other sites using the XML output format. Today the ST ScI dynamically queries the HEASARC archive to provide a ROSAT page within MAST. This uses relatively fragile links to an ASCII output format. The HEASARC uses a similar mechanism to periodically update its Chandra tables from the CXC. Use of XML formats will make this kind of transfer much more robust. The HEASARC has recently been able to support queries to all VizieR tables within the context of its own interface by using VizieR's Astrores output option. The catalog interfaces of MAST, the HEASARC, the ADC and IRSA are all well suited to providing query capabilities to remote data using the XML output. Essentially they can just add more tables to their existing lists. With the rapid development of XML tools both in the general community and within astronomy as part of the VO effort, these sites can add remote queries within their existing frameworks. NED, the CXC and ADS with their narrower focus on a single mission or view of data may not choose initially to build in dynamic queries of remote tables. However it is certainly easy to envisage applications in which such a query could augment their current services. This would allow NED to easily link to source observations for objects. The CXC could choose to add a few critical foreign catalogs and archive resources. If the current effort to provide source observation information to the ADS is successful, the availability of a common interface to NASA archives will make it easy to link back from journal papers to the observations and data on which they are based. User interfaces that integrate results from remote sites will need to be careful to inform the user of the provenance of the data. 'Foreign' data should be prominently marked to indicate the origin of the data. E.g., in the current HEASARC interface to VizieR, all data from VizieR is colored differently from the HEASARC data. 1 Currently requires the user to choose a catalog prior to specifying selection criteria. The position should be held and passed onto the next page where the user specifies selection criteria, e.g., as a hidden parameter.