SkyView as an Archetype of Archival Systems of the Future
T.A. McGlynn (HEASARC/USRA), N.E. White (HEASARC),&
K.A. Scollick (CSC)
We discuss theSkyView virtual observatory and how it represents a new approach to the archiving of astronomical information. This approach is essential if astronomers are to effectively use the vast information resources that are now coming online.
SkyView takes all-sky or large area surveys in wavelengths ranging from radio to gamma-rays and provides the data in a convenient, easy-to-use form. Astronomers need not concern themselves with the geometric issues of projections, coordinate systems, resampling -- these are addressed automatically and largely invisible to the user. Rather than simply giving users a copy of observations as they were taken (the conventional archive approach),SkyView transforms these data into a form which allows astronomers to immediately begin addressing the astronomical questions that interest them.
Since its introduction,SkyView has been very successful. In its first year we anticipate users will come in from over 10,000 distinct addresses and users of the preliminary version of the system already retrieve about 100 images per day. With the wide availability of Mosaic to the public, and the simplicity of the SkyView interface it is used not only by professional astronomers, but by interested members of the public.
In the next decade the availability of astronomical data to astronomers will grow at an unprecedented rate. New sources are coming on line, e.g., the terabytes of the Sloan Sky Survey. Simultaneously, the availability of increased network connectivity and cheap distribution by CD makes existing resources far easier to access. To deal with this exciting but potentially bewildering array of information, our community must begin to build interfaces of a new kind, with an intrinsic understanding of basic astronomy. The current generation of archival interfaces, itself a technology only a decade old, is based on a paradigm of atomic units of information, like books in a library. In the era we are entering, new ideas which recognize the malleability of digital information are essential. The next generation of archives will enable users to retrieve information in forms directly relevant to their immediate research goals, rather than requiring a series of tedious and often mechanical steps before the users can begin to do astronomy.
Public archives of astronomical information have undergone enormous changes in the past decade and this rate of change is certain to continue in the future. It is only roughly a decade ago that the first electronically accessible archives became available (e.g., at the International Ultraviolet Explorer Observatory at GSFC). Prior to that, services like the National Space Science Data Center (NSSDC) supported (and still support) retrieval of information of requested information by shipping out tapes and other physical media.
Since the first electronic archives came online, they have become increasingly sophisticated. Starting from little more than tables of contents, archives now are indexed using sophisticated databases like HST StarView which uses a relational database with hundreds of different tables.
The ability to distribute data electronically has also grown enormously. The steady growth of network capacity has culminated in the last year in the explosion of the World Wide Web which has made available more resources than any one person could possible deal with. Data centers now can distribute data using standard mail or FTP protocols or can bring up client-server models which use some internal distribution protocol.
These later developments have not obviated the need for the simpler archives. Just as Fortran did not replace assembly language and 4GLs are not replacing Fortran and C but supplementing them, this growth in the sophistication of archives has led to archives of many levels of sophistication appropriate to particular purposes. In this paper we discuss a general classification of existing archive systems into three categories and discuss a new kind of archive that is just beginning to become available. We use theSkyView as an example of these fourth generation archives.
What is SkyView?
SkyView is a network service to allow astronomers to make a virtual observation of the sky using existing all-sky and large area surveys. If an astronomer wishes to make an observation of M31 in the infrared, he or she asks the system for an image of a particular region of the sky and specifies the kind of data wanted. For example, the IRAS data is distributed on CD-ROM in B1950 coordinates, but a user may wish the data in J2000. Similarly the user may want a different scale than the default, or perhaps wishes to view some large region of the sky which requires mosaicking several of the distribution images together. SkyView addresses these and other geometric issues and immediately gives the user the needed data.
In many cases the astronomer would be perfectly cabable of doing the manipulations that SkyView performs on the data. However, having to do these -- and having to deal with a different set of manipulations for every type of data in a multi-wavelength investigation -- sets up a serious barrier for astronomers in using this information. SkyView lets the astronomer get a quick look at the situation immediately.
SkyView allows the astronomer to view the data and can also create FITS files which the astronomer can use for further analysis. The system has extensive capabilities for manipulating the image and color tables, for overlaying images, for contour mapping, and for performing overlays of astronomical catalog sources. While these are very useful, the heart of the system and what distinguishes SkyView is its geometry engine.
The Types of Archives
We propose to classify archives into four categories which represent increasing levels of sophistication of the interface and abstraction of users from the data.
Level 1: Archive as Ordered Files
If the purpose of an archive is to provide some means of recovering information, then a random collection of files should not be classified as an archive. The lowest level of archive requires that some order is imposed upon files. A collection of telemetry tapes taken in time order, perhaps one tape per day, would represent this kind of simple archive. Even at this level there is some meta-information required for the archive, e.g., what is the format of the files, and what is the sense of ordering of the data. Many more advanced archives may be viewed as containing sets of level 1 archives.
Level 2: Archive with File Index
Beyond the simplest level, archives provide a mechanism which mediates between the data and the user's requests. The simplest mechanism is a file index. This is just a static list of the files included in the archive. User's can search this list and choose a set of files to retrieve. Many archives are of this form. The typical anonymous FTP archive uses the directory hierarchy to provide the file index. Many missions provide an observation index which includes direct pointers to the observation data. In using those elements alone, one has a level 2 archive.
Level 3: Archive with Database
The typical level of archive access that astronomers now demand is what we term a level 3 archive. While a level 2 archive had a static index of its contents, a level 3 archive has a database system which allows users to make queries about the contents of the archive. Thus users can make a statement in terms they understand, i.e., "what observations have you made of stars brighter than B = 5?", and the system will respond. Once the number of files maintained in the archive gets large and the number of different types of data multiplies, it becomes very difficult for many users to find data using a static index.
Typically two distinct elements are now present in the user's interaction with the archive: a series of queries of the archive database, and a separate retrieval process.
Level 4: Archive with Data Service
In moving from level 2 to level 3, the user's interaction with the archive catalog goes from dealing directly with the index to dealing with the index through a database intermediary which interprets the user's astronomical requirements. As we move to a level 4 archive a similar intermediary is established between the user and the archive data itself. In level 1-3 archives the system provides the user with data which is atomic -- unchangeable and indivisible. At these levels we may envisage the archive as a library which lends out books but is loath to rip out individual pages. A level 4 archive recognizes the intrinsic malleability of digital data and can extract elements from the various archive files for processing prior to delivery to the user.
SkyView as a Level 4 archive
The element that distinguishes SkyView as a level 4 archive is that it generates its products for the user dynamically. When a user requests an image the system determines the parameters of the request and extracts and manipulates data from the existing all-sky surveys. Then it creates an output product to the user specifications.
Several things are key to making it possible to have a fourth-level archive. First, there must be some agreement among the community of the scope and type of data manipulations possible. If there is no way to predict the kinds of manipulations that would be useful the system may have very limited appeal. SkyView deals with clearly defined geometric transforms. While there are a number of coordinate projections and coordinate systems this number is manageable.
Similarly, it is very important that there be some way of presenting the data to the user in a fashion that he or she can be expected to understand. The universal adoption of FITS formats by the astronomical community makes this possible for SkyView. The current draft World Coordinate System proposal, which SkyView uses, addresses precisely the same geometric issues. The existence of these standard data formats greatly enhances the usefulness of the fourth generation archive by making its data products immediately usable in community software. The alternative is for the archive to be able to generate data in a variety of formats. SkyView does this for its image data which can be generated in GIF's, JPEG's, TIFF's, etc., but this is obviously more work.
Another essential element for the fourth-level archive is the ability to distribute and display information directly to the user. One can imagine systems where user's requests are responded to non-interactively. But in such a system an essential coupling between the user and the archive is lost, just as there is difference between written and oral communication between people. In the future we may envisage the relation between the archive and user not as a set of commands and responses but as a dialog where the archive begins to anticipate the requests and sensibilities of the user.
The character of interaction with a fourth-level archive is different than with a third-level archive. The separation that used to be present between querying and data retrieval begins to blur. Since a fourth-level archive must generate a data product dynamically, the response to a query is not just a listing but at least a sample of what the data product looks like. In SkyView, the user immediately gets back the requested image on the screen. There is still a separate step to retrieve FITS or image files, but the differentiation between catalog data and archive data is less meaningful.
One test capability we have implemented in SkyView is to try create an all-sky mosaic of ROSAT pointed observations. With this mosaic, users do not need to individually add a set of observations---the data product is provided ready for use. A user selects a position and a minute or two later the ROSAT image, or a blank field if there has been no observation, is returned. By providing this kind of value-added product, a fourth-level archive can enhance the value of a third-level system. Once the user sees the data, perhaps discovering that the object of interest is seen in the field of view, he or she is motivated to work with the original atomic observations to do the very best science possible. With a third-level archive alone, the process is much more cumbersome: Search the catalog, extract the needed observations, add the observations, view the result. If a user is unfamiliar with the data it can be days before one can determine if there is anything interesting in the field, a formidable barrier to getting started with a new kind of data.
SkyView is not the only fourth-level archive effort underway in astronomy. Elements of this emerging technology can be seen in a number of systems. For example, the spectral plotting features of the ESIS system allow users to retrieve multi-mission data on sources very readily. The quick-look capabilities that have been built into the CADC HST Starcat and have been brought up at the Space Telescope Science Institute (STScI) allow users to browse data. The HEASARC Xobserver system couples an analysis environment very tightly to the database system which allows users to see and browse the data. However, in these cases the data products are still typically bound by the library paradigm.
Elsewhere the EUVE Guest Observer Facility has recently started a service to provide all-sky products on demand, but the service is not interactive, requiring waits of several hours for the data products. At STScI there are projects to develop expert systems to assist in data analysis. While the emphasis here is on analysis, to the extent that such assistants interact with archives they may be seen as a fourth-level archive.
The effort underway to provide a data system for the Earth Observing System, the EOSDIS system, currently envisages many fourth-level archive elements where the data will be processed at user request. Since the data volumes there are so enormous, terabytes per day, it behooves the astronomical community to keep up with the developments there, likely learning as much from the mistakes as the successes of this system.
Astronomy and the other physical sciences are seeing an explosion in the amount of digital data available. New sources of data such as the Sloan Sky Survey and new NASA missions continually increase the base volume of data, while the increases in network capacity continuously add to the effective number of datasets online. This information is leading to a literal embarassment of riches where astronomers may not know what, or where, data exists to answer their questions. Nor can they cope with the varying formats used in different specialities or at different times.
The only way in which we are going to enable our constituents to deal with this explosion is to use the comparable increases in capacity of our computers to develop information systems which provide users with data in forms they can use immediately. The emergence of third-level archives in the past decade has enabled archive systems to cope with the very large databases from individual missions. At the HEASARC, much effort has gone into developing a discipline-wide third-level archive for high-energy astronomy, but it remains incomplete and deals with only a small fraction of the astronomical community's data. SkyView has been developed to address some of the concerns that have arisen as we have begun to use the resources of the HEASARC and other facilities, but much more remains to be done.
We feel that it is important to explore new ways in which we present data to our community. Not only must we provide astronomers with original observations, we must provide them with the capability of using data from the multitude of sources transparently. The purpose of archives is not simply to preserve information, it is to make it useful to the community. We must update our paradigm of the archive as data library and have archives which tear pages out of their books and present newly formatted volumes to their users.
Proceed to the next article Return to the previous article
HEASARC Home | Observatories | Archive | Calibration | Software | Tools | Students/Teachers/Public
Last modified: Monday, 19-Jun-2006 11:40:52 EDT