By Ara Howard
IntroductionThe accuracy of any weather forecast is dependent on the quality of the data and metadata used to formulate that forecast. FSL supports NOAA in developing systems that are used to perform real-time acquisition, integration, and display of environmental data in highly demanding forecast operations. The process of data analysis and dissemination requires constant alertness toward new methods to improve acquiring, validating, integrating, and presenting data.
A crucial factor to the successful operation of any meteorological data processing system is the rapid access to accurate metadata. For example, an update to a weather station’s recorded elevation above sea level or a change to the units associated with a potential vorticity parameter can have a profound impact on the accuracy and correctness of decoded meteorological data. Other types of metadata (such as the history of availability of a data product) are useful when designing scientific case studies or analyzing the historical reliability of data product generation and archival systems.
FSL manages data products assimilated from sources ranging from satellite feeds to reports handwritten by forecasters at airports around the world. Each data source requires a specialized decoder that nearly always relies on rapid access to unique lookup tables, used to store elements. These lookup tables may be unique, but they certainly are not constant – revisions occur almost daily. A significant number of hours are spent managing these tables. More important is the problematic manner in which decoders treat the kind of retrospective data that are needed for case studies and other processing tasks, such as validation and verification, which require equivalent access to both current and historic metadata. The treatment of the many differing datasets required at FSL has been streamlined and homogenized through implementation of the Object Data System, a basic system of "temporally versioned" tables (see the next article in this issue by Paul Hamer on ODS).
Here we discuss the development of a bitemporal relational data management system, including the management of those metadata which directly support the ODS meteorological data processing systems. Also covered are the methods used to disseminate these data to remote near real-time processing nodes, designed to rapidly access the metadata.
Design StrategiesA consistent pattern for managing the many revisions occurring on metadata tables has been developed, as well as a system to provide uniform access to current and historic metadata for the near real-time decoder while allowing for the dissemination of changes to these metadata within a collection of distributed processors. To understand the difficulties involved in designing a system that fulfills these requirements, it is important to fully comprehend the nature of the problem.
An objective of developing a bitemporal relational data management system was to minimize overall costs and maximize the potential that users inside and outside FSL might be able to absorb some, or all of the systems’ components into their own environments at low additional cost. An exclusively open source tool set was used consisting of the PostgreSQL (Structured Query Language) relational database management system, the Constant Database library, the object-oriented scripting language Ruby, the Gnu Compiler Collection, and the GNU/Linux operating system.
Using Metadata Lookup Tables Most of the tables used in FSL’s decoders are temporal tables, which contain those metadata whose keys may have potentially differing values during nonoverlapping time periods. For example, the key for the International Civil Aviation Organization at the Denver, Colorado, station (KDEN) may have the value of "elevation 5000 feet above sea level" during the period 1970 1990, and yet be associated with a different value of "elevation 5010 feet above sea level" during the subsequent period 1990 2002. This discrepancy may be the result of improved accuracy in measurement, or because the station was physically relocated. In either case, it is necessary to know which elevation applies for a given time period so that, for example, given data to decode from 1989, decoders may ascertain that the correct elevation for KDEN was at that time 5000 feet, and not the currently valid 5010 feet.
Types of Lookup Tables
Valid Time State Table The type of table discussed above, known as a valid time state table, tracks the state of the external world and the value associated with a key for a given valid time.
Transaction Time State Table Another, equally important type of temporal table, the transaction time state table, tracks the state of the metadata table itself, recording its history of insertions, updates, and deletions. For example, it may be recorded that the unit for GRIB parameter 42, wind speed, is in meters per second, and only later is it discovered that parameter number 42 is actually associated with the unit feet per second. This difference could obviously affect the quality of decoded data.
The ideal situation, manifested by transaction time state tables, is to be able somehow to make the correction and also record the past error. This very useful property would allow scientists and software engineers to understand why a value might have changed, for example, by a factor of nearly three overnight. Was it due to a change made in a decoder program? Were the data themselves corrupt? Or was it, considering the change would only have been around a factor of three, simply a very windy day? Transaction time state tables cannot answer these questions directly, but they can track the history of updates to metadata so that one may, at least, rule out or identify those updates as the source of an error.
Bitemporal State Table A bitemporal state table is the union of a valid time table and a transaction time state table, containing properties of both. Therefore, bitemporal tables would provide answers to the issues related to the Denver station (above): the elevation of KDEN in 1985 as is best known now, and, the elevation in 1985 as known 11 years ago. (In the following text, reference to "temporal" means "bitemporal," unless valid time or transaction time is otherwise specified.)
Advantages of Using Relational Database Management SystemsA relational database management system (RDBMS) offers an attractive metadata storage technology for many reasons, including searchability, network availability, uniform access methods, and its transactional nature. One of the most convincing arguments, however, for storing data in an RDBMS is the ability to establish relationships among the data that ensure a high level of data quality. This is known as the relational model, which, if applied correctly, can enforce what is known as referential integrity. This is simply a term used to describe those relationships that are self-consistent and accurately reflect the real situation. Unfortunately, the design of bitemporal, valid time state, and transaction time state tables completely precludes the need for the standard RDBMS facilities used to enforce referential integrity primary and foreign keys. However, no standard products exist in either the commercial or open source arenas that completely address this issue. Thus, metadata to be modeled both bitemporally and with the use of a relational model to enforce referential integrity mandate the design and implementation of a custom data management system.
Managing Bitemporal MetadataIf one assumes bitemporal metadata managed from within a dedicated system, the challenge becomes one of efficiently accessing and using the metadata. The ODS architecture is a distributed one: decoders for various data types reside on physically disparate but networked processors. Therefore, one of two mechanisms must be employed to provide access to each of these processing nodes: the data must be either remotely available or replicated on each host. Remote access, the first possibility, was rejected on the grounds that many of the ODS processors were already pushing the limits of cpu usage, and the Intranet on which they communicate was also under very heavy load. Under certain circumstances, the cost of simply connecting to a remote database would be too great a burden on these near real-time nodes. Replication, the alternative, is attractive because it does not require constant availability of a central database server if "lazy" replication is used. Also, minimal network traffic is generated when updates are relatively infrequent, since all that is required is to convey "no change" to each of the remote nodes. Finally, with data residing locally on each node, lookups can be orders of magnitude faster than if they are done against a remote server or servers.
The FSL metadata system is a two part system: one system, running on a server(s), stores and manages a central repository of bitemporal metadata. The other system, running concurrently on multiple processing clients, retrieves and provides access methods to the now locally available metadata. The systems were developed such that there are no mutual dependencies between the two – each part can operate independently from the other in the event of a network or hardware failure.
Implementing an RDBMSThe overriding design philosophy behind the manage- ment and dissemination systems was to create lightweight, easy to understand tools to accomplish each task rather than a "goliath" system that could handle all possible tasks. The emphasis was on rapidly prototyping a set of systems to handle a subset of the available metadata, namely the GRIB tables. This subset was chosen for the initial implementation because of its relatively complex set of relationships. It is also expected that other FSL metadatasets, which are on average of a much simpler nature, will be readily integrated into both systems.
At FSL, an RDBMS can fulfill many of the requirements placed on metadata storage, such as data independence, efficient access, data integrity, security, and centralized data administration. Even without the ability to enforce referential integrity via the normal primary and foreign key constraints, the advantages of storing temporal data in an RDBMS are compelling. The capability to conduct complex searches, which require none of the standard referential integrity constraints, is itself a motivating factor. Plus, when considering the fact that most RDBMS designs handle network connectivity and security, then the decision to store the metadata tables, temporal and otherwise, in a relational database becomes clear.
We chose the PostgreSQL for the FSL metadata RDBMS because of its special features. It has 100% ACID compliance, 100% ANSI SQL-92 compliance, extensive referential integrity support, replication, scalability, rules, views, triggers, outer joins, subselects, stored procedures, MVCC design, 50-MB install footprint (versus 500 MB for the leading vendors' RDBMSs), unlimited connectivity, small resident memory footprint, use on inexpensive workstations running the free Linux operating system, and it appears most poised to adopt the SQL temporal extensions when they are released in the SQL3 standard. It is important to note that other commercial RDBMSs could be purchased to fill the role of PostgreSQL in the management system because it is not tied to a specific RDBMS. Other freely available RDBMSs could be modified to support the management system, but currently no other open source RDBMS implements all features required by the system.
Preserving the semantics and correctness of bitemporal relationships across updates, insertions, and deletions, as it turns out, can become very complicated even when considering only a small number of relationships. Research concerning their management was the subject of roughly 1600 doctoral theses last year alone, and the lack of temporal support in the current SQL version, used by all major RDBMSs, is being addressed for its next release. Extensive research on bitemporal data management has been published by Professor Richard T. Snodgrass of the University of Arizona’s Computer Science Department in a book titled, “Developing Time-Oriented Database Applications in SQL." He also initiated the SQL/Temporal section of the current ANSI/ISO SQL3 draft standard. Dr. Snodgrass generously donated his time and expertise in helping us address several particularly difficult issues encountered during the design phases of this project.
Programmatically Controlling Metadata IntegrityOnce difficulties that arise in managing bitemporal metadata are fully understood, it becomes evident that only deliberate programmatic management has any chance of maintaining data integrity. There are essentially two ways of programmatically managing bitemporal metadata stored in an RDBMS: internal to the database via triggers or stored procedures, or external to the database using a programming language API. The FSL Central Facility management system is implemented externally to the RDBMS for several reasons. First, only a few commercial and Open Source RDBMSs support triggers and stored procedures, meaning that any implementation using them is limited to a subset of the available SQL-compliant databases. Also, certain critical operations may only be achieved from outside the RDBMS, and triggers and stored procedures are both limited in functionality and, in the opinion of this author, difficult to implement, debug, and maintain for the developer.
In order to support the variety of applications ranging from Web interfaces to command-line utilities that require programmatic control of the FSL metadata tables, we designed a very high-level object-oriented class library requiring little or no familiarity with the theory of bitemporal data management, and requiring knowledge of only the most rudimentary SQL. The purpose of this class library was, in essence, to model bitemporal tables as a class of objects capable of "doing the right thing" when sending messages, such as insert (new), update (old, new), and delete (old). The library was built using the freely available object-oriented scripting language Ruby, with a flyweight design pattern to model the database tables themselves. Ruby was chosen as the development language primarily because of its unparalleled abilities as a rapid object-oriented prototyping tool. It also has native PostgreSQL bindings, runs on all platforms supported by PostgreSQL, and requires no license fees. More than most object-oriented languages, Ruby enables average developers to quickly produce and maintain reusable object-oriented code, reducing both development and maintenance costs.
The Ruby class library we designed, Bitemporal PostgreSQL (BTPGSQL), is used in applications to manage the bitemporal tables stored in PostgreSQL. These tables are the sole interface to these metadata; all transactions on the data are accomplished via Ruby bitemporal table proxy objects that, in turn, automatically enforce the relational model so that the referential integrity of tables is maintained both in the standard and bitemporal sense. This layering offers the standard referential integrity constraints, if they were possible for some cases, that would move the management of metadata out of the hands of specialists and into the hands of the programmers and scientists who are most impacted by the quality of these metadata.
Disseminating Bitemporal MetadataAn important point to recognize is that once a centralized server is successfully managing a master copy of all bitemporal tables, accessibility by the Object Data System and other client nodes can be accomplished in many ways. Clients themselves never insert, update, or delete any metadata; all that is required by them is fast lookups into externally managed tables. The tables are constant from the point of view of the clients; each decoder acquires the required tables and performs simple lookups on those tables based on keys valid at a certain time, and no ad hoc queries are ever performed.
These facts contributed to the selection of the freely available open source package, the Constant Database (CDB) library, for use on client nodes. It is a nonserver-based, non-SQL-compliant, read-only embedded application layer database supporting key/value type lookups, a platform independent binary database format, and atomic database replacement. Implemented in the C programming language, CDB performs several orders of magnitude faster than all tested embedded databases, including the Berkeley Database, SQLite, and GDBM. Also, its constant nature allows CDB to use shared memory so that all running decoders on a given processor share access to a given table, keeping the hardware requirements to an absolute minimum. Since the CDB library does not handle temporal data at all, a lightweight C library, Valid Time CDB (VTCDB) package, was created as an abstraction layer atop CDB, allowing temporally constrained lookups into CDB databases. As previously mentioned, CDB databases are stored in small platform-independent binary files generated from simple human-readable text files. These databases are generated on a server from PostgreSQL using the Ruby BTPGSQL class library at regular intervals, and then placed onto a Network File System visible to all Object Data System nodes (Figure 1). A simple Unix makefile is then used voluntarily by clients to synchronize their locally stored CDB databases with those master copies visible on the NFS. This system is fast complete database replacement in the case of failovers takes only a few moments; simple a 100-line makefile comprises a majority of the system; easy to understand it integrates easily with current ODS protocols, is available to all of FSL, continues to work during network outages, and requires practically zero development or maintenance costs.
Figure 1. Schematic of the Bitemporal Metadata System at FSL.
The Ruby library BTPGSQL used in conjunction with PostgreSQL maintains bitemporal data on a central server. Clients wishing to use these metadata need only install two small C libraries and voluntarily acquire copies of the CDB files generated on the server. The Network File System is the current means of making these databases available for the ODS nodes. Other means of acquiring FSL's Central Facility CDB databases may be developed. CDB's platform-independent nature greatly simplifies distribution, meaning that many types of "subscription" services could be provided to users inside and outside FSL wishing to remain synchronized with the latest maintained metadata. Protocols such as ftp, http, ssl, cvs, and rsync could all be easily supported if demand arises.
SummaryIn building an infrastructure for high-performance computing, one of FSL's goals is to develop data management capabilities to improve data access and metadata management. Thus, the FSL meteorological data processing system discussed here comprises two subsystems, a management system and a dissemination system. Both systems cooperate to provide referentially intact bitemporal metadata to remote Object Data System (see next article) client applications, which require accurate metadata in order to produce quality data products. Rapid prototyping tools and 100% open source technology were chosen for the design and implementation of this project in order to minimize cost and reduce development time. This solution for efficiently managing and disseminating metadata is a lightweight one to a difficult problem set. While other solutions certainly exist, it is hoped that the simplicity and economy of this model might serve as a template, in part or in full, for similar problems in other atmospheric research laboratories.
Note: More information about FSL's Central Facility and other related topics is available at http://www.fsl.noaa.gov.
(Ara Howard is a Professional Research Assistant in FSL’s Information and Technology Services, headed by Dr. Peter Mandics. Mr. Howard is also affiliated with the Cooperative Institute for Research in Environmental Sciences (CIRES), at CU - Boulder. He can be reached at Ara.Howard@noaa.gov or by phone, 303-497-7238.)