UNSD Document

in terms of a journey identifier, distance, time and speed.

The speed, of course, is computed from distance and time, but the units of measurement are different. In a lan-guage-based system, a formula like the following could be specified:

Speed = (Distance * 1609.29) / (Time * 3600)

There are two problems with this approach: 1) the system does not know what the constants 1609.29 and 3600 are all about; 2) the system does not know why dis-tance has to be divided by time. It is all in the designer’s head. Maybe it has been stated somewhere in a formal design document, or in a meta-information data base, but there is no operational link between those specifications and the use of the formula. If, for example, it is decided that the distance will be presented in kilometres instead of miles, it is the responsibility of the designer to change not only the unit label of the Distance column and its data, but also the first constant of the formula for Speed. On the other hand, it would be difficult to support inter-active visual definition and maintenance of this compu-tation.

The approach we suggest allows the system to do the job of defining and maintaining this computation. Meas-urement units and scaling factors must be part of the metadata. The system must be aware of a measurable concept such as Distance, and it must know all about the units that can be used to express a value of a given con-cept (meters, kilometres, miles, inches, etc.). If told to convert a value from one unit to the other, such as inches to kilometres, the system must be able to access the defi-nition of the conversion formula. This can involve more than just a scaling factor, as is the case for conversion of a temperature value from Fahrenheit to Celsius. Derived concepts such as Speed are defined by reference to two or more other concepts: Speed is defined as the quotient of Distance by Time, and the measurement units of Speed are defined by reference to the measurement units of Distance and Time. Finally, each numeric field is speci-fied by reference to a unit belonging to a predefined measurable concept.

Now it is possible to replace the previous formula by a much simpler one:

Speed = Distance / Time

The system knows all about the relationship between the three concepts referred to by these fields, so it can do the following things:

*) check that the quotient of values in Distance and Time units yields a value in a Speed unit: this can lead the system to reject a formula that is not con-sistent with the available unit definitions;
*) select the scaling factors to apply to the computa-tion, in relation to the units used in the three fields;
*) automatically convert the values of a column if the unit scale is changed;
*) automatically adapt the scaling factors in the com-putation if the unit scale is changed in one of the columns.

This also makes it easy to support visual interactive specification of the Speed column. The following sce-nario becomes possible, supposing there is a table con-taining the first three columns of our example: Journey id, Distance and Time:

*) use the menu to add a column;
*) select columns 2 and 3 and drag them onto the new column;
*) the system looks up its list of derived concepts for one that is defined in terms of Distance and Time. It finds Speed, which is specified as the quotient of Distance by Time;
*) the system opens a list box with all defined units of speed;
*) select m/s;
*) the system fills the new column with computed values;
*) the system looks up the list of labels available as column titles for the Speed concept and presents a list box;
*) choose the label Speed.

This example illustrates the fact that the language and interactive specification approaches are not contradictory, and shows that a good operational metadata concept can help simplify work in either case.

2.5. Events

Processes take place at a certain point in time, as soon as predefined conditions are met. Such a condition can simply be the availability of all the required input. Reaching a certain date can also trigger a process. Other possibilities are a human decision to launch the process, a state reached by another process, etc. If process meta-information is to be automated, the system needs to know about events triggering processes. These are an integral part of the metadata.

2.6. From concept to documentation

The same concept can play various roles in different parts of a survey. A publication table, for instance, could be supplied with information referring to the non-response rate and the weighting method used to compen-sate for it. At this stage, it is documentary meta-information. However, both concepts have been present in an earlier phase: non-response has been captured as refused or unknown information and was aggregated to a non-response rate indicator, and the weighting method was implemented in a process that computed the weights. This has to be captured in operational meta-information. But first of all one needs to have a clear definition of the concepts of non-response, weighting and the available algorithms. This is conceptual meta-information.

The system has to know how to supply the individual cases of non-response to the weighting process, how to merge them into a non-response rate, and how to make this available for documentation. The information about the weighting algorithm must be stored only once, and accessed both by the weighting process and the docu-mentation.

3. System design

As stated in the first part of this paper, the first step will be an abstract specification of the system to be built. The design will have to meet the following four require-ments: implementation independence, unlimited cover-age of all aspects of meta-information, unity of definition, and object orientation. In the third part we will now con-centrate on these four points and conclude with some examples.

3.1. Implementation independence

We have already mentioned that some processes are implementation-resistant. This is partly due to the fact that they have never yet been analysed formally. It is also a question of culture. There are statistics that are gener-ally recognised to be dependent on the specialist’s know-how, something that "could never, ever be replaced by software", a handy alibi for the status-quo. So we want to integrate these processes into the system by describing them as black boxes. The system only knows they exist, what input they need, what output they produce, what the relationship is between input and input, and what events trigger them. Taking the human factor into account in this way implies building a system that runs on two plat-forms at a time: the computer and the human brain.
Another argument for implementation independence is the fact that some aspects of metadata management already have been implemented in systems that are avail-able on the market. EMMA is one of them. The domain of metadata is so vast that it is not possible to cover it all with one universal package. However, by integrating third-party software into the system we can easily extend its functionality to what is already available. Blaise III has shown that this approach is very powerful. Blaise does not carry out tasks that are made available by other packages. One example is statistical analysis. Quite a few good packages do the job, SPSS is but an example. A Blaise application can offer statistical analysis by inte-grating SPSS. For instance, a menu entry can be defined as a call to SPSS: selecting this entry will trigger a data selection process, the creation of an SPSS set-up on the fly, after which control is transferred to SPSS to create the system file and interact with the user directly. When the user exits SPSS, the application will go on to whatever has been defined as the next step. Any third-party software can be used in this manner to process data managed by the Blaise system. This works for data, and we are con-vinced that it will also work for metadata. So our choice of platforms is not limited to data base management sys-tems and general purpose programming languages: we also have third-party software to choose from.

Our own software is also a candidate for integration in the metadata system. In the case of Blaise, this could happen in various ways. The system could transfer op-erational metadata to Blaise and call upon the data man-agement functionality of Blaise for such things as inter-viewing and data editing. Or we might end up building a metadata designing tool that is so general that Blaise could be rewritten in terms of it. Alternatively, the func-tionality of Blaise could be extended in such a way that Blaise itself would become the general metadata man-agement system. We do not know yet. And we are not in a hurry to decide.

Presently we are witnessing a rapid evolution of the concept of inter-program communication. In the past, programs could communicate through ASCII files; later came the dBase format; today, they access each other’s data through ODBC drivers, or even communicate with each other in real time through OLE. Software is no longer by definition an executable program. There are a host of formats for software that does not run autono-mously: there is DLL, VBX and OCX, and there are more to come. The day will come when those formats will be preferred to the heavy monolithical executables of today. The end-users will be able to build their own private ap-plications by clicking diverse modules into place. A word processing module from Microsoft, a spelling checker from WordPerfect, a data management module from Borland, and maybe a metadata management module from Statistics Netherlands. So the system could be im-plemented as a loose collection of modules selectable by the user.

The system we have in mind might very well never be implemented completely. The first prototypes will be limited subsystems, such as a module for classification definition and maintenance or a tool for co-ordinating and standardising the legacy meta-information scattered over all the surveys and publications of an institution. The implementation of such loosely related applications could vary greatly, and this is another reason to wish to keep implementation aspects out of the general design.

Finally, the implementation landscape is so dynamic that switching from one implementation form to another between releases needs to be an easy and fast operation. This is not possible if implementation aspects are part of the design.

3.2. Unlimited coverage of all aspects

It is not easy to give a well delimited definition of what metadata will be considered relevant, and what should be left out of the system. Metadata are used to manage data. What do we use to manage metadata? If they are to be operational, they need to be described, they need to fit their own model. The metadata of the meta-data. The meta-metadata. This is also meta-information. Here are a couple of examples that will show that this is also relevant to our system.

The process of data editing has two objects: its first function is to identify and correct errors in the data set. The input of the editing process is a dirty data set, and the output is a clean (or less dirty) data set. The user of the data set considers this to be the primary function of data editing. But the designer has something else in mind: systematic errors can be symptomatic of design mistakes that have to be corrected before the same data are collected again. So the designer also wants data edit-ing to produce an edit trail. This edit trail (which belongs to the output of the data editing process) will be used as input by the maintenance process. The output of the maintenance process is a new data entry process that produces better data and needs less editing. The new en-try process cannot take place before the maintenance pro-cess has executed.

In this example, the survey design process has become an integral part of the designed survey process. There is no a-priori reason to exclude this type of reflexivity. This, of course, does not mean that we have the ambition of automating every single sub-process that we identify. But we want to see the whole picture before we start imple-menting some parts of it, and we do not want to exclude any relevant aspect because it might turn out to be over-kill.

We would like to draw our next example from the problem of multi-lingualism. Some countries have two or more official languages, and maybe also one or more widely used non-official ones. In such cases good com-munication requires the use of more than one language, and sometimes the use of multilingual documents is im-posed by law. This applies to every bit of natural-language meta-information: question texts of an inter-view, titles and footnotes of a publication table, etc. This poses a heavy maintenance problem if the task of warn-ing all translators is the responsibility of any one entering new texts or updating old ones, or if it is the responsibil-ity of translators to check whether there are new or up-dated meta-information texts. The whole operation of maintaining texts in multiple languages is much easier if the system is designed to keep multilingual texts syn-chronised by sending a warning to translators at every change of the original set of texts. Here again, we have something that is marginal to the survey process and be-longs to the design process, something, nevertheless, that we do not wish to exclude from the possible functionality of our future system.

Another point that falls under this paragraph is the range of applications of a meta-information system. We want to be able to use it for all activities involving data and/or metadata. It must be able to control the co-ordination and standardisation of concepts, to run and monitor processes, to manage resources, to produce documentation, etc.

3.3. Unity of definition

A concept can be present at many different points of a project. Most concepts have a definition (conceptual meta-information), a set of operations (operational meta-information), and user-oriented documentation (docu-mentary meta-information). We do not want these differ-ent aspects of one concept to be scattered all over the system. One must be able to access all aspects of a given concept through the unique object representing it. The rule is: one concept, one polyvalent definition.

3.4. Object orientation

In this paper we have expressed the idea that meta-information should not only be stored and retrieved, managed and maintained, but that it should have tasks to perform and know how and when to perform them. In the previous section we have also demanded that every con-cept should concentrate all elements of its definition. This combination of data and behaviour is a typical result of an object-oriented analysis. It is our conviction that object orientation can be a significant contribution to a powerful and intuitive metadata management system. This does not mean that we intend to use an object oriented language or object oriented data base management system: these implementation decisions will have to be taken later. By an object-oriented system we mean a system in which resources, data and processes are repre-sented as objects combining information and behaviour.

The importance given here to objects may come as a surprise to the reader: object orientation is often seen as a programming style, an attitude towards implementation techniques, something that does not concern the user. It is also often considered to be something hard that needs a considerable time investment in order to be applied successfully. We do not agree with this point of view. Object orientation is first of all a design attitude, which in our view contributes to better control of system complexity. It is the next logical step after structured design. A system whose users are involved in designing surveys should help them to apply the most efficient design methodology.

By managing all information and behaviour of one concept, a unique object representing that concept is a contribution to the overall unity and coherence of the system. In this view of things there are no barriers be-tween conceptual, operational and documentary meta-data.

What is the behaviour of a metadata object? It is any-thing the object can do to supply information or to con-tribute to one or more processes. An object should be able to supply its name or identifying key, its definition, a list of all the objects it is related to; it should be able to write itself in Pascal or C code or in an SPSS set-up, to store itself in a relational data base, detect that its documenta-tion in its original English form is more recent than its French translation and trigger a translation process (in which the human black box will be involved, of course), perform computations and selections, take decisions, and carry out any other relevant tasks.

3.5. Examples

To illustrate these ideas, let us follow the life cycle of the Speed column as represented by an object in the sys-tem. We will refer to the table represented in section 2.4.

Before we can use Speed in a survey, we need a defi-nition of the concept of speed. This definition must be common to all surveys: we want to be able to relate speed information over surveys. So we need an object to repre-sent the concept of speed. Speed is a concept that can be measured, it can have values expressed in units belong-ing exclusively to speed. The system we have in mind will offer a predefined class of objects, the Measurable Concepts, which is designed to model things like dis-tances, times and speeds.

There are two sorts of measurable concepts: primi-tives, like distance and time, which are defined inde-pendently of any other measurable concept, and derived, which are defined in terms of one or more other measur-able concepts. The two have in common the possibility of defining and using a list of units in which to express val-ues. The different units belonging to one concept can be interrelated in order to provide both their definition and their conversion formulae.

Primitive and derived measurable concepts can be de-fined in terms of what they share and in terms of their differences. We model them with two different classes of objects, but when we are interested in their common fea-tures, we can pretend they belong to the same class and forget their differences. They are interrelated in the fol-lowing way:

Any Primitive measurable concept is a Measurable concept, and so is any Derived measurable concept. The relationship, however, does not hold the other way round. The superclass of Measurable concepts defines all fea-tures common to both: a natural-language definition, a set of units, and an operational definition of the relation-ship between the available units.

Derived measurable concepts add to this a list of re-lated measurable concepts and an operational definition linking them in a formula. In the case of Speed, the op-erational definition is Distance / Time. The Distance ob-ject, which, let us assume, has already been defined, pro-vides the following units: mile, yard, inch, km, m, and cm. One of them, e.g. m, is defined as the primary unit. All others are defined by reference to the primary unit and a multiplication factor. The Time object is very similar. Its primary unit could be chosen as being the hour, which can be expressed in minutes, seconds, days and weeks (months and years address the issue of vari-able scaling factors, which we will not discuss here). Theoretically all combinations of Distance/Time units could yield a Speed unit. But in practice only a few are used. Only three of them will be taken in the definition of Speed: mph (miles / hours), kmh (km / hours) and m/s (m / seconds). Note that the symbol of a derived unit is not always identical to its operational definition: see mph vs m/s. A derived measurable concept like Speed has no primary unit: all units are derived from those of the concepts referred to.

In order to collect and publish the information shown in the table of section 2.4, we need to define a data model containing fields for a journey identifier, and for Dis-tance, Time and Speed. The mechanism for defining fields is the same as the one for measurable concepts: the system provides a predefined class to model field objects. For the definition of the field object Distance, there is no need to rewrite or copy the definition of the concept ob-ject Distance and its units: the Field class provides a ref-erence to a concept for its definition, and knows how to access the functionality of its concept. If the Distance field object is asked to convert its value from miles to kilometres, it will pass the request over to the Distance concept object along with the value to be converted. This mechanism, by which an object draws upon the function-ality of another object, is called delegation, and is one of the most powerful features of object orientation. Note that the Measurable concepts have no values: they have only definitions and operations. Field objects have values, and they can use the Concept objects they refer to in order to perform the required operations.

The functionality of the Speed object is a good illus-tration of the power of the delegation concept. This is what allows us to design the Speed column by dragging the other two onto it (see 2.4). Adding a column to the table creates a new field, but without any definition. Dropping the other two fields onto it triggers the auto-matic definition process, which works as follows: the new field asks the other two to supply information about what they mean together. They send the question back to their respective concepts. The two find out that they only col-laborate in the definition of Speed. A reference to the Speed concept is sent back to the new field. The field inserts the reference to the Speed concept in its defini-tion. From now on this is the Speed field. It asks the Speed concept for a list of available units and asks the user to choose. It passes the user’s choice, m/s, back to the Speed concept and asks it to return the Distance and Time units that support m/s. The Speed concept returns m for Distance and seconds for Time. The Speed field asks the Distance field to supply its value in m. Because the Distance field knows it is using the miles unit, it sends its value to the Distance concept and asks it to convert from miles to m. On receiving the result it passes it back to the Speed field. The same happens to the Time value, which the Time field will send to the Speed field after having it converted to seconds by the Time concept. Having the two required values, the Speed field can send them to the Speed concept and request the computation of a Speed value. It is the responsibility of the Speed concept to perform the division and send back the result.

At first view this looks complicated, and it would be if it had to be implemented in a procedural way. Objects, however, allow the user to define this interaction as a series of messages and responses exchanged by the ob-jects. Most of this behaviour can be implemented in the abstract objects supplied with the system. So the preceding scenario turns out to be quite simple to design. When the developer comes to defining specific measurable con-cepts and their units, the abstract classes defining meas-urable concepts and their features, including their inter-action with field objects, are already present in the sys-tem, with all their functionality. Fields and concepts al-ready know how to communicate with each other. All the designer has to do is supply definitions for new concepts. The task of defining Distance, Time and Speed will take only a few minutes, after which the objects are opera-tional to define Distance, Time and Speed fields through-out the whole survey, or rather for all surveys.

What we want to offer our users is all the advantages of object orientation without the burden of object oriented theory and rules. Blaise has shown that this was possible for structured design: Blaise offers a framework in which developers work in a structured way without ever having learned to do so the hard way. One of our users told us one day about a software specialist who was looking at his work and said he was doing structured programming: being told such a thing made him feel like Mr Jourdain who had been producing prose for his whole life without knowing so. We are convinced that this can work for object-oriented design too. The only way to prove it is to build the system and look what happens. It is a bet, but we are confident that the odds are on our side.
Notice the definition of fields by reference to concepts such as Distance and Time, and the definition of the concepts themselves are two different design sub-processes. Keeping them separate promotes both modularity and standardisation: it makes the definition of concepts inde-pendent of their use. This approach enables to separate tasks that are normally performed by different persons: defining concepts and maintaining classifications is usu-ally not the responsibility of people in charge of defining and maintaining questionnaires or publishing tables. By managing those tasks within different sub-processes we ensure that they do not interfere with each other.

Conversion of Age to Age-class will offer another ex-ample of the possible contribution of object orientation to making metadata more readily available both to software and to the user. Age information is usually present in microdata in the form of an age field. In cross tables, age information is often present in the form of age classes defining the rows or the columns. This could be modelled by the definition of an Age concept and an Age class concept, which are interrelated. The Age-class concept should know what the relationship is between an Age value and an Age-class value. It would be expected to know the answer to questions such as: what age class does the age value 45 belong to? Is the age value 45 greater than (all the values belonging to) the age class 31-40? What is the highest age value belonging to age class 61-69? In other design methodologies, where proc-esses are defined far away from the data, such questions would be implemented in separate functions. An object oriented system will allow to manage all these aspects as part of one object: the Age-class object.

4. CONCLUSION

In this paper we have tried to present some very gen-eral ideas about metadata. The aim is not yet to provide solutions, but to define the field for further research. It has become clear that the concept of metadata is very diverse, and that a general approach is needed in order to provide generally applicable solutions.

We have learned some basic principles about metadata that will help us in further research:

*) Meta-information is interpreted in almost as many ways as there are implementations of meta-information concepts.
*) In order to talk about a system covering all meta-information aspects, one must observe meta-information from a broad perspective.
*) Meta-information should be defined in such a way as to become operational. That is: its usage should not be limited to one task, say documentation, but it should remain open for implementing other tasks, even some that may not yet be known at de-sign time.
*) Design principles such as object orientation should play an important part in the functionality of a meta-information system.
*) The ultimate meta-information system will not be one system covering all meta-information aspects, but a loose combination of other systems and tools, sharing the same meta-information principles.
*) Guided by these principles we will go on searching for ways to handle metadata that will enable us to compare and, more importantly, combine existing metadata tools. Once we reach a metadata concept of a sufficient level of generality and abstraction, we can start designing a system that will contribute to easing metadata management.

5. References

[1] P. Basset & Ann Stoyka. "Statistics Canada’s aggre-gate output database - CANSIM II", Conference on Output Databases. Voorburg, November 1996: 35-39.

[2] J.G. Bethlehem & W.J. Keller. The Blaise System for Integrated Survey Processing. Statistics Netherlands, Voorburg, 1990.

[3] J.G., Bethlehem F.M. Kellenbach & W.J. Keller. Computer Assisted Statistical Information Process-ing at the Netherlands Central Bureau of Statistics. Statistics Netherlands, Voorburg,, 1991.

[4] Blaise III Developer’s Guide, Statistical Informatics Unit, Statistics Netherlands, 1994 (Blaise III 1.0) and 1996 (Blaise III 1.1).

[5] D.W. Gillman et al. Design Principles for a Unified Statistical Data/Metadata System. Eighth Interna-tional Conference on Scientific and Statistical Data-base Management, 1966): 150-155.

[6] J. Gray. An Object-Based Approach for the Handling of Survey Data. Essays on Blaise: Third Interna-tional Blaise Users’ Conference, Helsinki, Septem-ber 1995: 61-70.

[7] L.P.B. Hofman. Survey Management Systems, 1995 Seminar on New Techniques and Technologies for Statistics,Bonn, November 1995: 179-190.

[8] J.-P. Kent & G. Razoux Schultz Recent develop-ments in Blaise. in P. Debets et al., De computer als veldwerker: verzameling, invoer en kwaliteit van gegevens. Amsterdam, Informatiseringscentrum, 1993: 91-110

[9] J.-P. Kent & L. Willenborg. Documenting question-naires. Proceedings of the Fourth International Blaise Users’ Conference, Paris, May 1977: 155-161.

[10] National Software Testing Laboratories. Data Entry Software: Comparative Analysis. NSTL Final Report for the World Bank. World Bank, 1994.

[11] J.M. O’Reilly. Lessons Learned Programming a Large, Complex CAPI Instrument, Essays on Blaise: Third International Blaise Users’ Conference. Lon-don, October 1993.

[12] J. Olensky. Practical Problems of Implementing Meta-data Standards in Official Statistics. Eighth In-ternational Conference on Scientific and Statistical Database Management. Stockholm, 1966: 130-147.

[13] W. Richter. The ABS Information Warehouse - Present & Future. Conference on Output Databases, Voorburg, November 1996: 11-18.

[14] J.E. Smith. From Products to Systems: Addressing the Needs of CAI Surveys at Westat. Essays on Blaise: Third International Blaise Users’ Confer-ence, Helsinki, September 1995: 175-180.

[15] B. Sundgren. Infological approach to data bases, Urval Skriftserie 7/1973, Stockholm, National Bu-reau of Statistics.

[16] B. Sundgren. Meta-Information in Statistical Agen-cies. Conference of European Statisticians, ISIS’77 Seminar, 1977.

[17] B. Sundgren. Statistical Data Processing Systems - Architectures and Design methodologies. Invited Paper for the Golden Jubilee Celebration Conference on Statistics: Applications and New Directions, In-dian Statistical Institute, Calcutta, December 1981.

[18] B. Sundgren. Towards a Unified Data and Metadata System at the Australian Bureau of Statistics - Final Report. 1991.

[19] B. Sundgren. Statistical Metainformation and Metainformation Systems, R&D Report Statistics Sweden, 1991.

[20] B. Sundgren. Organizing the Metainformation Sys-tems of a Statistical Office, R&D Report Statistics Sweden, 1992.

[21] B. Sundgren. Guidelines on the Design and Imple-mentation of Statistical Metainformation Systems. R&D Report Statistics Sweden, 1993.

[22] A. Willeboordse & Winfried Ypma. From Rules to Tools - new opportunities to establish coherence among statistics. Conference on Output Databases, Voorburg, November 1996: 50-57.