Article Preview
TopObstacles To Capturing Context
If context is important to data integration, why is it not already captured and considered in current approaches? There are a number of contributing factors that may help explain why context is not generally considered a first class concern in terms of data integration. First, many data sources are not created with the intent of data integration in mind. The original goal in creating the data source may possibly be for a very specific use. Data integration may just be an opportunity to leverage existing captured data. When originally capturing the data, however, much of the knowledge forming the context of the data may simply be implicit information. For instance, if the goal is to capture information about students at a university, then the data source may neglect to document which university the students are attending because the identity of the university is understood.
Another reason implicit or assumed information may not be captured in a data source is simply because it is not efficient to do so. Data sources are primarily designed to be fast and efficient data storage and retrieval solutions. One way to build more efficient data sources is to remove or reduce redundant information, which in turn may begin to strip out some of the information that might be useful for determining context. Also, when designing a new data source, one may question the value of including information that is consistently the same value for every tuple in the data source (such as the university in the above example). If included, information that is non-variant for every tuple in the database, while useful for establishing context, will result in wasted storage space. For instance, is it worthwhile indicating that every student in a database goes to a particular university if the database is being designed solely to track students at a single university? This information is either never considered for inclusion in the data source because it is too obvious (implicit) or factored out of the data source because it is inefficient to store. Ideally this information should be added to the data source metadata in order to provide a context of what the data source designers intended to capture and store.
A third reason for not recording the context of a data source is simply that the people who created the database expect that they themselves will be performing whatever data integration is necessary in the future. The implicit context information is so “obvious” to them (having designed and constructed the data source) that they may not think any context information is worth recording (or at least, not deserving of much time and effort).
Capturing data source metadata is hardly a new or novel idea. Many data source solutions provide some capabilities to annotate data sources with designer comments. However, data integration remains a difficult task and few if any systems make a pointed effort to establish a data source context. Perhaps the chore of integration would be made easier if context information were more readily available.