Using metadata and taxonomies: An assessment for Enterprise Content Management


Isaac Quinn DuPont
October 2007


The research issues to be addressed in this essay are:

  • What are the benefits of controlled vocabularies and metadata sets to content management systems, and their users?
  • Why is this sort of information architecture (above and beyond full text indexing) important?
I will demonstrate that controlled vocabularies ought to be used in combination with free or full text searching. The decision to use controlled or free indexing is not one of either/or---instead, because "representational predictability" is very low for general concepts, whereas individual concepts maintain high lexicality and require quick updating and precision---the choice should be to use both indexing methods [Fugmann, 1982]. Further, searchers will use both methods when available, and when searching for practical matters or across multiple databases searchers will use free--text, but controlled vocabulary is the main form of terminological control because even expert searchers will not formulate free--text search queries to constrain non-lexical words. The use of metadata and controlled vocabulary is important, especially for Enterprise Content Management, because it aids in the recollection of relevant document (increases recall) and can be used to aid in records management for regulation compliance and administration.

Representational predictability

Considering Fugmann's [1982] revised Semantic Triangle, it is clear that the concept is typically what is sought by a searcher, and when it is not, the expression can be located easily with free--text searching. Concepts, however, are of two sorts: individual and general. Individual concepts are typically lexical [Fugmann, 1982, 395], i.e., individual concepts are highly specific and have sharply defined (univocal) meaning. General concepts, on the other hand, are rarely lexical [Fugmann, 1982, 395]. The difficult in retrieving non--lexical concepts is considerable, since the meaning can be expressed in an infinite variety of lexical ways. Further, the searcher is unable to predict which of the infinite lexical expressions will map on to the lexical terms used within the corpus. Thus, general concepts have very low "representational predictability" [Fugmann, 1982]. To address this issue, Fugman [1982, 397] suggests that any "classification, thesaurus, authority list, controlled vocabulary, etc., serves" to increase representational predictability. These tools increase the representational predictability by reducing an infinite number of lexical expressions to a finite number, and further, often do so by elucidating logical connection between lexical terms. The elucidation of logical connections can also aid in the speed of retrieval because searchers can select amongst a list of concepts [Cisco, 2005, 46].

Complementarity

Using controlled vocabularies alone is not sufficient because natural language often coins new terminology very quickly, and the catholic nature of controlled vocabularies cannot maintain lexical parity. The principal architecting point of a controlled vocabulary is that the vocabulary be controlled, and thus not change with considerable frequency. New terms of art, specific to a discipline or field are often coined with considerable frequency. While many of these words may be temporary buzz-words, others have considerable importance. Technology words are the most dramatic example of rapidly changing locutions, but many fields, academic and corporate, update and change their language to reflect advances in technology, economic or business trends [Cisco, 2005, 49], or socio-political trends. A controlled vocabulary would lose its value if all of the new terms were instantly reflected in the index; thus the catholic approach of slowly adding only the most crucial terms is to be adopted. Further, the rapid maintenance of controlled vocabularies is often expensive [Cisco, 2005, 49].

Controlled vocabularies are essentially a translation from natural language to an artificial language, and thus as with any translation, problems arise. If a document is machine indexed subtleties of locution and meaning will be lost, however, even human indexers will err when attempting to understand difficult or ambiguous text. Indeed, sometimes the requirement to index as specifically as possible will change purposefully ambiguous or over-broad text into constrained and specific index terms. Examples of this translation problem arise in academic documents, specifically those of a theoretical tenure, but one could likewise imagine legal documents in corporate environments being subtly but importantly perverted from their original intent. The impact of indexing incorrectly is only significant for the metadata, not the document itself, however, the impact on retrieval could be significant. For example, a incorrectly indexed legal document that does not return when appropriate queries are sent is akin to not existing, or at the very least, causing expense for additional effort in retrieval.

Specificity of search meaning is typically considered a virtue of free--text indexing, since what could be more specific than searching for the exact words you want? This virtue, however, often rings false; Fugmann [1982, 398] argues that much of the specificity of natural languages comes from "synatactic devices such as propositions, word sequence, pronouns, and grammatical cases", which would be impossible to accommodate in a free--text search. The near infinite different ways to express a concept often result from the looseness and flexibility that a natural language offers a speaker. Even if pluralization and localization can be accommodated by software, suggestive meaning (such as rhetorical uses) may belie attempts of free--text searching. Some of this extra--lexical meaning can be captured by using particular techniques, such as quasi--natural--language processing, or by using external data in addition to the natural language (Google, e.g., uses the lexical data contained within anchor--text inlinks to eliminate some of the difficulties with free text searching, however, their ability to exploit this technology is due to architectural features of the World Wide Web, which are not architectural features that most corpora contain).

As an empirical fact, searchers will use both free--text and controlled--vocabulary systems when given the option [Fidel, 1991, 511]. Reasons for avoiding the use of controlled vocabulary (such as thesauri) include: lack of trust of index quality, multiple database searches result in (temporally and psychologically) expensive vocabulary lookup, and disciplinary differences favour one method or another [Fidel, 1991, 511 ff]. If searchers lack confidence in the quality (thoroughness and extent) of a controlled vocabulary index, they will tend to use free--text indexes. This result is unsurprising, thorough searchers will require very good recall, but if indexing has been poorly done the searcher will lack the guarantee that any search query is comprehensive. Free--text searching is seen as an alternative when the controlled--vocabulary index is of poor quality. When searching across multiple databases searchers will tend to favour free--text searching because of vocabulary eccentricities. It is speculated that if greater standardization was possible across databases searchers would be more likely to use controlled--vocabulary indexes. The reluctance of searchers to use controlled vocabulary indexes when searching multiple databases is due to the requirement that index terms must be looked-up for each separate database. Thus, the personal investment in terms of time and energy is too expensive. The use of pre-built "taxonomies" (really, controlled vocabularies) can mitigate some of the impact of inter-database index discrepancies, since standardization would be common to a ready--made set [Cisco, 2005, 48]. However, the use of pre--built taxonomies cuts both ways, since the quality is likely to be poorer, or at least require extensive domain--specific specialization [Cisco, 2005, 48], and thus searchers will be less inclined to use the controlled vocabulary if it is perceived of a poor quality (as described above). Some disciplines tend to favour one method of search over another; likewise, one could imagine that some corporate fields would prefer one type of search over another. There are many reasons why a disciplinary search preference might arise: high levels of homogenization of terms within the discipline (i.e., high lexicality), practical vs. theoretical searching, etc..

Habitual search behaviour has been shown to be a less significant factor in the use of one search method or another; Fidel [1991] found that subject area, environment, and number of search steps required were the most significant factors (as described above). Interestingly, the number of free--text search words used to formulate a query does not increase with interactive systems (those that provide on--the--fly lookup of abstracts, vocabularies, and full-text) [Fidel, 1991, 511].

Metadata

Index terms, either free--text or controlled are one aspect of metadata. Indexes are important for the recall of documents, which is one of the most difficult and time consuming tasks to get correct in an Enterprise Management System, but metadata can contain other information as well. Franks [2006, 57] suggest that there are three types of metadata: descriptive, structural, and administrative. Index terms are a form of descriptive metadata, this usually contains free--text indexing of the document's title, author, date, and other bibliographic information. Structural metadata tends to be machine readable, and is often opaque to the end user. Structural metadata usually includes file format declarations, schema declarations, software and hardware specifications, and compression and encryption algorithm specifications (if applicable) [Franks, 2006, 57]. Administrative metadata is vitally important to the maintenance of an Enterprise Content System because it can specify rights management and data preservation. This metadata could, e.g., specify a Records Management policy for each document, or could contain information for automated workflow processes. Some of the Administrative metadata could be used for document recall, but the complications involved with indexing at a conceptual level are not present when indexing for administration. For example, it might be helpful to recall documents of a certain position in a workflow, or documents pending destruction; the administrative metadata would contain this information and thus should be included in searchable database indexes.

I have shown that the most difficult conceptual problem facing taxonomy creation for Enterprise Content Management is indexing at the conceptual level, specifically general concepts. The lack of representational predictability requires a skilled human indexer for the creation of a good controlled vocabulary index. However, searchers should also be given the choice to use free--text searching, since some applications lend themselves well to this method. Finally, although document recall is the most challenging aspect, other forms of metadata ought not be forgotten. The Enterprise Content System manager/maintainer ought to be especially interested in the proper maintenance of administrative metadata.

References

Cisco, S. L., & Jackson, W. K. (2005, May/Jun.). Creating order out of chaos with taxonomies. Information Management Journal, 39(3), 44-50.

Fidel, R. (1991). Searchers' selection of search keys: II. Controlled vocabulary or free-text searching. Journal of the American Society for Information Science, 42(7), 501-14.

Franks, P., & Kunde, N. (2006, Sept./Oct.). Why metadata matters. Information Management Journal, 40(5), 55-61.

Fugmann, R. (1982). The complementarity of natural and indexing languages. International Classification, 9(3), 140-44.