Article Preview
Top1. Introduction
This paper discusses our work on automatically verbalizing conceptual data models expressed in the graphical notation of Object-Role Modeling (ORM) into readable text expressed in two Asian languages: Bahasa Melayu (Malay) which has official status as a language in Malaysia, Brunei, Indonesia, and Singapore; and Mandarin (the most widely used Chinese language).
In developing information systems, the most critical aspect is to specify a conceptual model that correctly and completely captures the relevant semantics of the business domain of interest. This model can then be used to drive the later phases of the information systems engineering process. Industrial practice has shown that the most reliable way to validate the model with business domain experts is to communicate the model clearly in their native language.
Object-Role Modeling (Halpin, 2010, 2011, 2012, 2015, 2016; Halpin & Morgan, 2008) is an approach for conceptual data modeling specifically designed to optimize communication between data modelers and business users. Its graphical notation can capture many kinds of constraints that are not expressible in industrial versions of Entity Relationship (ER) models (Chen, 1976) or in class diagrams in the Unified Modeling Language (UML) notation (Object Management Group, 2013), and its fact-based models are semantically more stable than attribute-based models such as those of ER or UML which require radical restructuring if a fact needs to be added about some feature modeled as an attribute (Halpin, 2011, 2015).
While ORM’s rich graphical notation is useful for modelers to visualize fine details of their data models, these data models should be validated with domain experts who best understand the business requirements, even though they may be unfamiliar with the graphical notation. Hence, the data models themselves are best validated by verbalizing the models in a controlled natural language (CNL), an unambiguous subset of natural language with restricted grammar and vocabulary, and by populating the relevant fact types with concrete examples.
This process of validation by verbalization and population is central to all fact-oriented modeling approaches, including ORM, Cognition enhanced Natural Information Analysis Method (CogNIAM) (Nilsson & Lemmings, 2008), and Fully Communication Oriented Information Modeling (FCO-IM) (Bakema, Zwart & van der Leek, 2000), which use fact types as their sole data structure.
While some fact-based modeling tools provide automated verbalization in English, comparatively little support exists for verbalizing fact-based models in other languages, especially Asian languages. This paper describes our work on verbalizing ORM models in Malay and Mandarin. We specify some typical transformation patterns, discuss features of these languages requiring special treatment in order to render natural verbalization (e.g. noun classifiers, repositioning of modal operators, and different uses for terms equivalent to “who” and “that” in English), and describe our current implementation efforts, which involved creating both a prototype and an extension to the NORMA (Natural ORM Architect) tool (Curland & Halpin, 2010) to allow ORM models to be entered and verbalized in these two Asian languages.
The rest of this paper is structured as follows. Section 2 briefly overviews related work on high level textual languages for data modeling, both within and outside the fact-oriented community. Section 3 discuses verbalization of fact types as well as our overall strategy for dealing with noun classifiers. Section 4 provides details on how various logical constructs and ORM constraint patterns are mapped to Malay and Mandarin. Section 5 illustrates our prototype implementation for entering and verbalizing ORM models in Malay and Mandarin. Section 6 summarizes the main contributions and outlines future research directions.