Presented at ICCS'2000 in Darmstadt, Germany, on August 14, 2000. Published in B. Ganter & G. W. Mineau, eds., Conceptual Structures: Logical, Linguistic, and Computational Issues, Lecture Notes in AI #1867, Springer-Verlag, Berlin, 2000, pp. 55-81.
With existential graphs, Peirce set out to determine the simplest, most primitive forms for expressing the elements of logic. Although he developed a graphical notation for expressing those forms, they can be expressed equally well in a natural language, an algebraic notation, or many different linear, graphical, or even spoken representations. The following table lists Peirce's five semantic primitives, each illustrated with an English example. Since these five elements are primitive, they cannot be formally defined in terms of anything more primitive; instead, the middle column of the table briefly states their "informal meaning."
Primitive | Informal Meaning | English Example | SQL WHERE-clause Example |
---|---|---|---|
Existence | Something exists. | There is a dog. | |
Coreference | Something is the same as something. | The dog is my pet. | Use of variables and "=" |
Relation | Something is related to something. | The dog has fleas. | Table |
Conjunction | A and B. | The dog is running, and the dog is barking. | AND |
Negation | Not A. | The dog is not sleeping. | NOT( ) |
The five primitives in Table 1 are available in every natural language and in every version of first-order logic. They are called semantic primitives because they go beyond syntactic relations between signs to semantic relations between signs and the world. Any notation that is capable of expressing these five primitives in all possible combinations must include all of FOL as a subset. As an example, the WHERE clause of the SQL query language can express each of these primitives and combine them in all possible ways; therefore, first-order logic is a subset of SQL. Different languages may use different notations for representing the five primitives:
Operator | English Example | Translation to Primitives |
---|---|---|
Universal | Every dog is barking. | not(there is a dog and not(it is barking)) |
Implication | If there is a dog, then it is barking. | not(there is a dog and not(it is barking)) |
Disjunction | A dog is barking, or a cat is eating. | not(not(a dog is barking) and not(a cat is eating)) |
Instead of choosing existence and conjunction as primitives, Frege chose the universal and implication as primitives. Then he defined existence and conjunction in terms of his primitives. The result was not as readable as Peirce's algebraic notation, but it was semantically equivalent. Peirce's existential graphs (EGs) were also semantically equivalent to both of the other notations, but they had the simplest of all mappings to the five primitives. SQL also uses existence, conjunction, and negation as its three basic primitives, but it provides the keyword OR as well. SQL has no universal quantifier, which must be represented by a paraphrase of the form NOT EXISTS... NOT. To add logical operators to RDF, Berners-Lee (1999) proposed the tags <not> and <exists>, which can be combined with the implicit conjunction of RDF to define the operators of Table 2.
To illustrate ....................
.
.
.
.
For most people, no training is needed to read a controlled NL, but some training is needed to write it. For computers, it is easy to translate a controlled NL to or from logic, but fully automated understanding of unrestricted NL is still an unsolved research problem. To provide semiautomated tools for analyzing unrestricted language, Doug Skuce (1995, 1998, 2000) has designed an evolving series of knowledge extraction (KE) systems, which he called CODE, IKARUS, and DocKMan (Document-based Knowledge Management). The KE tools use a version of controlled English called ClearTalk, which is intelligible to both people and computers. As input, the KE tools take documents written in unrestricted NL, but they require assistance from a human editor to generate ClearTalk as output. Once the ClearTalk has been edited and approved, further processing by the KE tools is fully automated. The ClearTalk statements can either be stored in a knowledge base or be written as annotations to the original documents. Because of the way they're generated, the comments that people read are guaranteed to be logically equivalent to the computer implementation.
The oldest logic patterns expressed in controlled natural language are the four types of statements used in Aristotle's system of syllogisms. Each syllogistic rule combines a major premise and a minor premise to draw a conclusion. Following are examples of the four sentence patterns:
Other important logic patterns are the if-then rules used in expert systems. In some rule-based systems, the controlled language is about as English-like as COBOL, but others are much more natural. Attempto Controlled English (Fuchs et al. 1998; Schwitter 1998) is an example of a rich, but unambiguous language that uses a version of Kamp's theory for resolving indexicals. Following are two ACE rules used to specify operating procedures for a library database:
If a copy of a book is checked out to a borrower and a staff member returns the copy then the copy is available.
If a staff member adds a copy of a book to the library and no catalog entry of the book exists then the staff member creates a catalog entry that contains the author name of the book and the title of the book and the subject area of the book and the staff member enters the id of the copy and the copy is available.Rules like these are translated automatically to the Horn-clause subset of FOL, which is the basis for Prolog and many expert system languages. The subset of FOL consisting of Horn-clause rules plus Aristotelian syllogisms can be executed efficiently, but it is powerful enough to specify a Turing machine.
For database queries and constraints, natural language statements with the full expressive power of FOL can be translated to SQL. Although many NL query systems have been developed, none of them have yet become commercially successful. The major stumbling block is the amount of effort required to define the vocabulary terms and map them to appropriate fields of the database. But if KE tools are used to design the database, the vocabulary needed for the query system can be generated as a by-product of the design process. As an example, the RÉCIT system (Rassinoux 1994; Rassinoux et al. 1998) uses KE tools to extract knowledge from medical documents written in English, French, or German and translates the results to a language-independent representation in conceptual graphs. The knowledge extraction process defines the appropriate vocabulary, specifies the database design, and adds new information to the database. The vocabulary generated by the KE process is sufficient for end users to ask questions and get answers in any of the three languages.
Design and specification languages have multiple metalevels. As an example, the Unified Modeling Language has four levels: the metametalanguage defines the syntax and semantics of the UML notations; the metalanguage defines the general-purpose UML types; a systems analyst defines application types as instances of the UML types; finally, the working data of an application program consists of instances of the application types. To provide a unified view of all these levels, Olivier Gerbé and his colleagues at the DMR Consulting Group implemented design tools that use conceptual graphs as the representation language at every level (Gerbé et al. 1995, 1996, 1997, 1998, 2000). For his PhD dissertation, Gerbé developed an ontology for using CGs as the metametalanguage for defining CGs themselves. He also applied it to other notations, including UML and the Common KADS system for designing expert systems. Using that theory, Gerbé and his colleagues developed the Method Repository System for defining, editing, and displaying the methods used by the DMR consultants. Internally, the knowledge base is stored in conceptual graphs, but externally, the graphs can be translated to web pages in either English or French. About 200 business processes have been modeled in a total of 80,000 CGs. Since DMR is a Canadian company, the language-independent nature of CGs is important because it allows the specifications to be stored in the neutral CG form. Then any manager, systems analyst, or programmer can read them in his or her native language.
Translating an informal diagram to a formal notation of any kind is as difficult as translating unrestricted NL to executable programs. But it is much easier to translate a formal representation in any version of logic to controlled natural languages, to various kinds of graphics, and to executable specifications. Walling Cyre and his students have developed KE tools for mapping both the text and the diagrams from patent applications and similar documents to conceptual graphs (Cyre et al. 1994, 1997, 1999). Then they implemented a scripting language for translating the CGs to circuit diagrams, block diagrams, and other graphic depictions. Their tools can also translate CGs to VHDL, a hardware design language used to specify very high speed integrated circuits (VHSIC).
No single system discussed in this paper incorporates all the features desired in a KE system, but the critical research has been done, and the remaining work requires more development effort than pure research. Figure 18 shows the flow of information from documents to logic and then to documents or to various computational representations. The dotted arrow from documents to controlled languages requires human assistance. The solid arrows represent fully automated translations that have been implemented in one or more systems.
Figure 18. Flow of information from documents to computer representations
For the KE tools, the unifying representation language is logic, which may be implemented in different subsets and notations for different tools. All the subsets, however, use the same vocabulary of natural-language terms, which map to the same ontology of concepts and relations. From the user's point of view, a KE system communicates in a subset of natural language, and the differences between tools appear to be task-related differences rather than differences in language.