Multimodal Architecture and Interfaces (original) (raw)

W3C Recommendation 25 October 2012

This version:

http://www.w3.org/TR/2012/REC-mmi-arch-20121025/

Latest version:

http://www.w3.org/TR/mmi-arch/

Previous version:

http://www.w3.org/TR/2012/PR-mmi-arch-20120814/

Editor:

Jim Barnett, Genesys Telecommunications Laboratories

Authors:

Michael Bodell (until 2012, while at Microsoft)

Deborah Dahl, Invited Expert

Ingmar Kliche, Deutsche Telekom AG

Jim Larson, Invited Expert

Brad Porter (until 2005, while at Tellme)

Dave Raggett (until 2007, while at W3C/Volantis)

T.V. Raman (until 2005, while at IBM)

Bertha Helena Rodriguez, Institut Telecom

Muthuselvam Selvaraj, (until 2009, while at HP)

Raj Tumuluri, Openstream

Andrew Wahbe (until 2006, while at VoiceGenie)

Piotr Wiechno, France Telecom

Moshe Yudkowsky, Invited Expert (until 2012)

Please refer to the erratafor this document, which may include normative corrections.

1 Conformance Requirements

An implementation is conformant with the MMI Architecture if it consists of one or more software constituents that are conformant with the MMI Life-Cycle Event specification.

A constituent is conformant with the MMI Life-Cycle Event specification if it supports the Life-Cycle Event interface between the Interaction Manager and the Modality Component defined in 6 Interface between the Interaction Manager and the Modality Components. To support the Life-Cycle Event interface, a constituent must be able to handle all Life-Cycle events defined in 6.2 Standard Life Cycle Events either as an Interaction Manager or as a Modality Component or as both.

Transport and format of Life-Cycle Event messages may be implemented in any manner, as long as their contents conform to the standard Life-Cycle Event definitions given in 6.2 Standard Life Cycle Events. Any implementation that uses XML format to represent the life-cycle events must comply with the normative MMI XML schemas contained in C Event Schemas.

The key words MUST, MUST NOT, REQUIRED,SHALL,SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and _OPTIONAL_in this specification are to be interpreted as described in [IETF RFC 2119].

The terms BASE URI and RELATIVE URI are used in this specification as they are defined in [RFC2396].

Any section that is not marked as 'informative' is normative.

3 Overview

This section is informative.

This document describes the architecture of the Multimodal Interaction (MMI) framework [MMIF] and the interfaces between its constituents. The MMI Working Group is aware that multimodal interfaces are an area of active research and that commercial implementations are only beginning to emerge. Therefore we do not view our goal as standardizing a hypothetical existing common practice, but rather providing a platform to facilitate innovation and technical development. Thus the aim of this design is to provide a general and flexible framework providing interoperability among modality-specific components from different vendors - for example, speech recognition from one vendor and handwriting recognition from another. This framework places very few restrictions on the individual components, but instead focuses on providing a general means for communication, plus basic infrastructure for application control and platform services.

Our framework is motivated by several basic design goals:

Encapsulation. The architecture should make no assumptions about the internal implementation of components, which will be treated as black boxes.
Distribution. The architecture should support both distributed and co-hosted implementations.
Extensibility. The architecture should facilitate the integration of new modality components. For example, given an existing implementation with voice and graphics components, it should be possible to add a new component (for example, a biometric security component) without modifying the existing components.
Recursiveness. The architecture should allow for nesting, so that an instance of the framework consisting of several components can be packaged up to appear as a single component to a higher-level instance of the architecture.
Modularity. The architecture should provide for the separation of data, control, and presentation.

Even though multimodal interfaces are not yet common, the software industry as a whole has considerable experience with architectures that can accomplish these goals. Since the 1980s, for example, distributed message-based systems have been common. They have been used for a wide range of tasks, including in particular high-end telephony systems. In this paradigm, the overall system is divided up into individual components which communicate by sending messages over the network. Since the messages are the only means of communication, the internals of components are hidden and the system may be deployed in a variety of topologies, either distributed or co-located. One specific instance of this type of system is the DARPA Hub Architecture, also known as the Galaxy Communicator Software Infrastructure [Galaxy]. This is a distributed, message-based, hub-and-spoke infrastructure designed for constructing spoken dialogue systems. It was developed in the late 1990's and early 2000's under funding from DARPA. This infrastructure includes a program called the Hub, together with servers which provide functions such as speech recognition, natural language processing, and dialogue management. The servers communicate with the Hub and with each other using key-value structures called frames.

Another recent architecture that is relevant to our concerns is the model-view-controller (MVC) paradigm. This is a well known design pattern for user interfaces in object oriented programming languages, and has been widely used with languages such as Java, Smalltalk, C, and C++. The design pattern proposes three main parts: a Data Model that represents the underlying logical structure of the data and associated integrity constraints, one or more Views which correspond to the objects that the user directly interacts with, and a Controller which sits between the data model and the views. The separation between data and user interface provides considerable flexibility in how the data is presented and how the user interacts with that data. While the MVC paradigm has been traditionally applied to graphical user interfaces, it lends itself to the broader context of multimodal interaction where the user is able to use a combination of visual, aural and tactile modalities.

4 Design versus Run-Time considerations

This section is informative.

In discussing the design of MMI systems, it is important to keep in mind the distinction between the design-time view (i.e., the markup) and the run-time view (the software that executes the markup). At the design level, we assume that multimodal applications will take the form of multiple documents from different namespaces. In many cases, the different namespaces and markup languages will correspond to different modalities, but we do not require this. A single language may cover multiple modalities and there may be multiple languages for a single modality.

At runtime, the MMI architecture features loosely coupled software constituents that may be either co-resident on a device or distributed across a network. In keeping with the loosely-coupled nature of the architecture, the constituents do not share context and communicate only by exchanging events. The nature of these constituents and the APIs between them is discussed in more detail in Sections 3-5, below. Though nothing in the MMI architecture requires that there be any particular correspondence between the design-time and run-time views, in many cases there will be a specific software component responsible for each different markup language (namespace).

4.1 Markup and The Design-Time View

At the markup level, an application consists of multiple documents. A single document may contain markup from different namespaces if the interaction of those namespaces has been defined. By the principle of encapsulation, however, the internal structure of documents is invisible at the MMI level, which defines only how the different documents communicate. One document has a special status, namely the Root or Controller Document, which contains markup defining the interaction between the other documents. Such markup is called Interaction Manager markup. The other documents are called Presentation Documents, since they contain markup to interact directly with the user. The Controller Document may consist solely of Interaction Manager markup (for example a state machine defined in CCXML [CCXML] or SCXML [SCXML]) or it may contain Interaction Manager markup combined with presentation or other markup. As an example of the latter design, consider a multimodal application in which a CCXML document provides call control functionality as well as the flow control for the various Presentation documents. Similarly, an SCXML flow control document could contain embedded presentation markup in addition to its native Interaction Management markup.

These relationships are recursive, so that any Presentation Document may serve as the Controller Document for another set of documents. This nested structure is similar to 'Russian Doll' model of Modality Components, described below in 4.2 Software Constituents and The Run-Time View.

The different documents are loosely coupled and co-exist without interacting directly. Note in particular that there are no shared variables that could be used to pass information between them. Instead, all runtime communication is handled by events, as described below in 6 Interface between the Interaction Manager and the Modality Components. Note, however, that this only applies to non-root documents. The IM, which loads the root document, interacts with "other components". I.e., the IM (having the root-document) interacts directly through life-cycle events with Modality Components (having different documents and/or namespaces).

Furthermore, it is important to note that the asynchronicity of the underlying communication mechanism does not impose the requirement that the markup languages present a purely asynchronous programming model to the developer. Given the principle of encapsulation, markup languages are not required to reflect directly the architecture and APIs defined here. As an example, consider an implementation containing a Modality Component providing Text-to-Speech (TTS) functionality. This Component must communicate with the Interaction Manager via asynchronous events (see 4.2 Software Constituents and The Run-Time View). In a typical implementation, there would likely be events to start a TTS play and to report the end of the play, etc. However, the markup and scripts that were used to author this system might well offer only a synchronous "play TTS" call, it being the job of the underlying implementation to convert that synchronous call into the appropriate sequence of asynchronous events. In fact, there is no requirement that the TTS resource be individually accessible at all. It would be quite possible for the markup to present only a single "play TTS and do speech recognition" call, which the underlying implementation would realize as a series of asynchronous events involving multiple Components.

Existing languages such as HTML may be used as either the Controller Documents or as Presentation Documents. Further examples of potential markup components are given in 5.2.7 Examples

4.2 Software Constituents and The Run-Time View

At the core of the MMI runtime architecture is the distinction between the Interaction Manager (IM) and the Modality Components, which is similar to the distinction between the Controller Document and the Presentation Documents. The Interaction Manager interprets the Controller Document while the individual Modality Components are responsible for specific tasks, particularly handling input and output in the various modalities, such as speech, pen, video, etc.

The Interaction Manager receives all the events that the various Modality Components generate. Those events may be commands or replies to commands, and it is up to the Interaction Manager to decide what to do with them, i.e., what events to generate in response to them. In general, the MMI architecture follows a 'targetless' event model. That is, the Component that raises an event does not specify its destination. Rather, it passes it up to the Runtime Framework, which will pass it to the Interaction Manager. The IM, in turn, decides whether to forward the event to other Components, or to generate a different event, etc.

Modality Components are black boxes, required only to implement the Modality Component Interface API which is described below. This API allows the Modality Components to communicate with the IM and thus indirectly with each other, since the IM is responsible for delivering events/messages among the Components. Since the internals of a Component are hidden, it is possible for an Interaction Manager and a set of Components to present themselves as a Component to a higher-level Interaction Manager. All that is required is that the IM implement the Component API. The result is a "Russian Doll" model in which Components may be nested inside other Components to an arbitrary depth. Nesting components in this manner is one way to produce a 'complex' Modality Component, namely one that handles multiple modalities simultaneously. However, it is also possible to produce complex Modality Components without nesting, as discussed in 5.2.3 The Modality Components.

In addition to the Interaction Manager and the modality components, there is a Runtime Framework that provides infrastructure support, in particular a transport layer which delivers events among the components.

Because we are using the term 'Component' to refer to a specific set of entities in our architecture, we will use the term 'Constituent' as a cover term for all the elements in our architecture which might normally be called 'software components'.

4.3 Relationship to EMMA

The Extended Multimodal Annotation Language [EMMA], is a set of specifications for multimodal systems, and provides details of an XML markup language for containing and annotating the interpretation of user input. For example, a user of a multimodal application might use both speech to express a command, and keystroke gesture to select or draw command parameters. The Speech Recognition Modality would express the user command using EMMA to indicate the input source (speech). The Pen Gesture Modality would express the command parameters using EMMA to indicate the input source (pen gestures). Both modalities may include timing information in the EMMA notation. Using the timing information, a fusion module combines the speech and pen gesture information into a single EMMA notation representing both the command and its parameters. The use of EMMA enables the separation of recognition process from the information fusion process, and thus enables reusable recognition modalities and general purpose information fusion algorithms.

5 Overview of Architecture

Here is a list of the Constituents of the MMI architecture. They are discussed in more detail below.

the Interaction Manager, which coordinates the different modalities. It is the Controller in the MVC paradigm.
the Data Component, which provides the common data model and represents the Model in the MVC paradigm.
the Modality Components, which provide modality-specific interaction capabilities. They are the Views in the MVC paradigm.
the Runtime Framework, which provides the basic infrastructure and enables communication among the other Constituents.

5.2 The Constituents

This section presents the responsibilities of the various constituents of the MMI architecture.

5.2.1 The Interaction Manager

All life-cycle events that the Modality Components generate MUST be delivered to the Interaction Manager. All life-cycle events that are delivered to Modality Components MUST be sent by the Interaction Manager.

Due to the Russian Doll model, Modality Components MAY contain their own Interaction Managers to handle their internal events. However these Interaction Managers are not visible to the top level Runtime Framework or Interaction Manager.

If the Interaction Manager does not contain an explicit handler for an event, it MUST respect any default behavior that has been established for the event. If there is no default behavior, the Interaction Manager MUST ignore the event. (In effect, the Interaction Manager's default handler for all events is to ignore them.)

The following paragraph is informative.

Normally there will be specific markup associated with the IM instructing it how to respond to events. This markup will thus contain a lot of the most basic interaction logic of an application. Existing languages such as SMIL, CCXML, SCXML, or ECMAScript can be used for IM markup as an alternative to defining special-purpose languages aimed specifically at multimodal applications. The IM fulfills multiple functions. For example, it is responsible for synchronization of data and focus, etc., across different Modality Components as well as the higher-level application flow that is independent of Modality Components. It also maintains the high-level application data model and may handle communication with external entities and back-end systems. Logically these functions could be separated into separate constituents and implementations may want to introduce internal structure to the IM. However, for the purposes of this standard, we leave the various functions rolled up in a single monolithic Interaction Manager component. We note that state machine languages such as SCXML are a good choice for authoring such a multi-function component, since state machines can be composed. Thus it is possible to define a high-level state machine representing the overall application flow, with lower-level state machines nested inside it handling the the cross-modality synchronization at each phase of the higher-level flow.

5.2.2 The Data Component

This section is informative.

The Data Component is responsible for storing application-level data. The Interaction Manager is a client of the Data Component and is able to access and update it as part of its control flow logic, but Modality Components do not have direct access to it. Since Modality Components are black boxes, they may have their own internal Data Components and may interact directly with backend servers. However, the only way that Modality Components can share data among themselves and maintain consistency is via the Interaction Manager. It is therefore a good application design practice to divide data into two logical classes: private data, which is of interest only to a given modality component, and public data, which is of interest to the Interaction Manager or to more than one Modality Component. Private data may be managed as the Modality Component sees fit, but all modification of public data, including submission to back end servers, should be entrusted to the Interaction Manager.

This specification does not define an interface between the Data Component and the Interaction Manager. This amounts to treating the Data Component as part of the Interaction Manager. (Note that this means that the data access language will be whatever one the IM provides.) The Data Component is shown with a dotted outline in the diagram above, however, because it is logically distinct and could be placed in a separate component.

6 Interface between the Interaction Manager and the Modality Components

The most important interface in this architecture is the one between the Modality Components and the Interaction Manager. Modality Components communicate with the IM via asynchronous events. Constituents MUST be able to send events and to handle events that are delivered to them asynchronously. It is not required that Constituents use these events internally since the implementation of a given Constituent is black box to the rest of the system. In general, it is expected that Constituents will send events both automatically (i.e., as part of their implementation) and under mark-up control.

The majority of the events defined here come in request/response pairs. That is, one party (either the IM or an MC) sends a request and the other returns a response. (The exceptions are the ExtensionNotification, StatusRequest and StatusResponse events, which can be sent by either party.) In each case it is specified which party sends the request and which party returns the response. If the wrong party sends a request or response, or if the request or response is sent under the wrong conditions (e.g. response without a previous request) the behavior of the receiving party is undefined. In the descriptions below, we say that the originating party "MAY" send the request, because it is up to the internal logic of the originating party to decide if it wants to invoke the behavior that the request would trigger. On the other hand, we say that the receiving party "MUST" send the response, because it is mandatory to send the response if and when the request is received.

6.1 Common Event Fields

The concept of 'context' is basic to these events described below. A context represents a single extended interaction with zero or more users across one or more modality components. In a simple unimodal case, a context can be as simple as a phone call or SSL session. Multimodal cases are more complex, however, since the various modalities may not be all used at the same time. For example, in a voice-plus-web interaction, e.g., web sharing with an associated VoIP call, it would be possible to terminate the web sharing and continue the voice call, or to drop the voice call and continue via web chat. In these cases, a single context persists across various modality configurations. In general, the 'context'_SHOULD_cover the longest period of interaction over which it would make sense for components to store information.

For examples of the concrete XML syntax for all these events, see B Examples of Life-Cycle Events

The following common fields are shared by multiple life-cycle events:

6.2 Standard Life Cycle Events

The Multimodal Architecture defines the following basic life-cycle events which the Interaction Manager and Modality Components MUST support. These events allow the Interaction Manager to invoke modality components and receive results from them. They thus form the basic interface between the IM and the Modality components. Note that the ExtensionNotification event offers extensibility since it contains arbitrary content and can be raised by either the IM or the Modality Components at any time once the context has been established. For example, an application relying on speech recognition could use the 'Extension' event to communicate recognition results or the fact that speech had started, etc.

In the definitions below, all fields are mandatory, unless explicitly stated to be optional.

6.2.4 DoneNotification

If the Modality Component reaches the end of its processing, it_MUST_return a DoneNotification to the IM that issued the StartRequest.

6.2.4.1 DoneNotification Properties

RequestID. See 6.1.4 RequestID. This_MUST_ match the RequestID of the StartRequest event.
Context See 6.1.1 Context. This MUST match the value in the StartRequest event.
Status See 6.1.5 Status.
StatusInfo Optional. See 6.1.6 StatusInfo.
Source See 6.1.2 Source.
Target See 6.1.3 Target.
Data Optional. See 6.1.7 Data.

The DoneNotification event is intended to indicate the completion of the processing that has been initiated by the Interaction Manager with a StartRequest. As an example a voice modality component might use the DoneNotification event to indicate the completion of a recognition task. In this case the DoneNotification event might carry the recognition result expressed using EMMA. However, there may be tasks which do not have a specific end. For example the Interaction Manager might send a StartRequest to a graphical modality component requesting it to display certain information. Such a task does not necessarily have a specific end and thus the graphical modality component might never send a DoneNotification event to the Interaction Manager. Thus the graphical modality component would display the screen until it received another StartRequest (or some other lifecycle event) from the Interaction Manager.

6.2.10 StatusRequest/StatusResponse

The StatusRequest message and the corresponding StatusResponse are intended to provide keep-alive functionality. Either the IM or the Modality Component MAY send the StatusRequest message. The recipient MUST respond with the StatusResponse message, unless the request specifies a context which is unknown to it, in which case the behavior is undefined.

6.2.10.1 Status Request Properties

RequestID. See 6.1.4 RequestID. A newly generated identifier used to identify this request.
Context See 6.1.1 Context. Optional specification of the context for which the status is requested. If it is present, the recipient MUST respond with a StatusResponse message indicating the status of the specified context. If it is not present, the recipient MUST send a StatusResponse message indicating the status of the underlying server, namely the software that would host a new context if one were created.
Source See 6.1.2 Source.
Target See 6.1.3 Target.
Data Optional. See 6.1.7 Data.

6.2.10.2 StatusResponse Properties

RequestID. See 6.1.4 RequestID. This_MUST_ match the RequestID in the StatusRequest event.
Context See 6.1.1 Context. An optional specification of the context for which the status is being returned. If it is present, the response MUST represent the status of the specified context. If it is not present, the response MUST represent the status of the underlying server.
Status An enumeration of 'Alive' or 'Dead'. The meaning of these values depends on whether the 'context' parameter is present. If it is, and the specified context is still active and capable of handling new life cycle events, the sender MUST set this field to 'Alive'. If the 'context' parameter is present and the context has terminated or is otherwise unable to process new life cycle events, the sender MUST set the status to 'Dead'. If the 'context' parameter is not provided, the status refers to the underlying server. If the sender is able to create new contexts, it MUST set the status to 'Alive', otherwise, it MUST set it to 'Dead'.
Source See 6.1.2 Source.
Target See 6.1.3 Target.
Data Optional. See 6.1.7 Data.