Voice Extensible Markup Language (VoiceXML) Version
W3C Recommendation 16 March 2004
This Version:
http://www.w3.org/TR/2004/REC-voicexml20-20040316/
Latest Version:
http://www.w3.org/TR/voicexml20/
Previous Version:
http://www.w3.org/TR/2004/PR-voicexml20-20040203/
Editors:
Scott McGlashan, Hewlett-Packard (Editor-in-Chief)
Daniel C. Burnett, Nuance Communications
Jerry Carter, Invited Expert
Peter Danielsen, Lucent (until October 2002)
Jim Ferrans, Motorola
Andrew Hunt, ScanSoft
Bruce Lucas, IBM
Brad Porter, Tellme Networks
Ken Rehor, Vocalocity
Steph Tryphonas, Tellme Networks
Please refer to the erratafor this document, which may include some normative corrections.
See also translations.
Copyright © 2004 W3C®(MIT,ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
Abstract
This document specifies VoiceXML, the Voice Extensible Markup Language. VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.
Status of this Document
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document has been reviewed by W3C Members and other interested parties, and it has been endorsed by the Director as a W3C Recommendation. W3C's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the functionaility and interoperability of the Web.
This specification is part of the W3C Speech Interface Framework and has been developed within the W3C Voice Browser Activity by participants in the Voice Browser Working Group (W3C Members only).
The design of VoiceXML 2.0 has been widely reviewed (see thedisposition of comments) and satisfies the Working Group's technical requirements. A list of implementations is included in the VoiceXML 2.0 implementation report, along with the associated test suite.
Comments are welcome on www-voice@w3.org (archive). See W3C mailing list and archive usage guidelines.
The W3C maintains a list of any patent disclosures related to this work.
Conventions of this Document
In this document, the key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" are to be interpreted as described in [RFC2119]and indicate requirement levels for compliant VoiceXML implementations.
Table of Contents
Abbreviated Contents
- 1. Overview
- 2. Dialog Constructs
- 3. User Input
- 4. System Output
- 5. Control flow and scripting
- 6. Environment and Resources
- Appendices
Full Contents
- 1. Overview
- 1.1 Introduction
- 1.2 Background
* 1.2.1 Architectural Model
* 1.2.2 Goals of VoiceXML
* 1.2.3 Scope of VoiceXML
* 1.2.4 Principles of Design
* 1.2.5 Implementation Platform Requirements - 1.3 Concepts
* 1.3.1 Dialogs and Subdialogs
* 1.3.2 Sessions
* 1.3.3 Applications
* 1.3.4 Grammars
* 1.3.5 Events
* 1.3.6 Links - 1.4 VoiceXML Elements
- 1.5 Document Structure and Execution
* 1.5.1 Execution within one Document
* 1.5.2 Executing a Multi-Document Application
* 1.5.3 Subdialogs
* 1.5.4 Final Processing
- 2. Dialog Constructs
- 2.1 Forms
* 2.1.1 Form Interpretation
* 2.1.2 Form Items
* 2.1.3 Form Item Variables and Conditions
* 2.1.4 Directed Forms
* 2.1.5 Mixed Initiative Forms
* 2.1.6 Form Interpretation Algorithm - 2.2 Menus
* 2.2.1 menu element
* 2.2.2 choice element
* 2.2.3 DTMF in Menus
* 2.2.4 enumerate element
* 2.2.5 Grammar Generation
* 2.2.6 Interpretation Model - 2.3 Form Items
* 2.3.1 field element
* 2.3.2 block element
* 2.3.3 initial element
* 2.3.4 subdialog element
* 2.3.5 object element
* 2.3.6 record element
* 2.3.7 transfer element - 2.4 Filled
- 2.5 Links
- 2.1 Forms
- 3. User Input
- 3.1 Grammars
* 3.1.1 Speech Grammars
* 3.1.2 DTMF Grammars
* 3.1.3 Scope of Grammars
* 3.1.4 Activation of Grammars
* 3.1.5 Semantic Interpretation of Input
* 3.1.6 Mapping Semantic Interpretation Results to VoiceXML forms
- 3.1 Grammars
- 4. System Output
- 4.1 Prompt
* 4.1.1 Speech Markup
* 4.1.2 Basic Prompts
* 4.1.3 Audio Prompting
* 4.1.4 Element
* 4.1.5 Bargein
* 4.1.6 Prompt Selection
* 4.1.7 Timeout
* 4.1.8 Prompt Queueing and Input Collection
- 4.1 Prompt
- 5. Control flow and scripting
- 5.1 Variables and Expressions
* 5.1.1 Declaring Variables
* 5.1.2 Variable Scopes
* 5.1.3 Referencing Variables
* 5.1.4 Standard Session Variables
* 5.1.5 Standard Application Variables - 5.2 Event Handling
* 5.2.1 throw element
* 5.2.2 catch element
* 5.2.3 Shorthand Notation
* 5.2.4 catch Element Selection
* 5.2.5 Default catch elements
* 5.2.6 Event Types - 5.3 Executable Content
* 5.3.1 var element
* 5.3.2 assign element
* 5.3.3 clear element
* 5.3.4 if, elseif, else elements
* 5.3.5 prompts
* 5.3.6 reprompt element
* 5.3.7 goto element
* 5.3.8 submit element
* 5.3.9 exit element
* 5.3.10 return element
* 5.3.11 disconnect element
* 5.3.12 script element
* 5.3.13 log element
- 5.1 Variables and Expressions
- 6. Environment and Resources
- 6.1 Resource Fetching
* 6.1.1 Fetching
* 6.1.2 Caching
* 6.1.3 Prefetching
* 6.1.4 Protocols - 6.2 Metadata Information
* 6.2.1 meta element
* 6.2.2 metadata element - 6.3 property element
* 6.3.1 Platform-Specific Properties
* 6.3.2 Generic Speech Recognizer Properties
* 6.3.3 Generic DTMF Recognizer Properties
* 6.3.4 Prompt and Collect Properties
* 6.3.5 Fetching Properties
* 6.3.6 Miscellaneous Properties - 6.4 param element
- 6.5 Value Designations
- 6.1 Resource Fetching
- Appendices
- Appendix A. Glossary of Terms
- Appendix B. VoiceXML Document Type Definition
- Appendix C. Form Interpretation Algorithm
- Appendix D. Timing Properties
- Appendix E. Audio File Formats
- Appendix F. Conformance
- Appendix G. Internationalization
- Appendix H. Accessibility
- Appendix I. Privacy
- Appendix J. Changes from VoiceXML 1.0
- Appendix K. Reusability
- Appendix L. Acknowledgements
- Appendix M. References
- Appendix N. Media Type and File Suffix
- Appendix O. VoiceXML XML Schema Definition
- Appendix P. Builtin Grammar Types
1. Overview
This document defines VoiceXML, the Voice Extensible Markup Language. Its background, basic concepts and use are presented inSection 1. The dialog constructs of form, menu and link, and the mechanism (Form Interpretation Algorithm) by which they are interpreted are then introduced in Section 2. User input using DTMF and speech grammars is covered in Section 3, while Section 4 covers system output using speech synthesis and recorded audio. Mechanisms for manipulating dialog control flow, including variables, events, and executable elements, are explained in Section 5. Environment features such as parameters and properties as well as resource handling are specified in Section 6. The appendices provide additional information including the VoiceXML Schema, a detailed specification of theForm Interpretation Algorithmand timing, audio file formats, and statements relating to conformance, internationalization, accessibility and privacy.
The origins of VoiceXML began in 1995 as an XML-based dialog design language intended to simplify the speech recognition application development process within an AT&T project called Phone Markup Language (PML). As AT&T reorganized, teams at AT&T, Lucent and Motorola continued working on their own PML-like languages.
In 1998, W3C hosted a conference on voice browsers. By this time, AT&T and Lucent had different variants of their original PML, while Motorola had developed VoxML, and IBM was developing its own SpeechML. Many other attendees at the conference were also developing similar languages for dialog design; for example, such as HP's TalkML and PipeBeach's VoiceHTML.
The VoiceXML Forum was then formed by AT&T, IBM, Lucent, and Motorola to pool their efforts. The mission of the VoiceXML Forum was to define a standard dialog design language that developers could use to build conversational applications. They chose XML as the basis for this effort because it was clear to them that this was the direction technology was going.
In 2000, the VoiceXML Forum released VoiceXML 1.0 to the public. Shortly thereafter, VoiceXML 1.0 was submitted to the W3C as the basis for the creation of a new international standard. VoiceXML 2.0 is the result of this work based on input from W3C Member companies, other W3C Working Groups, and the public.
Developers familiar with VoiceXML 1.0 are particularly directed to Changes from Previous Public Version which summarizes how VoiceXML 2.0 differs from VoiceXML 1.0.
1.1 Introduction
VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.
Here are two short examples of VoiceXML. The first is the venerable "Hello World":
Hello World!
The top-level element is , which is mainly a container for dialogs. There are two types of dialogs:forms and menus. Forms present information and gather input; menus offer choices of what to do next. This example has a single form, which contains a block that synthesizes and presents "Hello World!" to the user. Since the form does not specify a successor dialog, the conversation ends.
Our second example asks the user for a choice of drink and then submits it to a server script:
Would you like coffee, tea, milk, or nothing?
A field is an input field. The user must provide a value for the field before proceeding to the next element in the form. A sample interaction is:
C (computer): Would you like coffee, tea, milk, or nothing?
H (human): Orange juice.
C: I did not understand what you said. (a platform-specific default message.)
C: Would you like coffee, tea, milk, or nothing?
H: Tea
C: (continues in document drink2.asp)
1.2 Background
This section contains a high-level architectural model, whose terminology is then used to describe the goals of VoiceXML, its scope, its design principles, and the requirements it places on the systems that support it.
1.2.1 Architectural Model
The architectural model assumed by this document has the following components:
Figure 1: Architectural Model
A document server (e.g. a Web server) processes_requests_ from a client application, the VoiceXML Interpreter, through the VoiceXML interpreter context. The server produces VoiceXML documents in reply, which are processed by the VoiceXML interpreter. The VoiceXML interpreter context may monitor user inputs in parallel with the VoiceXML interpreter. For example, one VoiceXML interpreter context may always listen for a special escape phrase that takes the user to a high-level personal assistant, and another may listen for escape phrases that alter user preferences like volume or text-to-speech characteristics.
The implementation platform is controlled by the VoiceXML interpreter context and by the VoiceXML interpreter. For instance, in an interactive voice response application, the VoiceXML interpreter context may be responsible for detecting an incoming call, acquiring the initial VoiceXML document_,_and answering the call, while the VoiceXML interpreter conducts the dialog after answer. The implementation platform generates events in response to user actions (e.g. spoken or character input received, disconnect) and system events (e.g. timer expiration). Some of these events are acted upon by the VoiceXML interpreter itself, as specified by the VoiceXML document, while others are acted upon by the VoiceXML interpreter context.
1.2.2 Goals of VoiceXML
VoiceXML's main goal is to bring the full power of Web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management. It enables integration of voice services with data services using the familiar client-server paradigm. A voice service is viewed as a sequence of interaction dialogs between a user and an implementation platform. The dialogs are provided by document servers, which may be external to the implementation platform. Document servers maintain overall service logic, perform database and legacy system operations, and produce dialogs. A VoiceXML document specifies each interaction dialog to be conducted by a VoiceXML interpreter. User input affects dialog interpretation and is collected into requests submitted to a document server. The document server replies with another VoiceXML document to continue the user's session with other dialogs.
VoiceXML is a markup language that:
- Minimizes client/server interactions by specifying multiple interactions per document.
- Shields application authors from low-level, and platform-specific details.
- Separates user interaction code (in VoiceXML) from service logic (e.g. CGI scripts).
- Promotes service portability across implementation platforms. VoiceXML is a common language for content providers, tool providers, and platform providers.
- Is easy to use for simple interactions, and yet provides language features to support complex dialogs.
While VoiceXML strives to accommodate the requirements of a majority of voice response services, services with stringent requirements may best be served by dedicated applications that employ a finer level of control.
1.2.3 Scope of VoiceXML
The language describes the human-machine interaction provided by voice response systems, which includes:
- Output of synthesized speech (text-to-speech).
- Output of audio files.
- Recognition of spoken input.
- Recognition of DTMF input.
- Recording of spoken input.
- Control of dialog flow.
- Telephony features such as call transfer and disconnect.
The language provides means for collecting character and/or spoken input, assigning the input results to document-defined request variables, and making decisions that affect the interpretation of documents written in the language. A document may be linked to other documents through Universal Resource Identifiers (URIs).
1.2.4 Principles of Design
VoiceXML is an XML application [XML].
- The language promotes portability of services through abstraction of platform resources.
- The language accommodates platform diversity in supported audio file formats, speech grammar formats, and URI schemes. While producers of platforms may support various grammar formats the language requires a common grammar format, namely the XML Form of the W3C Speech Recognition Grammar Specification [SRGS], to facilitate interoperability. Similarly, while various audio formats for playback and recording may be supported, the audio formats described in Appendix E must be supported
- The language supports ease of authoring for common types of interactions.
- The language has well-defined semantics that preserves the author's intent regarding the behavior of interactions with the user. Client heuristics are not required to determine document element interpretation.
- The language recognizes semantic interpretations from grammars and makes this information available to the application.
- The language has a control flow mechanism.
- The language enables a separation of service logic from interaction behavior.
- It is not intended for intensive computation, database operations, or legacy system operations. These are assumed to be handled by resources outside the document interpreter, e.g. a document server.
- General service logic, state management, dialog generation, and dialog sequencing are assumed to reside outside the document interpreter.
- The language provides ways to link documents using URIs, and also to submit data to server scripts using URIs.
- VoiceXML provides ways to identify exactly which data to submit to the server, and which HTTP method (GET or POST) to use in the submittal.
- The language does not require document authors to explicitly allocate and deallocate dialog resources, or deal with concurrency. Resource allocation and concurrent threads of control are to be handled by the implementation platform.
1.2.5 Implementation Platform Requirements
This section outlines the requirements on the hardware/software platforms that will support a VoiceXML interpreter.
Document acquisition. The interpreter context is expected to acquire documents for the VoiceXML interpreter to act on. The "http" URI scheme must be supported. In some cases, the document request is generated by the interpretation of a VoiceXML document, while other requests are generated by the interpreter context in response to events outside the scope of the language, for example an incoming phone call. When issuing document requests via http, the interpreter context identifies itself using the "User-Agent" header variable with the value "/", for example, "acme-browser/1.2"
Audio output. An implementation platform must support audio output using audio files and text-to-speech (TTS). The platform must be able to freely sequence TTS and audio output. If an audio output resource is not available, an error.noresource event must be thrown. Audio files are referred to by a URI. The language specifies a required set of audio file formats which must be supported (see Appendix E); additional audio file formats may also be supported.
Audio input. An implementation platform is required to detect and report character and/or spoken input simultaneously and to control input detection interval duration with a timer whose length is specified by a VoiceXML document. If an audio input resource is not available, an error.noresource event must be thrown.
- It must report characters (for example, DTMF) entered by a user. Platforms must support the XML form of DTMF grammars described in the W3C Speech Recognition Grammar Specification[SRGS]. They should also support the Augmented BNF (ABNF) form of DTMF grammars described in the W3C Speech Recognition Grammar Specification[SRGS].
- It must be able to receive speech recognition grammar data dynamically. It must be able to use speech grammar data in the XML Form of the W3C Speech Recognition Grammar Specification[SRGS]. It should be able to receive speech recognition grammar data in the ABNF form of the W3C Speech Recognition Grammar Specification [SRGS], and may support other formats such as the JSpeech Grammar Format [JSGF] or proprietary formats. Some VoiceXML elements contain speech grammar data; others refer to speech grammar data through a URI. The speech recognizer must be able to accommodate dynamic update of the spoken input for which it is listening through either method of speech grammar data specification.
- It must be able to record audio received from the user. The implementation platform must be able to make the recording available to a request variable. The language specifies a required set of recorded audio file formats which must be supported (see Appendix E); additional formats may also be supported.
Transfer The platform should be able to support making a third party connection through a communications network, such as the telephone.
1.3 Concepts
A VoiceXML document (or a set of related documents called an application) forms a conversational finite state machine. The user is always in one conversational state, or_dialog_, at a time. Each dialog determines the next dialog to transition to. Transitions are specified using URIs, which define the next document and dialog to use. If a URI does not refer to a document, the current document is assumed. If it does not refer to a dialog, the first dialog in the document is assumed. Execution is terminated when a dialog does not specify a successor, or if it has an element that explicitly exits the conversation.
1.3.1 Dialogs and Subdialogs
There are two kinds of dialogs: forms and menus. Forms define an interaction that collects values for a set of form item variables. Each field may specify a grammar that defines the allowable inputs for that field. If a form-level grammar is present, it can be used to fill several fields from one utterance. A menu presents the user with a choice of options and then transitions to another dialog based on that choice.
A subdialog is like a function call, in that it provides a mechanism for invoking a new interaction, and returning to the original form. Variable instances, grammars, and state information are saved and are available upon returning to the calling document. Subdialogs can be used, for example, to create a confirmation sequence that may require a database query; to create a set of components that may be shared among documents in a single application; or to create a reusable library of dialogs shared among many applications.
1.3.2 Sessions
A session begins when the user starts to interact with a VoiceXML interpreter context, continues as documents are loaded and processed, and ends when requested by the user, a document, or the interpreter context.
1.3.3 Applications
An application is a set of documents sharing the same_application root document_. Whenever the user interacts with a document in an application, its application root document is also loaded. The application root document remains loaded while the user is transitioning between other documents in the same application, and it is unloaded when the user transitions to a document that is not in the application. While it is loaded, the application root document's variables are available to the other documents as application variables, and its grammars remain active for the duration of the application, subject to the grammar activation rules discussed in Section 3.1.4.
Figure 2 shows the transition of documents (D) in an application that share a common application root document (root).
Figure 2: Transitioning between documents in an application.
1.3.4 Grammars
Each dialog has one or more speech and/or DTMF _grammars_associated with it. In machine directed applications, each dialog's grammars are active only when the user is in that dialog. In mixed initiative applications, where the user and the machine alternate in determining what to do next, some of the dialogs are flagged to make their grammars active(i.e., listened for) even when the user is in another dialog in the same document, or on another loaded document in the same application. In this situation, if the user says something matching another dialog's active grammars, execution transitions to that other dialog, with the user's utterance treated as if it were said in that dialog. Mixed initiative adds flexibility and power to voice applications.
1.3.5 Events
VoiceXML provides a form-filling mechanism for handling "normal" user input. In addition, VoiceXML defines a mechanism for handling events not covered by the form mechanism.
Events are thrown by the platform under a variety of circumstances, such as when the user does not respond, doesn't respond intelligibly, requests help, etc. The interpreter also throws events if it finds a semantic error in a VoiceXML document. Events are caught by catch elements or their syntactic shorthand. Each element in which an event can occur may specify catch elements. Furthermore, catch elements are also inherited from enclosing elements "as if by copy". In this way, common event handling behavior can be specified at any level, and it applies to all lower levels.
1.3.6 Links
A link supports mixed initiative. It specifies a grammar that is active whenever the user is in the scope of the link. If user input matches the link's grammar, control transfers to the link's destination URI. A link can be used to throw an event or go to a destination URI.
1.4 VoiceXML Elements
Table 1: VoiceXML Elements
Element | Purpose | Section | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Assign a variable a value | 5.3.2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Play an audio clip within a prompt | 4.1.3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A container of (non-interactive) executable code | 2.3.2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Catch an event | 5.2.2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Define a menu item | 2.2.2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Clear one or more form item variables | 5.3.3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Disconnect a session | 5.3.11 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Used in elements | 5.3.4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Used in elements | 5.3.4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Shorthand for enumerating the choices in a menu | 2.2.4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Catch an error event | 5.2.3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Exit a session | 5.3.9 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Declares an input field in a form | 2.3.1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
An action executed when fields are filled | 2.4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A dialog for presenting information and collecting data | 2.1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Go to another dialog in the same or different document | 5.3.7 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Specify a speech recognition or DTMF grammar | 3.1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Catch a help event | 5.2.3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Simple conditional logic | 5.3.4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Declares initial logic upon entry into a (mixed initiative) form | 2.3.3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Specify a transition common to all dialogs in the link's scope | 2.5 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Generate a debug message | 5.3.13 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A dialog for choosing amongst alternative destinations | 2.2.1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Define a metadata item as a name/value pair | 6.2.1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Define metadata information using a metadata schema | 6.2.2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Catch a noinput event | 5.2.3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Catch a nomatch event | 5.2.3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Interact with a custom extension | 2.3.5 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Specify an option in a | 2.3.1.3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Parameter in or | 6.4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Queue speech synthesis and audio output to the user | 4.1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Control implementation platform settings. | 6.3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Record an audio sample | 2.3.6 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Play a field prompt when a field is re-visited after an event | 5.3.6 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Return from a subdialog. | 5.3.10 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Attributes of defined in [SSML] are: Table 36: Attributes Inherited from SSML
Attributes of defined only in VoiceXML are: Table 37: Attributes added in VoiceXML
Exactly one of "src" or "expr" must be specified; otherwise, an error.badfetch event is thrown. Note that it is a platform optimization to stream audio: i.e. the platform may begin processing audio content as it arrives and not to wait for full retrieval. The "prefetch" fetchhint can be used to request full audio retrieval prior to playback. 4.1.4 ElementThe element is used to insert the value of an expression into a prompt. It has one attribute: Table 38: Attributes
For example if n is 12, the prompt is the square of .will result in the text string "144 is the square of 12" being passed to the speech synthesis engine. The manner in which the value attribute is played is controlled by the surrounding speech synthesis markup. For instance, a value can be played as a date in the following example: The text inserted by the element is not subject to any special interpretation; in particular, it is not parsed as an [SSML] document or document fragment. XML special characters (&, >, and <) are not treated specially and do not need to be escaped. The equivalent effect may be obtained by literally inserting the text computed by the element in a CDATA section. For example, when the following variable assignment: is referenced in a prompt element as The price of is $1. the following output is produced. The price of AT&T is $1. 4.1.5 BargeinIf an implementation platform supports bargein, the application author can specify whether a user can interrupt, or "bargein" on, a prompt using speech or DTMF input. This speeds up conversations, but is not always desired. If the application author requires that the user must hear all of a warning, legal notice, or advertisement, bargein should be disabled. This is done with the bargein attribute: Users can interrupt a prompt whose bargein attribute is true, but must wait for completion of a prompt whose bargein attribute is false. In the case where several prompts are queued, the bargein attribute of each prompt is honored during the period of time in which that prompt is playing. If bargein occurs during any prompt in a sequence, all subsequent prompts are not played (even those whose bargein attribute is set to false). If the bargein attribute is not specified, then the value of the bargein property is used if set. When the bargein attribute is false, input is not buffered while the prompt is playing, and any DTMF input buffered in a transition state is deleted from the buffer (Section 4.1.8 describes input collection during transition states). Note that not all speech recognition engines or implementation platforms support bargein. For a platform to support bargein, it must support at least one of the bargein types described in Section 4.1.5.1. 4.1.5.1 Bargein typeWhen bargein is enabled, the bargeintype attribute can be used to suggest the type of bargein the platform will perform in response to voice or DTMF input. Possible values for this attribute are: Table 39: bargeintype Values
If the bargeintype attribute is not specified, then the value of the bargeintype property is used. Implementations that claim to support bargein are required to support at least one of these two types. Mixing these types within a single queue of prompts can result in unpredictable behavior and is discouraged. In the case of "speech" bargeintype, the exact meaning of "speech input" is necessarily implementation-dependent, due to the complexity of speech recognition technology. It is expected that the prompt will be stopped as soon as the platform is able to reliably determine that the input is speech. Stopping the prompt as early as possible is desireable because it avoids the "stutter" effect in which a user stops in mid-utterance and re-starts if he does not believe that the system has heard him. 4.1.6 Prompt SelectionTapered prompts are those that may change with each attempt. Information-requesting prompts may become more terse under the assumption that the user is becoming more familiar with the task. Help messages become more detailed perhaps, under the assumption that the user needs more help. Or, prompts can change just to make the interaction more interesting. Each input item, , and menu has an internal prompt counter that is reset to one each time the form or menu is entered. Whenever the system selects a given input item in the select phase of FIA and FIA does perform normal selection and queuing of prompts (i.e., as described in Section 5.3.6, the previous iteration of FIA did not end with a catch handler that had no reprompt), the input item's associated prompt counter is incremented. This is the mechanism supporting tapered prompts. For instance, here is a form with a form level prompt and field level prompts: Welcome to the ice cream survey. vanilla chocolate strawberry What is your favorite flavor? Say chocolate, vanilla, or strawberry. Sorry, no help is available.A conversation using this form follows:
This is just an example to illustrate the use of prompt counters. A polished form would need to offer a more extensive range of choices and to deal with out of range values in more flexible way. When it is time to select a prompt, the prompt counter is examined. The child prompt with the highest count attribute less than or equal to the prompt counter is used. If a prompt has no count attribute, a count of "1" is assumed. A conditional prompt is one that is spoken only if its condition is satisfied. In this example, a prompt is varied on each visit to the enclosing form. Would you like to hear another elephant joke? For another joke say yes. To exit say no.When a prompt must be chosen, a set of prompts to be queued is chosen according to the following algorithm:
All elements that remain on the list will be queued for play. 4.1.7 TimeoutThe timeout attribute specifies the interval of silence allowed while waiting for user input after the end of the last prompt. If this interval is exceeded, the platform will throw a noinput event. This attribute defaults to the value specified by the timeout property (see Section 6.3.4) at the time the prompt is queued. In other words, each prompt has its own timeout value. The reason for allowing timeouts to be specified as prompt attributes is to support tapered timeouts. For example, the user may be given five seconds for the first input attempt, and ten seconds on the next. The prompt timeout attribute determines the noinput timeout for the following input: Pick a color for your new Model T. Please choose color of your new nineteen twenty four Ford Model T. Possible colors are black, black, or black. Please take your time.If several prompts are queued before a field input, the timeout of the last prompt is used. 4.1.8 Prompt Queueing and Input CollectionA VoiceXML interpreter is at all times in one of two states:
The waiting and transitioning states are related to the phases of the Form Interpretation Algorithm as follows:
This distinction of states is made in order to greatly simplify the programming model. In particular, an important consequence of this model is that the VoiceXML application designer can rely on all executable content (such as the content of and elements) being run to completion, because it is executed while in the transitioning state, which may not be interrupted by input. While in the transitioning state various prompts are queued, either by the element in executable content or by the element in form items. In addition, audio may be queued by the fetchaudio attribute. The queued prompts and audio are played either
Note that when a prompt's bargein attribute is false, input is not collected and DTMF input buffered in a transition state is deleted (see Section 4.1.5). When an ASR grammar is matched, if DTMF input was consumed by a simultaneously active DTMF grammar (but did not result in a complete match of the DTMF grammar), the DTMF input may, at processor discretion, be discarded. Before the interpreter exits all queued prompts are played to completion. The interpreter remains in the transitioning state and no input is accepted while the interpreter is exiting. It is a permissible optimization to begin playing prompts queued during the transitioning state before reaching the waiting state, provided that correct semantics are maintained regarding processing of the input audio received while the prompts are playing, for example with respect to bargein and grammar processing. The following examples illustrate the operation of these rules in some common cases. Case 1Typical non-fetching case: field, followed by executable content (such as and ), followed by another field. in document d0
As a result of input received while waiting in field f0 the following actions take place:
Case 2Typical fetching case: field, followed by executable content (such as and ) ending with a that specifies fetchaudio, ending up in a field in a different document that is fetched from a server. in document d0
in document d1
As a result of input received while waiting in field f0 the following actions take place:
Case 3As in Case 2, but no fetchaudio is specified. in document d0
in document d1
As a result of input received while waiting in field f0 the following actions take place:
5. Control flow and scripting5.1 Variables and ExpressionsVoiceXML variables are in all respects equivalent to ECMAScript variables: they are part of the same variable space. VoiceXML variables can be used in a Tell me a number and I'll tell you its factorial. factorial isA The time is hours, minutes, and seconds. Do you want to hear another time? The content of a All variables must be declared before being referenced by ECMAScript scripts, or by VoiceXML elements as described in Section 5.1.1. 5.3.13 log elementThe element allows an application to generate a logging or debug message which a developer can use to help in application development or post-execution analysis of application performance. The element may contain any combination of text (CDATA) and elements. The generated message consists of the concatenation of the text and the string form of the value of the "expr" attribute of the elements. The manner in which the message is displayed or logged is platform-dependent. The usage of label is platform-dependent. Platforms are not required to preserve white space. ECMAScript expressions in must be evaluated in document order. The use of the element should have no other side-effects on interpretation. The card number was The element has the following attributes: Table 53: Attributes
6. Environment and Resources6.1 Resource Fetching6.1.1 FetchingA VoiceXML interpreter context needs to fetch VoiceXML documents, and other resources, such as audio files, grammars, scripts, and objects. Each fetch of the content associated with a URI is governed by the following attributes: Table 54: Fetch Attributes
When content is fetched from a URI, the fetchtimeout attribute determines how long to wait for the content (starting from the time when the resource is needed), and the fetchhint attribute determines when the content is fetched. The caching policy for a VoiceXML interpreter context utilizes the maxage and maxstale attributes and is explained in more detail below. The fetchhint attribute, in combination with the various fetchhint properties, is merely a hint to the interpreter context about when it may schedule the fetch of a resource. Telling the interpreter context that it may prefetch a resource does not require that the resource be prefetched; it only suggests that the resource may be prefetched. However, the interpreter context is always required to honor the safe fetchhint. When transitioning from one dialog to another, through either a , , , , or element, there are additional rules that affect interpreter behavior. If the referenced URI names a document (e.g. "doc#dialog"), or if query data is provided (through POST or GET), then a new document is obtained (either from a local cache, intermediate cache, or from a origin Web server). When it is obtained, the document goes through its initialization phase (i.e., obtaining and initializing a new application root document if needed, initializing document variables, and executing document scripts). The requested dialog (or first dialog if none is specified) is then initialized and execution of the dialog begins. Generally, if a URI reference contains only a fragment (e.g., "#my_dialog"), then no document is fetched, and no initialization of that document is performed. However, always results in a fetch, and if a fragment is accompanied by a namelist attribute there will also be a fetch. Another exception is when a URI reference in a leaf document references the application root document. In this case, the root document is transitioned to without fetching and without initialization even if the URI reference contains an absolute or relative URI (see Section 1.5.2 and [RFC2396]). However, if the URI reference to the root document contains a query string or a namelist attribute, the root document is fetched. Elements that fetch VoiceXML documents also support the following additional attribute: Table 55: Additional Fetch Attribute
The fetchaudio attribute is useful for enhancing a user experience when there may be noticeable delays while the next document is retrieved. This can be used to play background music, or a series of announcements. When the document is retrieved, the audio file is interrupted if it is still playing. If an error occurs retrieving fetchaudio from its URI, no badfetch event is thrown and no audio is played during the fetch. 6.1.2 CachingThe VoiceXML interpreter context, like [HTML] visual browsers, can use caching to improve performance in fetching documents and other resources; audio recordings (which can be quite large) are as common to VoiceXML documents as images are to HTML pages. In a visual browser it is common to include end user controls to update or refresh content that is perceived to be stale. This is not the case for the VoiceXML interpreter context, since it lacks equivalent end user controls. Thus enforcement of cache refresh is at the discretion of the document through appropriate use of the maxage, and maxstale attributes. The caching policy used by the VoiceXML interpreter context must adhere to the cache correctness rules of HTTP 1.1 ([RFC2616]). In particular, the Expires and Cache-Control headers must be honored. The following algorithm summarizes these rules and represents the interpreter context behavior when requesting a resource:
The "maxstale check" is:
Note: it is an optimization to perform a "get if modified" on a document still present in the cache when the policy requires a fetch from the server. The maxage and maxstale properties are allowed to have no default value whatsoever. If the value is not provided by the document author, and the platform does not provide a default value, then the value is undefined and the 'Otherwise' clause of the algorithm applies. All other properties must provide a default value (either as given by the specification or by the platform). While the maxage and maxstale attributes are drawn from and directly supported by HTTP 1.1, some resources may be addressed by URIs that name protocols other than HTTP. If the protocol does not support the notion of resource age, the interpreter context shall compute a resource's age from the time it was received. If the protocol does not support the notion of resource staleness, the interpreter context shall consider the resource to have expired immediately upon receipt. 6.1.2.1 Controlling the Caching PolicyVoiceXML allows the author to override the default caching behavior for each use of each resource (except for any document referenced by the element's application attribute: there is no markup mechanism to control the caching policy for an application root document). Each resource-related element may specify maxage and maxstale attributes. Setting maxage to a non-zero value can be used to get a fresh copy of a resource that may not have yet expired in the cache. A fresh copy can be unconditionally requested by setting maxage to zero. Using maxstale enables the author to state that an expired copy of a resource, that is not too stale (according to the rules of HTTP 1.1), may be used. This can improve performance by eliminating a fetch that would otherwise be required to get a fresh copy. It is especially useful for authors who may not have direct server-side control of the expiration dates of large static files. 6.1.3 PrefetchingPrefetching is an optional feature that an interpreter context may implement to obtain a resource before it is needed. A resource that may be prefetched is identified by an element whose fetchhint attribute equals "prefetch". When an interpreter context does prefetch a resource, it must ensure that the resource fetched is precisely the one needed. In particular, if the URI is computed with an expr attribute, the interpreter context must not move the fetch up before any assignments to the expression's variables. Likewise, the fetch for a must not be moved prior to any assignments of the namelist variables. The expiration status of a resource must be checked on each use of the resource, and, if its fetchhint attribute is "prefetch", then it is prefetched. The check must follow the caching policy specified in Section 6.1.2. 6.1.4 ProtocolsThe "http" URI scheme must be supported by VoiceXML platforms, the "https" protocol should be supported and other URI protocols may be supported. 6.2 Metadata InformationMetadata information is information about the document rather than the document's content. VoiceXML 2.0 provides two elements in which metadata information can be expressed: and . The element provides more general and powerful treatment of metadata information than . VoiceXML does not specify required metadata information. However, it does recommend that metadata is expressed using the element with information in Resource Description Framework (RDF) [RDF-SYNTAX] using the Dublin Core version 1.0 RDF schema [DC] (see Section 6.2.2). 6.2.1 meta elementThe element specifies meta information as in [HTML]. There are two types of . The first type specifies a metadata property of the document as a whole and is expressed by the pair of attributes, name and content. For example to specify the maintainer of a VoiceXML document:
The second type of specifies HTTP response headers and is expressed by the pair of attributes http-equiv and content. In the following example, the first element sets an expiration date that prevents caching of the document; the second element sets the Date header.
Attributes of are: Table 56: Attributes
Exactly one of "name" or "http-equiv" must be specified; otherwise, an error.badfetch event is thrown. 6.2.2 metadata elementThe element is container in which information about the document can be placed using a metadata schema. Although any metadata schema can be used with , it is recommended that the RDF schema is used in conjunction with metadata properties defined in the Dublin Core Metadata Initiative. RDF is a declarative language and provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web. Content creators should refer to W3C metadata Recommendations [RDF-SYNTAX] and [RDF-SCHEMA] as well as the Dublin Core Metadata Initiative [DC], which is a set of generally applicable core metadata properties (e.g., Title, Creator, Subject, Description, Copyrights, etc.). The following Dublin Core metadata properties are recommended in : Table 57: Recommended Dublin Core Metadata Properties
Here is an example of how can be included in a VoiceXML document using the Dublin Core version 1.0 RDF schema[DC]: <rdf:RDF xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs = "http://www.w3.org/TR/1999/PR-rdf-schema-19990303#" xmlns:dc = "" title="undefined" rel="noopener noreferrer">http://purl.org/metadata/dublin_core#"> <rdf:Description about="http://www.example.com/meta.vxml"
dc:Title="Directory Enquiry Service"
dc:Description="Directory Enquiry Service for London in VoiceXML"
dc:Publisher="W3C"
dc:Language="en"
dc:Date="2002-02-12"
dc:Rights="Copyright 2002 John Smith"
dc:Format="application/voicexml+xml" > Hello 6.3 property elementThe element sets a property value. Properties are used to set values that affect platform behavior, such as the recognition process, timeouts, caching policy, etc. Properties may be defined for the whole application, for the whole document at the level, for a particular dialog at the or level, or for a particular form item. Properties apply to their parent element and all the descendants of the parent. A property at a lower level overrides a property at a higher level. When different values for a property are specified at the same level, the last one in document order applies. Properties specified in the application root document provide default values for properties in every document in the application; properties specified in an individual document override property values specified in the application root document. If a platform detects that the value of a property is invalid, then it should throw an error.semantic. In some cases, elements specify default values for element attributes, such as timeout or bargein. For example, to turn off bargein by default for all the prompts in a particular form: This introductory prompt cannot be barged into. And neither can this prompt. But this one can be barged into. Please say yes or no. The element has the following attributes: Table 58: Attributes
6.3.1 Platform-Specific PropertiesAn interpreter context is free to provide platform-specific properties. For example, to set the "multiplication factor" for this platform in the scope of this document: WelcomeBy definition, platform-specific properties introduce incompatibilities which reduce application portability. To minimize them, the following interpreter context guidelines are strongly recommended:
6.3.2 Generic Speech Recognizer PropertiesThe generic speech recognizer properties mostly are taken from the Java Speech API [JSAPI]: Table 59: Generic Speech Recognizer Properties
6.3.3 Generic DTMF Recognizer PropertiesSeveral generic properties pertain to DTMF grammar recognition: Table 60: Generic DTMF Recognizer Properties
6.3.4 Prompt and Collect PropertiesThese properties apply to the fundamental platform prompt and collect cycle: Table 61: Prompt and Collect Properties
6.3.5 Fetching PropertiesThese properties pertain to the fetching of new documents and resources (note that maxage and maxstale properties may have no default value - see Section 6.1.2): Table 62: Fetching Properties
6.3.6 Miscellaneous PropertiesTable 63: Miscellaneous Properties
Our last example shows several of these properties used at multiple levels. Welcome to the Voice Address Book Who would you like to call? Say the name of the person you would like to call. Say the location of the person you would like to call. You said to call at . Is this correct? 6.4 param elementThe element is used to specify values that are passed to subdialogs or objects. It is modeled on the [HTML] element. Its attributes are: Table 64: Attributes
Exactly one of "expr" or "value" must be specified; otherwise, an error.badfetch event is thrown. The use of valuetype and type is optional in general, although they may be required by specific objects. When is contained in a element, the values specified by it are used to initialize dialog elements in the subdialog that is invoked. See Section 2.3.4 for details regarding initialization of variables in subdialogs using . When is contained in an , the use of the parameter data is specific to the object that is being invoked, and is outside the scope of the VoiceXML specification. Below is an example of used as part of an . In this case, the first two elements have expressions (implicitly of valuetype="data"), the third has an explicit value, and the fourth is a URI that returns a media type of text/plain. The meaning of this data is specific to the object.
The next example illustrates used with . In this case, two expressions are used to initialize variables in the scope of the subdialog form. Form with calling dialoghttp://another.example.comPlease say Social Securityy number. Subdialog inUsing in a is a convenient way of passing data to a subdialog without requiring the use of server side scripting. 6.5 Value DesignationsSeveral VoiceXML parameter values follow the conventions used in the W3C's Cascading Style Sheet Recommendation [CSS2]. 6.5.1 Integers and Real NumbersReal numbers and integers are specified in decimal notation only. An integer consists of one or more digits "0" to "9". A real number may be an integer, or it may be zero or more digits followed by a dot (.) followed by one or more digits. Both integers and real numbers may be preceded by a "-" or "+" to indicate the sign. 6.5.2 TimesTime designations consist of a non-negative real number followed by a time unit identifier. The time unit identifiers are:
Examples include: "3s", "850ms", "0.7s", ".5s" and "+1.5s". AppendicesAppendix A — Glossary of Termsactive grammar A speech or DTMF grammar that is currently active. This is based on the currently executing element, and the scope elements of the currently defined grammars. application A collection of VoiceXML documents that are tagged with the same application name attribute. ASR Automatic speech recognition. author The creator of a VoiceXML document. catch element A block or one of its abbreviated forms. Certain default catch elements are defined by the VoiceXML interpreter. control item A form item whose purpose is either to contain a block of procedural logics () or to allow initial prompts for a mixed initiative dialog (). CSS W3C Cascading Style Sheet specification. See [CSS2] dialog An interaction with the user specified in a VoiceXML document. Types of dialogs include forms and_menus_. DTMF (Dual Tone Multi-Frequency) Touch-tone or push-button dialing. Pushing a button on a telephone keypad generates a sound that is a combination of two tones, one high frequency and the other low frequency. ECMAScript A standard version of JavaScript backed by the European Computer Manufacturer's Association. See [ECMASCRIPT] event A notification "thrown" by the implementation platform, VoiceXML interpreter context, VoiceXML interpreter, or VoiceXML code. Events include exceptional conditions (semantic errors), normal errors (user did not say something recognizable), normal events (user wants to exit), and user defined events. executable content Procedural logic that occurs in , , and event handlers. form A dialog that interacts with the user in a highly flexible fashion with the computer and the _user_sharing the initiative. FIA (Form Interpretation Algorithm) An algorithm implemented in a _VoiceXML interpreter_which drives the interaction between the user and a VoiceXML form or menu. See Section 2.1.6and Appendix C. form item An element of that can be visited during form execution: , , , , , , and .form item variable A variable, either implicitly or explicitly defined, associated with each form item in a form. If the form item variable is undefined, the form interpretation algorithm will visit the form item and use it to interact with the user. implementation platform A computer with the requisite software and/or hardware to support the types of interaction defined by VoiceXML. input item A form item whose purpose is to input a input item variable. Input items include , , , , and . language identifier A language identifier labels information content as being of a particular human language variant. Following the XML specification for language identification [XML], a legal language identifier is identified by an RFC 3066 [RFC3066]code. A language code is required by RFC 3066. A country code or other subtag identifier is optional by RFC 3066. link A set of grammars that when matched by something the_user_ says or keys in, either transitions to a new dialog or document or throws an event in the current form item. menu A dialog presenting the user with a set of choices and takes action on the selected one. mixed initiative A computer-human interaction in which either the computer or the human can take initiative and decide what to do next. JSGF Java API Speech Grammar Format. A proposed standard for representing speech grammars. See [JSGF] object A platform-specific capability with an interface available via VoiceXML. request A collection of data including: a URI specifying a document server for the data, a set of name-value pairs of data to be processed (optional), and a method of submission for processing (optional). script A fragment of logic written in a client-side scripting language, especially ECMAScript, which is a scripting language that must be supported by any VoiceXML interpreter. session A connection between a user and an implementation platform, e.g. a telephone call to a voice response system. One session may involve the interpretation of more than one_VoiceXML document_. SRGS (Speech Recognition Grammar Specification) A standard format for context-free speech recognition grammars being developed by the W3C Voice Browser group. Both ABNF and XML formats are defined [SRGS]. SSML (Speech Synthesis Markup Language) A standard format for speech synthesis being developed by the W3C Voice Browser group [SSML]. subdialog A VoiceXML dialog (or document) invoked from the current_dialog_ in a manner analogous to function calls. tapered prompts A set of prompts used to vary a message given to the human. Prompts may be tapered to be more terse with use (field prompting), or more explicit (help prompts). throw An element that fires an event. TTS text-to-speech; speech synthesis. user A person whose interaction with an implementation platform is controlled by a VoiceXML interpreter. URI Uniform Resource Indicator. URL Uniform Resource Locator. VoiceXML document An XML document conforming to the VoiceXML specification. VoiceXML interpreter A computer program that interprets a _VoiceXML document_to control an implementation platform for the purpose of conducting an interaction with a user. VoiceXML interpreter context A computer program that uses a VoiceXML interpreter to interpret a VoiceXML Document and that may also interact with the implementation platform independently of the_VoiceXML interpreter_. W3C World Wide Web Consortium http://www.w3.org/ Appendix B — VoiceXML Document Type DefinitionThe VoiceXML DTD is located at http://www.w3.org/TR/voicexml20/vxml.dtd. Due to DTD limitations, the VoiceXML DTD does not correctly express that the element can contain elements from other XML namespaces. Note: the VoiceXML DTD includes modified elements from the DTDs of the Speech Recognition Grammar Specification 1.0 [SRGS] and the Speech Synthesis Markup Language 1.0 [SSML]. Appendix C — Form Interpretation AlgorithmThe form interpretation algorithm (FIA) drives the interaction between the user and a VoiceXML form or menu. A menu can be viewed as a form containing a single field whose grammar and whose action are constructed from the elements. The FIA must handle:
First we define some terms and data structures used in the form interpretation algorithm: active grammar set The set of grammars active during a VoiceXML interpreter context's input collection operation. utterance A summary of what the user said or keyed in, including the specific grammar matched, and a semantic result consisting of an interpretation structure or, where there is no semantic interpretation, the raw text of the input (see Section 3.1.6). An example utterance might be: "grammar 123 was matched, and the semantic interpretation is {drink: "coke" pizza: {number: "3" size: "large"}}". execute To execute executable content – either a block, a filled action, or a set of filled actions. If an event is thrown during execution, the execution of the executable content is aborted. The appropriate event handler is then executed, and this may cause control to resume in a form item, in the next iteration of the form's main loop, or outside of the form. If a is executed, the transfer takes place immediately, and the remaining executable content is not executed. Here is the conceptual form interpretation algorithm. The FIA can start with no initial utterance, or with an initial utterance passed in from another dialog: // // Initialization Phase // foreach ( , |