Voice Extensible Markup Language (VoiceXML) Version

2.0 (original) (raw)

W3C Recommendation 16 March 2004

This Version:

http://www.w3.org/TR/2004/REC-voicexml20-20040316/

Latest Version:

http://www.w3.org/TR/voicexml20/

Previous Version:

http://www.w3.org/TR/2004/PR-voicexml20-20040203/

Editors:

Scott McGlashan, Hewlett-Packard (Editor-in-Chief)
Daniel C. Burnett, Nuance Communications
Jerry Carter, Invited Expert
Peter Danielsen, Lucent (until October 2002)
Jim Ferrans, Motorola
Andrew Hunt, ScanSoft
Bruce Lucas, IBM
Brad Porter, Tellme Networks
Ken Rehor, Vocalocity
Steph Tryphonas, Tellme Networks

Please refer to the erratafor this document, which may include some normative corrections.

Abstract

This document specifies VoiceXML, the Voice Extensible Markup Language. VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document has been reviewed by W3C Members and other interested parties, and it has been endorsed by the Director as a W3C Recommendation. W3C's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the functionaility and interoperability of the Web.

This specification is part of the W3C Speech Interface Framework and has been developed within the W3C Voice Browser Activity by participants in the Voice Browser Working Group (W3C Members only).

The design of VoiceXML 2.0 has been widely reviewed (see thedisposition of comments) and satisfies the Working Group's technical requirements. A list of implementations is included in the VoiceXML 2.0 implementation report, along with the associated test suite.

Comments are welcome on www-voice@w3.org (archive). See W3C mailing list and archive usage guidelines.

The W3C maintains a list of any patent disclosures related to this work.

Conventions of this Document

In this document, the key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" are to be interpreted as described in [RFC2119]and indicate requirement levels for compliant VoiceXML implementations.

Full Contents

1. Overview
- 1.1 Introduction
- 1.2 Background
  * 1.2.1 Architectural Model
  * 1.2.2 Goals of VoiceXML
  * 1.2.3 Scope of VoiceXML
  * 1.2.4 Principles of Design
  * 1.2.5 Implementation Platform Requirements
- 1.3 Concepts
  * 1.3.1 Dialogs and Subdialogs
  * 1.3.2 Sessions
  * 1.3.3 Applications
  * 1.3.4 Grammars
  * 1.3.5 Events
  * 1.3.6 Links
- 1.4 VoiceXML Elements
- 1.5 Document Structure and Execution
  * 1.5.1 Execution within one Document
  * 1.5.2 Executing a Multi-Document Application
  * 1.5.3 Subdialogs
  * 1.5.4 Final Processing
2. Dialog Constructs
- 2.1 Forms
  * 2.1.1 Form Interpretation
  * 2.1.2 Form Items
  * 2.1.3 Form Item Variables and Conditions
  * 2.1.4 Directed Forms
  * 2.1.5 Mixed Initiative Forms
  * 2.1.6 Form Interpretation Algorithm
- 2.2 Menus
  * 2.2.1 menu element
  * 2.2.2 choice element
  * 2.2.3 DTMF in Menus
  * 2.2.4 enumerate element
  * 2.2.5 Grammar Generation
  * 2.2.6 Interpretation Model
- 2.3 Form Items
  * 2.3.1 field element
  * 2.3.2 block element
  * 2.3.3 initial element
  * 2.3.4 subdialog element
  * 2.3.5 object element
  * 2.3.6 record element
  * 2.3.7 transfer element
- 2.4 Filled
- 2.5 Links
3. User Input
- 3.1 Grammars
  * 3.1.1 Speech Grammars
  * 3.1.2 DTMF Grammars
  * 3.1.3 Scope of Grammars
  * 3.1.4 Activation of Grammars
  * 3.1.5 Semantic Interpretation of Input
  * 3.1.6 Mapping Semantic Interpretation Results to VoiceXML forms
4. System Output
- 4.1 Prompt
  * 4.1.1 Speech Markup
  * 4.1.2 Basic Prompts
  * 4.1.3 Audio Prompting
  * 4.1.4 Element
  * 4.1.5 Bargein
  * 4.1.6 Prompt Selection
  * 4.1.7 Timeout
  * 4.1.8 Prompt Queueing and Input Collection
5. Control flow and scripting
- 5.1 Variables and Expressions
  * 5.1.1 Declaring Variables
  * 5.1.2 Variable Scopes
  * 5.1.3 Referencing Variables
  * 5.1.4 Standard Session Variables
  * 5.1.5 Standard Application Variables
- 5.2 Event Handling
  * 5.2.1 throw element
  * 5.2.2 catch element
  * 5.2.3 Shorthand Notation
  * 5.2.4 catch Element Selection
  * 5.2.5 Default catch elements
  * 5.2.6 Event Types
- 5.3 Executable Content
  * 5.3.1 var element
  * 5.3.2 assign element
  * 5.3.3 clear element
  * 5.3.4 if, elseif, else elements
  * 5.3.5 prompts
  * 5.3.6 reprompt element
  * 5.3.7 goto element
  * 5.3.8 submit element
  * 5.3.9 exit element
  * 5.3.10 return element
  * 5.3.11 disconnect element
  * 5.3.12 script element
  * 5.3.13 log element
6. Environment and Resources
- 6.1 Resource Fetching
  * 6.1.1 Fetching
  * 6.1.2 Caching
  * 6.1.3 Prefetching
  * 6.1.4 Protocols
- 6.2 Metadata Information
  * 6.2.1 meta element
  * 6.2.2 metadata element
- 6.3 property element
  * 6.3.1 Platform-Specific Properties
  * 6.3.2 Generic Speech Recognizer Properties
  * 6.3.3 Generic DTMF Recognizer Properties
  * 6.3.4 Prompt and Collect Properties
  * 6.3.5 Fetching Properties
  * 6.3.6 Miscellaneous Properties
- 6.4 param element
- 6.5 Value Designations
Appendices
- Appendix A. Glossary of Terms
- Appendix B. VoiceXML Document Type Definition
- Appendix C. Form Interpretation Algorithm
- Appendix D. Timing Properties
- Appendix E. Audio File Formats
- Appendix F. Conformance
- Appendix G. Internationalization
- Appendix H. Accessibility
- Appendix I. Privacy
- Appendix J. Changes from VoiceXML 1.0
- Appendix K. Reusability
- Appendix L. Acknowledgements
- Appendix M. References
- Appendix N. Media Type and File Suffix
- Appendix O. VoiceXML XML Schema Definition
- Appendix P. Builtin Grammar Types

1. Overview

This document defines VoiceXML, the Voice Extensible Markup Language. Its background, basic concepts and use are presented inSection 1. The dialog constructs of form, menu and link, and the mechanism (Form Interpretation Algorithm) by which they are interpreted are then introduced in Section 2. User input using DTMF and speech grammars is covered in Section 3, while Section 4 covers system output using speech synthesis and recorded audio. Mechanisms for manipulating dialog control flow, including variables, events, and executable elements, are explained in Section 5. Environment features such as parameters and properties as well as resource handling are specified in Section 6. The appendices provide additional information including the VoiceXML Schema, a detailed specification of theForm Interpretation Algorithmand timing, audio file formats, and statements relating to conformance, internationalization, accessibility and privacy.

The origins of VoiceXML began in 1995 as an XML-based dialog design language intended to simplify the speech recognition application development process within an AT&T project called Phone Markup Language (PML). As AT&T reorganized, teams at AT&T, Lucent and Motorola continued working on their own PML-like languages.

In 1998, W3C hosted a conference on voice browsers. By this time, AT&T and Lucent had different variants of their original PML, while Motorola had developed VoxML, and IBM was developing its own SpeechML. Many other attendees at the conference were also developing similar languages for dialog design; for example, such as HP's TalkML and PipeBeach's VoiceHTML.

The VoiceXML Forum was then formed by AT&T, IBM, Lucent, and Motorola to pool their efforts. The mission of the VoiceXML Forum was to define a standard dialog design language that developers could use to build conversational applications. They chose XML as the basis for this effort because it was clear to them that this was the direction technology was going.

In 2000, the VoiceXML Forum released VoiceXML 1.0 to the public. Shortly thereafter, VoiceXML 1.0 was submitted to the W3C as the basis for the creation of a new international standard. VoiceXML 2.0 is the result of this work based on input from W3C Member companies, other W3C Working Groups, and the public.

Developers familiar with VoiceXML 1.0 are particularly directed to Changes from Previous Public Version which summarizes how VoiceXML 2.0 differs from VoiceXML 1.0.

1.1 Introduction

VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.

Here are two short examples of VoiceXML. The first is the venerable "Hello World":

Hello World!

The top-level element is , which is mainly a container for dialogs. There are two types of dialogs:forms and menus. Forms present information and gather input; menus offer choices of what to do next. This example has a single form, which contains a block that synthesizes and presents "Hello World!" to the user. Since the form does not specify a successor dialog, the conversation ends.

Our second example asks the user for a choice of drink and then submits it to a server script:

Would you like coffee, tea, milk, or nothing?

A field is an input field. The user must provide a value for the field before proceeding to the next element in the form. A sample interaction is:

C (computer): Would you like coffee, tea, milk, or nothing?

H (human): Orange juice.

C: I did not understand what you said. (a platform-specific default message.)

C: Would you like coffee, tea, milk, or nothing?

H: Tea

C: (continues in document drink2.asp)

1.2 Background

This section contains a high-level architectural model, whose terminology is then used to describe the goals of VoiceXML, its scope, its design principles, and the requirements it places on the systems that support it.

1.2.1 Architectural Model

The architectural model assumed by this document has the following components:

VoiceXML interpreter fits between document server and implementation platform
Figure 1: Architectural Model

A document server (e.g. a Web server) processes_requests_ from a client application, the VoiceXML Interpreter, through the VoiceXML interpreter context. The server produces VoiceXML documents in reply, which are processed by the VoiceXML interpreter. The VoiceXML interpreter context may monitor user inputs in parallel with the VoiceXML interpreter. For example, one VoiceXML interpreter context may always listen for a special escape phrase that takes the user to a high-level personal assistant, and another may listen for escape phrases that alter user preferences like volume or text-to-speech characteristics.

The implementation platform is controlled by the VoiceXML interpreter context and by the VoiceXML interpreter. For instance, in an interactive voice response application, the VoiceXML interpreter context may be responsible for detecting an incoming call, acquiring the initial VoiceXML document_,_and answering the call, while the VoiceXML interpreter conducts the dialog after answer. The implementation platform generates events in response to user actions (e.g. spoken or character input received, disconnect) and system events (e.g. timer expiration). Some of these events are acted upon by the VoiceXML interpreter itself, as specified by the VoiceXML document, while others are acted upon by the VoiceXML interpreter context.

1.2.2 Goals of VoiceXML

VoiceXML's main goal is to bring the full power of Web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management. It enables integration of voice services with data services using the familiar client-server paradigm. A voice service is viewed as a sequence of interaction dialogs between a user and an implementation platform. The dialogs are provided by document servers, which may be external to the implementation platform. Document servers maintain overall service logic, perform database and legacy system operations, and produce dialogs. A VoiceXML document specifies each interaction dialog to be conducted by a VoiceXML interpreter. User input affects dialog interpretation and is collected into requests submitted to a document server. The document server replies with another VoiceXML document to continue the user's session with other dialogs.

VoiceXML is a markup language that:

Minimizes client/server interactions by specifying multiple interactions per document.
Shields application authors from low-level, and platform-specific details.
Separates user interaction code (in VoiceXML) from service logic (e.g. CGI scripts).
Promotes service portability across implementation platforms. VoiceXML is a common language for content providers, tool providers, and platform providers.
Is easy to use for simple interactions, and yet provides language features to support complex dialogs.

While VoiceXML strives to accommodate the requirements of a majority of voice response services, services with stringent requirements may best be served by dedicated applications that employ a finer level of control.

1.2.3 Scope of VoiceXML

The language describes the human-machine interaction provided by voice response systems, which includes:

Output of synthesized speech (text-to-speech).
Output of audio files.
Recognition of spoken input.
Recognition of DTMF input.
Recording of spoken input.
Control of dialog flow.
Telephony features such as call transfer and disconnect.

The language provides means for collecting character and/or spoken input, assigning the input results to document-defined request variables, and making decisions that affect the interpretation of documents written in the language. A document may be linked to other documents through Universal Resource Identifiers (URIs).

1.2.4 Principles of Design

VoiceXML is an XML application [XML].

The language promotes portability of services through abstraction of platform resources.
The language accommodates platform diversity in supported audio file formats, speech grammar formats, and URI schemes. While producers of platforms may support various grammar formats the language requires a common grammar format, namely the XML Form of the W3C Speech Recognition Grammar Specification [SRGS], to facilitate interoperability. Similarly, while various audio formats for playback and recording may be supported, the audio formats described in Appendix E must be supported
The language supports ease of authoring for common types of interactions.
The language has well-defined semantics that preserves the author's intent regarding the behavior of interactions with the user. Client heuristics are not required to determine document element interpretation.
The language recognizes semantic interpretations from grammars and makes this information available to the application.
The language has a control flow mechanism.
The language enables a separation of service logic from interaction behavior.
It is not intended for intensive computation, database operations, or legacy system operations. These are assumed to be handled by resources outside the document interpreter, e.g. a document server.
General service logic, state management, dialog generation, and dialog sequencing are assumed to reside outside the document interpreter.
The language provides ways to link documents using URIs, and also to submit data to server scripts using URIs.
VoiceXML provides ways to identify exactly which data to submit to the server, and which HTTP method (GET or POST) to use in the submittal.
The language does not require document authors to explicitly allocate and deallocate dialog resources, or deal with concurrency. Resource allocation and concurrent threads of control are to be handled by the implementation platform.

1.2.5 Implementation Platform Requirements

This section outlines the requirements on the hardware/software platforms that will support a VoiceXML interpreter.

Document acquisition. The interpreter context is expected to acquire documents for the VoiceXML interpreter to act on. The "http" URI scheme must be supported. In some cases, the document request is generated by the interpretation of a VoiceXML document, while other requests are generated by the interpreter context in response to events outside the scope of the language, for example an incoming phone call. When issuing document requests via http, the interpreter context identifies itself using the "User-Agent" header variable with the value "/", for example, "acme-browser/1.2"

Audio output. An implementation platform must support audio output using audio files and text-to-speech (TTS). The platform must be able to freely sequence TTS and audio output. If an audio output resource is not available, an error.noresource event must be thrown. Audio files are referred to by a URI. The language specifies a required set of audio file formats which must be supported (see Appendix E); additional audio file formats may also be supported.

Audio input. An implementation platform is required to detect and report character and/or spoken input simultaneously and to control input detection interval duration with a timer whose length is specified by a VoiceXML document. If an audio input resource is not available, an error.noresource event must be thrown.

It must report characters (for example, DTMF) entered by a user. Platforms must support the XML form of DTMF grammars described in the W3C Speech Recognition Grammar Specification[SRGS]. They should also support the Augmented BNF (ABNF) form of DTMF grammars described in the W3C Speech Recognition Grammar Specification[SRGS].
It must be able to receive speech recognition grammar data dynamically. It must be able to use speech grammar data in the XML Form of the W3C Speech Recognition Grammar Specification[SRGS]. It should be able to receive speech recognition grammar data in the ABNF form of the W3C Speech Recognition Grammar Specification [SRGS], and may support other formats such as the JSpeech Grammar Format [JSGF] or proprietary formats. Some VoiceXML elements contain speech grammar data; others refer to speech grammar data through a URI. The speech recognizer must be able to accommodate dynamic update of the spoken input for which it is listening through either method of speech grammar data specification.
It must be able to record audio received from the user. The implementation platform must be able to make the recording available to a request variable. The language specifies a required set of recorded audio file formats which must be supported (see Appendix E); additional formats may also be supported.

Transfer The platform should be able to support making a third party connection through a communications network, such as the telephone.

1.3 Concepts

A VoiceXML document (or a set of related documents called an application) forms a conversational finite state machine. The user is always in one conversational state, or_dialog_, at a time. Each dialog determines the next dialog to transition to. Transitions are specified using URIs, which define the next document and dialog to use. If a URI does not refer to a document, the current document is assumed. If it does not refer to a dialog, the first dialog in the document is assumed. Execution is terminated when a dialog does not specify a successor, or if it has an element that explicitly exits the conversation.

1.3.1 Dialogs and Subdialogs

There are two kinds of dialogs: forms and menus. Forms define an interaction that collects values for a set of form item variables. Each field may specify a grammar that defines the allowable inputs for that field. If a form-level grammar is present, it can be used to fill several fields from one utterance. A menu presents the user with a choice of options and then transitions to another dialog based on that choice.

A subdialog is like a function call, in that it provides a mechanism for invoking a new interaction, and returning to the original form. Variable instances, grammars, and state information are saved and are available upon returning to the calling document. Subdialogs can be used, for example, to create a confirmation sequence that may require a database query; to create a set of components that may be shared among documents in a single application; or to create a reusable library of dialogs shared among many applications.

1.3.2 Sessions

A session begins when the user starts to interact with a VoiceXML interpreter context, continues as documents are loaded and processed, and ends when requested by the user, a document, or the interpreter context.

1.3.3 Applications

An application is a set of documents sharing the same_application root document_. Whenever the user interacts with a document in an application, its application root document is also loaded. The application root document remains loaded while the user is transitioning between other documents in the same application, and it is unloaded when the user transitions to a document that is not in the application. While it is loaded, the application root document's variables are available to the other documents as application variables, and its grammars remain active for the duration of the application, subject to the grammar activation rules discussed in Section 3.1.4.

Figure 2 shows the transition of documents (D) in an application that share a common application root document (root).

root over sequence of 3 documents
Figure 2: Transitioning between documents in an application.

1.3.4 Grammars

Each dialog has one or more speech and/or DTMF _grammars_associated with it. In machine directed applications, each dialog's grammars are active only when the user is in that dialog. In mixed initiative applications, where the user and the machine alternate in determining what to do next, some of the dialogs are flagged to make their grammars active(i.e., listened for) even when the user is in another dialog in the same document, or on another loaded document in the same application. In this situation, if the user says something matching another dialog's active grammars, execution transitions to that other dialog, with the user's utterance treated as if it were said in that dialog. Mixed initiative adds flexibility and power to voice applications.

1.3.5 Events

VoiceXML provides a form-filling mechanism for handling "normal" user input. In addition, VoiceXML defines a mechanism for handling events not covered by the form mechanism.

Events are thrown by the platform under a variety of circumstances, such as when the user does not respond, doesn't respond intelligibly, requests help, etc. The interpreter also throws events if it finds a semantic error in a VoiceXML document. Events are caught by catch elements or their syntactic shorthand. Each element in which an event can occur may specify catch elements. Furthermore, catch elements are also inherited from enclosing elements "as if by copy". In this way, common event handling behavior can be specified at any level, and it applies to all lower levels.

1.3.6 Links

A link supports mixed initiative. It specifies a grammar that is active whenever the user is in the scope of the link. If user input matches the link's grammar, control transfers to the link's destination URI. A link can be used to throw an event or go to a destination URI.

1.4 VoiceXML Elements

Table 1: VoiceXML Elements

Element Purpose Section

Assign a variable a value 5.3.2

Play an audio clip within a prompt 4.1.3

A container of (non-interactive) executable code 2.3.2

Catch an event 5.2.2

Define a menu item 2.2.2

Clear one or more form item variables 5.3.3

Disconnect a session 5.3.11

Used in elements 5.3.4

Shorthand for enumerating the choices in a menu 2.2.4

Catch an error event 5.2.3

Exit a session 5.3.9

Declares an input field in a form 2.3.1

An action executed when fields are filled 2.4

A dialog for presenting information and collecting data 2.1

Go to another dialog in the same or different document 5.3.7

Specify a speech recognition or DTMF grammar 3.1

Catch a help event 5.2.3

Simple conditional logic 5.3.4

Declares initial logic upon entry into a (mixed initiative) form 2.3.3

Specify a transition common to all dialogs in the link's scope 2.5

Generate a debug message 5.3.13

A dialog for choosing amongst alternative destinations 2.2.1

Define a metadata item as a name/value pair 6.2.1

Define metadata information using a metadata schema 6.2.2

Catch a noinput event 5.2.3

Catch a nomatch event 5.2.3

Interact with a custom extension 2.3.5

Specify an option in a 2.3.1.3

Parameter in or 6.4

Queue speech synthesis and audio output to the user 4.1

Control implementation platform settings. 6.3

Record an audio sample 2.3.6

Play a field prompt when a field is re-visited after an event 5.3.6

Return from a subdialog. 5.3.10

              <clear/>
          <else/>
              <goto next="./make_bid.vxml"/>
          </if>
    </filled>

Attributes of defined in [SSML] are:

Table 36: Attributes Inherited from SSML

src	The URI of the audio prompt. See Appendix E for required audio file formats; additional formats may be used if supported by the platform.

Attributes of defined only in VoiceXML are:

Table 37: Attributes added in VoiceXML

fetchtimeout	See Section 6.1. This defaults to the fetchtimeout property.
fetchhint	See Section 6.1. This defaults to the audiofetchhint property.
maxage	See Section 6.1. This defaults to the audiomaxage property.
maxstale	See Section 6.1. This defaults to the audiomaxstale property.
expr	An ECMAScript expression which determines the source of the audio to be played. The expression may be either a reference to audio previously recorded with the item or evaluate to the URI of an audio resource to fetch.

Exactly one of "src" or "expr" must be specified; otherwise, an error.badfetch event is thrown.

Note that it is a platform optimization to stream audio: i.e. the platform may begin processing audio content as it arrives and not to wait for full retrieval. The "prefetch" fetchhint can be used to request full audio retrieval prior to playback.

4.1.4 Element

The element is used to insert the value of an expression into a prompt. It has one attribute:

Table 38: Attributes

expr	The expression to render.

For example if n is 12, the prompt

is the square of .

will result in the text string "144 is the square of 12" being passed to the speech synthesis engine.

The manner in which the value attribute is played is controlled by the surrounding speech synthesis markup. For instance, a value can be played as a date in the following example:

The text inserted by the element is not subject to any special interpretation; in particular, it is not parsed as an [SSML] document or document fragment. XML special characters (&, >, and <) are not treated specially and do not need to be escaped. The equivalent effect may be obtained by literally inserting the text computed by the element in a CDATA section. For example, when the following variable assignment: is referenced in a prompt element as The price of is $1. the following output is produced. The price of AT&T is $1. 4.1.5 Bargein If an implementation platform supports bargein, the application author can specify whether a user can interrupt, or "bargein" on, a prompt using speech or DTMF input. This speeds up conversations, but is not always desired. If the application author requires that the user must hear all of a warning, legal notice, or advertisement, bargein should be disabled. This is done with the bargein attribute: Users can interrupt a prompt whose bargein attribute is true, but must wait for completion of a prompt whose bargein attribute is false. In the case where several prompts are queued, the bargein attribute of each prompt is honored during the period of time in which that prompt is playing. If bargein occurs during any prompt in a sequence, all subsequent prompts are not played (even those whose bargein attribute is set to false). If the bargein attribute is not specified, then the value of the bargein property is used if set. When the bargein attribute is false, input is not buffered while the prompt is playing, and any DTMF input buffered in a transition state is deleted from the buffer (Section 4.1.8 describes input collection during transition states). Note that not all speech recognition engines or implementation platforms support bargein. For a platform to support bargein, it must support at least one of the bargein types described in Section 4.1.5.1. 4.1.5.1 Bargein type When bargein is enabled, the bargeintype attribute can be used to suggest the type of bargein the platform will perform in response to voice or DTMF input. Possible values for this attribute are: Table 39: bargeintype Values speech The prompt will be stopped as soon as speech or DTMF input is detected. The prompt is stopped irrespective of whether or not the input matches a grammar and irrespective of which grammars are active. hotword The prompt will not be stopped until a complete match of an active grammar is detected. Input that does not match a grammar is ignored (note that this even applies during the timeout period); as a consequence, a nomatch event will never be generated in the case of hotword bargein. If the bargeintype attribute is not specified, then the value of the bargeintype property is used. Implementations that claim to support bargein are required to support at least one of these two types. Mixing these types within a single queue of prompts can result in unpredictable behavior and is discouraged. In the case of "speech" bargeintype, the exact meaning of "speech input" is necessarily implementation-dependent, due to the complexity of speech recognition technology. It is expected that the prompt will be stopped as soon as the platform is able to reliably determine that the input is speech. Stopping the prompt as early as possible is desireable because it avoids the "stutter" effect in which a user stops in mid-utterance and re-starts if he does not believe that the system has heard him. 4.1.6 Prompt Selection Tapered prompts are those that may change with each attempt. Information-requesting prompts may become more terse under the assumption that the user is becoming more familiar with the task. Help messages become more detailed perhaps, under the assumption that the user needs more help. Or, prompts can change just to make the interaction more interesting. Each input item, , and menu has an internal prompt counter that is reset to one each time the form or menu is entered. Whenever the system selects a given input item in the select phase of FIA and FIA does perform normal selection and queuing of prompts (i.e., as described in Section 5.3.6, the previous iteration of FIA did not end with a catch handler that had no reprompt), the input item's associated prompt counter is incremented. This is the mechanism supporting tapered prompts. For instance, here is a form with a form level prompt and field level prompts: Welcome to the ice cream survey. vanilla chocolate strawberry What is your favorite flavor? Say chocolate, vanilla, or strawberry. Sorry, no help is available. A conversation using this form follows: C: Welcome to the ice cream survey. C: What is your favorite flavor? (the "flavor" field's prompt counter is 1) H: Pecan praline. C: I do not understand. C: What is your favorite flavor? (the prompt counter is now 2) H: Pecan praline. C: I do not understand. C: Say chocolate, vanilla, or strawberry. (prompt counter is 3) H: What if I hate those? C: I do not understand. C: Say chocolate, vanilla, or strawberry. (prompt counter is 4) H: ... This is just an example to illustrate the use of prompt counters. A polished form would need to offer a more extensive range of choices and to deal with out of range values in more flexible way. When it is time to select a prompt, the prompt counter is examined. The child prompt with the highest count attribute less than or equal to the prompt counter is used. If a prompt has no count attribute, a count of "1" is assumed. A conditional prompt is one that is spoken only if its condition is satisfied. In this example, a prompt is varied on each visit to the enclosing form. Would you like to hear another elephant joke? For another joke say yes. To exit say no. When a prompt must be chosen, a set of prompts to be queued is chosen according to the following algorithm: Form an ordered list of prompts consisting of all prompts in the enclosing element in document order. Remove from this list all prompts whose cond evaluates to false after conversion to boolean. Find the "correct count": the highest count among the prompt elements still on the list less than or equal to the current count value. Remove from the list all the elements that don't have the "correct count". All elements that remain on the list will be queued for play. 4.1.7 Timeout The timeout attribute specifies the interval of silence allowed while waiting for user input after the end of the last prompt. If this interval is exceeded, the platform will throw a noinput event. This attribute defaults to the value specified by the timeout property (see Section 6.3.4) at the time the prompt is queued. In other words, each prompt has its own timeout value. The reason for allowing timeouts to be specified as prompt attributes is to support tapered timeouts. For example, the user may be given five seconds for the first input attempt, and ten seconds on the next. The prompt timeout attribute determines the noinput timeout for the following input: Pick a color for your new Model T. Please choose color of your new nineteen twenty four Ford Model T. Possible colors are black, black, or black. Please take your time. If several prompts are queued before a field input, the timeout of the last prompt is used. 4.1.8 Prompt Queueing and Input Collection A VoiceXML interpreter is at all times in one of two states: waiting for input in an input item (such as , , or ),or transitioning between input items in response to an input (including spoken utterances, dtmf key presses, and input-related events such as a noinput or nomatch event) received while in the waiting state. While in the transitioning state no speech input is collected, accepted or interpreted. Consequently root and document level speech grammars (such as defined in s) may not be active at all times. However, DTMF input (including timing information) should be collected and buffered in the transition state. Similarly, asynchronously generated events not related directly to execution of the transition should also be buffered until the waiting state (e.g. connection.disconnect.hangup). The waiting and transitioning states are related to the phases of the Form Interpretation Algorithm as follows: the waiting state is eventually entered in the collect phase of an input item (at the point at which the interpreter waits for input), and the transitioning state encompasses the process and select phases, the collect phase for control items (such as s), and the collect phase for input items up until the point at which the interpreter waits for input. This distinction of states is made in order to greatly simplify the programming model. In particular, an important consequence of this model is that the VoiceXML application designer can rely on all executable content (such as the content of and elements) being run to completion, because it is executed while in the transitioning state, which may not be interrupted by input. While in the transitioning state various prompts are queued, either by the element in executable content or by the element in form items. In addition, audio may be queued by the fetchaudio attribute. The queued prompts and audio are played either when the interpreter reaches the waiting state, at which point the prompts are played and the interpreter listens for input that matches one of the active grammars, or when the interpreter begins fetching a resource (such as a document) for which fetchaudio was specified. In this case the prompts queued before the fetchaudio are played to completion, and then, if the resource actually needs to be fetched (i.e. it is not unexpired in the cache), the fetchaudio is played until the fetch completes. The interpreter remains in the transitioning state and no input is accepted during the fetch. Note that when a prompt's bargein attribute is false, input is not collected and DTMF input buffered in a transition state is deleted (see Section 4.1.5). When an ASR grammar is matched, if DTMF input was consumed by a simultaneously active DTMF grammar (but did not result in a complete match of the DTMF grammar), the DTMF input may, at processor discretion, be discarded. Before the interpreter exits all queued prompts are played to completion. The interpreter remains in the transitioning state and no input is accepted while the interpreter is exiting. It is a permissible optimization to begin playing prompts queued during the transitioning state before reaching the waiting state, provided that correct semantics are maintained regarding processing of the input audio received while the prompts are playing, for example with respect to bargein and grammar processing. The following examples illustrate the operation of these rules in some common cases. Case 1 Typical non-fetching case: field, followed by executable content (such as and ), followed by another field. in document d0 <field name="f0"/> <block> executable content e1 queues prompts {p1} </block> <field name="f2"> queues prompts {p2} enables grammars {g2} </field> As a result of input received while waiting in field f0 the following actions take place: in transitioning state execute e1 (without goto) queue prompts {p1} queue prompts {p2} in waiting state, simultaneously play prompts {p1,p2} enable grammars {g2} and wait for input Case 2 Typical fetching case: field, followed by executable content (such as and ) ending with a that specifies fetchaudio, ending up in a field in a different document that is fetched from a server. in document d0 <field name="f0"/> <block> executable content e1 queues prompts {p1} ends with goto f2 in d1 with fetchaudio fa </block> in document d1 <field name="f2"> queues prompts {p2} enables grammars {g2} </field> As a result of input received while waiting in field f0 the following actions take place: in transitioning state execute e1 queue prompts {p1} simultaneously * fetch d1 * play prompts {p1} to completion and then play fa until fetch completes queue prompts {p2} in waiting state, simultaneously play prompts {p2} enable grammars {g2} and wait for input Case 3 As in Case 2, but no fetchaudio is specified. in document d0 <field name="f0"/> <block> executable content e1 queues prompts {p1} ends with goto f2 in d1 (no fetchaudio specified) </block> in document d1 <field name="f2"> queues prompts {p2} enables grammars {g2} </field> As a result of input received while waiting in field f0 the following actions take place: in transitioning state execute e1 queue prompts {p1} fetch d1 queue prompts {p2} in waiting state, simultaneously play prompts {p1, p2} enable grammars {g2} and wait for input 5. Control flow and scripting 5.1 Variables and Expressions VoiceXML variables are in all respects equivalent to ECMAScript variables: they are part of the same variable space. VoiceXML variables can be used in a Tell me a number and I'll tell you its factorial. factorial is A The time is hours, minutes, and seconds. Do you want to hear another time? The content of a All variables must be declared before being referenced by ECMAScript scripts, or by VoiceXML elements as described in Section 5.1.1. 5.3.13 log element The element allows an application to generate a logging or debug message which a developer can use to help in application development or post-execution analysis of application performance. The element may contain any combination of text (CDATA) and elements. The generated message consists of the concatenation of the text and the string form of the value of the "expr" attribute of the elements. The manner in which the message is displayed or logged is platform-dependent. The usage of label is platform-dependent. Platforms are not required to preserve white space. ECMAScript expressions in must be evaluated in document order. The use of the element should have no other side-effects on interpretation. The card number was The element has the following attributes: Table 53: Attributes label An optional string which may be used, for example, to indicate the purpose of the log. expr An optional ECMAScript expression evaluating to a string. 6. Environment and Resources 6.1 Resource Fetching 6.1.1 Fetching A VoiceXML interpreter context needs to fetch VoiceXML documents, and other resources, such as audio files, grammars, scripts, and objects. Each fetch of the content associated with a URI is governed by the following attributes: Table 54: Fetch Attributes fetchtimeout The interval to wait for the content to be returned before throwing an error.badfetch event. The value is a Time Designation (see Section 6.5). If not specified, a value derived from the innermost fetchtimeout property is used. fetchhint Defines when the interpreter context should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed. If not specified, a value derived from the innermost relevant fetchhint property is used. maxage Indicates that the document is willing to use content whose age is no greater than the specified time in seconds (cf. 'max-age' in HTTP 1.1 [RFC2616]). The document is not willing to use stale content, unless maxstale is also provided. If not specified, a value derived from the innermost relevant maxage property, if present, is used. maxstale Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [RFC2616]). If maxstale is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified number of seconds. If not specified, a value derived from the innermost relevant maxstale property, if present, is used. When content is fetched from a URI, the fetchtimeout attribute determines how long to wait for the content (starting from the time when the resource is needed), and the fetchhint attribute determines when the content is fetched. The caching policy for a VoiceXML interpreter context utilizes the maxage and maxstale attributes and is explained in more detail below. The fetchhint attribute, in combination with the various fetchhint properties, is merely a hint to the interpreter context about when it may schedule the fetch of a resource. Telling the interpreter context that it may prefetch a resource does not require that the resource be prefetched; it only suggests that the resource may be prefetched. However, the interpreter context is always required to honor the safe fetchhint. When transitioning from one dialog to another, through either a , , , , or element, there are additional rules that affect interpreter behavior. If the referenced URI names a document (e.g. "doc#dialog"), or if query data is provided (through POST or GET), then a new document is obtained (either from a local cache, intermediate cache, or from a origin Web server). When it is obtained, the document goes through its initialization phase (i.e., obtaining and initializing a new application root document if needed, initializing document variables, and executing document scripts). The requested dialog (or first dialog if none is specified) is then initialized and execution of the dialog begins. Generally, if a URI reference contains only a fragment (e.g., "#my_dialog"), then no document is fetched, and no initialization of that document is performed. However, always results in a fetch, and if a fragment is accompanied by a namelist attribute there will also be a fetch. Another exception is when a URI reference in a leaf document references the application root document. In this case, the root document is transitioned to without fetching and without initialization even if the URI reference contains an absolute or relative URI (see Section 1.5.2 and [RFC2396]). However, if the URI reference to the root document contains a query string or a namelist attribute, the root document is fetched. Elements that fetch VoiceXML documents also support the following additional attribute: Table 55: Additional Fetch Attribute fetchaudio The URI of the audio clip to play while the fetch is being done. If not specified, the fetchaudio property is used, and if that property is not set, no audio is played during the fetch. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch. The fetchaudio attribute is useful for enhancing a user experience when there may be noticeable delays while the next document is retrieved. This can be used to play background music, or a series of announcements. When the document is retrieved, the audio file is interrupted if it is still playing. If an error occurs retrieving fetchaudio from its URI, no badfetch event is thrown and no audio is played during the fetch. 6.1.2 Caching The VoiceXML interpreter context, like [HTML] visual browsers, can use caching to improve performance in fetching documents and other resources; audio recordings (which can be quite large) are as common to VoiceXML documents as images are to HTML pages. In a visual browser it is common to include end user controls to update or refresh content that is perceived to be stale. This is not the case for the VoiceXML interpreter context, since it lacks equivalent end user controls. Thus enforcement of cache refresh is at the discretion of the document through appropriate use of the maxage, and maxstale attributes. The caching policy used by the VoiceXML interpreter context must adhere to the cache correctness rules of HTTP 1.1 ([RFC2616]). In particular, the Expires and Cache-Control headers must be honored. The following algorithm summarizes these rules and represents the interpreter context behavior when requesting a resource: If the resource is not present in the cache, fetch it from the server using get. If the resource is in the cache, If a maxage value is provided, * If age of the cached resource <= maxage, * If the resource has expired, * Perform maxstale check. * Otherwise, use the cached copy. * Otherwise, fetch it from the server using get. Otherwise, * If the resource has expired, * Perform maxstale check. * Otherwise, use the cached copy. The "maxstale check" is: If maxstale is provided, If cached copy has exceeded its expiration time by no more than maxstale seconds, then use the cached copy. Otherwise, fetch it from the server using get. Otherwise, fetch it from the server using get. Note: it is an optimization to perform a "get if modified" on a document still present in the cache when the policy requires a fetch from the server. The maxage and maxstale properties are allowed to have no default value whatsoever. If the value is not provided by the document author, and the platform does not provide a default value, then the value is undefined and the 'Otherwise' clause of the algorithm applies. All other properties must provide a default value (either as given by the specification or by the platform). While the maxage and maxstale attributes are drawn from and directly supported by HTTP 1.1, some resources may be addressed by URIs that name protocols other than HTTP. If the protocol does not support the notion of resource age, the interpreter context shall compute a resource's age from the time it was received. If the protocol does not support the notion of resource staleness, the interpreter context shall consider the resource to have expired immediately upon receipt. 6.1.2.1 Controlling the Caching Policy VoiceXML allows the author to override the default caching behavior for each use of each resource (except for any document referenced by the element's application attribute: there is no markup mechanism to control the caching policy for an application root document). Each resource-related element may specify maxage and maxstale attributes. Setting maxage to a non-zero value can be used to get a fresh copy of a resource that may not have yet expired in the cache. A fresh copy can be unconditionally requested by setting maxage to zero. Using maxstale enables the author to state that an expired copy of a resource, that is not too stale (according to the rules of HTTP 1.1), may be used. This can improve performance by eliminating a fetch that would otherwise be required to get a fresh copy. It is especially useful for authors who may not have direct server-side control of the expiration dates of large static files. 6.1.3 Prefetching Prefetching is an optional feature that an interpreter context may implement to obtain a resource before it is needed. A resource that may be prefetched is identified by an element whose fetchhint attribute equals "prefetch". When an interpreter context does prefetch a resource, it must ensure that the resource fetched is precisely the one needed. In particular, if the URI is computed with an expr attribute, the interpreter context must not move the fetch up before any assignments to the expression's variables. Likewise, the fetch for a must not be moved prior to any assignments of the namelist variables. The expiration status of a resource must be checked on each use of the resource, and, if its fetchhint attribute is "prefetch", then it is prefetched. The check must follow the caching policy specified in Section 6.1.2. 6.1.4 Protocols The "http" URI scheme must be supported by VoiceXML platforms, the "https" protocol should be supported and other URI protocols may be supported. 6.2 Metadata Information Metadata information is information about the document rather than the document's content. VoiceXML 2.0 provides two elements in which metadata information can be expressed: and . The element provides more general and powerful treatment of metadata information than . VoiceXML does not specify required metadata information. However, it does recommend that metadata is expressed using the element with information in Resource Description Framework (RDF) [RDF-SYNTAX] using the Dublin Core version 1.0 RDF schema [DC] (see Section 6.2.2). 6.2.1 meta element The element specifies meta information as in [HTML]. There are two types of . The first type specifies a metadata property of the document as a whole and is expressed by the pair of attributes, name and content. For example to specify the maintainer of a VoiceXML document: Hello The second type of specifies HTTP response headers and is expressed by the pair of attributes http-equiv and content. In the following example, the first element sets an expiration date that prevents caching of the document; the second element sets the Date header. Hello Attributes of are: Table 56: Attributes name The name of the metadata property. content The value of the metadata property. http-equiv The name of an HTTP response header. Exactly one of "name" or "http-equiv" must be specified; otherwise, an error.badfetch event is thrown. 6.2.2 metadata element The element is container in which information about the document can be placed using a metadata schema. Although any metadata schema can be used with , it is recommended that the RDF schema is used in conjunction with metadata properties defined in the Dublin Core Metadata Initiative. RDF is a declarative language and provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web. Content creators should refer to W3C metadata Recommendations [RDF-SYNTAX] and [RDF-SCHEMA] as well as the Dublin Core Metadata Initiative [DC], which is a set of generally applicable core metadata properties (e.g., Title, Creator, Subject, Description, Copyrights, etc.). The following Dublin Core metadata properties are recommended in : Table 57: Recommended Dublin Core Metadata Properties Creator An entity primarily responsible for making the content of the resource. Rights Information about rights held in and over the resource. Subject The topic of the content of the resource. Typically, a subject will be expressed as keywords, key phrases or classification codes. Recommended best practice is to select values from a controlled vocabulary or formal classification scheme. Here is an example of how can be included in a VoiceXML document using the Dublin Core version 1.0 RDF schema[DC]: <rdf:RDF xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs = "http://www.w3.org/TR/1999/PR-rdf-schema-19990303#" xmlns:dc = "" title="undefined" rel="noopener noreferrer">http://purl.org/metadata/dublin_core#"> <rdf:Description about="http://www.example.com/meta.vxml" dc:Title="Directory Enquiry Service" dc:Description="Directory Enquiry Service for London in VoiceXML" dc:Publisher="W3C" dc:Language="en" dc:Date="2002-02-12" dc:Rights="Copyright 2002 John Smith" dc:Format="application/voicexml+xml" > dc:Creator <rdf:Seq ID="CreatorsAlphabeticalBySurname"> rdf:liJackie Crystal rdf:liWilliam Lee Hello 6.3 property element The element sets a property value. Properties are used to set values that affect platform behavior, such as the recognition process, timeouts, caching policy, etc. Properties may be defined for the whole application, for the whole document at the level, for a particular dialog at the or level, or for a particular form item. Properties apply to their parent element and all the descendants of the parent. A property at a lower level overrides a property at a higher level. When different values for a property are specified at the same level, the last one in document order applies. Properties specified in the application root document provide default values for properties in every document in the application; properties specified in an individual document override property values specified in the application root document. If a platform detects that the value of a property is invalid, then it should throw an error.semantic. In some cases, elements specify default values for element attributes, such as timeout or bargein. For example, to turn off bargein by default for all the prompts in a particular form: This introductory prompt cannot be barged into. And neither can this prompt. But this one can be barged into. Please say yes or no. The element has the following attributes: Table 58: Attributes name The name of the property. value The value of the property. 6.3.1 Platform-Specific Properties An interpreter context is free to provide platform-specific properties. For example, to set the "multiplication factor" for this platform in the scope of this document: Welcome By definition, platform-specific properties introduce incompatibilities which reduce application portability. To minimize them, the following interpreter context guidelines are strongly recommended: Platform-specific properties should use reverse domain names to eliminate potential collisions as in: com.example.foo, which is clearly different from net.example.foo An interpreter context must not throw an error.unsupported.property event when encountering a property it cannot process; rather the interpreter context must just ignore that property. 6.3.2 Generic Speech Recognizer Properties The generic speech recognizer properties mostly are taken from the Java Speech API [JSAPI]: Table 59: Generic Speech Recognizer Properties confidencelevel The speech recognition confidence level, a float value in the range of 0.0 to 1.0. Results are rejected (a nomatch event is thrown) when application.lastresult$.confidence is below this threshold. A value of 0.0 means minimum confidence is needed for a recognition, and a value of 1.0 requires maximum confidence. The value is a Real Number Designation (see Section 6.5). The default value is 0.5. sensitivity Set the sensitivity level. A value of 1.0 means that it is highly sensitive to quiet input. A value of 0.0 means it is least sensitive to noise. The value is a Real Number Designation (see Section 6.5). The default value is 0.5. speedvsaccuracy A hint specifying the desired balance between speed vs. accuracy. A value of 0.0 means fastest recognition. A value of 1.0 means best accuracy. The value is a Real Number Designation (see Section 6.5). The default is value 0.5. completetimeout The length of silence required following user speech before the speech recognizer finalizes a result (either accepting it or throwing a nomatch event). The complete timeout is used when the speech is a complete match of an active grammar. By contrast, the incomplete timeout is used when the speech is an incomplete match to an active grammar. A long complete timeout value delays the result completion and therefore makes the computer's response slow. A short complete timeout may lead to an utterance being broken up inappropriately. Reasonable complete timeout values are typically in the range of 0.3 seconds to 1.0 seconds. The value is a Time Designation (see Section 6.5). The default is platform-dependent. See Appendix D. Although platforms must parse the completetimeout property, platforms are not required to support the behavior of completetimeout. Platforms choosing not to support the behavior of completetimeout must so document and adjust the behavior of the incompletetimeout property as described below. incompletetimeout The required length of silence following user speech after which a recognizer finalizes a result. The incomplete timeout applies when the speech prior to the silence is an incomplete match of all active grammars. In this case, once the timeout is triggered, the partial result is rejected (with a nomatch event). The incomplete timeout also applies when the speech prior to the silence is a complete match of an active grammar, but where it is possible to speak further and still match the grammar. By contrast, the complete timeout is used when the speech is a complete match to an active grammar and no further words can be spoken. A long incomplete timeout value delays the result completion and therefore makes the computer's response slow. A short incomplete timeout may lead to an utterance being broken up inappropriately. The incomplete timeout is usually longer than the complete timeout to allow users to pause mid-utterance (for example, to breathe). See Appendix D. Platforms choosing not to support the _completetimeout_property (described above) must use the maximum of the completetimeout and incompletetimeout values as the value for the incompletetimeout. The value is a Time Designation (see Section 6.5). maxspeechtimeout The maximum duration of user speech. If this time elapsed before the user stops speaking, the event "maxspeechtimeout" is thrown. The value is a Time Designation (see Section 6.5). The default duration is platform-dependent. 6.3.3 Generic DTMF Recognizer Properties Several generic properties pertain to DTMF grammar recognition: Table 60: Generic DTMF Recognizer Properties interdigittimeout The inter-digit timeout value to use when recognizing DTMF input. The value is a Time Designation (see Section 6.5). The default is platform-dependent. See Appendix D. termtimeout The terminating timeout to use when recognizing DTMF input. The value is a Time Designation (see Section 6.5). The default value is "0s". Appendix D. termchar The terminating DTMF character for DTMF input recognition. The default value is "#". See Appendix D. 6.3.4 Prompt and Collect Properties These properties apply to the fundamental platform prompt and collect cycle: Table 61: Prompt and Collect Properties bargein The bargein attribute to use for prompts. Setting this to true allows bargein by default. Setting it to false disallows bargein. The default value is "true". bargeintype Sets the type of bargein to be speech or hotword. Default is platform-specific. See Section 4.1.5.1. timeout The time after which a noinput event is thrown by the platform. The value is a Time Designation (see Section 6.5). The default value is platform-dependent. See Appendix D. 6.3.5 Fetching Properties These properties pertain to the fetching of new documents and resources (note that maxage and maxstale properties may have no default value - see Section 6.1.2): Table 62: Fetching Properties audiofetchhint This tells the platform whether or not it can attempt to optimize dialog interpretation by pre-fetching audio. The value is either safe to say that audio is only fetched when it is needed, never before; or prefetch to permit, but not require the platform to pre-fetch the audio. The default value is prefetch. audiomaxage Tells the platform the maximum acceptable age, in seconds, of cached audio resources. The default is platform-specific. audiomaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached audio resources. The default is platform-specific. documentfetchhint Tells the platform whether or not documents may be pre-fetched. The value is either safe (the default), or prefetch. documentmaxage Tells the platform the maximum acceptable age, in seconds, of cached documents. The default is platform-specific. documentmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached documents. The default is platform-specific. grammarfetchhint Tells the platform whether or not grammars may be pre-fetched. The value is either prefetch (the default), or safe. grammarmaxage Tells the platform the maximum acceptable age, in seconds, of cached grammars. The default is platform-specific. grammarmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached grammars. The default is platform-specific. objectfetchhint Tells the platform whether the URI contents for may be pre-fetched or not. The values are prefetch (the default), or safe. objectmaxage Tells the platform the maximum acceptable age, in seconds, of cached objects. The default is platform-specific. objectmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached objects. The default is platform-specific. scriptfetchhint Tells whether scripts may be pre-fetched or not. The values are prefetch (the default), or safe. scriptmaxage Tells the platform the maximum acceptable age, in seconds, of cached scripts. The default is platform-specific. scriptmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached scripts. The default is platform-specific. fetchaudio The URI of the audio to play while waiting for a document to be fetched. The default is not to play any audio during fetch delays. There are no fetchaudio properties for audio, grammars, objects, and scripts. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch. fetchaudiodelay The time interval to wait at the start of a fetch delay before playing the fetchaudio source. The value is a Time Designation (see Section 6.5). The default interval is platform-dependent, e.g. "2s". The idea is that when a fetch delay is short, it may be better to have a few seconds of silence instead of a bit of fetchaudio that is immediately cut off. fetchaudiominimum The minimum time interval to play a fetchaudio source, once started, even if the fetch result arrives in the meantime. The value is a Time Designation (see Section 6.5). The default is platform-dependent, e.g., "5s". The idea is that once the user does begin to hear fetchaudio, it should not be stopped too quickly. fetchtimeout The timeout for fetches. The value is a Time Designation (see Section 6.5). The default value is platform-dependent. 6.3.6 Miscellaneous Properties Table 63: Miscellaneous Properties inputmodes This property determines which input modality to use. The input modes to enable: dtmf and voice. On platforms that support both modes, inputmodes defaults to "dtmf voice". To disable speech recognition, set inputmodes to "dtmf". To disable DTMF, set it to "voice". One use for this would be to turn off speech recognition in noisy environments. Another would be to conserve speech recognition resources by turning them off where the input is always expected to be DTMF. This property does not control the activation of grammars. For instance, voice-only grammars may be active when the inputmode is restricted to DTMF. Those grammars would not be matched, however, because the voice input modality is not active. universals Platforms may optionally provide platform-specific universal command grammars, such as "help", "cancel", or "exit" grammars, that are always active (except in the case of modal input items - see Section 3.1.4) and which generate specific events. Production-grade applications often need to define their own universal command grammars, e.g., to increase application portability or to provide a distinctive interface. They specify new universal command grammars with elements. They turn off the default grammars with this property. Default catch handlers are not affected by this property. The value "none" is the default, and means that all platform default universal command grammars are disabled. The value "all" turns them all on. Individual grammars are enabled by listing their names separated by spaces; for example, "cancel exit help". maxnbest This property controls the maximum size of the "application.lastresult$" array; the array is constrained to be no larger than the value specified by 'maxnbest'. This property has a minimum value of 1. The default value is 1. Our last example shows several of these properties used at multiple levels. Welcome to the Voice Address Book Who would you like to call? Say the name of the person you would like to call. Say the location of the person you would like to call. You said to call at . Is this correct? 6.4 param element The element is used to specify values that are passed to subdialogs or objects. It is modeled on the [HTML] element. Its attributes are: Table 64: Attributes name The name to be associated with this parameter when the object or subdialog is invoked. expr An expression that computes the value associated with name. value Associates a literal string value with name. valuetype One of data or ref, by default data; used to indicate to an object if the value associated with name is data or a URI (ref). This is not used for since values are always data. type The media type of the result provided by a URI if the valuetype is ref; only relevant for uses of in . Exactly one of "expr" or "value" must be specified; otherwise, an error.badfetch event is thrown. The use of valuetype and type is optional in general, although they may be required by specific objects. When is contained in a element, the values specified by it are used to initialize dialog elements in the subdialog that is invoked. See Section 2.3.4 for details regarding initialization of variables in subdialogs using . When is contained in an , the use of the parameter data is specific to the object that is being invoked, and is outside the scope of the VoiceXML specification. Below is an example of used as part of an . In this case, the first two elements have expressions (implicitly of valuetype="data"), the third has an explicit value, and the fourth is a URI that returns a media type of text/plain. The meaning of this data is specific to the object. The next example illustrates used with . In this case, two expressions are used to initialize variables in the scope of the subdialog form. Form with calling dialog Subdialog in http://another.example.com Please say Social Securityy number. Using in a is a convenient way of passing data to a subdialog without requiring the use of server side scripting. 6.5 Value Designations Several VoiceXML parameter values follow the conventions used in the W3C's Cascading Style Sheet Recommendation [CSS2]. 6.5.1 Integers and Real Numbers Real numbers and integers are specified in decimal notation only. An integer consists of one or more digits "0" to "9". A real number may be an integer, or it may be zero or more digits followed by a dot (.) followed by one or more digits. Both integers and real numbers may be preceded by a "-" or "+" to indicate the sign. 6.5.2 Times Time designations consist of a non-negative real number followed by a time unit identifier. The time unit identifiers are: ms: milliseconds s: seconds Examples include: "3s", "850ms", "0.7s", ".5s" and "+1.5s". Appendices Appendix A — Glossary of Terms active grammar A speech or DTMF grammar that is currently active. This is based on the currently executing element, and the scope elements of the currently defined grammars. application A collection of VoiceXML documents that are tagged with the same application name attribute. ASR Automatic speech recognition. author The creator of a VoiceXML document. catch element A block or one of its abbreviated forms. Certain default catch elements are defined by the VoiceXML interpreter. control item A form item whose purpose is either to contain a block of procedural logics () or to allow initial prompts for a mixed initiative dialog (). CSS W3C Cascading Style Sheet specification. See [CSS2] dialog An interaction with the user specified in a VoiceXML document. Types of dialogs include forms and_menus_. DTMF (Dual Tone Multi-Frequency) Touch-tone or push-button dialing. Pushing a button on a telephone keypad generates a sound that is a combination of two tones, one high frequency and the other low frequency. ECMAScript A standard version of JavaScript backed by the European Computer Manufacturer's Association. See [ECMASCRIPT] event A notification "thrown" by the implementation platform, VoiceXML interpreter context, VoiceXML interpreter, or VoiceXML code. Events include exceptional conditions (semantic errors), normal errors (user did not say something recognizable), normal events (user wants to exit), and user defined events. executable content Procedural logic that occurs in , , and event handlers. form A dialog that interacts with the user in a highly flexible fashion with the computer and the _user_sharing the initiative. FIA (Form Interpretation Algorithm) An algorithm implemented in a _VoiceXML interpreter_which drives the interaction between the user and a VoiceXML form or menu. See Section 2.1.6and Appendix C. form item An element of that can be visited during form execution: , , , , , , and . form item variable A variable, either implicitly or explicitly defined, associated with each form item in a form. If the form item variable is undefined, the form interpretation algorithm will visit the form item and use it to interact with the user. implementation platform A computer with the requisite software and/or hardware to support the types of interaction defined by VoiceXML. input item A form item whose purpose is to input a input item variable. Input items include , , , , and . language identifier A language identifier labels information content as being of a particular human language variant. Following the XML specification for language identification [XML], a legal language identifier is identified by an RFC 3066 [RFC3066]code. A language code is required by RFC 3066. A country code or other subtag identifier is optional by RFC 3066. link A set of grammars that when matched by something the_user_ says or keys in, either transitions to a new dialog or document or throws an event in the current form item. menu A dialog presenting the user with a set of choices and takes action on the selected one. mixed initiative A computer-human interaction in which either the computer or the human can take initiative and decide what to do next. JSGF Java API Speech Grammar Format. A proposed standard for representing speech grammars. See [JSGF] object A platform-specific capability with an interface available via VoiceXML. request A collection of data including: a URI specifying a document server for the data, a set of name-value pairs of data to be processed (optional), and a method of submission for processing (optional). script A fragment of logic written in a client-side scripting language, especially ECMAScript, which is a scripting language that must be supported by any VoiceXML interpreter. session A connection between a user and an implementation platform, e.g. a telephone call to a voice response system. One session may involve the interpretation of more than one_VoiceXML document_. SRGS (Speech Recognition Grammar Specification) A standard format for context-free speech recognition grammars being developed by the W3C Voice Browser group. Both ABNF and XML formats are defined [SRGS]. SSML (Speech Synthesis Markup Language) A standard format for speech synthesis being developed by the W3C Voice Browser group [SSML]. subdialog A VoiceXML dialog (or document) invoked from the current_dialog_ in a manner analogous to function calls. tapered prompts A set of prompts used to vary a message given to the human. Prompts may be tapered to be more terse with use (field prompting), or more explicit (help prompts). throw An element that fires an event. TTS text-to-speech; speech synthesis. user A person whose interaction with an implementation platform is controlled by a VoiceXML interpreter. URI Uniform Resource Indicator. URL Uniform Resource Locator. VoiceXML document An XML document conforming to the VoiceXML specification. VoiceXML interpreter A computer program that interprets a _VoiceXML document_to control an implementation platform for the purpose of conducting an interaction with a user. VoiceXML interpreter context A computer program that uses a VoiceXML interpreter to interpret a VoiceXML Document and that may also interact with the implementation platform independently of the_VoiceXML interpreter_. W3C World Wide Web Consortium http://www.w3.org/ Appendix B — VoiceXML Document Type Definition The VoiceXML DTD is located at http://www.w3.org/TR/voicexml20/vxml.dtd. Due to DTD limitations, the VoiceXML DTD does not correctly express that the element can contain elements from other XML namespaces. Note: the VoiceXML DTD includes modified elements from the DTDs of the Speech Recognition Grammar Specification 1.0 [SRGS] and the Speech Synthesis Markup Language 1.0 [SSML]. Appendix C — Form Interpretation Algorithm The form interpretation algorithm (FIA) drives the interaction between the user and a VoiceXML form or menu. A menu can be viewed as a form containing a single field whose grammar and whose action are constructed from the elements. The FIA must handle: Form initialization. Prompting, including the management of the prompt counters needed for prompt tapering. Grammar activation and deactivation at the form and form item levels. Entering the form with an utterance that matched one of the form's document-scoped grammars while the user was visiting a different form or menu. Leaving the form because the user matched another form, menu, or link's document-scoped grammar. Processing multiple field fills from one utterance, including the execution of the relevant actions. Selecting the next form item to visit, and then processing that form item. Choosing the correct catch element to handle any events thrown while processing a form item. First we define some terms and data structures used in the form interpretation algorithm: active grammar set The set of grammars active during a VoiceXML interpreter context's input collection operation. utterance A summary of what the user said or keyed in, including the specific grammar matched, and a semantic result consisting of an interpretation structure or, where there is no semantic interpretation, the raw text of the input (see Section 3.1.6). An example utterance might be: "grammar 123 was matched, and the semantic interpretation is {drink: "coke" pizza: {number: "3" size: "large"}}". execute To execute executable content – either a block, a filled action, or a set of filled actions. If an event is thrown during execution, the execution of the executable content is aborted. The appropriate event handler is then executed, and this may cause control to resume in a form item, in the next iteration of the form's main loop, or outside of the form. If a is executed, the transfer takes place immediately, and the remaining executable content is not executed. Here is the conceptual form interpretation algorithm. The FIA can start with no initial utterance, or with an initial utterance passed in from another dialog: // // Initialization Phase // foreach ( ,

Voice Extensible Markup Language (VoiceXML) Version

W3C Recommendation 16 March 2004

Abstract

Status of this Document

Conventions of this Document

Table of Contents

Abbreviated Contents

Full Contents

1. Overview

1.1 Introduction

1.2 Background

1.2.1 Architectural Model

1.2.2 Goals of VoiceXML

1.2.3 Scope of VoiceXML

1.2.4 Principles of Design

1.2.5 Implementation Platform Requirements

1.3 Concepts

1.3.1 Dialogs and Subdialogs

1.3.2 Sessions

1.3.3 Applications

1.3.4 Grammars

1.3.5 Events

1.3.6 Links

1.4 VoiceXML Elements

4.1.4 Element

4.1.5 Bargein

4.1.5.1 Bargein type

4.1.6 Prompt Selection

4.1.7 Timeout

4.1.8 Prompt Queueing and Input Collection

Case 1

Case 2

Case 3

5. Control flow and scripting

5.1 Variables and Expressions

5.3.13 log element

6. Environment and Resources

6.1 Resource Fetching

6.1.1 Fetching

6.1.2 Caching

6.1.2.1 Controlling the Caching Policy

6.1.3 Prefetching

6.1.4 Protocols

6.2 Metadata Information

6.2.1 meta element

6.2.2 metadata element

6.3 property element

6.3.1 Platform-Specific Properties

6.3.2 Generic Speech Recognizer Properties

6.3.3 Generic DTMF Recognizer Properties

6.3.4 Prompt and Collect Properties

6.3.5 Fetching Properties

6.3.6 Miscellaneous Properties

6.4 param element

Form with calling dialog

Subdialog in http://another.example.com

6.5 Value Designations

6.5.1 Integers and Real Numbers

6.5.2 Times

Appendices

Appendix A — Glossary of Terms

Appendix B — VoiceXML Document Type Definition

Appendix C — Form Interpretation Algorithm