Further Development of the PhonicStick (original) (raw)

Further Development of the PhonicStick: The application of phonic-based acceleration methods to the speaking joystick

2010

The PhonicStick is a novel Augmentative and Alternative Communication joystick-like device which enables individuals with severe speech and physical disorders to access forty-two sounds (i.e. phonics) and blend the sounds together to create spoken words. The device aims to allow the users to select phonics and produce speech without the need for a visual interface. One of the problems with the current prototype of the PhonicStick is that the phonic entry is relatively slow and may involve many physical movements which will cause great difficulties for users with poor hand function. Therefore, in this research we are investigating whether natural language processing (NLP) technology can be utilized to facilitate the phonic retrieval and word creation processes. Our goal is to develop a set of phonic-based NLP acceleration methods, such as phonic disambiguation and phonic prediction, which will reduce the user effort required to select the target phonics and improve the speed of producing words. This paper will discuss the challenges of applying such methods to the PhonicStick and report on the current state of the development of the proposed techniques. The presentation will also include a live demonstration of the latest prototype of the PhonicStick.

The Application of Natural Language Processing to Augmentative and Alternative Communication

Assistive Technology, 2012

Significant progress has been made in the application of natural language processing (NLP) to augmentative and alternative communication (AAC), particularly in the areas of interface design and word prediction. This article will survey the current state-of-the-science of NLP in AAC and discuss its future applications for the development of next generation of AAC technology.

The Vocal Joystick: A Voice-Based Human-Computer Interface for Individuals with Motor Impairments

We present a novel voice-based humancomputer interface designed to enable individuals with motor impairments to use vocal parameters for continuous control tasks. Since discrete spoken commands are ill-suited to such tasks, our interface exploits a large set of continuous acousticphonetic parameters like pitch, loudness, vowel quality, etc. Their selection is optimized with respect to automatic recognizability, communication bandwidth, learnability, suitability, and ease of use. Parameters are extracted in real time, transformed via adaptation and acceleration, and converted into continuous control signals. This paper describes the basic engine, prototype applications (in particular, voice-based web browsing and a controlled trajectory-following task), and initial user studies confirming the feasibility of this technology.

Virtual keyboard with the prediction of words for children with cerebral palsy

Computer Methods and Programs in Biomedicine, 2020

One in every 200 people worldwide cannot express orally because of cognitive, motor, neurological, or emotional problems. Assistive technologies can help people with impairments to use computers to perform their daily life activities independently and to communicate with others. This paper presents a Hidden Markov Model-based word prediction method that allows keyboard emulation software to predict words so that children with disabilities can type texts more quickly. The proposed system involved the development of a keyboard emulator, the construction and processing of a corpus, as well as a word prediction algorithm. Children with different cognitive profiles had to produce a text and type it twice: first with free typing, second using the virtual keyboard's word prediction. Results indicated the word prediction of the keyboard emulator software reduced typing efforts. However, the software initially increased the typing time when the corpus was not well adapted to users. The total amount of clicks with word prediction decreased by around 26.2%. Regarding execution time using prediction, 61% typed the text in less time. The tests performed with literate volunteers indicated a reduction in the number of clicks by up to 51.3%. This result surpasses the 15% achieved in the previous study by Free Virtual Keyboard with word prediction based on pure statistics. Moreover, all volunteers required fewer clicks to perform the task. People with impairments, especially children, could use the system and demonstrate their knowledge and abilities. The entire system is available on the Internet and users have unrestricted and free access to it.

Speech and Language Processing for Multimodal Human-Computer Interaction

The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology, 2004

In this paper, we describe our recent work at Microsoft Research, in the project codenamed Dr. Who, aimed at the development of enabling technologies for speech-centric multimodal human-computer interaction. In particular, we present in detail MiPad as the first Dr. Who's application that addresses specifically the mobile user interaction scenario. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies. It fully integrates continuous speech recognition and spoken language understanding, and provides a novel solution to the current prevailing problem of pecking with tiny styluses or typing on minuscule keyboards in today's PDAs or smart phones. Despite its current incomplete implementation, we have observed that speech and pen have the potential to significantly improve user experience in our user study reported in this paper. We describe in this system-oriented paper the main components of MiPad, with a focus on the robust speech processing and spoken language understanding aspects. The detailed MiPad components discussed include: distributed speech recognition considerations for the speech processing algorithm design; a stereo-based speech feature enhancement algorithm used for noise-robust front-end speech processing; Aurora2 evaluation results for this front-end processing; speech feature compression (source coding) and error protection (channel coding) for distributed speech recognition in MiPad; HMM-based acoustic modeling for 1 Various portions of this article have been presented at ICSLP

A self learning vocal interface for speech-impaired users

In this work we describe research aimed at developing an assistive vocal interface for users with a speech impairment. In contrast to existing approaches, the vocal interface is self-learning, which means it is maximally adapted to the end-user and can be used with any language, dialect, vocabulary and grammar. The paper describes the overall learning framework and the vocabulary acquisition technique, and proposes a novel grammar induction technique based on weakly supervised hidden Markov model learning. We evaluate early implementations of these vocabulary and grammar learning components on two datasets: recorded sessions of a vocally guided card game by non-impaired speakers and speech-impaired users engaging in a home automation task.

A Real-Time Oral Cavity Gesture Based Words Synthesizer Using Sensors

Computers, materials & continua, 2022

The present system experimentally demonstrates a synthesis of syllables and words from tongue manoeuvers in multiple languages, captured by four oral sensors only. For an experimental demonstration of the system used in the oral cavity, a prototype tooth model was used. Based on the principle developed in a previous publication by the author(s), the proposed system has been implemented using the oral cavity (tongue, teeth, and lips) features alone, without the glottis and the larynx. The positions of the sensors in the proposed system were optimized based on articulatory (oral cavity) gestures estimated by simulating the mechanism of human speech. The system has been tested for all English alphabets and several words with sensor-based input along with an experimental demonstration of the developed algorithm, with limit switches, potentiometer, and flex sensors emulating the tongue in an artificial oral cavity. The system produces the sounds of vowels, consonants, and words in English, along with the pronunciation of meanings of their translations in four major Indian languages, all from oral cavity mapping. The experimental setup also caters to gender mapping of voice. The sound produced from the hardware has been validated by a perceptual test to verify the gender and word of the speech sample by listeners, with ∼ 98% and ∼ 95% accuracy, respectively. Such a model may be useful to interpret speech for those who are speech-disabled because of accidents, neuron disorder, spinal cord injury, or larynx disorder.

Speech and language processing for multimodal human-computer interaction (Invited Article

2004

In this paper, we describe our recent work at Microsoft Research, in the project codenamed Dr. Who, aimed at the development of enabling technologies for speech-centric multimodal human-computer interaction. In particular, we present in detail MiPad as the first Dr. Who's application that addresses specifically the mobile user interaction scenario. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies. It fully integrates continuous speech recognition and spoken language understanding, and provides a novel solution to the current prevailing problem of pecking with tiny styluses or typing on minuscule keyboards in today's PDAs or smart phones. Despite its current incomplete implementation, we have observed that speech and pen have the potential to significantly improve user experience in our user study reported in this paper. We describe in this system-oriented paper the main components of MiPad, with a focus on the robust speech processing and spoken language understanding aspects. The detailed MiPad components discussed include: distributed speech recognition considerations for the speech processing algorithm design; a stereo-based speech feature enhancement algorithm used for noise-robust front-end speech processing; Aurora2 evaluation results for this front-end processing; speech feature compression (source coding) and error protection (channel coding) for distributed speech recognition in MiPad; HMM-based acoustic modeling for 1 Various portions of this article have been presented at ICSLPcontinuous speech recognition decoding; a unified language model integrating context-free grammar and N-gram model for the speech decoding; schema-based knowledge representation for the MiPad's personal information management task; a unified statistical framework that integrates speech recognition, spoken language understanding and dialogue management; the robust natural language parser used in MiPad to process the speech recognizer's output; a machine-aided grammar learning and development used for spoken language understanding for the MiPad task; Tap & Talk multimodal interaction and user interface design; back channel communication and MiPad's error repair strategy; and finally, user study results that demonstrate the superior throughput achieved by the Tap & Talk multimodal interaction over the existing pen-only PDA interface. These user study results highlight the crucial role played by speech in enhancing the overall user experience in MiPad-like human-computer interaction devices. 4 document reading and annotation. This collection of functions unifies the various devices that people carry around today into a single, comprehensive communication and productivity tool. While the entire functionality of MiPad can be accessed by pen alone, it is preferred to be accessed by speech and pen combined. The user can dictate to an input field by holding the pen down in it. Alternatively, the user can also select the speech field by using the roller to navigate and by holding it down while speaking. The field selection, called Tap & Talk, not only indicates where the recognized text should go but also serves as a push to talk control. Tap & Talk narrows down the number of possible instructions for spoken language processing. For example, selecting the "To:" field on an e-mail application display indicates that the user is about to enter a name. This dramatically reduces the complexity of spoken language processing and cuts down the speech recognition and understanding errors to the extent that MiPad can be made practically usable despite the current well known limitations of speech recognition and natural language processing technology.

Application of Natural Language Processing Techniques to Augmentative Communication Systems

2012

Natural language processing is an efficient example to denote the interaction between humans and computers. The computer fetches the input; and transforms the meaningful information into natural language; and finally, produces natural language as the output. NLP falls under the category of computational linguistics. There are several forms of techniques and applications of the natural language processing. The goal of this paper is to identify the potential applications of natural language processing techniques that can be incorporated in the augmentative communication systems. This research can extend the communication rate of the children / individuals who are with physical disabilities.

The vocal joystick data collection effort and vowel corpus

Vocal Joystick is a mechanism that enables individuals with motor impairments to make use of vocal parameters to control objects on a computer screen (buttons, sliders, etc.) and ultimately will be used to control electro-mechanical instruments (e.g., robotic arms, wireless home automation devices). In an effort to train the VJ-system, speech data from the TIMIT speech corpus was initially used. However, due to problematic issues with co-articulation, we began a large data collection effort in a controlled environment that would not only address the problematic issues, but also yield a new vowel corpus that was representative of the utterances a user of the VJ-system would use. The data collection process evolved over the course of the effort as new parameters were added and as factors relating to the quality of the collected data in terms of the specified parameters were considered. The result of the data collection effort is a vowel corpus of approximately 11 hours of recorded data comprised of approximately 23500 sound files of the monophthongs and vowel combinations (e.g. diphthongs) chosen for the Vocal Joystick project varying along the parameters of duration, intensity and amplitude. This paper discusses how the data collection has evolved since its initiation and provides a brief summary of the resulting corpus.