Akinori Ito | Tohoku University (original) (raw)
Papers by Akinori Ito
Applied Sciences, 2021
The development of robots that play with humans is a challenging topic for robotics. We are devel... more The development of robots that play with humans is a challenging topic for robotics. We are developing a robot that plays tag with human players. To realize such a robot, it needs to observe the players and obstacles around it, chase a target player, and touch the player without collision. To achieve this task, we propose two methods. The first one is the player tracking method, by which the robot moves towards a virtual circle surrounding the target player. We used a laser range finder (LRF) as a sensor for player tracking. The second one is a motion control method after approaching the player. Here, the robot moves away from the player by moving towards the opposite side to the player. We conducted a simulation experiment and an experiment using a real robot. Both experiments proved that with the proposed tracking method, the robot properly chased the player and moved away from the player without collision. The contribution of this paper is the development of a robot control metho...
J. Inf. Hiding Multim. Signal Process., 2017
This paper describes methods that add values to audio signals using side information. Many acoust... more This paper describes methods that add values to audio signals using side information. Many acoustic signal processing methods have been proposed for estimating the lost information from the original signal. Using the appropriate side information, we can enhance the estimation easily. In this paper, the principle of audio signal processing using side information is described first, and then three applications are described: packet loss concealment of audio signal, manipulation of mixed music signal and frequency band extension of telephone speech.
2016 24th European Signal Processing Conference (EUSIPCO), 2016
A design method of a multiple description vector quantizer (VQ) is proposed. VQ is widely used fo... more A design method of a multiple description vector quantizer (VQ) is proposed. VQ is widely used for data compression, transmission and other processing. Here, we assume transmission channels with data erasure such as a packet-based network. Multiple description coding is a coding method used to achieve “graceful degradation” when transmitting signals through lossy channels. The proposed method is inspired by the vector quantizer design of Poggi et al., which combines VQ design based on the self-organizing map (SOM) and the multiple description scalar quantizer (MDSQ). The method also uses the SOM-based VQ; the difference is that the proposed method combines a bit-error-tolerant VQ designed by SOM and a novel scheme for cell arrangement of SOM based on Redundant Representation of Central Code (RRCC). The method is not only easy to design for any bit rate but is also more robust against data erasure compared with the conventional VQ.
Interdisciplinary Information Sciences, 2012
Acoustical Science and Technology, 2009
Interdisciplinary Information Sciences
Recent Advances in Intelligent Information Hiding and Multimedia Signal Processing, 2018
During the feature extraction process for speech recognition, a window function is first applied ... more During the feature extraction process for speech recognition, a window function is first applied to the input waveform to extract temporally-limited spectrum. By shifting the window function with a short time period, we can analyze the temporal change of speech spectrum. This time period is called “the frame shift,” which is usually 5 to 10 ms. In this paper, frame shift is re-considered from two aspects. The first one is the appropriateness of 10 ms as the frame shift. The frame-based process is based on the assumption that temporal change of speech spectrum is slow enough compared with the frame shift, which does not hold for kinds of consonants such as plosives. Thus, this paper experimentally shows that feature value fluctuates much according to the first position of the frame. Then a training method is proposed that uses temporally shifted samples as independent samples to compensate for the fluctuation of feature caused by the difference of the beginning position of a frame. The second aspect is that the frame shift could be longer if the fluctuation can be compensated. To prove this, an experiment was conducted to change frame shift from 10 to 60 ms, and it was found that the result of 40 ms frame shift outperformed the result of 10 ms frame shift, and comparable recognition performance with 10 ms frame shift result was obtained with 50 ms frame shift.
Recent Advances in Intelligent Information Hiding and Multimedia Signal Processing, 2018
Spoken dialog systems have become popular and are used in a home environment, such as smart speak... more Spoken dialog systems have become popular and are used in a home environment, such as smart speakers. A problem will occur when two or more smart speakers are in the same environment, in which a dialog system misdetects the other dialog systems voice as a users voice. In this paper, a method to mute synthesized speech is proposed to prevent a speech recognizer from recognizing speech uttered by a machine. The audio watermark technique is used to indicate that a machine utters the speech, and the speech recognizer attenuates the observed speech if it contains the watermark. The watermark is embedded in high frequency so that humans cannot perceive the watermark and the watermark is robustly extracted. From the experimental result, we found that the proposed method robustly determine the existence of the watermark when the SNR is no less than 0 dB.
Recent Advances in Intelligent Information Hiding and Multimedia Signal Processing, 2018
Designing an example database is important for handling various users’ utterances in an example-b... more Designing an example database is important for handling various users’ utterances in an example-based dialog system, and several approaches to constructing the database have been proposed. This paper focuses on a method for collecting the example sentences through actual conversations with the system. Several studies employ this approach for constructing the dialog system, but conventional research lacks attentive analyses. In this study, we analyzed how many examples can be collected from the interactions, and investigated the characteristics of the collected examples. The experimental results show that the response accuracy improved with the increase in number of the interactions, and the examined collection method is effective for collecting examples of consecutive utterances. In addition, subjective evaluation comparing the databases constructed using actual conversation and the fully-handcrafted databases was conducted through dialog experiments. The results showed that the exa...
We investigated the effect of height of a robot on comfortableness of verbal interaction with the... more We investigated the effect of height of a robot on comfortableness of verbal interaction with the robot. We created a robot whose height could be changed continuously, and carried out dialog experiments with humans at varying robot heights. We employed 19 participants to evaluate “comfortableness of dialog”, and investigated the height at which the participants felt the dialog was most comfortable. Next, we investigated differences of dialog comfortableness when the height of the robot was changed. Finally, we changed the distance between the participant and the robot and observed whether the dialog comfortableness changed or not. The experimental results yielded the following three guidelines for designing the height of a communication robot. First, the optimum height of a communication robot is about 300mm lower than the eye height of the user. Second, the comfortableness of dialog with the robot degrades when the height of the robot is 200mm lower or 300mm higher than the optimum...
2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2018
Many of current spoken dialog systems can conduct non-task-oriented dialog. The systems that can ... more Many of current spoken dialog systems can conduct non-task-oriented dialog. The systems that can improve user impression are required for users to keep using them. This paper focuses on self-disclosure, that is a process that a person reveals information about herself/himself to an interlocutor in human-human conversation. It is known that the self-disclosure plays a vital role to develop an intimate relationship. However, it is still unclear how exchanging the self-disclosures affects the user impression in the human-machine dialog. In this paper, we conduct dialog experiments to investigate the effectiveness of mutual self-disclosures between the user and the system. To achieve this goal, we built a spoken dialog system which conducts the dialog that the user and the system disclose information about themselves alternately. The dialog experiments revealed that the proposed system can improve the user impression regarding satisfaction and friendliness.
We investigate a method of detecting the wrong lyrics from the singing voice. In the proposed met... more We investigate a method of detecting the wrong lyrics from the singing voice. In the proposed method, we compare the input singing voice and the reference singing voice using dynamic time warping, and then observe the frame-by-frame distance to find the error location. However, the absolute value of the distance is affected by the singer individuality of the reference and input singing voice. Thus, we attempted to adapt the singer individuality into the reference singer’s one by a linear transformation. The results of the experiment showed that we could detect the wrong lyrics with high accuracy when the different part of the lyrics was long. In addition, we investigated the effect of iterative linear transformation, and we could not find any benefit from the second or third linear transformations.
There are high expectations for multimodal dialog systems that can make natural small talk with f... more There are high expectations for multimodal dialog systems that can make natural small talk with facial expressions, gestures, and gaze actions as next-generation dialog-based systems. Two important roles of the chat-talk system are keeping the user engaged and establishing rapport. Many studies have conducted user evaluations of such systems, some of which reported that considering the relationship with the user is an effective way to improve the subjective evaluation. To facilitate research of such dialog systems, we are currently constructing a large-scale multimodal dialog corpus focusing on the relationship between speakers. In this paper, we describe the data collection and annotation process, and analysis of the corpus collected in the early stage of the project. This corpus contains 19,303 utterances (10 hours) from 19 pairs of participants. A dialog act tag is annotated to each utterance by two annotators. We compare the frequency and the transition probability of the tags b...
When a teacher gathers the students’ assignment electronically, one big problem is plagiarism of ... more When a teacher gathers the students’ assignment electronically, one big problem is plagiarism of report from documents in a Web site or other learner’s report. This paper proposes a framework using data hiding technology to suppress plagiarism. In this framework, a teacher embeds ID of a student into a template file and sends the template file to the student. The student writes a report using the template file and submits it. The teacher extracts the ID from the report file to validate the file’s originality. The Open Office XML (OOXML) format was chosen as the format of the template file because of its popularity. In the experiment, two methods were examined. The first method inserts small images with the ID into the template file. The second method embeds the ID into the fonts of the heading. According to the results of the experiments, the method using images was fragile against format conversion into PDF, and the method of font switching was more robust while the amount of embed...
Applied Sciences, 2021
The development of robots that play with humans is a challenging topic for robotics. We are devel... more The development of robots that play with humans is a challenging topic for robotics. We are developing a robot that plays tag with human players. To realize such a robot, it needs to observe the players and obstacles around it, chase a target player, and touch the player without collision. To achieve this task, we propose two methods. The first one is the player tracking method, by which the robot moves towards a virtual circle surrounding the target player. We used a laser range finder (LRF) as a sensor for player tracking. The second one is a motion control method after approaching the player. Here, the robot moves away from the player by moving towards the opposite side to the player. We conducted a simulation experiment and an experiment using a real robot. Both experiments proved that with the proposed tracking method, the robot properly chased the player and moved away from the player without collision. The contribution of this paper is the development of a robot control metho...
J. Inf. Hiding Multim. Signal Process., 2017
This paper describes methods that add values to audio signals using side information. Many acoust... more This paper describes methods that add values to audio signals using side information. Many acoustic signal processing methods have been proposed for estimating the lost information from the original signal. Using the appropriate side information, we can enhance the estimation easily. In this paper, the principle of audio signal processing using side information is described first, and then three applications are described: packet loss concealment of audio signal, manipulation of mixed music signal and frequency band extension of telephone speech.
2016 24th European Signal Processing Conference (EUSIPCO), 2016
A design method of a multiple description vector quantizer (VQ) is proposed. VQ is widely used fo... more A design method of a multiple description vector quantizer (VQ) is proposed. VQ is widely used for data compression, transmission and other processing. Here, we assume transmission channels with data erasure such as a packet-based network. Multiple description coding is a coding method used to achieve “graceful degradation” when transmitting signals through lossy channels. The proposed method is inspired by the vector quantizer design of Poggi et al., which combines VQ design based on the self-organizing map (SOM) and the multiple description scalar quantizer (MDSQ). The method also uses the SOM-based VQ; the difference is that the proposed method combines a bit-error-tolerant VQ designed by SOM and a novel scheme for cell arrangement of SOM based on Redundant Representation of Central Code (RRCC). The method is not only easy to design for any bit rate but is also more robust against data erasure compared with the conventional VQ.
Interdisciplinary Information Sciences, 2012
Acoustical Science and Technology, 2009
Interdisciplinary Information Sciences
Recent Advances in Intelligent Information Hiding and Multimedia Signal Processing, 2018
During the feature extraction process for speech recognition, a window function is first applied ... more During the feature extraction process for speech recognition, a window function is first applied to the input waveform to extract temporally-limited spectrum. By shifting the window function with a short time period, we can analyze the temporal change of speech spectrum. This time period is called “the frame shift,” which is usually 5 to 10 ms. In this paper, frame shift is re-considered from two aspects. The first one is the appropriateness of 10 ms as the frame shift. The frame-based process is based on the assumption that temporal change of speech spectrum is slow enough compared with the frame shift, which does not hold for kinds of consonants such as plosives. Thus, this paper experimentally shows that feature value fluctuates much according to the first position of the frame. Then a training method is proposed that uses temporally shifted samples as independent samples to compensate for the fluctuation of feature caused by the difference of the beginning position of a frame. The second aspect is that the frame shift could be longer if the fluctuation can be compensated. To prove this, an experiment was conducted to change frame shift from 10 to 60 ms, and it was found that the result of 40 ms frame shift outperformed the result of 10 ms frame shift, and comparable recognition performance with 10 ms frame shift result was obtained with 50 ms frame shift.
Recent Advances in Intelligent Information Hiding and Multimedia Signal Processing, 2018
Spoken dialog systems have become popular and are used in a home environment, such as smart speak... more Spoken dialog systems have become popular and are used in a home environment, such as smart speakers. A problem will occur when two or more smart speakers are in the same environment, in which a dialog system misdetects the other dialog systems voice as a users voice. In this paper, a method to mute synthesized speech is proposed to prevent a speech recognizer from recognizing speech uttered by a machine. The audio watermark technique is used to indicate that a machine utters the speech, and the speech recognizer attenuates the observed speech if it contains the watermark. The watermark is embedded in high frequency so that humans cannot perceive the watermark and the watermark is robustly extracted. From the experimental result, we found that the proposed method robustly determine the existence of the watermark when the SNR is no less than 0 dB.
Recent Advances in Intelligent Information Hiding and Multimedia Signal Processing, 2018
Designing an example database is important for handling various users’ utterances in an example-b... more Designing an example database is important for handling various users’ utterances in an example-based dialog system, and several approaches to constructing the database have been proposed. This paper focuses on a method for collecting the example sentences through actual conversations with the system. Several studies employ this approach for constructing the dialog system, but conventional research lacks attentive analyses. In this study, we analyzed how many examples can be collected from the interactions, and investigated the characteristics of the collected examples. The experimental results show that the response accuracy improved with the increase in number of the interactions, and the examined collection method is effective for collecting examples of consecutive utterances. In addition, subjective evaluation comparing the databases constructed using actual conversation and the fully-handcrafted databases was conducted through dialog experiments. The results showed that the exa...
We investigated the effect of height of a robot on comfortableness of verbal interaction with the... more We investigated the effect of height of a robot on comfortableness of verbal interaction with the robot. We created a robot whose height could be changed continuously, and carried out dialog experiments with humans at varying robot heights. We employed 19 participants to evaluate “comfortableness of dialog”, and investigated the height at which the participants felt the dialog was most comfortable. Next, we investigated differences of dialog comfortableness when the height of the robot was changed. Finally, we changed the distance between the participant and the robot and observed whether the dialog comfortableness changed or not. The experimental results yielded the following three guidelines for designing the height of a communication robot. First, the optimum height of a communication robot is about 300mm lower than the eye height of the user. Second, the comfortableness of dialog with the robot degrades when the height of the robot is 200mm lower or 300mm higher than the optimum...
2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2018
Many of current spoken dialog systems can conduct non-task-oriented dialog. The systems that can ... more Many of current spoken dialog systems can conduct non-task-oriented dialog. The systems that can improve user impression are required for users to keep using them. This paper focuses on self-disclosure, that is a process that a person reveals information about herself/himself to an interlocutor in human-human conversation. It is known that the self-disclosure plays a vital role to develop an intimate relationship. However, it is still unclear how exchanging the self-disclosures affects the user impression in the human-machine dialog. In this paper, we conduct dialog experiments to investigate the effectiveness of mutual self-disclosures between the user and the system. To achieve this goal, we built a spoken dialog system which conducts the dialog that the user and the system disclose information about themselves alternately. The dialog experiments revealed that the proposed system can improve the user impression regarding satisfaction and friendliness.
We investigate a method of detecting the wrong lyrics from the singing voice. In the proposed met... more We investigate a method of detecting the wrong lyrics from the singing voice. In the proposed method, we compare the input singing voice and the reference singing voice using dynamic time warping, and then observe the frame-by-frame distance to find the error location. However, the absolute value of the distance is affected by the singer individuality of the reference and input singing voice. Thus, we attempted to adapt the singer individuality into the reference singer’s one by a linear transformation. The results of the experiment showed that we could detect the wrong lyrics with high accuracy when the different part of the lyrics was long. In addition, we investigated the effect of iterative linear transformation, and we could not find any benefit from the second or third linear transformations.
There are high expectations for multimodal dialog systems that can make natural small talk with f... more There are high expectations for multimodal dialog systems that can make natural small talk with facial expressions, gestures, and gaze actions as next-generation dialog-based systems. Two important roles of the chat-talk system are keeping the user engaged and establishing rapport. Many studies have conducted user evaluations of such systems, some of which reported that considering the relationship with the user is an effective way to improve the subjective evaluation. To facilitate research of such dialog systems, we are currently constructing a large-scale multimodal dialog corpus focusing on the relationship between speakers. In this paper, we describe the data collection and annotation process, and analysis of the corpus collected in the early stage of the project. This corpus contains 19,303 utterances (10 hours) from 19 pairs of participants. A dialog act tag is annotated to each utterance by two annotators. We compare the frequency and the transition probability of the tags b...
When a teacher gathers the students’ assignment electronically, one big problem is plagiarism of ... more When a teacher gathers the students’ assignment electronically, one big problem is plagiarism of report from documents in a Web site or other learner’s report. This paper proposes a framework using data hiding technology to suppress plagiarism. In this framework, a teacher embeds ID of a student into a template file and sends the template file to the student. The student writes a report using the template file and submits it. The teacher extracts the ID from the report file to validate the file’s originality. The Open Office XML (OOXML) format was chosen as the format of the template file because of its popularity. In the experiment, two methods were examined. The first method inserts small images with the ID into the template file. The second method embeds the ID into the fonts of the heading. According to the results of the experiments, the method using images was fragile against format conversion into PDF, and the method of font switching was more robust while the amount of embed...