Crossmodal Interaction Research Papers - Academia.edu (original) (raw)

2025, Synthese

According to the decomposition thesis, perceptual experiences resolve without remainder into their different modality-specific components. Contrary to this view, I argue that certain cases of multisensory integration give rise to experiences representing features of a novel type. Through the coordinated use of bodily awareness-understood here as encompassing both proprioception and kinaesthesis-and the exteroceptive sensory modalities, one becomes perceptually responsive to spatial features whose instances couldn't be represented by any of the contributing modalities functioning in isolation. I develop an argument for this conclusion focusing on two cases: 3D shape perception in haptic touch and experiencing an object's egocentric location in crossmodally accessible, environmental space.

2025, Philosophy Compass

The first part of this survey article offered a cartography of some of the more extensively studied forms of multisensory processing. In this second part, I turn to examining some of the different possible ways in which the structure of conscious perceptual experience might also be characterized as multisensory. In addition, I discuss the significance of research on multisensory processing and multisensory consciousness for philosophical debates concerning the modularity of perception, cognitive penetration, and the individuation of the senses.

2025, Synthese

2025, Philosophy and the Mind Sciences

In this paper, we want to tackle the Molyneux question thoroughly, by addressing it in terms of both ordinary perception and pictorial perception: if a congenitally blind person recovered sight, could she recognize visually the 3D shapes she already recognized tactilely, both when such shapes are given to her directly and when they are given to her pictorially, i.e., as depicted shapes? We want to claim that empirical evidence suggests that the question can be positively answered in both cases. For in the first case, such evidence shows that perception of 3D shapes is supramodal; namely, it can be equivalently achieved in different sense modalities, notably touch and vision, independently of the sensory input such shapes are accessed. While in the second case, such evidence shows that one can satisfy both in vision and in touch the condition for depicted shapes, which are typically not where the perceiver is, to be grasped by that perceiver in a picture's subject, i.e., what the picture presents. This condition states that the picture's vehicle, i.e., the typically 2D physical basis of a picture, is enriched by adding to its properties the 3D grouping properties that allow for a figure/ground segmentation to be performed in that vehicle's elements.

2025

The ability to understand others' emotional states from variations in speech prosody has an adaptive value and is crucial both for personal and social adjustment. This ability emerges early in development, and it is related with better socio-emotional skills in children. However, the neural basis of emotional prosody recognition in children remains poorly understood. The main goal of the current study was to investigate whether differences in brain morphology might explain individual differences in children's ability to recognize emotions through prosodic cues. A sample of 66 children (M = 8.30 years; SD = 0.35) completed both a behavioural task and a structural magnetic resonance imaging scan. In the behavioural task, children listened to semantically neutral sentences and had to perform two consecutive judgments for each stimulus, including a forced-choice categorization of the emotional tone (neutral, happy, sad, angry, scared) and an intensity judgment, rating the salience of the emotion in the stimulus. Results revealed that children achieved high recognition accuracy rates for all emotional categories, and happiness was the best recognized emotion. Besides, in terms of neural structures, there were correlations between higher emotional prosody recognition and increased grey matter volume in the fusiform gyrus, cerebellum, motor/premotor and prefrontal regions; and decreased grey matter volume in parietal and occipital regions. Additionally, we found that some brain regions were correlated with higher recognition accuracy of specific emotions, when directly compared to the other ones. Our findings suggest that the individual differences in children's ability to recognize emotions through prosodic cues relate to differences in brain morphology, both for the general emotional prosody recognition ability and for the recognition of specific emotional categories.

2025, arXiv

In recent decades, neuroscientific and psychological research has traced direct relationships between taste and auditory perceptions. This article explores multimodal generative models capable of converting taste information into music, building on this foundational research. We provide a brief review of the state of the art in this field, highlighting key findings and methodologies. We present an experiment in which a fine-tuned version of a generative music model (MusicGEN) is used to generate music based on detailed taste descriptions provided for each musical piece. The results are promising: according the participants' (n = 111) evaluation, the fine-tuned model produces music that more coherently reflects the input taste descriptions compared to the non-fine-tuned model. This study represents a significant step towards understanding and developing embodied interactions between AI, sound, and taste, opening new possibilities in the field of generative AI. We release our dataset, code and pre-trained model at: https://osf.io/xs5jy/.

2025

POLISphone is a software for music performance, inspired on the popular idea of “soundmap”. Unlike most soundmaps, its main aim is to provide a way to easily create original soundmaps and perform with it. It also targets to be a versatile interface, both visual and sound wise, and to induce a sense of instrumentality. In this paper, the authors describe its implementation and, in addition, considerations are made regarding its use and performativity potential, based on fieldwork.

2025

This thesis has described the development of an artificial listener model capable of predicting a number of different perceived spatial attributes at arbitrary locations in the listening area for reproduced sound. Previous research-into modelling the perceived spatial attributes of sound reproduction systems has concentrated primarily on the sweet spot in the centre of the listening area. However, good audio reproduction is ideally required at multiple points in the listening area, for example for a family living room with a home cinema system. A framework for modelling the perception of reproduced audio was developed, including the capture of the original sound-field, modelling the signals at the ears and the translation of the binaural signals to the perceptual domain. Explicitly modelling the binaural signals meant that the same principal cues as human listeners are used and also allowed existing binaural models to be incorporated into the system. The three most widely researched different perceived spatial attributes were investigated: directional localisation, source width and listener envelopment. The models for predicting each of the three spatial attributes were validated using the results from formal listening tests. The output of the model was highly correlated with the localisation and envelopment results from the listening tests. Lower correlations were obtained for the predicted source width and for some groups of stimuli for directional localisation, and a number of modifications which may improve the performance of the model were identified. Kerry, Emily, Penny, Andrew, Luke, Thomas, Cathy, David and Rosemarie for being there for me; Charlie Hulme for the music; Fred and Ginger for the eggs; and finally, Jenny, for putting up with me during the creation of the thesis, providing endless encouragement and reminding me about what is important in life. xxv XX Vl

2024, 40. Jahrestagung der Deutschen Gesellschaft für Musikpsychologie (DGM), 6.-8.9.2024, Hochschule für Musik und Theater München.

Greiler-Weiß, P., Reuter, C., Ambros, S., Glasser, S., Jewanski, J. (2024). Ich sehe was, was Du nicht hörst – Farbe-Klangfarbe-Assoziationen bei Grundschulkindern der Klassen 1-4. Posterbeitrag, 40. Jahrestagung der Deutschen... more

2024, 40. Jahrestagung der Deutschen Gesellschaft für Musikpsychologie (DGM), 6.-8.9.2024, Hochschule für Musik und Theater München.

Feller, G., Reuter, C. (2024). Bunte Synthesizer. Crossmodal Correspondences zwischen Farbtönen und synthetischen Klängen. Posterbeitrag, 40. Jahrestagung der Deutschen Gesellschaft für Musikpsychologie (DGM), 6.-8.9.2024, Hochschule für... more

2024, Fortschritte der Akustik - DAGA 2024, 50. Jahrestagung der Deutschen Gesellschaft für Akustik

Feller, G., Reuter, C. (2024). Tonale Farbtöne - Crossmodal perception of (quasi) synthetic sounds. In: Fortschritte der Akustik - DAGA 2024, 50. Jahrestagung der Deutschen Gesellschaft für Akustik. (S. 1247-1250). Hannover.

2024, H.A. Nikolaeva, & C.B. Konanchuk (Eds.); Polylogue and Synthesis of Arts - Materials. St. Petersburg

Jewanski, J., Reuter, C., Czedig-Eysenberg, I., Siddiq, S., Saitis, C., Kruchten, S., Sakhabiev, R., & Oehler, M. (2019). Timbre and Colors. Features and Tendencies of timbre-color-mappings. In: H.A. Nikolaeva, & C.B. Konanchuk (Eds.);... more

2024, Fortschritte der Akustik - DAGA 2021, 47. Jahrestagung der Deutschen Gesellschaft für Akustik

Ambros, S., Jewanski, J., Reuter, C., Schmidhofer, A. (2021). Sind crossmodal correspondences kulturell erlernt? Zuordnungen von Farben und Tönen in Madagaskar. In: Fortschritte der Akustik - DAGA 2021, 47. Jahrestagung der Deutschen... more

2024, Fortschritte der Akustik - DAGA 2022, 48. Jahrestagung der Deutschen Gesellschaft für Akustik

Ambros, S., Reuter, C. (2022). Crossmodal Correspondences bei Musiker:innen und Nichtmusiker:innen im empirischen Vergleich. In: Fortschritte der Akustik - DAGA 2022, 48. Jahrestagung der Deutschen Gesellschaft für Akustik (S. 993-996).... more

2024, Fortschritte der Akustik - DAGA 2023, 49. Jahrestagung der Deutschen Gesellschaft für Akustik

Feller, G., Reuter, C. (2023). Klingt Sinus blau und Sägezahn rot? Eine Untersuchung zu Crossmodal Correspondences bei der Wahrnehmung von synthetischen Wellenformen. In: Fortschritte der Akustik - DAGA 2023, 49. Jahrestagung der... more

2024

People can consistently match to odors to colors, and within a culture, there are similarities in color-odor associations. These associations are forms of crossmodal correspondences. Recently, there has been discussion about the extent to which these correspondences arise for structural reasons (e.g., an inherent mapping between color and odor), statistical reasons (e.g., covariance in experience), and/or semantically-mediated reasons (e.g., stemming from language). The present study probed this question by testing color-odor correspondences in 6 different cultural groups (Dutch, Dutch residing-Chinese, German, Malay, Malaysian-Chinese, and US residents), using the same set of 14 odors and asking participants to make congruent and incongruent color choices for each odor. We found consistent patterns in color choices for each odor within each culture, and variation in the patterns of color-odor associations across cultures. Thus culture plays a role in color-odor crossmodal associations, which likely arise, at least in part, through experience.

2024, Applied Acoustics

Restaurants are complex environments where all our senses are engaged. Physical and psychoacoustic factors have been shown to be associated with perceived environmental quality in restaurants. More or less designable sound sources such as background music, voices, and kitchen noises are believed to be important in relation to the overall perception of the soundscape. Previous research publications have suggested typologies and other structured descriptions of sound sources for some environmental contexts, such as urban parks and offices, but there is no detailed account that is relevant to restaurants. While existing classification schemes might be extendable, an empirical approach was taken in the present work. We collected on-site data in 40 restaurants (n = 393), including perceptual ratings, free-form annotations of characteristic sounds and whether they were liked or not, and free-form descriptive words for the environment as a whole. The annotations were subjected to analysis using a cladistic approach and yielded a multi-level taxonomy of perceived sound sources in restaurants. Ten different classification taxa were evaluated by comparing the respondents' Liking of sound sources, by categories defined in the taxonomy, and their Pleasantness rating of the environment as a whole. Correlation analysis revealed that a four-level clade was efficient and outperformed alternatives. Internal validation of the Pleasantness construct was made through separate ratings (n = 7) of on-site free-form descriptions of the environment. External validation was made with ratings from a separate listening experiment (n = 48). The two validations demonstrated that the four-level Sound Sources in Restaurants (SSR) clade had good construct validity and external robustness. Analysis of the data revealed two findings. Voice-related characteristic sounds including a 'people' specifier were more liked than those without such a specifier (d = 0.14 SD), possibly due to an emotional crossmodal association mechanism. Liking of characteristic sounds differed between the first and last annotations that the respondents had made (d = 0.21 SD), which might be due to an initially positive bias being countered by exposure to a task inducing a mode of critical listening. We believe that the SSR taxonomy will be useful for field research and simulation design. The empirical findings might inform theory, specifically research charting the perception of sound sources in multimodal environments.

2024

Expectation learning is a continuous unsupervised learning process which uses multisensory bindings to modulate unisensory perception. As humans, we learn to associate a barking sound with the visual appearance of a dog, and we continuously fine-tune this association over time, as we learn, e.g., to associate highpitched barking with small dogs. In this work, we address the problem of building a computational model that captures two important properties of expectation learning, namely continuity and the lack of any external supervision other than temporal co-occurrence. To this end, we present a novel hybrid neural model based on audio/visual autoencoders and a recurrent self-organizing network for stimulus reconstruction and multisensory binding. We demonstrate that the proposed model is capable of learning concept bindings, i.e. dog barking with dogs, by evaluating it on unisensory classification tasks for audi-visual stimuli using the 43,500 Youtube videos in the animal subset of the AudioSet corpus. In addition, our analysis and discussion explain how the expectation learning mechanism enforces the generation of high-level bindings and how they contribute to audiovisual recognition.

2024, Frontiers in Neurorobotics

2024, 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob)

The brain integrates information from multiple sensory modalities to form a coherent and robust perceptual experience in complex environments. This ability is progressively acquired and fine-tuned during developmental stages in a multisensory environment. A rich set of neural mechanisms supports the integration and segregation of multimodal stimuli, providing the means to efficiently solve conflicts across modalities. Therefore, there is the motivation to develop efficient mechanisms for robotic platforms that process multisensory signals and trigger robust sensory-driven motor behavior. In this paper, we implement a computational model of crossmodal integration in a sound source localization task that accounts also for audiovisual conflict resolution. Our model consists of two layers of reciprocally connected visual and auditory neurons and a layer with crossmodal neurons that learns to integrate (or segregate) audiovisual stimuli on the basis of spatial disparity. To validate our architecture, we propose a spatial localization task in which 30 subjects had to determine the location of the sound source in a virtual scenario with four animated avatars. We measured their accuracy and reaction time under different conditions for congruent and incongruent audiovisual stimuli. We used this study as a baseline to model human-like behavioral responses with a neural network architecture exposed to the same experimental conditions.

2024, Adaptive Behavior

A robot capable of understanding emotion expressions can increase its own capability of solving problems by using emotion expressions as part of its own decision-making, in a similar way to humans. Evidence shows that the perception of human interaction starts with an innate perception mechanism, where the interaction between different entities is perceived and categorized into two very clear directions: positive or negative. While the person is developing during childhood, the perception evolves and is shaped based on the observation of human interaction, creating the capability to learn different categories of expressions. In the context of human–robot interaction, we propose a model that simulates the innate perception of audio–visual emotion expressions with deep neural networks, that learns new expressions by categorizing them into emotional clusters with a self-organizing layer. The proposed model is evaluated with three different corpora: The Surrey Audio–Visual Expressed Emo...

2024, 2nd ACM SIGCHI International Workshop

We present three digital systems: the Augmented Glass, the Bone-Conduction Hookah, and the sound installation T2M, designed for displaying sound and taste stimuli, with applications in research on crossmodal taste-sound interactions, multisensory experiences and performances, entertainment and health. CCS CONCEPTS • Human-centered computing → Human computer interaction (HCI) → Interaction devices → Sound-based input / output

2024

Some robots have been given the emotional state (expression and recognition) to make human-robot interaction (HRI) and robot-robot interaction (RRI) better. In this article we analyze what it means for a robot to have emotion and distinguishing emotional state for communication from emotion state as a mechanism for the organization of its behavior with humans and robots by convolutional neural network (CNN). Also, we discuss about the relation between deep emotion and cognition and illustrate about the biological aspect of emotion. We try to discuss about these relations and explain why it can be more effective by CNN in comparing with other methods for having better emotion in the robots. In this way, we present an Architecture based on Emotions for Robots by CNN.

2024, PeerJ

In this study, we examine different approaches to the presentation of Y coordinates in mobile auditory graphs, including the representation of negative numbers. These studies involved both normally sighted and visually impaired users, as there are applications where normally sighted users might employ auditory graphs, such as the unseen monitoring of stocks, or fuel consumption in a car. Multi-reference sonification schemes are investigated as a means of improving the performance of mobile nonvisual point estimation tasks. The results demonstrated that both populations are able to carry out point estimation tasks with a good level of performance when presented with auditory graphs using multiple reference tones. Additionally, visually impaired participants performed better on graphs represented in this format than normally sighted participants. This work also implements the component representation approach for negative numbers to represent the mapping by using the same positive mapping reference for the digit and adding a sign before the digit which leads to a better accuracy of the polarity sign. This work contributes to the areas of the design process of mobile auditory devices in human-computer interaction and proposed a methodological framework related to improving auditory graph performance in graph reproduction.

2024, Perception

Linear trend (slope) is important information conveyed by graphs. We investigated how sounds influenced slope detection in a visual search paradigm. Four bar graphs or scatter plots were presented on each trial. Participants looked for a positive-slope or a negative-slope target (in blocked trials), and responded to targets in a go/no-go fashion. For example, in a positive-slopetarget block, the target graph displayed a positive slope while other graphs displayed negative slopes (a go trial), or all graphs displayed negative slopes (a no-go trial). When an ascending or descending sound was presented concurrently, ascending sounds slowed detection of negativeslope targets whereas descending sounds slowed detection of positive-slope targets. The sounds had no effect when they immediately preceded the visual search displays, suggesting that the results were due to crossmodal interaction rather than priming. The sounds also had no effect when targets were words describing slopes, such as "positive," "negative," "increasing," or "decreasing," suggesting that the results were unlikely due to semantic-level interactions. Manipulations of spatiotemporal similarity between sounds and graphs had little effect. These results suggest that ascending and descending sounds influence visual search for slope based on a general association between the direction of auditory pitch-change and visual linear trend.

2024

The HapticWave is a haptic audio waveform display device that has been developed in collaboration with a group of audio engineers and producers with visual impairments for use in real world recording studio environments. This was not a project led by a designer or driven by a design brief: rather, the genesis of the HapticWave emerged from exchange and interaction between actors who brought to the table different practices, experiences, expertise and needs. By presenting the voices involved in this practice based research project, we offer a comprehensive report to retrace step by step the development and deployment of a research prototype.

2024

2024, Attention, Perception & Psychophysics

Schutz and Lipscomb (2007) reported a naturally occurring audiovisual illusion in which visual information changes the perceived duration of simultaneous auditory information. They demonstrated this by showing participants videos of a percussionist striking a marimba with either a long flowing gesture (labeled "long") that covered a large arc or with a short choppy gesture (labeled "short") that rebounded off of the bar and quickly stopped. Although the resultant sounds were acoustically indistinguishable and participants were asked to ignore visual information when judging tone duration, duration ratings were longer when presented with long rather than short gestures. In light of evidence that vision does not influence auditory judgments of tone duration (Walker & Scott, 1981), this illusion is unexpected. It is an exception to the rule that, with respect to a given task, the modality offering less accurate information does not appreciably influence the modality offering more accurate information. For example, the superior temporal precision of the auditory system generally translates into auditory dominance for temporal tasks such as the judgment of tone duration. Likewise, estimates of flash timings are more affected by temporally offset tones than estimates of tone timings are affected by temporally offset flashes (Fendrich & Corballis, 2001); and auditory flutter rate affects the perception of visual flicker rate, whereas the rate of visible flicker either fails to affect the perceived rate of concurrent auditory flutter (Shipley, 1964) or affects it minimally (Welch, DuttonHurt, & Warren, 1986). Understanding the Illusion We believe that the perception of a causal link between auditory and visual information is crucial to explaining why the illusion reported by Schutz and Lipscomb (2007) conflicts so strongly with previous work on sensory integration. However, before presenting evidence in support of this view, we will first discuss two alternative explanations that have been previously dismissed by Schutz and Kubovy (in press). We will close this section by explaining our reasons for proposing that causality plays an important role and by discussing links between this illusion and previous work on the unity assumption. Post-perceptual processing cannot explain the illusion. As has been shown by Arieh and Marks (2008), certain patterns of cross-modal interactions may be explained by decisional changes, rather than by sensory shifts. Therefore, it is possible that longer gestures could have suggested longer durations, affecting ratings through a top-down process (i.e., a response bias), without any actual perceptual shift. To test this explanation, Schutz and Kubovy (in press) designed a series of experiments manipulating the causal relationship between the auditory and visual components of the stimuli.

2024

Complex systems abound in nature and are becoming increasingly important in artificial systems. The understanding and controlling of such systems is a major challenge. This paper tries to take a fresh approach to these issues by describing an interactive art project that involves cross-modal interaction with a complex system. By combining sound and vision, the temporal and spatial dynamics of the system are conveyed simultaneously. Users can influence its dynamics in real time by using acoustics. Preliminary experiments with this system show that the combination of sound and vision can help users to obtain an intuitive understanding of the system's behavior. In addition, usability profits from the fact that the same modality is employed for both interaction and feedback.

2024, Cortex

Introduction: Crossmodality (i.e., the integration of stimulations coming from different sensory modalities) is a crucial ability in everyday life and has been extensively explored in healthy adults. Still, it has not yet received much attention in psychiatry, and particularly in alcoholdependence. The present study investigates the cerebral correlates of crossmodal integration deficits in alcohol-dependence to assess whether these deficits are due to the mere accumulation of unimodal impairments or rather to specific alterations in crossmodal areas. Methods: Twenty-eight subjects [14 alcohol-dependent subjects (ADS), 14 paired controls] were scanned using fMRI while performing a categorization task on faces (F), voices (V) and faceevoice pairs (FV). A subtraction contrast [FVÀ(FþV)] and a conjunction analysis [(FVÀF) X (FVÀV)] isolated the brain areas specifically involved in crossmodal faceevoice integration. The functional connectivity between unimodal and crossmodal areas was explored using psychoephysiological interactions (PPI). Results: ADS presented only moderate alterations during unimodal processing. More centrally, in the subtraction contrast and conjunction analysis, they did not show any specific crossmodal brain activation while controls presented activations in specific crossmodal areas (inferior occipital gyrus, middle frontal gyrus, superior parietal lobule). Moreover, PPI analyses showed reduced connectivity between unimodal and crossmodal areas in alcohol-dependence. Conclusions: This first fMRI exploration of crossmodal processing in alcohol-dependence showed a specific faceevoice integration deficit indexed by reduced activation of crossmodal areas and reduced connectivity in the crossmodal integration network. Using crossmodal paradigms is thus crucial to correctly evaluate the deficits presented by ADS in real-life situations.

2024, Experimental Brain Research

2024, Technologies

It is noteworthy nowadays that monitoring and understanding a human’s emotional state plays a key role in the current and forthcoming computational technologies. On the other hand, this monitoring and analysis should be as unobtrusive as possible, since in our era the digital world has been smoothly adopted in everyday life activities. In this framework and within the domain of assessing humans’ affective state during their educational training, the most popular way to go is to use sensory equipment that would allow their observing without involving any kind of direct contact. Thus, in this work, we focus on human emotion recognition from audio stimuli (i.e., human speech) using a novel approach based on a computer vision inspired methodology, namely the bag-of-visual words method, applied on several audio segment spectrograms. The latter are considered to be the visual representation of the considered audio segment and may be analyzed by exploiting well-known traditional computer v...

2023, HAL (Le Centre pour la Communication Scientifique Directe)

We report a series of experiments about a little-studied type of compatibility effect between a stimulus and a response: the priming of manual gestures via sounds associated with these gestures. The goal was to investigate the plasticity of the gesture-sound associations mediating this type of priming. Five experiments used a primed choice-reaction task. Participants were cued by a stimulus to perform response gestures that produced response sounds; those sounds were also used as primes before the response cues. We compared arbitrary associations between gestures and sounds (key lifts and pure tones) created during the experiment (i.e. no pre-existing knowledge) with ecological associations corresponding to the structure of the world (tapping gestures and sounds, scraping gestures and sounds) learned through the entire life of the participant (thus existing prior to the experiment). Two results were found. First, the priming effect exists for ecological as well as arbitrary associations between gestures and sounds. Second, the priming effect is greatly reduced for ecologically existing associations and is eliminated for arbitrary associations when the response gesture stops producing the associated sounds. These results provide evidence that auditory-motor priming is mainly created by rapid learning of the association between sounds and the gestures that produce them. Auditory-motor priming is therefore mediated by short-term associations between gestures and sounds that can be readily reconfigured regardless of prior knowledge.

2023, HAL (Le Centre pour la Communication Scientifique Directe)

2023, NeuroImage

Temporal regularities in the environment are thought to guide the allocation of attention in time. Here, we explored whether entrainment of neuronal oscillations underpins this phenomenon. Participants viewed a regular stream of images in silence, or in-synchrony or out-of-synchrony with an unmarked beat position of a slow (1.3Hz) auditory rhythm. Focusing on occipital recordings, we analyzed evoked oscillations shortly before and event-related potentials (ERPs) shortly after image onset. The phase of beta-band oscillations in the in-synchrony condition differed from that in the out-of-synchrony and silence conditions. Additionally, ERPs revealed rhythm effects for a stimulus onset potential (SOP) and the N1. Both were more negative for the in-synchrony as compared to the out-of-synchrony and silence conditions and their amplitudes positively correlated with the beta phase effects. Taken together, these findings indicate that rhythmic expectations are supported by a reorganization o...

2023, Attention, Perception, & Psychophysics

The present study aimed to investigate whether or not the so-called "bouba-kiki" effect is mediated by speech-specific representations. Sine-wave versions of naturally produced pseudowords were used as auditory stimuli in an implicit association task (IAT) and an explicit cross-modal matching (CMM) task to examine cross-modal shape-sound correspondences. A group of participants trained to hear the sine-wave stimuli as speech was compared to a group that heard them as non-speech sounds. Sound-shape correspondence effects were observed in both groups and tasks, indicating that speech-specific processing is not fundamental to the "bouba-kiki" phenomenon. Effects were similar across groups in the IAT, while in the CMM task the speechmode group showed a stronger effect compared with the non-speech group. This indicates that, while both tasks reflect auditoryvisual associations, only the CMM task is additionally sensitive to associations involving speech-specific representations.

2023, Technologies

2023, DAGA 2021 - 47. JAHRESTAGUNG FÜR AKUSTIK

Over the last ten years, several researchers investigated the emotional aspect of soundscape perception responses. Some essential findings helped to highlight relevant dimensions of the emotional content of soundscape perceptual responses. Initially, Axelsson et al. (2010) verified the emotional dimensions with the Circumplex model of affect developed by Russell in 1980. Based on their knowledge, they develop an emotional model for soundscape studies containing three dimensions: pleasantness, arousal, and eventfulness. Afterwards, Cain et al. (2013) checked other emotional dimensions with the help of works that described the emotional meaning of sounds, such as calmness and vibrancy. Our work aims to verify other emotional theories and taxonomies to classify cognitive dimensions related to emotions generated by auditory responses to soundscape studies. The Emotion Wheels of Plutchick (1980) and Geneva (2013) are used to classify emotions reported during soundwalks in Aachen, Germany and surveys in Goiania, Brazil. A principal component analysis is helping in the extraction of components that are summarizing the emotional content which is describing the sonic environment. By understanding how the subjects feel about these environments, findings can contribute to optimizing how places influence and change the user’s emotions which can reflect on improvements of urban sound design.

2023, 2018 IEEE Winter Applications of Computer Vision Workshops (WACVW)

We describe our approach towards building an efficient predictive model to detect emotions for a group of people in an image. We have proposed that training a Convolutional Neural Network (CNN) model on the emotion heatmaps extracted from the image, outperforms a CNN model trained entirely on the raw images. The comparison of the models have been done on a recently published dataset of Emotion Recognition in the Wild (EmotiW) challenge, 2017. The proposed method 1 achieved validation accuracy of 55.23% which is 2.44% above the baseline accuracy, provided by the EmotiW organizers.

2023, Proceedings of the 23rd ACM Symposium on Virtual Reality Software and Technology

We 1 present a novel signaling method for head-mounted displays, which surpasses eye (pupil) and delivers guiding light signals directly to retina through tissue near the eyes. is method preserves full visual acuity on the display and does not block view to the scene, while also delivering additional visual signals. CCS CONCEPTS • Human-centered computing ➝ Displays and imagers • Interface design prototyping • Gestural input • Computing methodologies ➝ Mixed / augmented reality • Virtual reality • Hardware ➝ Displays and imagers

2023, International Journal of Human-Computer Studies

Human augmentation is a field of research that aims to enhance human abilities through medicine or technology. This has historically been achieved by consuming chemical substances that improve a selected ability or by installing implants which require medical operations. Both of these methods of augmentation can be invasive. Augmented abilities have also been achieved with external tools, such as eyeglasses, binoculars, microscopes or highly sensitive microphones. Lately, augmented reality and multimodal interaction technologies have enabled non-invasive ways to augment human. In this article, we first discuss the field and related terms. We provide relevant definitions based on the present understanding of the field. This is followed by a summary of existing work in augmented senses, action, and cognition. Our contribution to the future includes a model for wearable augmentation. In addition, we present a call for research to realize this vision. Then, we discuss future human abilities. Wearable technologies may act as mediators for human augmentation, in the same manner as eyeglasses once revolutionized human vision. Non-invasive and easy-to-use wearable extensions will enable lengthening the active life for aging citizens or supporting the full inclusion of people with special needs in society, but there are also potential problems. Therefore, we conclude by discussing ethical and societal issues: privacy, social manipulation, autonomy and side effects, accessibility, safety and balance, and unpredictable future.

2023, Autism

There is some evidence that disordered self-processing in autism spectrum disorders is linked to the social impairments characteristic of the condition. To investigate whether bodily self-consciousness is altered in autism spectrum disorders as a result of multisensory processing differences, we tested responses to the full body illusion and measured peripersonal space in 22 adults with autism spectrum disorders and 29 neurotypical adults. In the full body illusion set-up, participants wore a head-mounted display showing a view of their ‘virtual body’ being stroked synchronously or asynchronously with respect to felt stroking on their back. After stroking, we measured the drift in perceived self-location and self-identification with the virtual body. To assess the peripersonal space boundary we employed an audiotactile reaction time task. The results showed that participants with autism spectrum disorders are markedly less susceptible to the full body illusion, not demonstrating the...

2023, PLoS ONE

Auditory and visual signals generated by a single source tend to be temporally correlated, such as the synchronous sounds of footsteps and the limb movements of a walker. Continuous tracking and comparison of the dynamics of auditory-visual streams is thus useful for the perceptual binding of information arising from a common source. Although language-related mechanisms have been implicated in the tracking of speech-related auditory-visual signals (e.g., speech sounds and lip movements), it is not well known what sensory mechanisms generally track ongoing auditory-visual synchrony for nonspeech signals in a complex auditory-visual environment. To begin to address this question, we used music and visual displays that varied in the dynamics of multiple features (e.g., auditory loudness and pitch; visual luminance, color, size, motion, and organization) across multiple time scales. Auditory activity (monitored using auditory steady-state responses, ASSR) was selectively reduced in the left hemisphere when the music and dynamic visual displays were temporally misaligned. Importantly, ASSR was not affected when attentional engagement with the music was reduced, or when visual displays presented dynamics clearly dissimilar to the music. These results appear to suggest that left-lateralized auditory mechanisms are sensitive to auditory-visual temporal alignment, but perhaps only when the dynamics of auditory and visual streams are similar. These mechanisms may contribute to correct auditory-visual binding in a busy sensory environment.

2023, Cognition

While perceiving speech, people see mouth shapes that are systematically associated with sounds. In particular, a vertically stretched mouth produces a /woo/ sound, whereas a horizontally stretched mouth produces a /wee/ sound. We demonstrate that hearing these speech sounds alters how we see aspect ratio, a basic visual feature that contributes to perception of 3D space, objects and faces. Hearing a /woo/ sound increases the apparent vertical elongation of a shape, whereas hearing a /wee/ sound increases the apparent horizontal elongation. We further demonstrate that these sounds influence aspect ratio coding. Viewing and adapting to a tall (or flat) shape makes a subsequently presented symmetric shape appear flat (or tall). These aspect ratio aftereffects are enhanced when associated speech sounds are presented during the adaptation period, suggesting that the sounds influence visual population coding of aspect ratio. Taken together, these results extend previous demonstrations that visual information constrains auditory perception by showing the converse-speech sounds influence visual perception of a basic geometric feature.

2023, Proceedings of the Third International Conference on Timbre (Timbre 2023), dir. Marcelo Caetano, Zachary Wallmark, Asterios Zacharakis, Charalampos Saitis and Kai Siedenburg

The piece Partiels (Grisey 1976), from Gérard Grisey's cycle Les Espaces Acoustiques, is emblematic of the aesthetics of French spectral music. Considered and analyzed by several authors from a compositional perspective (Krier 2000, Féron 2010), it has given rise to very few studies centered on performance, and more specifically on the performance elaboration process-rather than the performance considered as a finished product (in relation to this issue, see for example Cook 2001).

2023, Multimedia Tools and Applications

Emotion recognition from speech signals is an interesting research with several applications like smart healthcare, autonomous voice response systems, assessing situational seriousness by caller affective state analysis in emergency centers, and other smart affective services. In this paper, we present a study of speech emotion recognition based on the features extracted from spectrograms using a deep convolutional neural network (CNN) with rectangular kernels. Typically, CNNs have square shaped kernels and pooling operators at various layers, which are suited for 2D image data. However, in case of spectrograms, the information is encoded in a slightly different manner. Time is represented along the x-axis and y-axis shows frequency of the speech signal, whereas, the amplitude is indicated by the intensity value in the spectrogram at a particular position. To analyze speech through spectrograms, we propose rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features. The proposed scheme effectively learns discriminative features from speech spectrograms and performs better than many state-ofthe-art techniques when evaluated its performance on Emo-DB and Korean speech dataset.

2023, PLoS ONE

The phase reset hypothesis states that the phase of an ongoing neural oscillation, reflecting periodic fluctuations in neural activity between states of high and low excitability, can be shifted by the occurrence of a sensory stimulus so that the phase value become highly constant across trials (Schroeder et al., 2008). From EEG/MEG studies it has been hypothesized that coupled oscillatory activity in primary sensory cortices regulates multi sensory processing (Senkowski et al. 2008). We follow up on a study in which evidence of phase reset was found using a purely behavioral paradigm by including also EEG measures. In this paradigm, presentation of an auditory accessory stimulus was followed by a visual target with a stimulusonset asynchrony (SOA) across a range from 0 to 404 ms in steps of 4 ms. This fine-grained stimulus presentation allowed us to do a spectral analysis on the mean SRT as a function of the SOA, which revealed distinct peak spectral components within a frequency range of 6 to 11 Hz with a modus of 7 Hz. The EEG analysis showed that the auditory stimulus caused a phase reset in 7-Hz brain oscillations in a widespread set of channels. Moreover, there was a significant difference in the average phase at which the visual target stimulus appeared between slow and fast SRT trials. This effect was evident in three different analyses, and occurred primarily in frontal and central electrodes.

2023, Neuroscience of Consciousness

Over the last 30 years, our understanding of the neurocognitive bases of consciousness has improved, mostly through studies employing vision. While studying consciousness in the visual modality presents clear advantages, we believe that a comprehensive scientific account of subjective experience must not neglect other exteroceptive and interoceptive signals as well as the role of multisensory interactions for perceptual and self-consciousness. Here, we briefly review four distinct lines of work which converge in documenting how multisensory signals are processed across several levels and contents of consciousness. Namely, how multisensory interactions occur when consciousness is prevented because of perceptual manipulations (i.e. subliminal stimuli) or because of low vigilance states (i.e. sleep, anesthesia), how interactions between exteroceptive and interoceptive signals give rise to bodily self-consciousness, and how multisensory signals are combined to form metacognitive judgments. By describing the interactions between multisensory signals at the perceptual, cognitive, and metacognitive levels, we illustrate how stepping out the visual comfort zone may help in deriving refined accounts of consciousness, and may allow cancelling out idiosyncrasies of each sense to delineate supramodal mechanisms involved during consciousness.

2023, International Journal of Human-Computer Studies