Hyunkook Lee - Academia.edu (original) (raw)

Papers by Hyunkook Lee

Research paper thumbnail of Feature Extraction of Binaural Recordings for Acoustic Scene Classification

2018 Federated Conference on Computer Science and Information Systems (FedCSIS), 2018

Binaural technology becomes increasingly popular in the multimedia systems. This paper identifies... more Binaural technology becomes increasingly popular in the multimedia systems. This paper identifies a set of features of binaural recordings suitable for the automatic classification of the four basic spatial audio scenes representing the most typical patterns of audio content distribution around a listener. Moreover, it compares the five artificial-intelligence-based methods applied to the classification of binaural recordings. The results show that both the spatial and the spectro-temporal features are essential to accurate classification of binaurally rendered acoustic scenes. The spectro-temporal features appear to have a stronger influence on the classification results than the spatial metrics. According to the obtained results, the method based on the support vector machine, exploiting the features identified in the study, yields the classification accuracy approaching 84%.

Research paper thumbnail of Psychoacoustic Considerations in Surround Sound with Height

Research paper thumbnail of Level and Time Panning of Phantom Images for Musical Sources

Journal of the Audio Engineering Society, 2013

This study investigates the independent influences of interchannel level difference (ICLD) and in... more This study investigates the independent influences of interchannel level difference (ICLD) and interchannel time difference (ICTD) on the panning of 2-channel stereo phantom images for various musical sources. The results indicate that a level panning can perform robustly regardless of the spectral and temporal characteristics of source signals, whereas a time panning is not suitable for a continuous source with a high fundamental frequency. Statistical differences between the data obtained for different sources are found to be insignificant, and from this a unified set of ICLD and ICTD values for 10 • , 20 • , and 30 • image positions are derived. Linear level and time panning functions for the two separate panning regions of 0 •-20 • and 21 •-30 • are further proposed, and their applicability to arbitrary loudspeaker base angle is also considered. These perceptual panning functions are expected to be more accurate than the theoretical sine or tangent law in terms of matching between predicted and actually perceived image positions. 0 INTRODUCTION The localization of a stereophonic phantom image is based on the principle of the so-called "summing localiza-tion" [1]. In 2-channel loudspeaker reproduction, acoustic crosstalk of loudspeaker signals occurs at each ear of the listener; the signal from the contralateral loudspeaker is "summed" with that from the ipsilateral loudspeaker, with the former being attenuated in level at high frequencies due to head shadowing and delayed in time relative to the latter. If the signals are coherent, the listener will perceive a single phantom image in the median plane. If an interchan-nel level difference (ICLD) or interchannel time difference (ICTD) is applied to the loudspeaker signals, some combination of interaural level difference (ILD) and interaural time difference (ITD) will be introduced between the ear input signals, and consequently the apparent position of the image will be "panned" from the middle toward the earlier or louder loudspeaker. Research suggests that phantom images panned using ICLD are localized mainly based on ITDs at low frequencies and on ILDs and envelope-based ITDs at high frequencies [2,3]. With regard to ICTD-based panning, the frequency-dependency of interaural cues has not been studied extensively. However, it was shown in [1] that an ICTD produces only an ILD at a low frequency when it is assumed that there is no level difference between the loudspeaker signals arriving at each ear at low frequencies, whereas it leads to both ILD and ITD at a high frequency. Summing localization is valid only up to a certain threshold of ICTD (e.g., 1 ms as widely accepted), within which a trade-off between ICLD and ICTD is possible. Beyond this threshold, the localization of an auditory image largely relies on the precedence effect [4], where the image is perceived constantly at the earlier loudspeaker up to the echo threshold. Since 1940 a number of studies have been conducted to investigate the independent influence of ICLD or ICTD on the position of a phantom image perceived between two loudspeakers [2,5, 6,7, 8,9]. The data from these studies has had many practical applications. For example, Williams [10] analyzed the coverage angles of two-channel near-coincident microphone techniques based on the data obtained by Simonsen [8]. Wittek [11] developed a tool called "Image Assistant" to calculate localization curves for various microphone arrays based on the ICLD and ICTD data derived from the literature. A reliable ICLD or ICTD data set that is obtained from perceptual experiments would also be useful for panning applications where an accurate matching between target and perceived image positions is essential. The conventional sine and tangent level panning laws [2,12] have been claimed to be inaccurate in this type of application. They are based on ITD cues at low frequencies only and tend to result in a greater angular displacement than predicted for broadband sources [13]. With respect to ICTD-based panning (time panning), there has been no global law proposed to date for practical applications. 978

Research paper thumbnail of Effect of Vertical Microphone Layer Spacing for a 3D Microphone Array

Journal of the Audio Engineering Society, 2014

Subjective listening tests were conducted to investigate how the spacing between main (lower) and... more Subjective listening tests were conducted to investigate how the spacing between main (lower) and height (upper) microphone layers in a 3D main microphone array affects perceived spatial impression and overall preference. Four different layer spacings of 0m, 0.5m, 1m, and 1.5m were compared for the sound sources of trumpet, acoustic guitar, percussion quartet, and string quartet using a nine-channel loudspeaker setup. It was generally found that there was no significant difference between any of the spaced layer configurations, whereas the 0m layer had slightly higher ratings than the more spaced layers in both spatial impression and preference. Acoustical properties of the original microphone channel signals as well as those of the reproduced signals, which were binaurally recorded, were analyzed in order to find possible physical causes for the perceived results. It is suggested that the perceived results were mainly associated with vertical interchannel crosstalk in the signals of each height layer and the magnitude and pattern of spectral change at the listener's ear caused by each layer.

Research paper thumbnail of 2D-to-3D Ambience Upmixing Based on Perceptual Band Allocation

Journal of the Audio Engineering Society, 2015

Listening tests were conducted to evaluate the feasibility of a novel 2D-to-3D ambience upmixing ... more Listening tests were conducted to evaluate the feasibility of a novel 2D-to-3D ambience upmixing technique named "perceptual band allocation" (PBA). Four-channel ambience signals captured in a reverberant concert hall were low-pass and high-pass filtered, which were then routed to lower and upper loudspeaker layers arranged in a 9-channel 3D configuration, respectively. The upmixed stimuli were compared against original 3D recordings made using an 8-channel ambience microphone array in terms of 3D listener envelopment and preference. The results suggest that the perceived quality of the proposed method could be at least comparable to that of an original 3D recording. 0 INTRODUCTION Three-dimensional multichannel audio systems such as Auro-3D [1], Dolby Atmos [2], and 22.2 [3] employ additional height loudspeakers in order to provide the listener with a three-dimensional (3D) auditory experience. One of the perceptual attributes that could be enhanced by the use of height channels is listener envelopment (LEV). In the context of two-dimensional (2D) surround sound (e.g., 5.1), LEV is widely understood as the subjective impression of being enveloped by reverberant sound [4, 5]. With 3D loudspeaker formats, the added height channels could be used to render the "vertical" spread of reverberant sound image as well as the horizontal one, and ultimately the auditory impression of 3D LEV could be achieved. One of the key requirements for 3D multichannel audio applications would be a 2D-to-3D upmixing technique that can add a height dimension to 2D content. Therefore, a new method that can render vertical image spread would be necessary. In the context of horizontal stereophony, horizontal image spread can be rendered by means of inter-channel decorrelation, and many different decorrelation methods have been proposed over the past years [6-10]. Such methods are based on the principle that as the degree of correlation between stereophonic channel signals decreases, that between ear-input signals (interaural cross-correlation), which has a direct relationship with perceived auditory image spread [4], also decreases. However, vertically reproduced stereophonic signals would have no or little influence on interaural cross-correlation. From a recent study by Gribben and Lee [11] it was found that vertically applied interchannel decorrelation was not as effective as horizontal decorrelation in terms of controlling the spread of image. The literature generally suggests that vertical localiza-tion relies on spectral cues. A number of researchers [12-14] have found that the higher the frequency of a pure tone the higher the perceived image position was regardless of the physical height of the presenting loudspeaker; a phenomenon referred to as the "pitch-height" effect in [15]. In the case of band-pass filtered noise signals, however, this effect was reported to be dependent on the physical height of the loudspeaker that presents the signal. For example, Roffler and Butler [16] found from their experiments using loudspeakers vertically arranged at different heights that the perceived image height of a noise high-pass filtered at 2 kHz was similar to the physical height of the presenting loudspeaker. Conversely, a noise low-passed at 2 kHz was localized around or below the eye level regardless of its presenting loudspeaker height. Similar results were obtained from Cabrera and Tiley's [15] experiment conducted with octave-band noise signals centered at 125 Hz, 500 Hz, 2 kHz, and 8 kHz, using vertically arranged loudspeakers; a higher frequency band was localized at a higher position than a lower frequency band and this difference became larger as the loudspeaker height increased. Cabrera and Tiley [15] and Ferguson and Cabrera [17] confirmed the validity of this phenomenon for low-and high-pass filtered noise stimuli (crossover of 1 kHz) that were simultaneously presented from different loudspeakers at different heights. The present study aims to explore the feasibility of a new 2D-to-3D upmixing method developed based on the above research findings, which is named "perceptual band alloca-tion" (PBA). The method decomposes the spectrum of the

Research paper thumbnail of The Effect of Interchannel Time Difference on Localization in Vertical Stereophony , AND HYUNKOOK LEE, AES Member

Journal of the Audio Engineering Society, 2015

Listening tests were conducted in order to analyze the localization of band-limited stimuli in ve... more Listening tests were conducted in order to analyze the localization of band-limited stimuli in vertical stereophony. The test stimuli were seven octave bands of pink noise, with center frequencies ranging from 125-8000 Hz, as well as broadband pink noise. Stimuli were presented from vertically arranged loudspeakers either monophonically or as vertical phantom images, created with the upper loudspeaker delayed with respect to the lower by 0, 0.5, 1, 5, and 10 ms (i.e., interchannel time difference). The experimental data obtained showed that localization under the aforementioned conditions is generally governed by the so-called "pitch-height" effect, with the high frequency stimuli generally being localized significantly higher than the low frequency stimuli for all conditions. The effect of interchannel time difference was found to be significant on localization judgments for both the 1000-4000 Hz octave bands and the broadband pink noise; it is suggested that this was related to the effects of comb filtering. Additionally, no evidence could be found to support the existence of the precedence effect in vertical stereophony. 0 INTRODUCTION The mechanisms used to localize sound sources incident from the median plane are fundamentally different from those used in horizontal plane localization. In the horizontal plane, localization is reliant on a combination of the time and level differences between a given source arriving at each ear (binaural cues) as well as on the directional filtering of the sound source by the pinnae (spectral cues) [1]. However, in the median plane binaural cues are absent as sound sources arrive at each ear simultaneously. As a result, median plane localization relies solely on spectral cues [2]. Median plane localization is a topic that has received much attention in the literature, with numerous studies being particularly concerned with the localization of tonal and band limited stimuli. In early experiments using tonal stimuli presented from vertically arranged loudspeakers, Pratt [3] concluded that localization is governed solely by frequency , with high tones being localized physically higher in space than low tones. A similar observation was made by Trimble [4], who presented tonal stimuli both singularly and in succession to listeners via receiving phones positioned 15 cm from each ear. A more expansive study by Roffler and Butler [5], also using tonal stimuli presented from vertically arranged loudspeakers, affirmed the results presented in [3] and [4], with the authors noting that the effect was maintained irrelevant of listener orientation, visual bias, and whether or not subjects had prior knowledge of the terms "high" and "low" in describing pitch. Subsequent experiments by Roffler and Butler [6] and Cabrera and Tiley [7] demonstrated that the relationship between pitch and height is maintained for the localization of band-passed noise signals and moreover that the perceptual range of pitch-height depends on the physical height of the loudspeaker that presents the signal. In [7] the correlation between pitch and height was referred to as the "pitch-height effect." Following the Roffler and Butler study [5] it was noted by Blauert [8], from median plane localization experiments using loudspeakers placed in front of, directly above and behind the listener, that frequency also governed the lo-calization of 1/3-octave bands. Under these conditions certain frequency bands were related to specific locations on the median plane, irrelevant of actual loudspeaker position. Blauert called these bands "directional bands." Subsequent studies by Hebrank and Wright [2] and Asano et al. [9] have shown that directional bands are closely related to the spectral cues provided by the pinnae in vertical localization. Additionally, Itoh et al. [10] demonstrated that directional bands are maintained for 1/6-octave bands of noise and that there exist differences in directional bands depending on the listener. The aforementioned localization studies are similar in that they predominantly considered the localization of stimuli presented from single loudspeakers located on the median plane. However, with the emergence of

Research paper thumbnail of Perceptual Band Allocation (PBA) for the Rendering of Vertical Image Spread with a Vertical 2D Loudspeaker Array

Journal of the Audio Engineering Society, 2016

Two subjective experiments were conducted to examine a new vertical image rendering method named ... more Two subjective experiments were conducted to examine a new vertical image rendering method named "Perceptual Band Allocation (PBA)," using octave bands of pink noise presented from main and height loudspeaker pairs. The PBA attempts to control the perceived degree of vertical image spread (VIS) by a flexible mapping between frequency band and loudspeaker layer based on the desired positioning of the band in the vertical plane. The first experiment measured the perceived vertical location of the phantom image of each octave band stimulus for the main and height loudspeaker layers individually. Results showed significant differences among the frequency bands in perceived image location. Furthermore, the so-called "pitch-height" effect was found for two separate frequency regions, with most bands from the main loudspeaker layer perceived to be elevated from the physical height of the layer. Based on the localization data from the first experiment, six different PBA stimuli were created in such a way that each frequency band was mapped to either the main or height loudspeaker layer depending on the target degree of VIS. The second experiment conducted a listening test to grade the perceived magnitudes of VIS for the six stimuli. The results first indicated that PBA could significantly increase the perceived magnitude of VIS compared to that of a sound presented only from the main layer. It was also found that the different PBA schemes produced various degrees of perceived VIS with statistically significant differences. The paper discusses possible reasons for the obtained results in details based on the localization test results and the frequency-dependent energy weightings of ear-input signals. Implications of the proposed method for the vertical upmixing of horizontal surround content are also discussed.

Research paper thumbnail of Vertical Stereophonic Localization in the Presence of Interchannel Crosstalk: The Analysis of Frequency-Dependent Localization Thresholds , AND HYUNKOOK LEE, AES Member

Journal of the Audio Engineering Society, 2016

Listening tests were conducted in order to investigate the frequency dependency of localiza-tion ... more Listening tests were conducted in order to investigate the frequency dependency of localiza-tion thresholds in relation to vertical interchannel crosstalk. Octave band and broadband pink noise stimuli were presented to subjects as phantom images from vertically arranged stereo-phonic loudspeakers located directly in front of the listening position. With respect to the listening position the lower loudspeaker was not elevated; the upper loudspeaker was elevated by 30 •. Subjects completed a method of adjustment task in which they were required to reduce the amplitude of the upper loudspeaker until the resultant phantom image matched the position of the same stimulus presented from the lower loudspeaker alone. The upper loudspeaker was delayed with respect to the lower by 0, 0.5, 1, 5, and 10 ms. The experimental data demonstrated that the main effect of frequency on the localization threshold was significant, with the low frequency stimuli (125 and 250 Hz) requiring significantly less level reduction (less than 6 dB) than the mid-high (1, 2, and 8 kHz) frequency stimuli (9-10.5 dB reduction). The main effect of interchannel time difference (ICTD) on the localization thresholds for each octave band was found to be non-significant. For all stimuli interchannel level difference (ICLD) was always necessary, indicating that the precedence effect is not a feature of median plane localization.

Research paper thumbnail of A Comparison between Horizontal and Vertical Interchannel Decorrelation

Applied Sciences, 2017

Featured Application: 3D audio mixing and upmixing; creative sound design. Abstract: The perceptu... more Featured Application: 3D audio mixing and upmixing; creative sound design. Abstract: The perceptual effects of interchannel decorrelation on perceived image spread have been investigated subjectively in both horizontal and vertical stereophonic reproductions, looking specifically at the frequency dependency of decorrelation. Fourteen and thirteen subjects graded the horizontal and vertical image spreads of a pink noise sample, respectively. The pink noise signal had been decorrelated by a complementary comb-filter decorrelation algorithm, varying the frequency-band, time-delay and decorrelation factor for each sample. Results generally indicated that interchannel decorrelation had a significant effect on auditory image spread both horizontally and vertically, with spread increasing as correlation decreases. However, it was found that the effect of vertical decorrelation was less effective than that of horizontal decorrelation. The results also suggest that the decorrelation effect was frequency-dependent; changes in horizontal image spread were more apparent in the high frequency band, whereas those in vertical image spread were in the low band. Furthermore, objective analysis suggests that the perception of vertical image spread for the low and middle frequency bands could be associated with a floor reflection; whereas for the high band, the results appear to be related to spectral notches in the ear input signals.

Research paper thumbnail of The Reduction of Vertical Interchannel Crosstalk: The Analysis of Localisation Thresholds for Natural Sound Sources

Applied Sciences, 2017

In subjective listening tests, natural sound sources were presented to subjects as vertically-ori... more In subjective listening tests, natural sound sources were presented to subjects as vertically-oriented phantom images from two layers of loudspeakers, 'height' and 'main'. Subjects were required to reduce the amplitude of the height layer until the position of the resultant sound source matched that of the same source presented from the main layer only (the localisation threshold). Delays of 0, 1 and 10 ms were applied to the height layer with respect to the main, with vertical stereophonic and quadraphonic conditions being tested. The results of the study showed that the localisation thresholds obtained were not significantly affected by sound source or presentation method. Instead, the only variable whose effect was significant was interchannel time difference (ICTD). For ICTD of 0 ms, the median threshold was 9.5 dB, which was significantly lower than the 7 dB found for both 1 and 10 ms. The results of the study have implications both for the recording of sound sources for three-dimensional (3D) audio reproduction formats and also for the rendering of 3D images.

Research paper thumbnail of Sound Source and Loudspeaker Base Angle Dependency of Phantom Image Elevation Effect

Journal of the Audio Engineering Society

Early studies found that, when identical signals were presented from two loudspeakers equidistant... more Early studies found that, when identical signals were presented from two loudspeakers equidistant from the listener, the resulting phantom image was elevated in the median plane and the degree of the elevation increased with the loudspeaker base angle. However, sound sources used in such studies were either unknown or limited to noise signals. In order to investigate the dependencies of the elevation effect on sound source and loudspeaker base angle in details, the present study conducted listening tests using 11 natural sources and 4 noise sources with different spectral and temporal characteristics for 7 loudspeaker base angles between 0 • and 360 •. The elevation effect was found to be significantly dependent on the sound source and base angle. Results generally suggest that the effect is stronger for sources with transient nature and a flat frequency spectrum than for continuous and low-frequency-dominant sources. Theoretical reasons for the effect are also discussed based on head-related transfer function measurements. It is proposed that the perceived degree of elevation would be determined by a relative cue related to the spectral energy distribution at high frequencies, but by an absolute cue associated with the acoustic crosstalk and torso reflections at low frequencies.

Research paper thumbnail of The Perception of Hyper-compression by Mastering Engineers , AND HYUNKOOK LEE, 3 AES Member

Journal of the Audio Engineering Society

Hyper-compressed popular music is the product of a behavior associated with the over-use of dynam... more Hyper-compressed popular music is the product of a behavior associated with the over-use of dynamic range processing in an effort to gain a competitive advantage in music production. This behavior is unnecessary given the introduction of loudness normalization algorithms across the industry and has been denounced by mastering engineers as generating audible sound quality artifacts. However, the audibility of these sound quality artifacts to mastering engineers has not been examined. This study probes this question using an ABX listening experiment with 20 mastering engineers. On average, mastering engineers correctly discriminated 17 out of 24 conditions suggesting that the sound quality artifacts generated by hyper-compression are difficult to perceive. The findings in the study suggest that audibility depends on the Crest Factor (CF) of the music rather than the amount of CF reduction thus proposing the existence of a threshold of audibility. Further work focusing on education initiatives are offered. 0 INTRODUCTION To create a hyper-compressed popular music record, au-teurs must engage in a behavior compelling them to over-use dynamic range processing. This behavior is influenced by stakeholders and typically occurs at the end of the creative process in an effort to fit the audio signal to the reproduction medium. However, whilst vinyl had physical limitations governing the mastering approach, the limitations of the digital medium relate to perceptual and psychological ones. Stakeholders' worry that the laborious creative details will not be perceived in different reproduction environments and that this will influence the success of the record. As a result, the paradigm of loudness maximiza-tion has prevailed in the popular music industry in spite of the introduction of loudness normalization algorithms, such as the ITU BS 1770 [1] and the committed work from proponents of dynamic music. Research on hyper-compression in popular music has been addressed from a number of perspectives: (i) technical [2]-[5], (ii) listener preference [5-7], (iii) sound quality [8-11], (iv) record sales [12,13], and (v) listener fatigue [14,15]. A central theme in these studies is their attempt to reconcile the sound quality judgments communicated by expert listeners with the behavior motivating the practice. In many respects, the loudness wars are a continuation of what Leventhal has termed the "great debate" in audio [17]. This debate relates primarily to the testing of hypotheses posited by audiophiles concerning the audibility of audio components , sampling rates, and numerous other factors. These hypotheses are typically investigated using discrimination methodologies. Similarly, mastering engineers and expert listeners criticize the sound quality of hyper-compressed music [18] and posit hypotheses concerning the resulting sound quality attributes and their audibility. However, these hypotheses have not been formally tested. The results of a recent ABX experiment suggest that untrained listeners are unable to perceive sound quality ar-tifacts generated by Crest Factor (CF) reductions of up to 10 dB and that this is a primary factor supporting the persistence of hyper-compression [19]. Ronan suggests that the prevalence of hyper-compressed music during the loudness wars has altered listeners' concept of sound quality thus further encouraging the practice in spite of the introduction of loudness normalization [20]. Given the dissonance between the perceptions of mastering engineers and the behavior of music auteurs, there is a pressing need to examine the audibility of hyper-compression to mastering engineers

Research paper thumbnail of Vertical Interchannel Decorrelation on the Vertical Spread of an Auditory Image

Journal of the Audio Engineering Society

In horizontal stereophony, it is known that interchannel correlation relates to the horizontal sp... more In horizontal stereophony, it is known that interchannel correlation relates to the horizontal spread of a phantom auditory image. However, little is known about the perceptual effect of interchannel correlation on vertical image spread (VIS) between two vertically-arranged loudspeakers. The present study investigates this through two subjective experiments: (i) a multiple comparison of relative VIS for stimuli with varying degrees of correlation; and (ii) the absolute measurement of upper and lower VIS boundaries for extreme stimuli conditions. Octave-band (center frequencies: 63 Hz to 16 kHz) and broadband pink noise signals have been decorrelated using two techniques: all-pass filtering and complementary comb-filtering. These stimuli were presented from vertically-spaced loudspeaker pairs at three azimuth angles (0 • , ±30 • , and ±110 •), with each angle assessed discretely. Both the relative and absolute test results show no significant effect of vertical correlation on VIS for the 63 Hz, 125 Hz, and 250 Hz bands. For the 500 Hz band and above, there is a general tendency for VIS to increase as correlation decreases, which is observed for both decorrelation methods. This association is strongest at 0 • azimuth for the 500 Hz and 1 kHz bands; at ±30 • for 8 kHz and Broadband; and at ±110 • for 2 kHz, 4 kHz, and 16 kHz. The 8 kHz band at ±30 • has the strongest association of all conditions-post-hoc objective analysis indicates a potential relationship between HRTF localization cues (pinna filtering) and VIS perception within this frequency region. Furthermore, the absolute test results suggest that changes of VIS from interchannel decorrelation are fairly slight, with only the Broadband and 16 kHz bands showing a significant increase. The deviations of boundary scores also suggest a difficulty grading absolute VIS and/or potential disagreements among listeners.

Research paper thumbnail of Automatic Spatial Audio Scene Classification in Binaural Recordings of Music

Applied Sciences, 2019

The aim of the study was to develop a method for automatic classification of the three spatial au... more The aim of the study was to develop a method for automatic classification of the three spatial audio scenes, differing in horizontal distribution of foreground and background audio content around a listener in binaurally rendered recordings of music. For the purpose of the study, audio recordings were synthesized using thirteen sets of binaural-room-impulse-responses (BRIRs), representing room acoustics of both semi-anechoic and reverberant venues. Head movements were not considered in the study. The proposed method was assumption-free with regards to the number and characteristics of the audio sources. A least absolute shrinkage and selection operator was employed as a classifier. According to the results, it is possible to automatically identify the spatial scenes using a combination of binaural and spectro-temporal features. The method exhibits a satisfactory classification accuracy when it is trained and then tested on different stimuli but synthesized using the same BRIRs (accuracy ranging from 74% to 98%), even in highly reverberant conditions. However, the generalizability of the method needs to be further improved. This study demonstrates that in addition to the binaural cues, the Mel-frequency cepstral coefficients constitute an important carrier of spatial information, imperative for the classification of spatial audio scenes.

Research paper thumbnail of Capturing 360 • Audio Using an Equal Segment Microphone Array (ESMA

Journal of the Audio Engineering Society, 2019

The equal segment microphone array (ESMA) is a multichannel microphone technique that attempts to... more The equal segment microphone array (ESMA) is a multichannel microphone technique that attempts to capture a sound field in 360 • without any overlap between the stereophonic recording angle of each pair of adjacent microphones. This study investigated into the optimal microphone spacing for a quadraphonic ESMA using cardioid microphones. Recordings of a speech source were made using the ESMAs with four different microphone spacings of 0 cm, 24 cm, 30 cm, and 50 cm based on different psychoacoustic models for microphone array design. Multichannel and binaural stimuli were created with the reproduced sound field rotated with 45 • intervals. Listening tests were conducted to examine the accuracy of phantom image localization for each microphone spacing in both loudspeaker and binaural headphone reproductions. The results generally indicated that the 50 cm spacing, which was derived from an interchannel time and level trade-off model that is perceptually optimized for 90 • loudspeaker base angle, produced more accurate localization results than the 24 cm and 30 cm ones, which were based on conventional models derived from the standard 60 • loudspeaker setup. The 0 cm spacing produced the worst accuracy with the most frequent bimodal distributions of responses between the front and back regions. Analyses of the interaural time and level differences of the binaural stimuli supported the subjective results. In addition, two approaches for adding the vertical dimension to the ESMA (ESMA-3D) were devised. Findings from this study are considered to be useful for acoustic recording for virtual reality applications as well as for multichannel surround sound. 0 INTRODUCTION Microphone array techniques for surround sound recording can be broadly classified into two groups: those that attempt to produce the continuous phantom imaging around 360 • in the horizontal plane and those that treat the front and rear channels separately (i.e., source imaging in the front and environmental imaging in the rear) [1]. In conventional surround sound productions for home cinema settings, the front and rear separation approach tends to be used more widely due to its flexibility to control the amount of am-bience feeding the rear channels. However, with the recent development of virtual reality (VR) technologies that allow the user to view visual images in 360 • , the need for recording audio in 360 • arises. Currently, the most popular method for capturing 360 • audio for VR is arguably the first order Ambisonics (FOA). FOA microphone systems are typically compact in size, thus convenient for location recording, and offers a stable localization characteristic due to its coincident microphone arrangement [1]. Furthermore, the FOA allows one to flexibly rotate the initially captured sound field in post-production. However, it is known that the FOA has limitations in terms of perceived spaciousness and the size of sweet spot in loudspeaker reproduction due to the high level of interchannel correlation [2]. Higher order Ambison-ics (HOA) offers a higher spatial resolution than the FOA and therefore can overcome the limitations of the FOA to some extent, although it is more costly and requires a larger number of channels. An HOA recording can be made using a spherical microphone array (e.g., mh Acoustics Eigen-mike). A system that supports a higher order typically requires a larger number of microphones to be used on the sphere. A review of currently available Ambisonics microphone systems can be found in [3]. On the other hand, a near-coincident microphone array, which incorporates directional microphones that are spaced and angled outwards, can provide a greater balance between spaciousness and localizability than a pure coincident array. This is due to the fact that it relies on both interchannel time difference (ICTD) and interchannel level difference (ICLD) for phantom imaging [4]. The so-called "equal segment microphone arrays (ESMAs)," originally proposed by Williams [4, 5], are a group of multichannel near-coincident arrays that attempt to produce a continuous 360 • imaging in surround reproduction. The ESMAs follow the "critical

Research paper thumbnail of A Perceptual Model of "Punch" Based on Weighted Transient Loudness

Journal of the Audio Engineering Society, 2019

This paper proposes and evaluates a perceptual model for the measurement of "punch" in musical si... more This paper proposes and evaluates a perceptual model for the measurement of "punch" in musical signals based on a novel algorithm. Punch is an attribute that is often used to characterize music or sound sources that convey a sense of dynamic power or weight to the listener. A methodology is explored that combines signal separation, onset detection, and low level feature measurement to produce a perceptually weighted punch score. The model weightings are derived through a series of listening tests using noise bursts, which investigate the perceptual relevance of the onset time and frequency components of the signal across octave bands. The punch score is determined by a weighted sum of these parameters using coefficients derived through regression analysis. The model outputs are evaluated against subjective scores obtained through a pairwise comparison listening test using a wide variety of musical stimuli and against other computational models. The model output PM95 outperformed the other models showing a "very strong" correlation with punch perception with both Pearson r and Spearman rho coefficients being 0.849 and 0.833 respectively both being significant at the 0.01 level (2-tailed).

Research paper thumbnail of Perceptual threshold of apparent source width in relation to the azimuth of a single reflection

Journal of the Acoustical Society of America, 2019

An investigation into the perceptual threshold of apparent source width (ASW) in relation to a si... more An investigation into the perceptual threshold of apparent source width (ASW) in relation to a single reflection azimuth was performed in binaural reproduction. In the presence of a direct sound, subjects compared the ASW produced by a single 90 reference reflection against ASW produced by a test reflection with a varying angle for four reflection delay times between 5 and 30 ms. Threshold angles were found to be approximately 40 and 130 , and did not appear to be dependent on delay time. It was also found that these threshold angles were associated to saturation in [1-IACC E3 ] versus reflection azimuth.

Research paper thumbnail of The perception of hyper-compression by mastering engineers

Hyper-compressed popular music is the product of a behavior associated with the over-use of dynam... more Hyper-compressed popular music is the product of a behavior associated with the over-use of dynamic range processing in an effort to gain a competitive advantage in music production. This behavior is unnecessary given the introduction of loudness normalization algorithms across the
industry and has been denounced by mastering engineers as generating audible sound quality artifacts. However, the audibility of these sound quality artifacts to mastering engineers has not been examined. This study probes this question using an ABX listening experiment with 20 mastering engineers. On average, mastering engineers correctly discriminated 17 out of 24 conditions suggesting that the sound quality artifacts generated by hyper-compression are difficult to perceive. The findings in the study suggest that audibility depends on the Crest Factor (CF) of the music rather than the amount of CF reduction thus proposing the existence of a threshold of audibility. Further work focusing on education initiatives are offered.

Research paper thumbnail of The subjective effect of BRIR length on perceived headphone sound externalisation and tonal colouration

Binaural room impulse responses (BRIRs) of various lengths were convolved with stereophonic audio... more Binaural room impulse responses (BRIRs) of various lengths were convolved with stereophonic audio signals. Listening tests were conducted to assess how the length of BRIRs affected the perceived externalisation effect and tonal colouration of the audio. The results showed statistically significant correlations between BRIR length and both externalisation and tonal colouration. Conclusions are drawn from this and in addition, reasoning, a critical evaluation and suggested further work are suggested. The experiment provides the basis for further development of an effective and efficient externalisation algorithm.

Research paper thumbnail of Feature Extraction of Binaural Recordings for Acoustic Scene Classification

2018 Federated Conference on Computer Science and Information Systems (FedCSIS), 2018

Binaural technology becomes increasingly popular in the multimedia systems. This paper identifies... more Binaural technology becomes increasingly popular in the multimedia systems. This paper identifies a set of features of binaural recordings suitable for the automatic classification of the four basic spatial audio scenes representing the most typical patterns of audio content distribution around a listener. Moreover, it compares the five artificial-intelligence-based methods applied to the classification of binaural recordings. The results show that both the spatial and the spectro-temporal features are essential to accurate classification of binaurally rendered acoustic scenes. The spectro-temporal features appear to have a stronger influence on the classification results than the spatial metrics. According to the obtained results, the method based on the support vector machine, exploiting the features identified in the study, yields the classification accuracy approaching 84%.

Research paper thumbnail of Psychoacoustic Considerations in Surround Sound with Height

Research paper thumbnail of Level and Time Panning of Phantom Images for Musical Sources

Journal of the Audio Engineering Society, 2013

This study investigates the independent influences of interchannel level difference (ICLD) and in... more This study investigates the independent influences of interchannel level difference (ICLD) and interchannel time difference (ICTD) on the panning of 2-channel stereo phantom images for various musical sources. The results indicate that a level panning can perform robustly regardless of the spectral and temporal characteristics of source signals, whereas a time panning is not suitable for a continuous source with a high fundamental frequency. Statistical differences between the data obtained for different sources are found to be insignificant, and from this a unified set of ICLD and ICTD values for 10 • , 20 • , and 30 • image positions are derived. Linear level and time panning functions for the two separate panning regions of 0 •-20 • and 21 •-30 • are further proposed, and their applicability to arbitrary loudspeaker base angle is also considered. These perceptual panning functions are expected to be more accurate than the theoretical sine or tangent law in terms of matching between predicted and actually perceived image positions. 0 INTRODUCTION The localization of a stereophonic phantom image is based on the principle of the so-called "summing localiza-tion" [1]. In 2-channel loudspeaker reproduction, acoustic crosstalk of loudspeaker signals occurs at each ear of the listener; the signal from the contralateral loudspeaker is "summed" with that from the ipsilateral loudspeaker, with the former being attenuated in level at high frequencies due to head shadowing and delayed in time relative to the latter. If the signals are coherent, the listener will perceive a single phantom image in the median plane. If an interchan-nel level difference (ICLD) or interchannel time difference (ICTD) is applied to the loudspeaker signals, some combination of interaural level difference (ILD) and interaural time difference (ITD) will be introduced between the ear input signals, and consequently the apparent position of the image will be "panned" from the middle toward the earlier or louder loudspeaker. Research suggests that phantom images panned using ICLD are localized mainly based on ITDs at low frequencies and on ILDs and envelope-based ITDs at high frequencies [2,3]. With regard to ICTD-based panning, the frequency-dependency of interaural cues has not been studied extensively. However, it was shown in [1] that an ICTD produces only an ILD at a low frequency when it is assumed that there is no level difference between the loudspeaker signals arriving at each ear at low frequencies, whereas it leads to both ILD and ITD at a high frequency. Summing localization is valid only up to a certain threshold of ICTD (e.g., 1 ms as widely accepted), within which a trade-off between ICLD and ICTD is possible. Beyond this threshold, the localization of an auditory image largely relies on the precedence effect [4], where the image is perceived constantly at the earlier loudspeaker up to the echo threshold. Since 1940 a number of studies have been conducted to investigate the independent influence of ICLD or ICTD on the position of a phantom image perceived between two loudspeakers [2,5, 6,7, 8,9]. The data from these studies has had many practical applications. For example, Williams [10] analyzed the coverage angles of two-channel near-coincident microphone techniques based on the data obtained by Simonsen [8]. Wittek [11] developed a tool called "Image Assistant" to calculate localization curves for various microphone arrays based on the ICLD and ICTD data derived from the literature. A reliable ICLD or ICTD data set that is obtained from perceptual experiments would also be useful for panning applications where an accurate matching between target and perceived image positions is essential. The conventional sine and tangent level panning laws [2,12] have been claimed to be inaccurate in this type of application. They are based on ITD cues at low frequencies only and tend to result in a greater angular displacement than predicted for broadband sources [13]. With respect to ICTD-based panning (time panning), there has been no global law proposed to date for practical applications. 978

Research paper thumbnail of Effect of Vertical Microphone Layer Spacing for a 3D Microphone Array

Journal of the Audio Engineering Society, 2014

Subjective listening tests were conducted to investigate how the spacing between main (lower) and... more Subjective listening tests were conducted to investigate how the spacing between main (lower) and height (upper) microphone layers in a 3D main microphone array affects perceived spatial impression and overall preference. Four different layer spacings of 0m, 0.5m, 1m, and 1.5m were compared for the sound sources of trumpet, acoustic guitar, percussion quartet, and string quartet using a nine-channel loudspeaker setup. It was generally found that there was no significant difference between any of the spaced layer configurations, whereas the 0m layer had slightly higher ratings than the more spaced layers in both spatial impression and preference. Acoustical properties of the original microphone channel signals as well as those of the reproduced signals, which were binaurally recorded, were analyzed in order to find possible physical causes for the perceived results. It is suggested that the perceived results were mainly associated with vertical interchannel crosstalk in the signals of each height layer and the magnitude and pattern of spectral change at the listener's ear caused by each layer.

Research paper thumbnail of 2D-to-3D Ambience Upmixing Based on Perceptual Band Allocation

Journal of the Audio Engineering Society, 2015

Listening tests were conducted to evaluate the feasibility of a novel 2D-to-3D ambience upmixing ... more Listening tests were conducted to evaluate the feasibility of a novel 2D-to-3D ambience upmixing technique named "perceptual band allocation" (PBA). Four-channel ambience signals captured in a reverberant concert hall were low-pass and high-pass filtered, which were then routed to lower and upper loudspeaker layers arranged in a 9-channel 3D configuration, respectively. The upmixed stimuli were compared against original 3D recordings made using an 8-channel ambience microphone array in terms of 3D listener envelopment and preference. The results suggest that the perceived quality of the proposed method could be at least comparable to that of an original 3D recording. 0 INTRODUCTION Three-dimensional multichannel audio systems such as Auro-3D [1], Dolby Atmos [2], and 22.2 [3] employ additional height loudspeakers in order to provide the listener with a three-dimensional (3D) auditory experience. One of the perceptual attributes that could be enhanced by the use of height channels is listener envelopment (LEV). In the context of two-dimensional (2D) surround sound (e.g., 5.1), LEV is widely understood as the subjective impression of being enveloped by reverberant sound [4, 5]. With 3D loudspeaker formats, the added height channels could be used to render the "vertical" spread of reverberant sound image as well as the horizontal one, and ultimately the auditory impression of 3D LEV could be achieved. One of the key requirements for 3D multichannel audio applications would be a 2D-to-3D upmixing technique that can add a height dimension to 2D content. Therefore, a new method that can render vertical image spread would be necessary. In the context of horizontal stereophony, horizontal image spread can be rendered by means of inter-channel decorrelation, and many different decorrelation methods have been proposed over the past years [6-10]. Such methods are based on the principle that as the degree of correlation between stereophonic channel signals decreases, that between ear-input signals (interaural cross-correlation), which has a direct relationship with perceived auditory image spread [4], also decreases. However, vertically reproduced stereophonic signals would have no or little influence on interaural cross-correlation. From a recent study by Gribben and Lee [11] it was found that vertically applied interchannel decorrelation was not as effective as horizontal decorrelation in terms of controlling the spread of image. The literature generally suggests that vertical localiza-tion relies on spectral cues. A number of researchers [12-14] have found that the higher the frequency of a pure tone the higher the perceived image position was regardless of the physical height of the presenting loudspeaker; a phenomenon referred to as the "pitch-height" effect in [15]. In the case of band-pass filtered noise signals, however, this effect was reported to be dependent on the physical height of the loudspeaker that presents the signal. For example, Roffler and Butler [16] found from their experiments using loudspeakers vertically arranged at different heights that the perceived image height of a noise high-pass filtered at 2 kHz was similar to the physical height of the presenting loudspeaker. Conversely, a noise low-passed at 2 kHz was localized around or below the eye level regardless of its presenting loudspeaker height. Similar results were obtained from Cabrera and Tiley's [15] experiment conducted with octave-band noise signals centered at 125 Hz, 500 Hz, 2 kHz, and 8 kHz, using vertically arranged loudspeakers; a higher frequency band was localized at a higher position than a lower frequency band and this difference became larger as the loudspeaker height increased. Cabrera and Tiley [15] and Ferguson and Cabrera [17] confirmed the validity of this phenomenon for low-and high-pass filtered noise stimuli (crossover of 1 kHz) that were simultaneously presented from different loudspeakers at different heights. The present study aims to explore the feasibility of a new 2D-to-3D upmixing method developed based on the above research findings, which is named "perceptual band alloca-tion" (PBA). The method decomposes the spectrum of the

Research paper thumbnail of The Effect of Interchannel Time Difference on Localization in Vertical Stereophony , AND HYUNKOOK LEE, AES Member

Journal of the Audio Engineering Society, 2015

Listening tests were conducted in order to analyze the localization of band-limited stimuli in ve... more Listening tests were conducted in order to analyze the localization of band-limited stimuli in vertical stereophony. The test stimuli were seven octave bands of pink noise, with center frequencies ranging from 125-8000 Hz, as well as broadband pink noise. Stimuli were presented from vertically arranged loudspeakers either monophonically or as vertical phantom images, created with the upper loudspeaker delayed with respect to the lower by 0, 0.5, 1, 5, and 10 ms (i.e., interchannel time difference). The experimental data obtained showed that localization under the aforementioned conditions is generally governed by the so-called "pitch-height" effect, with the high frequency stimuli generally being localized significantly higher than the low frequency stimuli for all conditions. The effect of interchannel time difference was found to be significant on localization judgments for both the 1000-4000 Hz octave bands and the broadband pink noise; it is suggested that this was related to the effects of comb filtering. Additionally, no evidence could be found to support the existence of the precedence effect in vertical stereophony. 0 INTRODUCTION The mechanisms used to localize sound sources incident from the median plane are fundamentally different from those used in horizontal plane localization. In the horizontal plane, localization is reliant on a combination of the time and level differences between a given source arriving at each ear (binaural cues) as well as on the directional filtering of the sound source by the pinnae (spectral cues) [1]. However, in the median plane binaural cues are absent as sound sources arrive at each ear simultaneously. As a result, median plane localization relies solely on spectral cues [2]. Median plane localization is a topic that has received much attention in the literature, with numerous studies being particularly concerned with the localization of tonal and band limited stimuli. In early experiments using tonal stimuli presented from vertically arranged loudspeakers, Pratt [3] concluded that localization is governed solely by frequency , with high tones being localized physically higher in space than low tones. A similar observation was made by Trimble [4], who presented tonal stimuli both singularly and in succession to listeners via receiving phones positioned 15 cm from each ear. A more expansive study by Roffler and Butler [5], also using tonal stimuli presented from vertically arranged loudspeakers, affirmed the results presented in [3] and [4], with the authors noting that the effect was maintained irrelevant of listener orientation, visual bias, and whether or not subjects had prior knowledge of the terms "high" and "low" in describing pitch. Subsequent experiments by Roffler and Butler [6] and Cabrera and Tiley [7] demonstrated that the relationship between pitch and height is maintained for the localization of band-passed noise signals and moreover that the perceptual range of pitch-height depends on the physical height of the loudspeaker that presents the signal. In [7] the correlation between pitch and height was referred to as the "pitch-height effect." Following the Roffler and Butler study [5] it was noted by Blauert [8], from median plane localization experiments using loudspeakers placed in front of, directly above and behind the listener, that frequency also governed the lo-calization of 1/3-octave bands. Under these conditions certain frequency bands were related to specific locations on the median plane, irrelevant of actual loudspeaker position. Blauert called these bands "directional bands." Subsequent studies by Hebrank and Wright [2] and Asano et al. [9] have shown that directional bands are closely related to the spectral cues provided by the pinnae in vertical localization. Additionally, Itoh et al. [10] demonstrated that directional bands are maintained for 1/6-octave bands of noise and that there exist differences in directional bands depending on the listener. The aforementioned localization studies are similar in that they predominantly considered the localization of stimuli presented from single loudspeakers located on the median plane. However, with the emergence of

Research paper thumbnail of Perceptual Band Allocation (PBA) for the Rendering of Vertical Image Spread with a Vertical 2D Loudspeaker Array

Journal of the Audio Engineering Society, 2016

Two subjective experiments were conducted to examine a new vertical image rendering method named ... more Two subjective experiments were conducted to examine a new vertical image rendering method named "Perceptual Band Allocation (PBA)," using octave bands of pink noise presented from main and height loudspeaker pairs. The PBA attempts to control the perceived degree of vertical image spread (VIS) by a flexible mapping between frequency band and loudspeaker layer based on the desired positioning of the band in the vertical plane. The first experiment measured the perceived vertical location of the phantom image of each octave band stimulus for the main and height loudspeaker layers individually. Results showed significant differences among the frequency bands in perceived image location. Furthermore, the so-called "pitch-height" effect was found for two separate frequency regions, with most bands from the main loudspeaker layer perceived to be elevated from the physical height of the layer. Based on the localization data from the first experiment, six different PBA stimuli were created in such a way that each frequency band was mapped to either the main or height loudspeaker layer depending on the target degree of VIS. The second experiment conducted a listening test to grade the perceived magnitudes of VIS for the six stimuli. The results first indicated that PBA could significantly increase the perceived magnitude of VIS compared to that of a sound presented only from the main layer. It was also found that the different PBA schemes produced various degrees of perceived VIS with statistically significant differences. The paper discusses possible reasons for the obtained results in details based on the localization test results and the frequency-dependent energy weightings of ear-input signals. Implications of the proposed method for the vertical upmixing of horizontal surround content are also discussed.

Research paper thumbnail of Vertical Stereophonic Localization in the Presence of Interchannel Crosstalk: The Analysis of Frequency-Dependent Localization Thresholds , AND HYUNKOOK LEE, AES Member

Journal of the Audio Engineering Society, 2016

Listening tests were conducted in order to investigate the frequency dependency of localiza-tion ... more Listening tests were conducted in order to investigate the frequency dependency of localiza-tion thresholds in relation to vertical interchannel crosstalk. Octave band and broadband pink noise stimuli were presented to subjects as phantom images from vertically arranged stereo-phonic loudspeakers located directly in front of the listening position. With respect to the listening position the lower loudspeaker was not elevated; the upper loudspeaker was elevated by 30 •. Subjects completed a method of adjustment task in which they were required to reduce the amplitude of the upper loudspeaker until the resultant phantom image matched the position of the same stimulus presented from the lower loudspeaker alone. The upper loudspeaker was delayed with respect to the lower by 0, 0.5, 1, 5, and 10 ms. The experimental data demonstrated that the main effect of frequency on the localization threshold was significant, with the low frequency stimuli (125 and 250 Hz) requiring significantly less level reduction (less than 6 dB) than the mid-high (1, 2, and 8 kHz) frequency stimuli (9-10.5 dB reduction). The main effect of interchannel time difference (ICTD) on the localization thresholds for each octave band was found to be non-significant. For all stimuli interchannel level difference (ICLD) was always necessary, indicating that the precedence effect is not a feature of median plane localization.

Research paper thumbnail of A Comparison between Horizontal and Vertical Interchannel Decorrelation

Applied Sciences, 2017

Featured Application: 3D audio mixing and upmixing; creative sound design. Abstract: The perceptu... more Featured Application: 3D audio mixing and upmixing; creative sound design. Abstract: The perceptual effects of interchannel decorrelation on perceived image spread have been investigated subjectively in both horizontal and vertical stereophonic reproductions, looking specifically at the frequency dependency of decorrelation. Fourteen and thirteen subjects graded the horizontal and vertical image spreads of a pink noise sample, respectively. The pink noise signal had been decorrelated by a complementary comb-filter decorrelation algorithm, varying the frequency-band, time-delay and decorrelation factor for each sample. Results generally indicated that interchannel decorrelation had a significant effect on auditory image spread both horizontally and vertically, with spread increasing as correlation decreases. However, it was found that the effect of vertical decorrelation was less effective than that of horizontal decorrelation. The results also suggest that the decorrelation effect was frequency-dependent; changes in horizontal image spread were more apparent in the high frequency band, whereas those in vertical image spread were in the low band. Furthermore, objective analysis suggests that the perception of vertical image spread for the low and middle frequency bands could be associated with a floor reflection; whereas for the high band, the results appear to be related to spectral notches in the ear input signals.

Research paper thumbnail of The Reduction of Vertical Interchannel Crosstalk: The Analysis of Localisation Thresholds for Natural Sound Sources

Applied Sciences, 2017

In subjective listening tests, natural sound sources were presented to subjects as vertically-ori... more In subjective listening tests, natural sound sources were presented to subjects as vertically-oriented phantom images from two layers of loudspeakers, 'height' and 'main'. Subjects were required to reduce the amplitude of the height layer until the position of the resultant sound source matched that of the same source presented from the main layer only (the localisation threshold). Delays of 0, 1 and 10 ms were applied to the height layer with respect to the main, with vertical stereophonic and quadraphonic conditions being tested. The results of the study showed that the localisation thresholds obtained were not significantly affected by sound source or presentation method. Instead, the only variable whose effect was significant was interchannel time difference (ICTD). For ICTD of 0 ms, the median threshold was 9.5 dB, which was significantly lower than the 7 dB found for both 1 and 10 ms. The results of the study have implications both for the recording of sound sources for three-dimensional (3D) audio reproduction formats and also for the rendering of 3D images.

Research paper thumbnail of Sound Source and Loudspeaker Base Angle Dependency of Phantom Image Elevation Effect

Journal of the Audio Engineering Society

Early studies found that, when identical signals were presented from two loudspeakers equidistant... more Early studies found that, when identical signals were presented from two loudspeakers equidistant from the listener, the resulting phantom image was elevated in the median plane and the degree of the elevation increased with the loudspeaker base angle. However, sound sources used in such studies were either unknown or limited to noise signals. In order to investigate the dependencies of the elevation effect on sound source and loudspeaker base angle in details, the present study conducted listening tests using 11 natural sources and 4 noise sources with different spectral and temporal characteristics for 7 loudspeaker base angles between 0 • and 360 •. The elevation effect was found to be significantly dependent on the sound source and base angle. Results generally suggest that the effect is stronger for sources with transient nature and a flat frequency spectrum than for continuous and low-frequency-dominant sources. Theoretical reasons for the effect are also discussed based on head-related transfer function measurements. It is proposed that the perceived degree of elevation would be determined by a relative cue related to the spectral energy distribution at high frequencies, but by an absolute cue associated with the acoustic crosstalk and torso reflections at low frequencies.

Research paper thumbnail of The Perception of Hyper-compression by Mastering Engineers , AND HYUNKOOK LEE, 3 AES Member

Journal of the Audio Engineering Society

Hyper-compressed popular music is the product of a behavior associated with the over-use of dynam... more Hyper-compressed popular music is the product of a behavior associated with the over-use of dynamic range processing in an effort to gain a competitive advantage in music production. This behavior is unnecessary given the introduction of loudness normalization algorithms across the industry and has been denounced by mastering engineers as generating audible sound quality artifacts. However, the audibility of these sound quality artifacts to mastering engineers has not been examined. This study probes this question using an ABX listening experiment with 20 mastering engineers. On average, mastering engineers correctly discriminated 17 out of 24 conditions suggesting that the sound quality artifacts generated by hyper-compression are difficult to perceive. The findings in the study suggest that audibility depends on the Crest Factor (CF) of the music rather than the amount of CF reduction thus proposing the existence of a threshold of audibility. Further work focusing on education initiatives are offered. 0 INTRODUCTION To create a hyper-compressed popular music record, au-teurs must engage in a behavior compelling them to over-use dynamic range processing. This behavior is influenced by stakeholders and typically occurs at the end of the creative process in an effort to fit the audio signal to the reproduction medium. However, whilst vinyl had physical limitations governing the mastering approach, the limitations of the digital medium relate to perceptual and psychological ones. Stakeholders' worry that the laborious creative details will not be perceived in different reproduction environments and that this will influence the success of the record. As a result, the paradigm of loudness maximiza-tion has prevailed in the popular music industry in spite of the introduction of loudness normalization algorithms, such as the ITU BS 1770 [1] and the committed work from proponents of dynamic music. Research on hyper-compression in popular music has been addressed from a number of perspectives: (i) technical [2]-[5], (ii) listener preference [5-7], (iii) sound quality [8-11], (iv) record sales [12,13], and (v) listener fatigue [14,15]. A central theme in these studies is their attempt to reconcile the sound quality judgments communicated by expert listeners with the behavior motivating the practice. In many respects, the loudness wars are a continuation of what Leventhal has termed the "great debate" in audio [17]. This debate relates primarily to the testing of hypotheses posited by audiophiles concerning the audibility of audio components , sampling rates, and numerous other factors. These hypotheses are typically investigated using discrimination methodologies. Similarly, mastering engineers and expert listeners criticize the sound quality of hyper-compressed music [18] and posit hypotheses concerning the resulting sound quality attributes and their audibility. However, these hypotheses have not been formally tested. The results of a recent ABX experiment suggest that untrained listeners are unable to perceive sound quality ar-tifacts generated by Crest Factor (CF) reductions of up to 10 dB and that this is a primary factor supporting the persistence of hyper-compression [19]. Ronan suggests that the prevalence of hyper-compressed music during the loudness wars has altered listeners' concept of sound quality thus further encouraging the practice in spite of the introduction of loudness normalization [20]. Given the dissonance between the perceptions of mastering engineers and the behavior of music auteurs, there is a pressing need to examine the audibility of hyper-compression to mastering engineers

Research paper thumbnail of Vertical Interchannel Decorrelation on the Vertical Spread of an Auditory Image

Journal of the Audio Engineering Society

In horizontal stereophony, it is known that interchannel correlation relates to the horizontal sp... more In horizontal stereophony, it is known that interchannel correlation relates to the horizontal spread of a phantom auditory image. However, little is known about the perceptual effect of interchannel correlation on vertical image spread (VIS) between two vertically-arranged loudspeakers. The present study investigates this through two subjective experiments: (i) a multiple comparison of relative VIS for stimuli with varying degrees of correlation; and (ii) the absolute measurement of upper and lower VIS boundaries for extreme stimuli conditions. Octave-band (center frequencies: 63 Hz to 16 kHz) and broadband pink noise signals have been decorrelated using two techniques: all-pass filtering and complementary comb-filtering. These stimuli were presented from vertically-spaced loudspeaker pairs at three azimuth angles (0 • , ±30 • , and ±110 •), with each angle assessed discretely. Both the relative and absolute test results show no significant effect of vertical correlation on VIS for the 63 Hz, 125 Hz, and 250 Hz bands. For the 500 Hz band and above, there is a general tendency for VIS to increase as correlation decreases, which is observed for both decorrelation methods. This association is strongest at 0 • azimuth for the 500 Hz and 1 kHz bands; at ±30 • for 8 kHz and Broadband; and at ±110 • for 2 kHz, 4 kHz, and 16 kHz. The 8 kHz band at ±30 • has the strongest association of all conditions-post-hoc objective analysis indicates a potential relationship between HRTF localization cues (pinna filtering) and VIS perception within this frequency region. Furthermore, the absolute test results suggest that changes of VIS from interchannel decorrelation are fairly slight, with only the Broadband and 16 kHz bands showing a significant increase. The deviations of boundary scores also suggest a difficulty grading absolute VIS and/or potential disagreements among listeners.

Research paper thumbnail of Automatic Spatial Audio Scene Classification in Binaural Recordings of Music

Applied Sciences, 2019

The aim of the study was to develop a method for automatic classification of the three spatial au... more The aim of the study was to develop a method for automatic classification of the three spatial audio scenes, differing in horizontal distribution of foreground and background audio content around a listener in binaurally rendered recordings of music. For the purpose of the study, audio recordings were synthesized using thirteen sets of binaural-room-impulse-responses (BRIRs), representing room acoustics of both semi-anechoic and reverberant venues. Head movements were not considered in the study. The proposed method was assumption-free with regards to the number and characteristics of the audio sources. A least absolute shrinkage and selection operator was employed as a classifier. According to the results, it is possible to automatically identify the spatial scenes using a combination of binaural and spectro-temporal features. The method exhibits a satisfactory classification accuracy when it is trained and then tested on different stimuli but synthesized using the same BRIRs (accuracy ranging from 74% to 98%), even in highly reverberant conditions. However, the generalizability of the method needs to be further improved. This study demonstrates that in addition to the binaural cues, the Mel-frequency cepstral coefficients constitute an important carrier of spatial information, imperative for the classification of spatial audio scenes.

Research paper thumbnail of Capturing 360 • Audio Using an Equal Segment Microphone Array (ESMA

Journal of the Audio Engineering Society, 2019

The equal segment microphone array (ESMA) is a multichannel microphone technique that attempts to... more The equal segment microphone array (ESMA) is a multichannel microphone technique that attempts to capture a sound field in 360 • without any overlap between the stereophonic recording angle of each pair of adjacent microphones. This study investigated into the optimal microphone spacing for a quadraphonic ESMA using cardioid microphones. Recordings of a speech source were made using the ESMAs with four different microphone spacings of 0 cm, 24 cm, 30 cm, and 50 cm based on different psychoacoustic models for microphone array design. Multichannel and binaural stimuli were created with the reproduced sound field rotated with 45 • intervals. Listening tests were conducted to examine the accuracy of phantom image localization for each microphone spacing in both loudspeaker and binaural headphone reproductions. The results generally indicated that the 50 cm spacing, which was derived from an interchannel time and level trade-off model that is perceptually optimized for 90 • loudspeaker base angle, produced more accurate localization results than the 24 cm and 30 cm ones, which were based on conventional models derived from the standard 60 • loudspeaker setup. The 0 cm spacing produced the worst accuracy with the most frequent bimodal distributions of responses between the front and back regions. Analyses of the interaural time and level differences of the binaural stimuli supported the subjective results. In addition, two approaches for adding the vertical dimension to the ESMA (ESMA-3D) were devised. Findings from this study are considered to be useful for acoustic recording for virtual reality applications as well as for multichannel surround sound. 0 INTRODUCTION Microphone array techniques for surround sound recording can be broadly classified into two groups: those that attempt to produce the continuous phantom imaging around 360 • in the horizontal plane and those that treat the front and rear channels separately (i.e., source imaging in the front and environmental imaging in the rear) [1]. In conventional surround sound productions for home cinema settings, the front and rear separation approach tends to be used more widely due to its flexibility to control the amount of am-bience feeding the rear channels. However, with the recent development of virtual reality (VR) technologies that allow the user to view visual images in 360 • , the need for recording audio in 360 • arises. Currently, the most popular method for capturing 360 • audio for VR is arguably the first order Ambisonics (FOA). FOA microphone systems are typically compact in size, thus convenient for location recording, and offers a stable localization characteristic due to its coincident microphone arrangement [1]. Furthermore, the FOA allows one to flexibly rotate the initially captured sound field in post-production. However, it is known that the FOA has limitations in terms of perceived spaciousness and the size of sweet spot in loudspeaker reproduction due to the high level of interchannel correlation [2]. Higher order Ambison-ics (HOA) offers a higher spatial resolution than the FOA and therefore can overcome the limitations of the FOA to some extent, although it is more costly and requires a larger number of channels. An HOA recording can be made using a spherical microphone array (e.g., mh Acoustics Eigen-mike). A system that supports a higher order typically requires a larger number of microphones to be used on the sphere. A review of currently available Ambisonics microphone systems can be found in [3]. On the other hand, a near-coincident microphone array, which incorporates directional microphones that are spaced and angled outwards, can provide a greater balance between spaciousness and localizability than a pure coincident array. This is due to the fact that it relies on both interchannel time difference (ICTD) and interchannel level difference (ICLD) for phantom imaging [4]. The so-called "equal segment microphone arrays (ESMAs)," originally proposed by Williams [4, 5], are a group of multichannel near-coincident arrays that attempt to produce a continuous 360 • imaging in surround reproduction. The ESMAs follow the "critical

Research paper thumbnail of A Perceptual Model of "Punch" Based on Weighted Transient Loudness

Journal of the Audio Engineering Society, 2019

This paper proposes and evaluates a perceptual model for the measurement of "punch" in musical si... more This paper proposes and evaluates a perceptual model for the measurement of "punch" in musical signals based on a novel algorithm. Punch is an attribute that is often used to characterize music or sound sources that convey a sense of dynamic power or weight to the listener. A methodology is explored that combines signal separation, onset detection, and low level feature measurement to produce a perceptually weighted punch score. The model weightings are derived through a series of listening tests using noise bursts, which investigate the perceptual relevance of the onset time and frequency components of the signal across octave bands. The punch score is determined by a weighted sum of these parameters using coefficients derived through regression analysis. The model outputs are evaluated against subjective scores obtained through a pairwise comparison listening test using a wide variety of musical stimuli and against other computational models. The model output PM95 outperformed the other models showing a "very strong" correlation with punch perception with both Pearson r and Spearman rho coefficients being 0.849 and 0.833 respectively both being significant at the 0.01 level (2-tailed).

Research paper thumbnail of Perceptual threshold of apparent source width in relation to the azimuth of a single reflection

Journal of the Acoustical Society of America, 2019

An investigation into the perceptual threshold of apparent source width (ASW) in relation to a si... more An investigation into the perceptual threshold of apparent source width (ASW) in relation to a single reflection azimuth was performed in binaural reproduction. In the presence of a direct sound, subjects compared the ASW produced by a single 90 reference reflection against ASW produced by a test reflection with a varying angle for four reflection delay times between 5 and 30 ms. Threshold angles were found to be approximately 40 and 130 , and did not appear to be dependent on delay time. It was also found that these threshold angles were associated to saturation in [1-IACC E3 ] versus reflection azimuth.

Research paper thumbnail of The perception of hyper-compression by mastering engineers

Hyper-compressed popular music is the product of a behavior associated with the over-use of dynam... more Hyper-compressed popular music is the product of a behavior associated with the over-use of dynamic range processing in an effort to gain a competitive advantage in music production. This behavior is unnecessary given the introduction of loudness normalization algorithms across the
industry and has been denounced by mastering engineers as generating audible sound quality artifacts. However, the audibility of these sound quality artifacts to mastering engineers has not been examined. This study probes this question using an ABX listening experiment with 20 mastering engineers. On average, mastering engineers correctly discriminated 17 out of 24 conditions suggesting that the sound quality artifacts generated by hyper-compression are difficult to perceive. The findings in the study suggest that audibility depends on the Crest Factor (CF) of the music rather than the amount of CF reduction thus proposing the existence of a threshold of audibility. Further work focusing on education initiatives are offered.

Research paper thumbnail of The subjective effect of BRIR length on perceived headphone sound externalisation and tonal colouration

Binaural room impulse responses (BRIRs) of various lengths were convolved with stereophonic audio... more Binaural room impulse responses (BRIRs) of various lengths were convolved with stereophonic audio signals. Listening tests were conducted to assess how the length of BRIRs affected the perceived externalisation effect and tonal colouration of the audio. The results showed statistically significant correlations between BRIR length and both externalisation and tonal colouration. Conclusions are drawn from this and in addition, reasoning, a critical evaluation and suggested further work are suggested. The experiment provides the basis for further development of an effective and efficient externalisation algorithm.