Download fourth

Document related concepts

Auditory system wikipedia , lookup

Sound localization wikipedia , lookup

Speech perception wikipedia , lookup

Sound from ultrasound wikipedia , lookup

Transcript
Media Processing – Audio Part
Dr Wenwu Wang
Centre for Vision Speech and Signal Processing
Department of Electronic Engineering
w.wang@surrey.ac.uk
http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html
1
Approximate outline
 Week 6: Fundamentals of audio
 Week 7: Audio acquiring, recording, and standards
 Week 8: Audio processing, coding, and standards
 Week 9: Audio perception and audio quality assessment
 Week 10: Audio production and reproduction
2
Speech codec, audio coding quality
evaluation, and audio perception
Concepts and topics to be covered:
 Speech coding
 Waveform coder, vocoder, and hybrid coder
 Frequency domain and time domain coders
 Audio file formats
 Digital container format
 Audio quality measurement
 Subjective assessment: listening tests
 Objective assessment: perceptual objective measurement
 Objective perceptual measurements
 Masked threshold, internal representation,
 PEAQ, PESQ
 Audio Perception
 Loudness perception, pitch perception,
 space perception, timbre perception
3
Speech coding strategies
 Speech coding schemes can be broadly divided into the following three main categories:
vocoders, waveform coders, and hybrid coders.
 The aim is to analyse the signal, remove the redundancies, and efficiently code the nonredundant parts of the signal in a perceptually acceptable manner.

SBC = subband coding, ATC = adaptive transform coding, MBE = multiband excitation, APC =
adaptive coding, RELP = residual excited linear predictive coding (LPC), MPLPC = multi-pulse LPC,
CELP = code-excited LPC, SELP = self-excitation LPC.
Source: Kondoz, 2001
4
Waveform coders
 Such coders attempt to preserve the general shape of signal
waveforms. Hence they are not speech specific.
 They generally operate on a sample to sample basis. Their performance
is usually measured by SNR, as quantisation is the major source of
distortion.
 They usually operate above 16 kb/s. For example, the first speech
coding standard, PCM operates at 64 kb/s, and then a later standard,
adaptive differential PCM (ADPCM) operates at 32 kb/s.
5
Voice coders (vocoders)
 A vocoder consists of an analyser and a synthesiser. The analyser
extract from the original speech a set of parameters representing the
speech production model, which are then transmitted. The syntheser
then reconstruct the original speech based on the parameters
transmitted. The synthesised speech is often crude.
 Vocoders are very speech specific and they don’t attempt to preserve
the waveform of speech.
 Vocoder often operates in the regions below 4.8 kb/s. It is usually
subjectively measured using mean opinion scores (MOS) test,
diagnostic acceptability measure (DAM) (including both perceptual
quality of the signal and background, such as intelligibility, pleasantness,
and overall acceptability).
 Such standard is mainly targeted at non-commercial applications, e.g.
secure military systems.
6
Hybrid coders
 The hybrid scheme attempts to use the advantages of the waveform
coder and vocoder.
 It can be generally categorised into: frequency-domain and time-domain
methods.
 The basic idea of frequency domain coding is to divide the speech
spectrum into frequency bands or components using filter bank or a
block transform analysis. After encoding and decoding, these
components are used to resynthesise the original input waveform based
on either filter bank summation or inverse block transform.
 The time domain coding is usually motivated by linear prediction. The
statistical characteristics of speech signals can be very accurately
modelled by a source-filter model which assumes speech is produced
by filtering the excitation signal with a linear time-varying filter. For
voiced speech, the excitation signal is a periodic impulse train, and for
unvoiced speech, a random noise signal.
7
Hybrid coders (cont)
 An example of frequency domain hybrid coder: a typical sub-band coder
(broad band analysis).
Source: Kondoz, 2001
8
Hybrid coders (cont)
 An example of frequency domain hybrid coder: an adaptive transform
coder (narrow band analysis), in which different bit depth can be applied
to each sub-band.
Source: Kondoz, 2001
9
Hybrid coders (cont)
 An example of time domain hybrid coder: adaptive predictive coder.
Source: Kondoz, 2001
10
Quality of speech coding
schemes
Source: Kondoz, 2001
11
Digital speech coding standards
12
Difference between audio codecs
and audio file formats
 A codec is an algorithm that performs the encoding and decoding of the
raw audio data. Audio data itself is usually stored in a file with a specific
audio file format.
 There are three major kinds of file formats:
 Uncompressed audio formats, such as WAV, AIFF, AU, or PCM.
 Lossless compressed audio formats, such as FLAC, WavPack,
Apple Lossless, MPEG-4 ALS, Windows Media Audio (WMA)
Lossless.
 Lossy compression audio formats, such as MP3, Ogg Vorbis, AAC,
WMA Lossy.
13
Difference between audio codecs
and audio file formats (cont)
 Most audio file formats support only one type of audio data (created with
an audio coder), however there are multimedia digital container formats
(as AVI) that may contain multiple types of audio and video data.
 A digital container format is a meta-file format where different types of
data elements and metadata exist together in a computer file.
 Formats exclusive to audio include, e.g., wav, xmf.
 Formats that contain multiple types of data include, e.g. Ogg, MP4.
14
Coding dilemma
 In practical audio codec design, it is always a trade-off between the
following two important factors:
 Data rate and system complexity limitation
 Audio quality
15
Objective quality
measurement of coded audio
 Traditional objective measure: The quality of audio is measured using, e.g.
the following objective performance index, where psychoacoustic effects
are ignored.
 Signal to noise ratio (SNR)
 Total block distortion (TBD)
 Perceptual objective measure: The quality of audio is predicted based on a
specific model of hearing.
16
Subjective quality
measurement of coded audio
 Human listening tests: When a highly accurate assessment is needed,
formal listening tests will be required to judge the perceptual quality of
audio.
17
Experiment of “13 dB miracle”
 J. Johnston and K. Brandenburg, then at Bell Labs, presented examples of
two signals having the same SNR of 13 dB, one of which was added white
noise, and the other injected noise but perceptually shaped (so that the
distortion was partially or completed masked by the signal components).
Despite the same SNR measure, the perceived quality was very different
with the latter one being judged as a good-quality signal, and the former
one as a bit annoying.
18
Factors to consider in assessing
the audio coder quality
 Audio material
 Different material stresses different aspects of a coder. For example,
transient sounds can be used to test the coder’s ability of coding
transient signals.
 Data rate
 Decreasing the data rate is likely to reduce the quality of a codec. It
will be meaningful to take into account the data rate when comparing
the quality of audio codecs.
19
Impairments versus transparency
 The audio quality of a coding system can be assessed in terms of
impairment, which is the perceived difference between the output of a
system under test and a known reference signal.
 The coding system under test is said to be transparent when even listeners
who are experts in identifying impairments cannot distinguish between the
reference and the test signals.
 To determine whether the coding system is transparent or how
transparent the system is, we can present both the test and reference
signals to the listeners in random order, and ask them to pick up the
test signal. If the listeners get it wrong roughly 50%, then the system is
transparent.
20
Coding margin
 Coding margin refers to a measure of how far the coder is from the onset of
audible impairments.
 It can be estimated using listening tests where the data rate of a coder is
gradually reduced before listeners can detect the test signal with
statistically significant accuracy when the reference signals are also
present.
 In practice, people are interested in the degree of impairments when they
are below the region of transparency. In most cases, the coder in or near
the transparent region is preferred (i.e. the impairments are very small).
The well-known five-grade impairment scale and the formal listening test
process, to be discussed later, are designed (by the ITU-R) for such
situations.
21
Listening tests for audio codec
quality assessment
 Main features of carrying out a listening test for coded audio with small
impairments (more details described in the standard [ITU-R BS.526])
 Five-grade impairment scale
 Test method
 Training and grading
 Expert listeners and critical material
 Listening conditions
 Data analysis of listening results
22
Five-grade impairment scale
 According to the standard ITU-R BS.562-3, any perceived difference
between the reference signal and the output of the system under test
should be interpreted as a perceptual impairment, measured by the
following discrete five-grade scale:
Source: Bosi & Goldberg, 2002
23
Five-grade impairment scale (cont)
 Correspondence between the five-grade impairment scale and fivegrade quality scale:
Source: Bosi & Goldberg, 2002
24
Five-grade impairment scale (cont)
 For the convenience of data analysis, subjective difference grade (SDG)
is usually used. SDG is the difference grade between listener’s rating of
the reference signal and the coded signal, i.e. SDG = Grade of coded
signal – grade of reference signal.
 The SDG has a negative value when the listener successfully
distinguishes the reference from the coded signal and it has a positive
value when the listener erroneously identifies the coded signal as the
reference.
Source: Bosi & Goldberg, 2002
25
Test method
 The most widely accepted method for testing coding systems with small
impairments is the so-called “double-blind, triple-stimulus with hidden
reference” method.
 Triple stimulus. The listener is presented with three signals: the reference
signal, the test signals A and B. One of the test signals is identical to the
reference signal.
 Double blind. Neither the listener nor the test administrator should know
beforehand which test signal is which. Test signals A and B are assigned
randomly by some entity different from the test administrator entity.
 Hidden reference. The hidden reference (one of the test signals that is
identical to the reference signal) provides an easy mean to check that the
listener does not consistently make mistakes.
 The above test method has been employed worldwide and it provides a
very sensitive, accurate, and stable way of assessing small impairments in
coded audio.
26
Training and grading
 A listening test usually consists of two phases: training and a formal
grading phase.
 Training phase. This is carried out prior to the formal grading phase. This
phase allows the listening panel to become familiar with the test
environment, the grading process, and the codec impairments. This phase
can essentially reduce the effect of the so-called informational masking,
which refers to the phenomenon where the threshold of a complex maskee
masked by a complex masker can decrease on the order of 40 dB after
training. Note that a small unfamiliar distortion is much more difficult to
assess than a small familiar distortion.
 Testing phase. In this phase, the listener is presented with a grading sheet,
as shown in an example used for the development of the MPEG AAC
coder, see the figure in the next page.
27
Training and testing (cont)
 Example of a grading sheet from a listening test
Source: Bosi & Goldberg, 2002
28
Expert listeners and critical material
 Expert listener refers to listeners who have recent and extensive
experience of assessing impairments of the type being studied in the test.
The expert listener panel is typically selected by using pre-screening (e.g.
an audiometric test) and post-screening (e.g. to determine whether the
listener can consistently identify the hidden reference) procedures.
 Critical material should be sought for each codec to be tested, even though
it is impossible to create a complete list of the difficult material for
perceptual audio codecs. Such material can be the synthetic signals that
deliberately break the system under test, any potential broadcast material
that stresses the coding system under test.
29
Listening conditions
 The listening conditions and the
equipment need to be precisely
specified for others to be able to
reliably reproduce the test. The
listening conditions include the
characteristics of the listening
room (such as its geometric
properties, its reverberation time,
early reflections, background
noise, etc.), the characteristics
and the arrangement of the
loudspeakers in the listening
room, and the reference listening
area. (See the multichannel
loudspeakers configuration from
[ITU-R BS. 1116])
Source: Bosi & Goldberg, 2002
30
Data analysis
 ANOVA (Analysis of Variance) method is most commonly used for the
analysis of the test results. SDG (subjective difference grade) is an
appropriate basis for a detailed statistical analysis.
 The resolution achieved by the listening test is reflected in the confidence
interval, which contains the SDG values with a specific degree of
confidence, 1-a, where a represents the probability that inaudible
differences are labelled as audible. Figure below shows an example of
formal listening test results from [ISO/IEC MPEG N1420].
Source: Bosi & Goldberg, 2002
31
MUSHRA method
 MUSHRA (Multiple Stimulus with Hidden Reference and Anchors) is
recommended in [ITU-R BS. 1534] to provide guidelines for the
assessment of audio systems with intermediate quality, i.e. for the ranking
between two systems in the region far from transparency. In this case, the
seven-grade comparison scale is recommended.
 The presence of the anchor(s), which is a low-passed version of the
reference signal, is meant as an aid in weighting the relative annoyance of
the various artefacts.
Source: Bosi & Goldberg, 2002
32
Advantage and problems with
formal subjective listening tests
 Advantage
 Good reliability
 Disadvantage
 High cost
 Time consuming
33
Objective perceptual measurements
of audio quality
 Aim
 To predict the basic audio quality by using objective measurements
based on psychoacoustic principles.
 PEAQ (Perceptual Evaluation of Audio Quality)
 Adopted in [ITU-R BS.1387], is based on a refinement of a generally
accepted psychoacoustics models, together with new cognitive
components accounting for higher-level processes involved in the
judgement of audio quality.
34
Two basic approaches used in
objective perceptual measurements
 The masked threshold method (based on the estimation and accurate
model of masking)
 The internal representation method (based on the estimation of the
excitation patterns of the cochlea taking place in the human ear)
Masked threshold method
Internal representation method
35
Source: Bosi & Goldberg, 2002
PEAQ
 PEAQ has two versions: basic (only using DFT) and advanced (using both DFT
and filter bank). The basic model is fast and suitable for real-time applications,
while the advanced model is computationally more expensive but provides more
accurate results.
 In advanced version, the peripheral ear is modelled both through a DFT and a
bank of forty pairs of linear-phase filters with centre frequencies and bandwidths
corresponding to the auditory filters bandwidths.
 The model output values (MOVs) are based partly on the masked threshold
method and partly on the internal representation method.
 MOVs include partial loudness of linear and nonlinear distortions, noise to mask
ratios, alteration of temporal envelopes, harmonic errors, probability of error
detection, and proportion of signal frames containing audible distortions.
 The selected MOVs are mapped to an objective difference grade (ODG) via an
artificial neural network. ODG is a prediction of SDG.
 The correlation between SDG and ODG proved to be very good, and there is no
significant statistical difference between them.
36
PEAQ (cont)
 Psychoacoustic model of dvanced version of PEAQ.
Source: Bosi & Goldberg, 2002
37
Coding artifacts
 Pre-echo
 For sharp transient signal, pre-echo is caused by the spreading of
quantisation noise into a time region where it is not masked. Can be
reduced by block switching.
 Aliasing
 It might happen when applying PQMF and MDCT and coarse quantisation,
but not a problem in normal conditions.
 Birdies
 This could happen in low data rate, due to the bit allocation changes from
block to block for highest frequency bands, causing the appear or disappear
of some spectral coefficients.
 Reverberation
 It could happen when a large block size is employed for the filter bank in low
data rate.
 Multichannel artefacts
 The loss or shift in the stereo image can introduce artefacts, relevant to
binaural masking.
38
PESQ
 PESQ refers to perceptual evaluation of speech quality, described in [ITU-T Rec.
P.862], launched in 2000, is a family of algorithms for objective measurements of
speech quality that predict the results of subjective listening tests on telephony
systems.
 PESQ uses a sensory model to compare the original, unprocessed signal with
the degraded signal from the network or network element. The resulting quality
score is analogous to the subjective “Mean Opinion Score” (MOS) measured
using listening tests according to ITU-T P.800.
 PESQ takes into account coding distortions, errors, packet loss, delay and
variable delay, and filtering in analogue network components. The user interfaces
have been designed to provide a simple access to this powerful algorithm, either
directly from the analogue connection or from speech files recorded elsewhere.
39
Audio Perception
 Loudness perception
 Pitch perception
 Space perception
 Timbre perception
40
Inner Ear Function
 The inner ear consists of cochlea which has a snail-like structure.
o It transfers the mechanical vibrations to the movement of basilar
membrane, and then converts into nerve firings (organ of corti
which consists of a number of hair cells).
o The basilar membrane carries out frequency analysis of input
sounds, and it responds best to high frequencies at the (narrow
and thin) base end, and to low frequencies at the (wide and thick)
apex end.
Inner Ear Function
(a) The spiral nature of
the cochlea
(b) The cochlea unrolled
(c) Vertical crosssection through the
cochlea
(d) Detailed view of the
cochlea tube
From: (Howard &
Angus, 1996)
Loudness Perception
 The ear’s sensitivity to sounds of different frequencies varies over a wide
range of sound pressure level (SPL). The minimum SPL that can be
detected by the human hearing system around 4kHz is approximately
10e-5Pa, while the maximum SPL (i.e the threshold of pain) is 20Pa.
 For convenience, in practice, SPL is usually represented in decibels (dB)
relative to 20e-5Pa.
P 
dB( SPL)  20 log  m 
 Pr 
where
Pm
is the measured SPL,
5
 For example, the threshold of hearing at 1 kHz is, in fact, Pr  2 10 Pa
In dB, it equals to
 2 105 
  0dB
20 log 
5 
 2 10 
 While the threshold of pain is 20Pa which in dB equals to
 2 10 
20 log 
 120dB
5 
 2 10 
Loudness Perception (cont.)
 The perceived loudness of an acoustic sound is related to its
amplitude (but not a simple one-to-one relationship), as well as the
context and nature of the sound.
 As the sensitivity of our hearing system varies as the frequency
changes, it is possible for a sound with a larger pressure amplitude to
be heard as quieter than a sound with a lower pressure amplitude (for
example, if they are at different frequencies). [recall the equal
loudness contour of the human auditory system shown in the first
lecture]
Demos for Loudness Perception
 Resources: Audio Box CD from Univ. of Victoria

Decibels vs Loudness
Starting with a 440Hz tone (i.e. note A4), then it is reduced
1dB each step
Starting with a 440Hz tone (i.e. note A4), then it is reduced
3dB each step
Starting with a 440Hz tone (i.e. note A4), then it is reduced
5dB each step

Intensity vs Loudness
Various frequencies played at a constant SPL
A reference tone is played and then the same tone is played
5dB higher; followed by the reference tone, and then the tone
8dB higher and finally the reference tone and then the one
10dB higher
Pitch Perception
 What is pitch? Pitch
• is “the attribute of auditory sensation in terms of which sounds may be
ordered on a musical scale extending from low to high” (American
Standard Association, 1960)
• is a “subjective” attribute, and cannot be measured directly. Therefore, a
specific pitch value is usually referred to the frequency of a pure tone that
has the equal subjective pitch of the sound. In other words, the
measurement of pitch requires a human listener (the “subject”) to make a
perceptual judgement. This is in contrast to the measurement in the
laboratory of, for example, the fundamental frequency of a complex tone,
which is an “objective” measurement. (Howard & Angus, 1996)
• is related to the repetition rate of the waveform of a sound, therefore it
corresponds to the frequency of a pure tone and the fundamental
frequency of a complex tone. In general, sounds having a periodic
acoustic pressure variation with time are perceived as pitched sounds, for
non-periodic acoustic pressure waveform, as non-pitched sounds.
(Howard & Angus, 1996)
Existing Pitch Perception Theories
 ‘Place’ theory
 Spectral analysis is performed on the stimulus in the inner ear,
different frequency components of the input sound excite different
places or positions along the basilar membrane, and hence
neurones with different centre frequencies.
 ‘Temporal’ theory
 Pitch corresponds to the time pattern of the neural impulses
evoked by that stimulus. Nerve firings tend to occur at a particular
phase of the stimulating waveform, and thus the intervals between
successive neural impulses approximate integral multiples of the
period of the stimulating waveform.
 ‘Contemporary’ theory
 Neither of the theories is perfect for explaining the mechanism of
human pitch perception. A combination of both theories will benefit
the analysis of pitch perception.
Contemporary Theory
(Moore, 1982)
Demos for Pitch Perception
 Resources: Audio Box CD from Univ. of Victoria
This three demos show how pitch is perceived with different
time duration of the signals. In each track, time bursts of
sounds are played. Three different pitches are played in
these three tracks.
Space Perception
 Sound localisation refers to judgements of the
direction and distance of a sound source, usually
achieved through the use of two ears (binaural
hearing).
 interaural time difference
 interaural intensity difference
 Although binaural hearing is crucial for sound
localisation, monaural perception is similarly effective
in some cases, such as in the detection of signals in
quiet, intensity discrimination, and frequency
discrimination.
Interaural Time Difference (ITD)
(Howard & Angus, 1996)
Interaural Time Difference (ITD)
 Based on the equation below, it can be shown that the maximum ITD
occurs at 90 degree (considering the average head diameter), which
is: 6.73 10 4  673s
Where
r (  sin(  ))
t 
c
t - ITD (in s)
- Half the distance between the ears (in m)
r

c
- The angle of arrival of the sound from the
median (in radians)
- Sound speed (in m/s)
ITD and IPD
 The ear appears to use the interaural phase difference (IPD) caused
by the ITD in the two waves to resolve the sound direction.
 The phase difference is given by:
  2fr(  sin(  ))
Where

- The phase difference between the two ears
(in radians)
r
- Half the distance between the ears (in m)

- The angle of arrival of the sound from the
median (in radians)
f
- The frequency (in Hz)
 When the phase difference is greater than 180 degree, there will be an
unresolvable ambiguity in the sound direction as the angles could be
the one to the left or to the right.
Interaural Intensity Difference (IID)
 Due to the shading effect of the head, the intensity of the sound levels
reaching each ear is also different. Such difference is called interaural
intensity difference (ITD).
 When the sound source is on the median plane, the sound level at
each ear is equal, while the level at one ear progressively reduces,
and increases at the other, as the sources move away from the
median plane.
c
 The shading effect of the head is difficult to calculate, however,
experiments seem to show that the intensity ratio between the two
ears varies sinusoidally from 0dB up to 20dB with the sound direction
angles, for various frequencies.
 The shading effect is not significant unless the size of the head is
about one third of a wavelength in size. For a head with a diameter of
18cm, this corresponds to a minimum frequency (Howard & Angus,
1996) of:
1  c  1  344m / s 
f min(  / 2 )     
  637 Hz
3  d  3  0.18m 
Shading Effect in IID
c
(Howard & Angus, 1996)
Timbre Perception
 According to American Standard Association, it is defined as “that
attribute of sensation in terms of which a listener can judge that two
sounds have the same loudness and pitch are dissimilar”.
 Musically, it is “the quality of a musical note which distinguishes
different types of musical instruments.”
 It can be defined as “everything that is not loudness, pitch or spatial
perception”.
•
Loudness < - > Amplitude (frequency dependent)
•
Pitch < - > Fundamental Frequency
•
Spatial perception <-> IID, IPD
•
Timbre <-> ???
56
Physical Parameters
 Timbre relates to:
•
Static spectrum (e.g. harmonic content of spectrum)
•
Envelope of spectrum (e.g. the peaks in the LPC spectrum which
corresponds to formants)
•
Dynamic spectrum (time evolving)
•
Phase
•
…
57
Static Spectrum
58
Spectrum Envelope
The spectral envelopes of the flute (the above figure)
and the piano (the below figure) suggest that they are
different for different music instrument.
59
Dynamic Spectrum
This figure shows how the spectral envelope
looks like in a trumpet sound
60
Demos for Timbre Perception
 Resources: Audio Box CD from Univ. of Victoria
Examples of differences in timbres
61
A Music Demo for Auditory
Transduction
 Perceptual mechanism of an auditory system
http://www.youtube.com/watch?v=46aNGGNPm7s
62
References
 M. Bosi and R.E. Goldberg, “Introduction to Digital Audio Coding and
Standards”, Springer, 2002.
 A. Kondoz, “Digital Speech Coding for Low Bit Rate Communication
Systems”, Wiley, 2001.
 D.M. Howard, and J.A.S. Angus, Acoustics and Psychoacoustics (4th
Edition), 2009, Focal Press.
63