An automatic speech segmentation tool based on multiple acoustic parameters

Speech segmentation is required not only for linguistic research based on oral corpora, but had become essential for natural language processing. Many researchers have developed different approaches to deal with the need of automatic segmentation of speech data. In this paper we discuss some of the prosodic parameters used as cues for boundary identification and present an ongoing project for the automatic segmentation of spontaneous speech developed for Brazilian Portuguese.


Introduction
Speech corpora are increasingly becoming important resources for different areas, not only in the field of theoretical and applied linguistics but also for the development of technologies such as text to speech/speech to text systems.However, information extraction from speech corpora requires the segmentation of the audio signal into discrete and meaningful linguistic units.The main goal of this paper is to propose the main features of a model for the automatic segmentation of spontaneous speech based on prosodic parameters in Brazilian Portuguese.We also discuss some of the literature concerning this topic.
The definition for the basic segmental unit of speech may vary according to the researcher's interests.They can be either words or linguistic structures smaller or larger than the word.Many tools have been developed for the segmentation of a speech signal into phonetic units smaller than the word, i.e. phones and syllables.Segmentation at the phonetic level is useful for several purposes, in particular for the extraction of parameters such as duration, fundamental frequency (F0) and intensity within each segment.However, that type of segmentation is not appropriate for information extraction at the semantic or morphosyntactic levels.Segmentation of the speech signal into words is also not ideal, since it also does not allow a proper extraction and interpretation of various types of linguistic information, like those related to scope and hierarchical relationships between the elements of phrases or other linguistic units that are relevant from a communicative point of view.
In this work we adopt the utterance and the tonal unit as the elementary linguistic units into which the speech flow should be segmented.The boundaries delimiting these units are signalled in the speech flow through prosodic parameters.In the following sections, we present the concepts of these units and discuss some of the literature concerning acoustic parameters related to speech segmentation.We also present a proposal for a model for an automatic speech segmentation tool based exclusively on the analysis of acoustic cues obtained from the audio signal.

Elementary linguistic units for spoken communication
The problem of identification of phrase and utterance boundaries in speech is not a new one.The idea that prosody is an important component of spoken discourse organization is aknowlegded by a great number of scholars.Since the 1970's, studies of the nature of speech phrasing and the relation between segmental and suprassegmental structures have given rise to different approaches (Halliday 1970;Lehiste 1972;Nespor & Vogel 1986;Chafe 1987;Moneglia & Cresti 1997).Although there is some consensus concerning a relation between prosodic parsing and syntactic structure, the precise nature of this relation is still not fully understood, although an absolute isomorphism between the two domains is no longer advocated.Several studies on different languages have been demonstrating that prosodic parsing of speech is a highly prominent perceptual phenomenon (Batliner et al. 1995;Cummins 1998;Moneglia et al. 2010).Listeners can detect not only the presence of prosodic boundaries, but are also able to differentiate discourse finality or continuation according to the perception of non-terminal and terminal boundaries.This is true even when the speech fragments are resynthesised in an unintelligible way by means of spectral filters (Swerts et al. 1992), or when the listener is not proficient in the language (Carlson et al. 2005).
Studies on different languages have also pointed to a strong connection between prosodic parsing and information/discourse structure (Swerts et al. 1992;Chafe 1993;Cresti & Moneglia 2010;Izre'el 2005;Kibrik 2012).There is sufficient evidence from corpus-driven and corpus-based research that the segmentation of speech should be based on prosodic criteria.In accordance with these pieces of evidence, we adopt the assumption that prosodic boundaries signal the segmentation of speech into meaningful communicative units of spoken discourse.
These elementary linguistic units of spoken discourse can be defined within the theoretical framework of the the Language Into Act Theory (Cresti 2000;Moneglia & Raso 2014).According to this theory, the speech flow is parsed into utterances and smaller tonal units by means of prosodic boundaries (Crystal 1975) interpreted by the listener as having either a terminal (concluded/autonomous) or a non-terminal (non-concludes/non-autonomous) value.The term utterance is defined as every linguistic unit that has both pragmatic and prosodic autonomy in discourse, delimited within the speech flow by a prosodic boundary perceived as terminal.If the unit carries an illocutionary value (Austin 1962), then the unit is pragmatically (communicatively) autonomous.
Utterances can be produced as a single tonal unit or they can be parsed into two or more tonal units by means of non-terminal prosodic boundaries (Moneglia & Cresti 2006).Example (1) shows a sequence of three simple utterances (Figure 1) and Example (2) shows a compound utterance with two tonal units (Figure 2) (observe the single slash after sai in example 2).
(1) é a terceira // vão lá // foi // (bpubdl03, 50-52) 'it's the third' 'let's go' 'go' (2) quando sai / nũ é stop // (bfamdl32, 39) 'when (you're) out' 'it isn't stop' These examples were selected from the C-ORAL-BRASIL I corpus (Raso & Mello 2012).This corpus comprises 139 informal spontaneous speech recordings and provides audio files, transcriptions and text-to-speech alignment.Double slashes indicate utterance boundaries and single slashes signal tonal unit boundaries.It is important to note that the segmentation of the speech flow into tonal units and utterances is based exclusively on the annotator's perception of terminal and non-terminal prosodic boundaries.
Different acoustic cues appear to be involved in the delimitation of prosodic boundaries.Among these, silent pauses and lengthening of the pre-boundary syllable appear to be the most salient.Examples (3) and (4) show terminal (Figure 3) and non-terminal boundaries (Figure 4) followed by silent pauses.Examples ( 5) and ( 6) show pre-boundary lengthening associated with nonterminal and terminal boundaries respectively.In Figures 5 and 6 the normalized durations1 of Vowel-to-Vowel (VV) units (Barbosa 2013) were plotted as continuous red lines.Rising lines indicate increase in duration and yellow arrows indicate duration peaks associated with boundaries.VV units are transcribed using a broad phonetic transcription with ASCII characters.In ( 5), the stressed vowels on the segments az (from casa -[kaza]) and ig (from barriga -[baRiga]) are both lengthened.These occur in pre-boundary positions.Differently, the non-terminal boundary after the word não is not accompanied by pre-syllabic lengthening.This is indicated by the green arrow in Figure 5.In ( 6), there is a lengthening of the segment eNtR (from ventre - [veNtR]).This segment is in a pre-boundary position that corresponds to the end of the utterance, therefore, the lengthening here indicates a terminal boundary.This is indicated by the yellow arrow in Figure 6.
(6) você não nasceu do meu ventre // (bfammn05, 63) 'you weren't born of my womb' As can be noted from the examples (3-6), silent pauses and syllabic lengthening are prosodic cues of a boundary.Nevertheless, these parameters alone are not pervasive for the identification of all boundaries.Silent pauses may or may not occur at a perceived boundary.There is an estimate of 33% of utterance boundaries and 62% of tonal unit boundaries that do not coincide with a silent pause in the C-ORAL-BRASIL corpus (Raso et al. 2015).Silent pauses and syllabic lengthening are also not good predictors of the boundary type, since they can cooccur with terminal and non-terminal boundaries alike, as exemplified in (3-6).

Acoustic parameters for the segmentation of speech into discrete units
The acoustic correlates of prosodic boundaries have been studied for some time.
The development of powerful and accessible information technologies (personal computer, digital recorders, digital storage media) opened the possibility for the study of naturally ocurring speech fragments using statistical models.From the second half of the 1980's, many studies have been investigating specific aspects of individual as well as groupings of prosodic parameters that signal boundaries.
From these studies, it becomes clear that the perception of boundaries is dependent on the occurrence of a set of different prosodic features, such as a silent pause, lengthening of the pre-boundary syllable, a rise or fall in F0, as well as changes in intensity across the boundary and also creaky voice over the pre-boundary syllables.Among these, silent pauses and lengthening of the preboundary syllable have been regarded as the most important predictors of boundary perception and will be further discussed in the following sections.

Silent pause
Silent pause is the most studied parameter of speech segmentation (Martin 1970;Swerts 1997;Shriberg et al. 2000;Tseng & Chang 2008;Mo & Cole 2010;Tyler 2013).The analysis of this parameter points at two main results.On the one hand, long pauses are cues related to strong edge marking.On the other hand, the attempt to correlate boundary type (either terminal or non-terminal) with silent pauses shows inconsistent results.
Overall, extra-long pauses are associated with the completion of speech or change of topic (paragraphing).In that case, they are a useful and important parameter for automatic speech processing.However, pauses are not always present at utterance boundaries, and many long pauses do occur between tonal units belonging to the same utterance, as exemplified in (4).Thus, the presence or absence of a silent pause does not provide detailed information about the prosodic parsing of spoken discourse.

Pre-boundary lenghtening
Syllabic lengthening also received a lot of attention on studies on boundary detection.Syllables in pre-boundary position are often much longer than syllables in other positions.(Wightman et al. 1992;Barbosa 2008;Mo et al. 2008;Fuchs et al. 2010;Fon et al. 2011;Tyler 2013).This parameter has proved to be quite relevant for automatic speech segmentation.
The syllabic lengthening, however, does not signal speech boundaries only, but can also signal emphasis.Although the portion of the syllable that is lengthened is different in pre-boundary position and when it is used in order to signal emphasis (Campbell 1993;Barbosa 2008), delimiting utterances and tonal units in spoken discourse based on duration would require a very fine analysis (at phone level) for the appropriate training of an automatic segmentation system.Further on, so far there is no evidence that the duration of the pre-boundary syllable could distinguish between two boundary types.

Other parameters
Traditionally, the reset of the fundamental frequency (F0) is considered a cue of intonational phrase boundary.Intonational phrasing can be defined as a structured hierarchy of the intonational constituents in natural speech, dominated by boundary tones (Crystal 2008).The first and last stressed syllables of an intonational phrase delimit a span where there is a gradual declination throughout the whole unit (Couper-Kuhlen 2006).An intonational phrase is usually coextensive with an utterance, but that is not always the case.In fact, some studies have shown that the reset of F0 does not seem to be a sufficient parameter to differentiate boundaries that occur between intonational phrases within an utterance and those delimiting the utterances (Schuetze-Coburn et al. 1991;Couper-Kuhlen 2006).
Furthermore, variations in speech rate often signal boundaries between units, as observed in various studies.Generally, a change in the speech rate between the end of an unit and the beginning of the subsequent one is observed (Amir et al. 2004;Tyler 2013).What is more, this parameter is closely related to the style of speech and the particular characteristics of the speaker.
Intensity is also used as an auxiliary parameter in boundary identification, since it exhibits a declination line similar to F0 declination.Moreover, an increase in intensity can be related to the beginning of a prosodic unit (Swerts et al. 1994;Tseng & Fu 2005;Mo 2008).
Finally, laryngealization (creaky voice) has also been pointed out as an acoustic cue for the identification of prosodic boundaries.Studies on different languages indicate that laryngealization occurs mainly at prosodic boundaries, that is, the final portion of utterances or intonational phrases are often creaky (Kohler 1994;Redi & Shattuck-Hufnagel 2001;Ogden 2001;Garellek 2015) and it seems to be also related to fragmentation and disfluency phenomena (Kohler 1994;Kohler et al. 2001).

A model for an automatic speech segmentation tool
From the literature reviewed above, it is clear that an automatic speech segmentation task will not be accurate if it is based on a single prosodic parameter.Particularly, the task of differentiating between terminal and non-terminal prosodic boundaries cannot be achieved without a better understanding of the possible sets of parameters that correlate with each boundary type.Therefore, accurate measures of the acoustic correlates of terminal and non-terminal boundaries are crucial to perform the segmentation of speech into utterances and tonal units.
We consider that an automatic segmentation system for spontaneous speech should: be able to identify and differentiate terminal and non-terminal boundaries with a minimal margin of error; be based on acoustic data only, and not dependent on syntactic parsing or any other level of previous linguistic analysis; -require the least possible amount of human annotation for segmentation training.
To achieve these goals, we adopt the following procedures: first, we prepare a speech sample of audio files.The files correspond to 100-200 words fragments of texts from different speech styles and speakers (male and female).Speech samples are selected from the C-ORAL-BRASIL Reference corpus for spontaneous Brazilian Portuguese (Raso & Mello 2012;Raso & Mello 2014).Excerpts are taken from spontaneous monologues in informal and formal natural contexts and also from media (news).Each audio transcript is then annotated by a team of 14 expert prosodic boundary annotators.Annotators work independently of one another.While they listen to the audio, they insert tags for non-terminal and terminal prosodic boundaries on the transcript according to their perception.All boundary tags are counted according to type (non-terminal and terminal) and position (pre-boundary phonological word).This information is then transferred to point tiers in a Praat TextGrid object2 .Next, the audio files are annotated with Praat TextGrid objects with five tiers: -Vowel-to-Vowel interval tier.Interval tier with all phonetic syllables delimited by two consecutive vowel onsets accompanied by a broad phonetic transcription; In order to generate a model for the automatic annotation of prosodic boundaries, the Praat script ProsodyDescriptor (Barbosa 2013) is being adapted for the extraction of prosodic parameters at each point indicated in the Praat annotation object.The script uses the corresponding audio file and the annotated tiers to extract and calculate the following parameters: Measures for speech rate and rythm: a. speech rate in VV units per second; b.rate of non-salient VV units per second; Measures for segment duration and duration normalization 3 : c.Mean, standard deviation and skewness of smoothed z-score peaks; d.Smothed z-scored local peak rate in peaks per second; Measures for fundamental frequency (F0) and F0 normalization: e. F0 median in semi-tones (re 1 Hz); f.F0 standard deviation in semi-tones; g.F0 Pearson skewness 4 ; h.1st-derivative F0 mean in Hz/second -the value is multiplied by 1000 for scaling purposes; i. 1st-derivative 5 F0 standard deviation in Hz/second; j. 1st-derivative F0 skewness; 3 See footnote 1. 4 Skewness is a statistical measure of symmetry (or lack of it) in a distribution. 5The first derivative indicates whether a function is (and by how much) increasing or decreasing.All statistical measures extracted here are used to determine the direction of the pitch movement (rising or falling) and the shape of a smoothed and normalized pitch curve.
k. smoothed F0 peak rate in peaks per second; Measure for intensity: l. spectral emphasis in dB.
The script extracts these parameters from a window of 10 VV units to the left and 10 VV units to the right of each potential boundary.Figure 7 shows an example of the audio and annotation grid in Praat with the analysis windows (shaded in yellow) for the boundary point indicated by the red arrow.With the acoustic parameters values and the inter-annotator agreement on boundary perception, a logistic regression model will be used to predict the likelihood of boundary realization from the acoustic parameters in the sample.And this for the two types of boundary.

Final remarks
Observations of spontaneous speech corpora such as C-ORAL-BRASIL I (Raso & Mello 2012), C-ORAL-ROM (Cresti & Moneglia 2005) and the Santa Barbara Corpus (Du Bois et al. 2000Bois et al. -2005) ) show that final boundaries, i.e. boundaries that delimit utterances (prosodically/pragmatically autonomous linguistic units), can be either perceptually strong or weak, and the same is also true for continuative/non-final boundaries.That means that boundary strength (perceptually weak vs strong boundaries) does not necessarily overlap with boundary type (terminal and non-terminal boundaries), specially in spontaneous speech.
Silent pause and pre-boundary syllable lengthening have been successfully used as cues to automatic segmentation of speech.However, these parameters seem to be better correlates of boundary strength (perception of weak vs. strong boundaries) than of boundary type (terminal vs. non-terminal), since neither has proved to guarantee the distinction between final and non-final boundaries.Also, a system based on pre-boundary syllable lengthening for recognition of tonal unit boundaries requires the manual syllabic segmentation and annotation of a large volume of data, which takes a great amount of time and skilled human resources.
So far, we do not have a model that correlates different sets of prosodic parameters with terminal and non-terminal boundaries, as proposed in this paper.For the reasons pointed out above, we believe that the extraction of multiple acoustic parameters could provide a more complete probabilistic model for automatic boundary identification in spontaneous speech.

Figure 7 .
Figure 7. Illustration of audio file, Textgrid and analysis windows for data extraction.
Phonological Word point tier -non-terminal boundary.Point tier with points at every phonological word boundary (potential tonal unit boundary locations), accompanied by a label, at each point, of how many annotators signaled that point as a non-terminal boundary: 0-14.-Phonological Word point tier -terminal boundary.Point tier with points at every phonological word boundary, accompanied by a label, at each point, of how many annotators signaled that point as a terminal boundary: 0-14.-Silence.Interval tier delimiting silent pauses.-Text.Textual transcription of utterances.