Modeling the prosody of Vietnamese language for speech synthesis

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF TECHNOLOGY ------------------------------- Thesis for the degree of MASTER OF SCIENCE Modeling the prosody of Vietnamese language for speech synthesis Speciality: “Information processing and Communication” Code:23.04.3898 MẠC ĐĂNG KHOA Supervisor: Prof. PHẠM THỊ NGỌC YẾN Hanoi, 2007 Faculty of Information Technology International research center of Multimedia Information, Communication and Application - 1 -

105 trang | Chia sẻ: huyen82 | Lượt xem: 1575 | Lượt tải: 0

Tóm tắt tài liệu Modeling the prosody of Vietnamese language for speech synthesis, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

Master thesis Mạc Đăng Khoa Acknowledgment Many people provided me generous help and inspiration during my time of master student. First, I would like to express my deep sense of respect and gratitude towards my supervisors: Dr. Eric Castelli and Prof. Phạm Thị Ngọc Yến. Thank you very much for orienting and guiding my research in speech processing domain. Thank you for all your useful advices, your true criticisms and your patience during my time of master research. Special thanks also goes to Mrs. Geneviève Caelen-Haumont, PhD students Trần Đỗ Đạt, Vũ Minh Quang and all members of MICA’s speech group. I could not have done this thesis without your supports. Thank all of you for all your suggestions and your sincere remarks on entire of my research. I would like to thank to Ms. Đồn Thị Ngọc Hiền, who guiding me in recording the corpus. I would also like to thank to a lot of MICA member who spent much of time for recording and testing for my research. I am grateful to Prof. Nguyễn Trọng Giảng and MICA’s directorate supporting me the best convenient conditions during time working in International Research Center MICA. Finally, I owe a great deal to my parents and my sister for their continued support. I also give a very special thanks to my girl friend for her constant encouragement, giving me strength and motivation in my work and in my life. - 2 - Master thesis Mạc Đăng Khoa Abstract Text-To-Speech (TTS) system is a computer system which is able to produce the speech from the text. In the TTS system, the naturalness of the produced speech depends greatly on the variation of pitch, duration and energy during speaking. We call it the “prosody controlling ability”. A TTS system with good prosody controlling ability can be simulate the human speech prosody corresponding to the context of speaking. With tonal languages such as Vietnamese, the prosody of an utterance is the combination results of the two components: "micro-prosody" corresponding to the tone of each syllable in a sentence and "macro-prosody" corresponding to the whole sentence. The main goal of this thesis is to model the characteristics of Vietnamese prosody for speech synthesis. It focuses on the influences of the macro-prosody on the micro-prosody, in three types of sentence: assertive, interrogative and imperative. The first task is to set up a “prosody corpus” and extract all possible prosody parameters. Base on the extracted data, we defined seventy-two simple prosody patterns for Vietnamese syllables in three types of sentence. After that, these patterns were applied to synthesize some simple sentences. Finally, some perception experiments were taken to evaluate these synthesized sentences. The results shown that the proposed patterns can be applied successfully to generate the prosody of simple sentence. This work is our preliminary work in Vietnamese prosody, just concerning the sentence types and the position of syllable in a sentence. In the future, we expect to continue this research with more factors of Vietnamese prosody, improve our pattern and apply them Vietnamese TTS system. - 3 - Master thesis Mạc Đăng Khoa - 4 - Master thesis Mạc Đăng Khoa List of Figures Figure 1-1: Category of methods for predicting syllable duration [6]....................23 Figure 2-1: Example of the contours of six tones, as described in [21]...................30 Figure 2-2: The shape of Tone 1 with female and male voice [18].........................31 Figure 2-3: The shape of Tone 2 with female and male voice [18].........................31 Figure 2-4: The shape of Tone 3 with female and male voice [18].........................32 Figure 2-5: The shape of Tone 4 with female and male voice [18].........................32 Figure 2-6: The shape of Tone 5 with female and male voice [18].........................32 Figure 2-7: The shape of Tone 5b with female and male voice [18].......................33 Figure 2-8: The shape of Tone 6 with female and male voice [18].........................33 Figure 2-9: The shape of Tone 6b with female and male voice [18].......................34 Figure 2-10: Sentence classification by structure [20]............................................35 Figure 2-11: The sentences “Lan thích ăn cơm khơng” in......................................36 Figure 2-12: The sentences “Bảo cố gắng tập đi” in...............................................36 Figure 2-13: The sentences “Tân bỏ đi chứ” in ......................................................37 Figure 2-14: The differences of F0 contour between Assertive and Interrogative sentence [16] .........................................................................................................37 Figure 3-1: A general function diagram of TTS system [13] ..................................41 Figure 3-2: Fujisaki model.....................................................................................46 Figure 3-3: Fujisaki model for tonal language [19] ................................................46 Figure 3-4: Function diagram of proposal TTS system ..........................................47 Figure 3-5: Prosody generation module .................................................................48 Figure 4-1: Key-syllable segmentation ..................................................................56 Figure 4-2: Extracting F0 contour using PRAAT...................................................57 Figure 4-3: An example of prosody pattern............................................................60 Figure 5-1: An example of synthesized non-sense phrase ......................................73 Figure 5-2: Perception test 1 ..................................................................................74 Figure 5-3: An example of synthesized multi-type sentences.................................80 - 5 - Master thesis Mạc Đăng Khoa Figure 5-4: Interface for Perception test 2..............................................................82 Figure 5-5: Correct recognition rate with 8 tones of last syllable ...........................85 Figure 5-6: Correct recognition rate (%) with other types of sentences ..................86 Figure 5-7: Result comparison of three experiments ..............................................87 - 6 - Master thesis Mạc Đăng Khoa List of Tables Table 1.1: Prosody functions .................................................................................16 Table 1.2:Links between levels of representation of prosodic phenomena [13]......17 Table 1.3: Intonation model classification .............................................................18 Table 2.1:Vietnamese vowels. ...............................................................................27 Table 2.2:Vietnamese consonants. .........................................................................28 Table 2.3: Arrangement of Vietnamese consonants. ..............................................28 Table 2.4:The phonological hierarchy of Vietnamese syllables with total numbers of each phonetic unit [14]. .........................................................................................29 Table 2.5 The six Vietnamese tones.......................................................................30 Table 3.1: Comparison between direct pattern and model pattern ..........................50 Table 4.1: Prosody corpus structure.......................................................................52 Table 4.2: Prosody corpus text information ...........................................................53 Table 4.3: Recording information of Prosody corpus.............................................54 Table 5.1: Confusion matrix (in %) for 8 tones with male voice ............................75 Table 5.2: Confusion matrix (in %) for 8 tones with female voice .........................75 Table 5.3: Confusion matrix (%) of sentence types with male voice .....................76 Table 5.4: Confusion matrix (%) of sentence types with female voice ..................77 Table 5.5: Test data for Experiment 2....................................................................79 Table 5.6: Confusion matrix (in %) of sentence types (with male voice)................82 Table 5.7: Confusion matrix (in %) of sentence types (with female voice) ............83 Table 5.8: Confusion matrix (in %) of sentence types (average of Male and Female) ..............................................................................................................................84 Table 5.9: Correct recognition rate (%) with other types of sentences....................86 Table 5.10: Result of three experiments.................................................................87 - 7 - Master thesis Mạc Đăng Khoa Table of contents Acknowledgment .......................................................................................... 1 Abstract ........................................................................................................ 2 List of Figures............................................................................................... 4 List of Tables ................................................................................................ 6 Table of contents .......................................................................................... 7 0 INTRODUCTION ................................................................................. 9 1 PROSODY AND PROSODIC MODEL............................................. 12 1.1. Overview of prosody ...........................................................................................12 1.1.1. The concept of prosody............................................................................................................ 12 1.1.2. Major components of prosody ................................................................................................. 13 1.1.3. The functions of prosody ......................................................................................................... 14 1.1.4. Levels of representation of prosodic phenomena..................................................................... 16 1.2. Prosody modeling ................................................................................................17 1.2.1. Intonation models..................................................................................................................... 18 1.2.2. Duration modeling ................................................................................................................... 21 1.2.3. This thesis work approach........................................................................................................ 23 2 VIETNAMESE LANGUAGE AND PROSODY ............................... 25 2.1. Vietnamese language ...........................................................................................25 2.1.1. Vietnamese characteristics ....................................................................................................... 25 2.1.2. Vietnamese phoneme system ................................................................................................... 27 2.1.3. Syllable structure ..................................................................................................................... 29 2.2. Vietnamese prosody.............................................................................................29 2.2.1. Micro-prosody and tones system in Vietnamese...................................................................... 30 2.2.2. Macro-prosody and sentence types in Vietnamese .................................................................. 34 2.2.3. Some special phenomena in Vietnamese prosody ................................................................... 38 3 TTS SYSTEM AND PROSODY GENERATION............................. 40 3.1. An overview of TTS system ................................................................................40 3.2. Prosody generation ..............................................................................................41 3.2.1. Overview of prosody generation.............................................................................................. 41 3.2.2. From text to prosody................................................................................................................ 43 3.3. Other researches and our proposal.......................................................................45 4 PROSODY PATTERNS EXTRACTION .......................................... 51 4.1. Prosody corpus.....................................................................................................51 - 8 - Master thesis Mạc Đăng Khoa 4.1.1. Objectives ................................................................................................................................ 51 4.1.2. Define the corpus text .............................................................................................................. 52 4.1.3. Recording................................................................................................................................. 54 4.1.4. Sentence segmentation............................................................................................................. 54 4.2. Analysis and extracting prosody parameters .......................................................55 4.2.1. Segmentation............................................................................................................................ 55 4.2.2. Extracting prosody parameters of key-syllable ........................................................................ 56 4.3. Proposal the patterns for Vietnamese prosody ....................................................58 4.3.1. Methodology............................................................................................................................ 58 4.3.2. Prosody patterns....................................................................................................................... 59 4.3.3. Some visual remarks on extracted patterns .............................................................................. 70 5 EXPERIMENTS AND EVALUATION............................................. 72 5.1. Experiment 1: Tone and non-sense phrase ..........................................................72 5.1.1. Objectives ................................................................................................................................ 72 5.1.2. Method and Implementation .................................................................................................... 72 5.1.3. Results and discussion ............................................................................................................. 74 5.2. Experiment 2: Multi-type sentences ....................................................................79 5.2.1. Objectives ................................................................................................................................ 79 5.2.2. Method and Implementation .................................................................................................... 79 5.2.3. Results and discussion ............................................................................................................. 82 5.3. Comparison and conclusion.................................................................................87 6 CONCLUSION AND PERSPECTIVES ............................................ 89 REFERENCES........................................................................................... 92 APPENDIX................................................................................................. 95 A. Text for prosody corpus ..............................................................................................95 B: Datasheet of prosody patterns ...................................................................................100 - 9 - Chapter 0: Introduction Mạc Đăng Khoa 0 Introduction Speech is the primary means of communication between people. Speech synthesis, automatic generation of speech waveforms, has been under development for several decades. Recent progress in speech synthesis has produced synthesizers with very high intelligibility but the sound quality and naturalness remain a major problem. Most of recent researches attempt to improve the naturalness of synthesized sound to reach to human speech. In Vietnam, there are currently some Vietnamese synthesis system like VnVoice (develop by Institute of Information Technology) or HoaSung (develop by International Research Center MICA). These researches obtained some encouraging results. However, to release their systems to the market yet, they have to improve the produced speech quality, especially the naturalness of speech prosody. Thus, this thesis aims to study the characteristics of Vietnamese prosody for applying to synthesize the speech. This work is carried out in International research center of Multimedia Information, Communication and Application (MICA) and is part of MICA’s project: VN-Synthesis. With the research of PhD student Tran Do Dat in MICA, we have already developed a speech synthesis system using sound samples concatenation techniques. The first version now can produce sound from detailed text description, which consists of: - 10 - Chapter 0: Introduction Mạc Đăng Khoa • The sequence of phonemes for composing the utterance: can be obtained automatically from the raw text using a "phonetization” module, whose development is currently underway. • All information related to voice modulations: mostly pitch, energy and duration variations that constitute the intonation or prosody of the uttered statement. We call it “prosody description”. For tonal languages such as Vietnamese, the prosody of speech is composed of two components, which we call “micro-prosody” and “macro-prosody”: • Micro-prosody is the variations of pitch, duration and intensity of individual word or syllable. For tonal language, the micro-prosody is very important to distinguish the syllable’s tone. Thus, the meaning of the synthesized sound greatly depend on the quality of micro-prosody. • Macro-prosody is the application of prosody to whole phrase or sentence. It depends on the type of sentence, speaker's intentions, the emotions etc. Therefore, the "naturalness" of synthesized speech is depends on ability of macro-prosody controlling during speech synthesis process. Objectives and Tasks This thesis is part of MICA speech synthesis research and its main goal is to extract characteristics of Vietnamese prosody to generate the “prosody description” for speech synthesis. In this thesis, we just focus on the differences of Vietnamese tones in different positions in the sentence and in different types of sentences. In other words, these are the influences of macro-prosody on micro-prosody. The first task is setting up a corpus for researching Vietnamese prosody. With this corpus, we extract and analysis parameters of fundamental frequency, duration and intensity of the syllables in eight Vietnamese tones, in three positions and in three type sentences. - 11 - Chapter 0: Introduction Mạc Đăng Khoa After that, using these prosody parameters, we defined the simple prosody patterns for Vietnamese tones, corresponding to the cases of syllable in three types of sentence: assertive, interrogative and imperative. By applying these patterns to re- synthesize some simple sentences and doing some perception experiment, we can examine the appropriateness of these prosody patterns. Thesis outline This thesis is structured as follows: • Chapter 1 starts with Section 1.1 giving some background on prosody, also some definitions and some term we use in this thesis book. Section 1.2 briefly presents modeling prosody and some prosodic models. • Chapter 2 gives an overview of Vietnamese language and Vienamese prosody. • Chapter 3 starts with the introduction of Text-to-Speech system, the general structure of TTS system and the prosody generation. In last section of this chapter, we present some related work and propose a simple structure for prosody generation module for TTS system. • Chapter 4: Section 3.1 and 3.2 describes our work of setting up and analyzing the Vietnamese prosody corpus. In section 3.3, we propose set of prosody patterns for the Vietnamese syllables. • In chapter 5, a series of perception experiments is presented for evaluating our proposal patterns. • Chapter 6 completes with the conclusions from the work presented in the thesis and suggestions for further work - 12 - Chapter 1: Prosody and Prosodic model Mạc Đăng Khoa 1 Prosody and Prosodic model In this chapter, we give an overview of prosody and explain some terms we use in this thesis. The concept of modeling prosody and some prosodic models are also briefly presented after that. 1.1. Overview of prosody 1.1.1. The concept of prosody There is not an exact definition of the term “prosody”. We can use the term "prosody" broadly, meaning “a time series of speech-related information that is not predictable from a reasonable window (i.e. word-sized or sentence-sized) applied to the phoneme sequence” [1]. Viewed in the large, prosody is a parallel channel for communication, carrying some information that cannot be simply deduced from the lexical channel. All aspects of prosody are transmitted by muscle motions, and in most of them, the recipient can perceive, fairly directly, the motions of the speaker. Clearly, with that broad definition of prosody, hand gestures, eyebrow and face motions, can be considered prosody, because they carry information that modifies and can even reverse the meaning of the lexical channel. However, in the domain of speech processing, we concentrate on the aspect of speech of prosody. Thus, the prosody could include: “Pitch”, “Duration” and “Stress”. In the aspect of speech - 13 - Chapter 1: Prosody and Prosodic model Mạc Đăng Khoa signal, the prosody is represented by three components: “Fundamental frequency (F0)”, “Duration” and “Intensity”. “Prosody” and “Intonation” The term prosody refers to certain properties of the speech signal such as audible changes in pitch, loudness, and syllable length. For some authors the set of prosodic features also includes other aspects related to speech timing such as rhythm and speech rate. [13] Some as a synonym for prosody use the term intonation. It is restricted to the tonal (melodic) aspects of prosody by others. In the thesis, intonation refers to pitch variation in speech production and is part of prosody. [13]. In other words, we have: Prosody = Intonation + Duration 1.1.2. Major components of prosody As we discuss above, the prosody consist of: • Pitch (Fundamental frequency): Among prosodic event, the most overt are changes in pitch, which together constitute the pitch contour of the utterance. (F0 contour of speech signal). Some analysis of sentences-lever pitch contours show that the pitch contour of longer utterances can be broken down to a sequence of elementary contours, which can further be divided into syllabic contours. [13] • Duration: duration in prosody is concerning to the length of sentence, phrase, word, syllable, voiced part in syllable, syllabic nuclei, and so on. The duration of syllable and speech sounds depends on several (dependent or interdependent) factor such as speech rate, rhythm, phonetic nature, etc. Most of case, the absolute duration of an event is easily measured. However sometime, it is not obvious to define the boundary of an event. - 14 - Chapter 1: Prosody and Prosodic model Mạc Đăng Khoa • Stress (Intensity): stress is a prosodic property that has been described since the very first work on prosody in phonetics. It was said to be related to loudness and phonology force. Both these characterizations refer to the perceptual form of prosody: the syllable carrying stress is prominent with respect to the surrounding syllables, either due to its loudness or to its dynamic properties. 1.1.3. The functions of prosody Prosody, as expressed in pitch, gives clues to many channels of linguistic and para- linguistic information. Linguistic functions such as stress and tone tend to be expressed as local excursions of pitch movement. Intonation types and para- linguistic functions may affect the global pitch setting, in addition to characteristic local pitch excursion near the edge of the sentence (i.e. boundary tones). [1] Prosody used to convey lexical meaning: Stress, accentual and tone languages. • Stress language: English is an example of a stress language. Stress location is part of the lexical entry of each English word. For example, "apple" and "orange" both have stress on the first syllable, while "banana" has stress on the second syllable. When an English word is spoken in isolation in declarative intonation, f0 typically peaks on the stressed syllable. • Accentual language: Japanese is an example of an accentual language. A word is lexically marked as accented (on a particular syllable) or un- accented. A simplified description is that pitch rises near the beginning of an accentual phrase and falls on the accented syllable. For detailed analysis, see Beckman and Pierrehumbert (1988). • Tone language: Mandarin, Vietnamese are the examples of a lexical tone language. Each syllable is lexically marked with one lexical tones (. Tones have distinctive pitch contours. Altering the pitch contour may have the consequence of changing the lexical meaning of a word, and - 15 - Chapter 1: Prosody and Prosodic model Mạc Đăng Khoa perhaps the meaning of a sentence. For example in Vietnamese, the meaning of syllables “ta” (we), “tà” (lap of dress), “tã” (nappy), “tả” (to describe), “tá” (twelve), “tạ” (quintal) are different. Prosody used to convey non-lexical information: Intonation type (Question vs. declarative sentences). Languages may employ prosody in different ways to differentiate declarative sentences from questions. A general trend is that questions are associated with higher pitch somewhere in the sentence, most commonly near the end. This may be manifested as a final rising contour, or higher/expanded pitch range near the end of the sentence. In English, declarative intonation is marked by a falling ending while yes-no question intonation is marked by a rising one, as shown on the last digit "one" in the English examples. Russian question, on the other hand, uses strong emphasis on a key word instead of a rising tail. Chinese questions are manifested by an expanded pitch range near the end of the sentences, however, the speaker preserves the lexical tone shapes. [1] Prosody used to convey discourse functions: Focus, prominence, discourse segments, etc. Topic initialization is typically associated with high pitch. Pitch is typically raised in the discourse initial section and lowered in the discourse final section. Also, new information in the discourse structure is typically accented while old information de-accented. [1] Prosody used to convey emotion. Most experiments studying emotional speech study stylized emotion, as delivered by actors and actresses. In these acted-out emotions, a few categories of emotions can be reliably identified by listeners, and one can find consistent acoustic correlates of these categories. For example, excitement is expressed by high pitch and fast speed, while sadness is expressed by low pitch and slow speed. Hot anger is characterized by over-articulation, fast, downward pitch movement, and overall - 16 - Chapter 1: Prosody and Prosodic model Mạc Đăng Khoa elevated pitch. Cold anger shares many attributes with hot anger, but the pitch range is set lower. The study of emotion in natural speech is a lot more complicated. It is generally recognized that speakers show mixed feelings and ambiguous states of mind, and the emotions do not fall into clear cut categories.[1] We have the summary of prosody functions in Table 1.1: Table 1.1: Prosody functions modifying meaning not modifying meaning Linguistic (Lexicon information) Paralinguistic (non-lexicon information) Discourse function Extra linguistic - Tone - Accent Sentence type: - Assertive - Interrogative - Imperative - Focus, - Prominence - … - Emotion - Sex of speaker - … In this thesis work, we just focus on studying the functions of prosody which modify meaning, namely tones and sentence types in Vietnamese prosody. 1.1.4. Levels of representation of prosodic phenomena As for other properties of the speech signal, prosodic events can be studied at various levels of representation (see Table 1.2) [13] • First, the acoustic level: the acoustic manifestation of prosody (fundamental frequency, amplitude, and duration) can be measured directly, using specialized hardware or algorithms (such as pitch determination algorithms). • Second, the perceptual level represents the prosodic events as heard by the listener. As for spectral properties of speech sounds, acoustic characteristics that can be measured are not a._.lways perceptible. The perceptual representation is accessible to the individual listener, but this mental representation can hardly be measured. Alternatively it can be computed with a fair amount of precision on the basis of our knowledge about psychoacoustics. - 17 - Chapter 1: Prosody and Prosodic model Mạc Đăng Khoa • Finally, the linguistic level represents the prosody of an utterance as a sequence of abstract units (signs, symbols), some of which have a communicative function in speech, while others may just fulfill syntactic requirements. The linguistic structure of prosody is not some hidden code that simply can be revealed using some standard procedure. Table 1.2:Links between levels of representation of prosodic phenomena [13] Acoustic Perceptual Linguistic Fundamental frequency (F0) Pitch Tone, intonation, aspect of stress Intensity Loudness Aspect of stress Duration Length Aspect of stress Given the different nature of these representations, it is important to keep them apart. It can be helpful to have the terminology reflect the lever of representation. For instance, measuring loudness does not equal measuring signal energy. It is obvious that the perception of loudness is not exclusively related to the amplitude at one point of the signal, but also dependent on the duration of a speech fragment (the loudness of which we are measuring), and relative to the loudness of other parts in the signal. As one moves away from acoustic level towards the perceptual and/or linguistic levels, the measurement of some given prosodic property will progressively involve segmentation (for example, into syllables), context (such as relative prominence), and structural information (the linguistic interpretation of a syllabic tone, for example, often depends on whether the related syllable is stressed or not, which requires a prior analysis of the segmental layer). 1.2. Prosody modeling Prosodic models serve two purposes: On one hand, they can be scientific hypotheses that explain how we communicate with each other, and what we communicate. On the other hand, they can be engineered software systems that are part of a dialog system or speech synthesizer. To a lesser extent - and this is mostly potential - a prosodic model can be the background for a system to recognize prosody in human speech. - 18 - Chapter 1: Prosody and Prosodic model Mạc Đăng Khoa In general, a prosodic model is combined of two component, they are: intonation model and duration model. In this section, we word like to give an overview of some methods for prediction intonation (F0 contours) and duration which have actually been applied in speech synthesis. 1.2.1. Intonation models 1.2.1.1 Intonation model classification The primary goal of intonation research is to model natural f0 contours of speech, preferably in relation to a transcription and a description of the prosodic intent of the speaker. The starting point of intonation research is the time series of F0. But the interpretation of the F0 information diverges widely among intonation models. The Table 1.3 represents a view of how one can classify the various intonation models. Table 1.3: Intonation model classification Intonation model classified by the way they describe prosody. Under-specified - - Fully Specified Single Component INTSINT ToBI, Xu Tilt, IPO Olive, Machine learning Two components Grønnum - Fujisaki - Multiple components - - - Van Santen Under-specified or Fully specified The shape of an accent may be fully-specified (i.e. defined without gaps) or under- specified (defined by disconnected regions or isolated points). Along another dimension, f0 values at any given time may be treated as a single component or as the combination of multiple components. The advantage of using an under-specified accent shape is that it allows sufficient distance between specified accent targets to allow a smooth f0 transition, typically by way of interpolation. The drawback is that it ignores changes of shape between specified targets. - 19 - Chapter 1: Prosody and Prosodic model Mạc Đăng Khoa On the other hand, a system with fully specified accents leaves little room to resolve conflicting targets. A simple concatenation of fully-specified accents will result in a pitch curve with unnatural jumps at the concatenation joints. Many systems, such as Fujisaki (1983, 1988), use filters to smooth out abrupt changes in F0. Alternatively, van Santen (1997, 2000) requires each accent to begin and end at zero to ensure smooth connections between accents. Single component or many components? Many intonation models treat surface intonation contours as the superposition of a phrase component and an accent component. Grønnum (1992) and Fujisaki (1983, 1988) are representatives of this view. Well-defined model that fully specifies accent shape and uses multiple components is Van Santen's model (van Santen and Mưbius, 1997, 2000; van Santen et al., 1998), where accents are represented by densely populated points, providing a mechanism to describe highly complex accent shapes in detail. We characterize van Santen's system as having multiple components, because in addition to the phrase component, each accent in the phrase also adds a phrase-length component that contributes to the surface f0 contour. The advantage of multiple components is that it provides a mechanism to separate individual accents from long-term effects. However, if one allows multiple components, then one necessarily faces the problem that there is no unique solution in the decomposition of a single f0 time series into multiple components [1]. Any such decomposition depends on a model of the speech process, and is only as good as the underlying model. In contrast, Liberman and Pierrehumbert (1984) explicitly reject the notion of a phrase curve and represent intonation contours as a single component. The advantage of representing f0 information as a single component is that the representation of accent heights will then be transparent, which lends itself to convenient automatic labeling. [1] - 20 - Chapter 1: Prosody and Prosodic model Mạc Đăng Khoa 1.2.1.2 Some prosody models The following give an over view of intonation models in Table 1.3 • INTSINT (Hirst et al., 2000) is an underspecified intonation system that defines an accent by a single point. Fitting quadratic spline curves through these points generates surface f0. • ToBI: The most widely used under-specified accent shape is represented by the ToBI model (Beckman and Ayers, 1997; Silverman et al., 1992), which developed from earlier works such as Pierrehumbert (1980), Liberman and Pierrehumbert (1984), and Pierrehumbert and Beckman (1988). Each accent is represented by no more than two points, which specify abstractly the relative contrast of high (H) and low (L). One goal of the ToBI system is to specify a minimal set of categorical labels for intonation. These labels are usually interpreted as phonological distinction between accent types. • Xu: Xu et al. (1999) represents Chinese tones with under-specified, static or dynamic targets. The surface f0 contours are generated with a model that approaches these targets asymptotically within the domain of a syllable. • Tilt (Taylor, 2000; Taylor, 1998) allows more samples than ToBI near the peak of an accent and leaves the other regions unspecified, hence its status half way to a fully specified system. Tilt considers all accent types to be continuous variations of a single class. Surface variations are accounted for by changes in the continuous parameters. • IPO (de Pijper, 1983) prepares a piecewise-linear approximation to the pitch contour. They then associate the slope and height of these lines with various types of accents. Olive (1975) described a very early fully- specified system, following work by Levitt and Rabiner (1970). His model stored the surface pitch vs. time contour as a function of the grammatical structure of the sentence. The contour was then - 21 - Chapter 1: Prosody and Prosodic model Mạc Đăng Khoa approximated by polynomial splines attached to words, to allow for duration variations. • Machine-learning: Several works using machine-learning techniques generate densely sampled f0 values, including Chen et al. (1992) and Malfrère et al. (1998). We classify these works as fully specified systems even though in some cases the concept of accent may not be clear. Ross and Ostendorf (1999) described an interesting machine learning system where a discrete learning system would predict vectors attached to phonemes and syllables, and these vectors would in turn drive a (learned) dynamical system to predict f0. • Fujisaki: Fujisaki’s phonetic intonation model (Fujisaki and Kawai, 1982). Fujisaki’s model was developed from the filter method first proposed by O¨ hman (1967). Fujisaki states that intonation contours are comprised of two types of components, the phrase and the accent. The production process is represented by a glottal oscillation mechanism which takes phrase and accent information as input and produces a continuous F0 contour as output. The input to the mechanism is in the form of impulses, used to produce phrase shapes, and step functions which produce accent shapes [10]. The Fujisaki model has been successfully applied for decomposing F0 contours in many languages like Japanese, German, and Finnish and in some tonal languages like Chinese, Thai. Currently, some researches of applying Fujisaki model to Vietnamese are on the way [11]. We will return to this model in Chapter 3. 1.2.2. Duration modeling We now give a general overview of modeling the duration component of prosody. Common methods to predict duration in speech synthesis differ in the following aspects: [6] - 22 - Chapter 1: Prosody and Prosodic model Mạc Đăng Khoa • Durational Unit Predicted: the temporal unit predicted by most current systems are either the phone (phoneme), often referred to as “segment”, or the syllable. Since eventually phone duration are required for the acoustic synthesis, all syllable-based models include some kind of mechanism for calculating segment duration from the unit syllable duration. For example, in Barbosa and Bailly’s model, the basic unit is delimited by the onset of nuclear vowel and the onset of the following vowel. They are computed by a sequential network constrained by an internal clock (basically the speaking rate). • Predictor factors: Every model uses a particular vector of input features, which are extracted on the linguistic and phonetic levels. Most commonly employed factors include: on the syllabic level: the degree of accentuation and the position in a higher-level unit, such as the foot or accent group. on the segmental level: the properties of the phone to be synthesized and its neighboring phones on the phrase level: the location of a segment with respect to a minor or major boundary an the position of the phrase in a sentence. • The Prediction Method: The algorithms used for calculating a numerical duration value from the vector of input features can be roughly divided into rule systems and statistical approaches. In the Figure 1-1: Category of methods for predicting syllable duration [6]Figure 1-1Error! Reference source not found., the statistical approaches are subdivided into parametric and non-parametric regression models. Whereas the structure of a parametric regression model in term of how it processes the input factors is determined a priori, non-parametric regression models are developed by unsupervised training and the model structure is determined automatically (multi-layer perceptrons, CARTs). The main difference between rule-base and statistical models is that a rule system can be build - 23 - Chapter 1: Prosody and Prosodic model Mạc Đăng Khoa on relatively little speech data. The formulation of the rules, however, require a high amount of expert knowledge and considerable optimization effort by trial-and-error. In contrast, statistical approaches are built from a process is relatively effortless. Furthermore, the importance of individual factors can be easily assessed by the way the statistical models prioritize them. Figure 1-1: Category of methods for predicting syllable duration [6] • Pause Prediction. Some current approaches incorporate the prediction of speech pauses as part of model, others treat pauses strictly separately. • Speech Rate: Many current TTS systems produce different speech rates by linearly scaling the duration output by the duration model. As the speech rate not only affects the duration of individual segments, but also the overall prosodic structure of an utterance, this kind of modification needs to take place on an earlier step of processing when the phrasal structure of an utterance is determined. 1.2.3. This thesis work approach Modeling the intonation and duration in prosody is a complex field, relate to linguistic and acoustic field. There are many different methods to predict the - 24 - Chapter 1: Prosody and Prosodic model Mạc Đăng Khoa intonation and duration of speech. However, there is currently no methods completely apply in Vietnamese. In the scope of this thesis, we use the statistical approach to extract some basic patterns, just for modeling some basic cases in Vietnamese prosody. The following are some information about our approach to modeling Vietnamese prosody: • Method: Statistical approach: calculate average value of F0, intensity and duration from a corpus. • Intonation modeling: Under-specified: using 20 isolated point. Single component • Phrase level factors: Three type of phrase: Assertive, Interrogative and Imperative • Syllable level factors: Tone of syllable • Extra-linguistic factors: Male/Female voice This approach will be described more detail in Chapter 40. - 25 - Chapter 2: Vietnamese language and prosody Mạc Đăng Khoa 2 Vietnamese language and prosody The understanding of phonetic and phonological characteristics of a language has an important role in the studies on speech processing in general and on prosody analysis in particular. Thus, in this chapter, we give a review of Vietnamese language and Vietnamese prosody. 2.1. Vietnamese language 2.1.1. Vietnamese characteristics As we know, Vietnamese language is an amorphous language and a tonal/musical language. It has the following characteristics [21]: 1. Vietnamese words are amorphous words, they do not change to show grammatical categories, for instance, in French there are male and female word étudiant - étudiante, nouveau – nouvelle, singular and plural word amie, amies. 2. Vietnamese word structure does not use the affixes (prefixes, suffixes, and infixes). Vietnamese language is a non-affix language. For instance, in French or in English, an antonyms of one words is add the prefixes “im-”, “ir-”, “un-”: impolite, unreadable, irregular… 3. Vietnamese word structure uses very few morphemes. Vietnamese language has maximum twenty thousand syllables to create morphemes, thus Vietnamese language does not have the features of flexional languages. - 26 - Chapter 2: Vietnamese language and prosody Mạc Đăng Khoa Vietnamese language’s morpheme index (number of morphemes M/ number of words W) is about 1.06 [13], this is the least index in the 5000 languages in the world [13]. The language, which its morpheme is less than 2, is an analytic language. The amorphous feature of our language is an essential characteristic, which has an influence on other Vietnamese language’s characteristics. 4. Vietnamese language is a tonal/musical language. Vietnamese language has six tones, and each tone could contribute to create the morpheme and meaning of word, e.g. ba, bá, bà, bả, bã, bạ; me, mé, mè, mẻ, mẽ, mẹ. The tones make Vietnamese language have a musical characteristic; make sentences rhythmic and melodious. 5. A syllable (isolated word) of Vietnamese language in full structure has five parts: initial sound (consonant), medial sound (semi-vowel), nucleus sound (vowel or diphthong), final sound (consonant or semi-vowel) and tone. In one word, consonant and vowel take an essential role, they are the core of the word. They can create one syllable by themselves. Excepting the initial consonant, the rest of one word is called a final (vần). Vietnamese has 155 basic finals. [13] 6. In Vietnamese, the boundary of syllable and morpheme’s is the same. One syllable is one morpheme. In French: partir (come) has two syllables par-tir and two morphemes part-ir, vendeur (seller) has two syllables ven-deur and two morphemes vend-eur. In English: words have one syllable and two morphemes. In Vietnamese: the sentence “ Đẹp vơ cùng tổ quốc ta ơi!” (Tố Hữu) has seven morphemes, seven syllables, and five words (three mono words: đẹp, ta, ơi and two compound word: vơ cùng, tổ quốc). In conclusion, one Vietnamese word unit is one syllable, one morpheme and one real word. - 27 - Chapter 2: Vietnamese language and prosody Mạc Đăng Khoa 7. Almost Vietnamese vocabulary is created by one or two morphemes, and is monosyllable or bi-syllable, sometime polysyllable. There are 80% words being bi-syllable words. 8. The difference between writing language and speaking language on grammatical rules and phonetic rules is not large. 9. Through the period of foundation and development of Vietnamese language, it has received quite many words from foreign languages. Number of Han words is the greatest and next are French words, and a part of them were converted fully into Vietnamese. For example, words: đấu tranh, giai cấp, hồ bình, độc lập, tự do, hạnh phúc are Han words (Chinese words). Nhà ga (gare), xà phịng (savon), cà phê (café) are French words. 2.1.2. Vietnamese phoneme system Vietnamese phoneme system includes 14 vowels or vowel combinations and 22 consonants. The Vietnamese vowels include 11 vowels and three diphthongs [21]. All vowels are voiced sounds. Table 2.1:Vietnamese vowels. Transcription Reading Letters Example /a/ a a a ha /ắ/ ă ă con mắt /Φ〈/ ấ â ân cần /ε/ e e e dè /e/ ê ê ê chề /i/ short i & long y i ỉ eo, ý chí // o o co ro /o/ ơ ơ hồ đồ /Φ/ ơ ơ bơ phờ /u/ u u tù mù /∝/ ư ư lừ đừ /ie/ ia , yê ia, iê, ya,yê kia kìa, yêu kiều /uo/ ua ua, uơ tua rua, luơn luơn /∝Φ/ ưa ưa, ươ lưa thưa, lượt thượt - 28 - Chapter 2: Vietnamese language and prosody Mạc Đăng Khoa Vietnamese includes 22 consonants [21] as Table 2.2 Table 2.2:Vietnamese consonants. Transcription Reading Letter Example /b/ bê and bờ b bồng bềnh /p/ pê and pờ p ốp ép /v/ vê and vờ v vẩn vơ /f/ phờ ph phơi pha /m/ em-mờ and mờ m mơ màng /d/ đê and đờ đ đất đai /t/ tê and tờ t tin tưởng /t’/ thờ th thơ thẩn /s/ ich-xì and xờ x xa xơi /z/ dê, giê and dờ d, gi duyên dáng, giữ gìn /n/ en-nờ and nờ n nơn nĩng /l/ e-lờ and lờ l long lanh /τ/ trờ tr trồng trọt /ş/ ét-sì and sờ s sung sướng // e rờ and rờ r rĩc rách /c/ chờ ch chơng chênh // nhờ nh nhọc nhằn /ŋ/ ngờ ng, ngh ngơ nghê /k/ xê, ca, quy and cờ c,k,q con cá, kĩu kịt, qua quít /x/ khờ kh khúc khích /⊗/ gê, giê and gờ g, gh gồ ghề /h/ hát and hờ h hả hê The consonants in Vietnamese can be distinguished by the features: articulate mode and articulate position, stop or fricative properties, voiced or unvoiced properties, aspirate or non-aspirate, nasal or non-nasal and the place of consonant in syllable. Based on these features, Vietnamese consonants can be arranged as Table 2.3. Table 2.3: Arrangement of Vietnamese consonants. apical articulate position articulate method labial dental laminal palate dorsal glottal aspirate t Unvoiced p t’ } c k noise non- aspirate Voiced b d Stop sonant m n  Ν Unvoiced f s ♣ x h noise Voiced v z  Φ fricative beside sonant l - 29 - Chapter 2: Vietnamese language and prosody Mạc Đăng Khoa 2.1.3. Syllable structure Vietnamese grammarians and linguists have long considered the syllable in Vietnamese as a fundamental unit. A syllable in full structure (a tonal syllable) has five parts: initial sound, medial sound, nucleus sound, final sound and tone (Error! Reference source not found.) [21]. For instance, the syllable “tốn” has following components: initial sound /t/, medial sound /o/, nucleus sound /a/, final sound /n/, and tone “sắc” (or rising tone). One syllable has to have a nucleus sound. Other components are optional. A nucleus sound could create one syllable, for instance a, ơ, ê…Besides the initial sound (called INITIAL part), the rest of the syllable is called a FINAL part. A tone is a fundamental frequency variation spreading over the whole syllable. A tone has the same function as a phoneme. It always assigns for syllable and its influence covers the entire of syllable. There are a few constraints: if a syllable ends with unvoiced consonants /p,t,k/, only “sắc” and “nặng” tones are possible; otherwise in all varieties of Vietnamese, the whole tonal paradigm can occur. Table 2.4:The phonological hierarchy of Vietnamese syllables with total numbers of each phonetic unit [14]. TONAL SYLLABLE (6492) BASE SYLLABLE (2376) Final (155) Medial (1) Nucleus (16) Ending (8) Initial (22) TONE (6) 2.2. Vietnamese prosody As a tonal language, Vietnamese prosody is composed of two components, which we call “micro-prosody” and “macro-prosody”: • Micro-prosody is the variation of pitch, duration and intensity of individual word or syllable. For tonal language, the micro-prosody is very important to distinguish the syllable’s tone. Thus, the lexical meaning of the synthesized sound much depends on the quality of micro-prosody • Macro-prosody is the application of prosody to whole phrase or sentence. It depends on the type of sentence and speaker's intentions or - 30 - Chapter 2: Vietnamese language and prosody Mạc Đăng Khoa emotions. Therefore, the "naturalness" of synthesized sentences is much depends on ability of macro-prosody controlling during speech synthesis process. 2.2.1. Micro-prosody and tones system in Vietnamese In Vietnamese, micro-prosody is much depends on the tone of syllable. Each tone could contribute to construct the morpheme and meaning of word, it is also a distinguish signal. The tone has the same function as a phoneme, it always assigns for syllable and its influence cover the entire of syllable. The tones make Vietnamese language have a musical characteristic; make sentences rhythmic and melodious. There are six tones in Vietnamese; they are showed in the Table 2.5 Table 2.5 The six Vietnamese tones. Tone 1 Tone 2 Tone 3 Tone 4 Tone 5 Tone 6 ngang huyền ‘\’ ngã ‘~’ hỏi ‘?’ sắc ‘/’ nặng ‘.’ Figure 2-1: Example of the contours of six tones, as described in [21]. • Tone 1- Level tone (“ngang”): is a high tone. At the beginning of syllable, it is the highest tone. The steady state of the level contour is observed consistently. In the below figure, you can see the shape of tone - 31 - Chapter 2: Vietnamese language and prosody Mạc Đăng Khoa 1 for male and female voice. (two line present the maximum and minimum of F0 values) Figure 2-2: The shape of Tone 1 with female and male voice [18] • Tone 2 - Falling tone (“huyền”): the onset of the falling tone is lower than tone 1, tone 5 and tone 3. The low F0 at the onset gradually falls toward the end. Figure 2-3: The shape of Tone 2 with female and male voice [18] • Tone 3 - Broken tone (“ngã”): the onset is as high as that of the level of tone 5, it is higher than the falling tone. The second third of the contour of this tone is characterized by an abrupt dip caused by a glottalization. In most cases, the bottom of the dip occurs between the mid-point and the point two-thirds from onset. A creaky voice is heard during this dip. - 32 - Chapter 2: Vietnamese language and prosody Mạc Đăng Khoa Figure 2-4: The shape of Tone 3 with female and male voice [18] • Tone 4 - Curve tone (“hỏi”): the onset is the lowest among the six tones. The low onset falls further gradually until the point two-thirds from the onset. From this point, the extremely low F0 starts to rise toward the end. Figure 2-5: The shape of Tone 4 with female and male voice [18] • Tone 5 - Rising tone (“sắc”): the onset is also high. Starting from high onset, the F0 gradually rises for the first two thirds of the duration. After this point, the rise becomes more rapid. Figure 2-6: The shape of Tone 5 with female and male voice [18] - 33 - Chapter 2: Vietnamese language and prosody Mạc Đăng Khoa • With tone 5 ending with stop consonants (t,p,c,k), the onset is higher than tone 5a and the F0 rise rapidly with short duration. We call that tone is tone 5b. Figure 2-7: The shape of Tone 5b with female and male voice [18] • Tone 6 - Drop tone (“nặng”): the onset is usually higher than that of the falling or curve tone but considerably lower than the tone 1, tone 5 and tone 3. This tone is characterized by a glottalization at the end and also by its considerably shorter duration than the other tones. The duration of this tone is approximately two thirds of the other tones. The main body of this tone is almost leveled or slightly falling. Figure 2-8: The shape of Tone 6 with female and male voice [18] • Tone 6b (tone six ending with stop consonants): the onset is nearly equal tone 2. The F0 falls toward the end with short duration. - 34 - Chapter 2: Vietnamese language and prosody Mạc Đăng Khoa Figure 2-9: The shape of Tone 6b with female and male voice [18] These descriptions are only for the Northern dialect, in particular Hanoi dialect which is the standard dialect of Vietnamese. They would be changed with the other dialects in the South and the Center of Vietnam. In these regions, there are only 5 tones instead of 6 like the Hanoi dialect, because tone 3 and tone 4 are pronounced identically. In continuous speech, tones seldom reach their target values. They are generally affected by context: stressed vs. unstressed syllable, influence of neighbouring tones, tempo… and the affect of some phenomena in Vietnamese prosody, on which we will discuss later. 2.2.2. Macro-prosody and sentence types in Vietnamese As we talked above, the macro-prosody depends on the type of sentence, speaker's intentions or emotions. In this thesis, we just discuss on role of sentence types in Vietnamese prosody. The classifition of Vietnamese sentences • Classification by purpose: Sentences can be classified based on their purpose:[8] A assertive sentence or declaration: the most common type, commonly makes a statement. Ex: Tơi sẽ về nhà. (I am going home.) - 35 - Chapter 2: Vietnamese language and prosody Mạc Đăng Khoa An interrogative sentence or question: is commonly used to request information. Ex: Khi nào anh sẽ làm việc? (When are you going to work?) An imperative sentence or command: is ordinarily used to make a demand or request Ex: Mở cửa ra! (Open the door!) An exclamatory sentence or exclamation: is generally a more emphatic form of statement: Ex: Ngày hơm nay tuyệt quá! (What a wonderful day this is!) • Classification by structure: Sentences can also be classified based on their structure (by the number and types of finite clauses) as the below diagram Figure 2-10: Sentence classification by structure [20] With the scope of this thesis, we have just studied the macro-prosody of assertive, interrogative and imperative sentence with single structure. In the researches of Nguyễn and Boulakia [8], they gave some characteristics of prosody on three types of sentence (assertive, interrogative and imperative) as the following: • Duration (Tempo). Interrogative sentences (Q) are shorter than Assertive sentence (S) and this difference is significant. Imperatives (I) are even shorter, but the differences with Q and S are not significant. - 36 - Chapter 2: Vietnamese language and prosody Mạc Đăng Khoa • Intensity: The difference is significant between assertive and imperative for the S/I pair, but not for the S/Q and Q/I ones. • Fundamental frequency. The F0 mean value of Interrogative sentences and Imperative utterances is higher than that of Statements, while there is no difference between Interrogative and Imperative sentences. There is an obvious difference in the last syllable. The phonologically "level" (high) tone falls in Statements and is much higher and rising in Questions, while the mean value and movement is half way between for Imperatives. The rising tones, rise even more in the case of Interrogative and Imperative than in Statement sentences. It means that there is an influence of the intonation on the final-syllable tone of the sentence. Figure 2-11: The sentences “Lan thích ăn cơm khơng” in Assertive (S) and Interrogative (Q) mode [8] Figure 2-12: The sentences “Bảo cố gắng tập đi” in - 37 - Chapter 2: Vietnamese language and prosody Mạc Đăng Khoa Assertive (S) and Imperative (Q) mode [8] Figure 2-13: The sentences “Tân bỏ đi chứ” in Interrogative (Q) and Imperative (I) mode [8] In the research of Vu M. Q .et al [16], they found that the main part of differences in intonation._.Imperative 7 43 50 100 Assertive 67 10 23 100 Interrogative 20 60 20 100 6 Imperative 17 30 53 100 Assertive 67 20 13 100 Interrogative 10 53 37 100 T on e an d se nt en ce ty pe 6b Imperative 17 23 60 100 - 84 - Chapter 6: Conclusion and Perspectives Mạc Đăng Khoa Table 5.8: Confusion matrix (in %) of sentence types (average of Male and Female) Perceived type of sentence (%) Perceived Intended Assertive Interrogative Imperative Total Assertive 77 20 3 100 Interrogative 20 60 20 100 1 Imperative 10 33 57 100 Assertive 75 20 5 100 Interrogative 23 65 12 100 2 Imperative 42 17 42 100 Assertive 70 27 3 100 Interrogative 32 48 20 100 3 Imperative 52 23 25 100 Assertive 68 20 12 100 Interrogative 23 72 5 100 4 Imperative 25 25 50 100 Assertive 60 35 5 100 Interrogative 32 50 18 100 5 Imperative 15 28 57 100 Assertive 48 45 7 100 Interrogative 12 80 8 100 5b Imperative 10 40 50 100 Assertive 50 12 38 100 Interrogative 27 63 10 100 6 Imperative 15 25 60 100 Assertive 63 13 23 100 Interrogative 30 43 27 100 T on e an d se nt en ce ty pe 6b Imperative 22 15 63 100 - 85 - Chapter 6: Conclusion and Perspectives Mạc Đăng Khoa 0 10 20 30 40 50 60 70 80 90 1 2 3 4 5 5b 6 6b Tone of last syllable C o rr e c t re c o g n it io n r a te ( % ) Assertive Interrogative Imperative Figure 5-5: Correct recognition rate with 8 tones of last syllable As can be seen, the assertive sentence ending with tone 1, 2, 3 and 4 were correctly identified in well about 70% to 80% of judgments. In tone 5, 5b, 6 and 6b, the correction percent was fair, about 50% to 60 %. With tone 5b, 45% of assertive sentences were confused with imperative sentence. The results were fairly good with the interrogative sentences ending with tone 1,2, 4, 5b and 6 (over 60% of correction, even 80% with tone 5b). With the other tones, it was 40% to 50%. With imperative sentence ending with tone 1.4 .5 .5b, 6 and 6b, the correction percent is fair, from 50% to 60%. With tone 2 and tone 3, 42% and 52% of imperative sentences were confused with assertive sentence.. In summary, we have the average correct recognition rate as Table 5.9 and Figure 5-6 - 86 - Chapter 6: Conclusion and Perspectives Mạc Đăng Khoa Table 5.9: Correct recognition rate (%) with other types of sentences Rate (%) Sentence type Male voice Female Both Assertive 60 68 64 Interrogative 57 63 60 Imperative 53 48 50 Global 56 60 58 0 10 20 30 40 50 60 70 80 Male Female Both voice C o rr e ti o n r a te ( % ) Assertive Interrogative Imperative Figure 5-6: Correct recognition rate (%) with other types of sentences The global correct recognition rate of assertive and interrogative in female voice is better than in male voice. However, with imperative sentence, the male voice has a better result. Overall, the average correct recognition rate of both voice is approximately 64% for assertive sentences, 60% for imperative and 50 % for interrogative sentences. - 87 - Chapter 6: Conclusion and Perspectives Mạc Đăng Khoa 5.3. Comparison and conclusion For comparison, we used another result of perception test on pseudo-sentences 1, which was carried out in other research of MICA center [17. The On that test, the listener had to choose between two types “interrogative” and “assertive” for the pseudo-sentence they listened. Table 5.10 and Figure 5-7 give the comparison between our result in two experiments (with patterns base synthesized sentence) and the other experiment with pseudo-sentences (we call Experiment M): Table 5.10: Result of three experiments Experiment 1 (non-sense sentence) Experiment 2 (multi-type sentence) Experiment M (pseudo-sentence) Experiment and voice of synthesis/ Sentence type Male Female Both Male Female Both Male Female Both Assertive 68 55 61 60 68 64 58 69 64 Interrogative 49 50 50 57 63 60 61 74 68 Imperative 36 42 39 53 48 50 - - - Global 51 49 50 56 60 58 60 72 66 0 10 20 30 40 50 60 70 80 Experiment 1 Experiment 2 Experiment M C o rr e ti o n r a te ( % ) Assertive Interrogative Imperrative Figure 5-7: Result comparison of three experiments 1 The re-synthesized sentence without semantic information, which is resynthesized from vowel /a/ and simulated the prosody of an actual sentence. [17] - 88 - Chapter 6: Conclusion and Perspectives Mạc Đăng Khoa Overall, the result in this experiment 2 is better than in its previous, which used non-sense sentences (50% of correction). In all types of sentence, the correct recognition rate of experiment 2 is approximately 10% higher than in experiment 1. It can be explain that, in the first, we use the same tone for all syllables in the sentence but we did not simulate the co-articulation phenomena. Therefore, the synthesized sentences were very different from nature. With test sentences in experiment 2, except the final syllable, the others were in tone 1 (or non-tone). The co-articulation phenomenon is minimized, and the sentences are more natural. Compare with the experiment M, the result of experiment 2 is better in assertive sentences but worse in interrogative sentences. The test sentences in Experiment M were synthesized by simulating the prosody of actual sentences. That is why they could be more natural than the test sentences in Experiment 2, which used our proposal prosody patterns. Therefore, the worse result in Experiment 2 can be understandable and acceptable. In this chapter, we have presented some experiments to evaluate our prosody pattern. These patterns are currently very simple, just base on the position of syllable in sentence and type of container sentence. However, the results of experiments show that our proposal pattern can be apply to predict the prosody of simple sentences. - 89 - Chapter 6: Conclusion and Perspectives Mạc Đăng Khoa 6 Conclusion and Perspectives The thesis subject is “Modeling the prosody of Vietnamese language for speech synthesis”. But as all we know, finding a completed model to modeling Vietnamese prosody currently is a large and complex field, which requires many researches on linguistic, acoustic and on speech processing also. Therefore, in scope of master thesis, we have studied on some basic factors of Vietnamese prosody, characterized and tried to apply them to speech synthesis. In chapter 3, we have proposed a method and structure of prosody generation module. The simplest way to generate the prosody of whole sentence is concatenation the direct prosody patterns of syllables. Hence, in chapter 4, we set up a prosody corpus, analyzed and proposed 72 prosody patterns syllables, corresponding to the initial, middle and final positions of syllable in three type of sentence (assertive, interrogative and imperative). Although these patterns are not enough to model all case of Vietnamese prosody, but they are able to apply to generate the prosody of simple sentences. That was proved by the results of experiments in chapter 5. In the perception tests, the listener could correctly determine 64% of assertive sentences, 60% of interrogative sentences and 50% of imperative sentences. These patterns are our first prosody patterns for Vietnamese syllables. They were extracted from a small prosody corpus and just concern to three factors: tone, position of syllable and the sentence type. In the future work, we expect to improve - 90 - Chapter 6: Conclusion and Perspectives Mạc Đăng Khoa these patterns by set up and analyze a larger corpus. Additionally, we also research others factor such as syntactic, lexical meaning factors or glottalization and co- articulation phenomena to integrate into our patterns. Moreover, these prosody patterns can be represented not only in absolute values of F0, duration and intensity but also in a set of parameters of other prosody model (such as Fujisaki model). By that way, we are be able to generate the prosody description in other types of prosodic model and apply to other type of TTS system. The following is a summary on the works we have done in this master thesis, the limitations and future approaches: • The works we have done: Proposed a simple method a prosody generation and a structure of prosody generation module in TTS system Set up a corpus for researching prosody Proposed 72 prosody patterns for Vietnamese syllable Apply proposal prosody patterns to synthesize some simple sentences and evaluate these sentence by perception tests • The limited points The corpus is small The proposal patterns for syllables are simple, just concern to three factors: tone, position of syllable and the sentence type Intensity controlling in speech synthesis was done manually and not very accurate The listener in perception test was few (5 males and 5 females) • Future approach Set up a larger corpus, which concern to other factors and phenomena in Vietnamese prosody Define prosody patterns from that corpus Research other prosodic model and apply to transfer these patterns to other model parameters. - 91 - Chapter 6: Conclusion and Perspectives Mạc Đăng Khoa Develop a complete prosody generation module and integrate in a TTS system. The last words This work is carried out in MICA center and I expected its result could be applied into the MICA speech synthesis system. This work is also my preliminary work in the domain of speech processing. With the guiding of my supervisors, the instruction and support of others in MICA’s speech processing group, my knowledge and skill of speech processing have been more and more improved. That knowledge is the necessary background for my future studies and researches. Once again, thank all of you very much! - 92 - Master thesis Mạc Đăng Khoa References [1]. Chilin Shih , Greg Kochanski, “Prosody and Prosodic Models”, www.prosodies.org. [2]. Do T.D., Tran T.H., et al. (1998), “Intonation system - A survey of twenty languages”, chap. 22, Cambridge University Press. [3]. Dung Tien Nguyen, Hansjưrg Mixdorff, Mai Chi Luong, Huy Hoang Ngo, Bang Kim Vu, (2005) “Fujisaki Model based F0 contours in Vietnamese TTS”, Eurospeech proceeding [4]. H. Fujisaki, S. Ohno, C. Wang (1974), “A command-response model for F0 contour generation in multilingual speech synthesis”, Journal of Phonetics, vol. 2, pp 223-232, [5]. Mixdorff H. (1998), “Intonation patterns of German - Model-based quantitative analysis and synthesis of F0 contours”, PhD thesis, TU Dresden [6]. Mixdorff H. (2001), “An Integrated Approach to Modeling German Prosody”, TU Dresden [7]. Mixdorff H., Nguyen Hung Bach, Hiroya Fujisaki and Mai Chi Luong (2003), “Quantitative Analysis and Synthesis of Syllabic Tones in Vietnamese”, Eurospeech proceeding. [8]. Nguyen T.T.H. and Boulakia G. (1999), "Another look at Vietnamese intonation", ICPhS'99 [9]. Ninh Khanh Duy (2005), “Characterization of Vietnamese intonation for questions”, Master Thesis, Hanoi University of Technology, - 93 - Master thesis Mạc Đăng Khoa [10]. Paul Alexander Taylor (1992), “A Phonetic Model of English Intonation”, PhD thesis, University of Edinburgh, [11]. Sami Lemmetty (1999), “Review of Speech Synthesis Technology”, MSc thesis, Faculte Helsinki University of Technology [12]. Thierry Dutoit (1993), “High Quality Text-To-Speech Synthesis of the French Language”, PhD thesis, Faculte Polytechnique de Mons, TCTS Lab, Belgium [13]. Thierry Dutoit (1997), “An Introduction to Text-to-Speech Synthesis”, Kluwer Academic Publishers. [14]. Tran D.D., Castelli E., et al. (2005), "Influence of F0 on Vietnamese syllable perception", Interspeech. [15]. Tran Do Dat (2003), “Building a large Vietnamese Speech Database”, Master Thesis, Hanoi University of Technology [16]. Vu M.Q., Tran D.D., Castelli E. (2006), “Prosody of Interrogative and Affirmative Sentences in Vietnamese Language: Analysis and Perceptive Results” [17]. Vu M.Q., Tran D.D. & Castelli E. (2006), "Intonation des phrases interrogatives et affirmatives en langue vietnamienne", JEP2006, XXVIes Journées d’Etude sur la Parole. Manoir de la Vicomté - Dinard, France [18]. Nguyen Quoc Cuong (2002), “Reconnaissangce de la parole en langue Vietnamienne”, These, INPG, UJF Grenoble, France, Juin 2002 [19]. Bạch Hưng Nguyên, Nguyễn Tiến Dũng,(2005), "Mơ hình Fujisaki và áp dụng trong phân tích thanh điệu tiếng Việt" [20]. Mai Ngọc Chừ, Vũ Đức Nghiệu, Hồng Trọng Phiến (2005) “Cơ sở ngơn ngữ học và tiếng Việt”, NXB Giáo dục. - 94 - Master thesis Mạc Đăng Khoa [21]. Nguyễn Hữu Quỳnh (2001). “Ngữ Pháp Tiếng Việt”, Nhà xuất bản từ điển Bách Khoa, pp.11-86, Hà Nội - 95 - Master thesis Mạc Đăng Khoa Appendix A. Text for prosody corpus Code Role Sentences 1_As_In_A A Bên ta theo địch đến tận căn cứ 1_As_Mid_A A Bên địch bị bên ta theo đến tận căn cứ 1_As_Fi_A A Bên địch bị đánh bật khỏi căn cứ bên ta Context Kết thúc trận đánh, thủ trưởng hỏi: 1_Int_In_A A Này cậu. Bên ta theo địch đến tận căn cứ à? 1_As_In_B B Vâng. Bên ta theo bên địch đến tận căn cứ. 1_Int_Mid_A A Này cậu. Bên địch bị bên ta theo đến tận căn cứ à? 1_As_Mid_B B Đúng ạ. Bên địch bị bên ta theo đến tận căn cứ. 1_Int_Fi_A A Này cậu. Bên địch đã bị đánh bật khỏi căn cứ bên ta? 1_As_Fi_B B Đúng ạ. Bên địch bị đánh bật khỏi căn cứ bên ta. Context Trong trận đánh, cấp dưới hỏi thủ trưởng 1_Int_In_B B Bên ta theo anh cĩ cần bám theo địch khơng? 1_Imp_In_A A Bên ta theo địch ngay cho tơi! 1_Int_Mid_B B Địch rút thì bên ta theo cĩ được khơng anh? 1_Imp_Mid_A A À. Thế thì bên ta theo ngay cho tơi! Context Trong trận đánh, thủ trưởng ra lệnh: 1_Imp Fi_A A Gọi ngay quân cứu viện bên ta! Nhanh lên ! Code Role Sentence 2_As_In_A A Trên tà thêu một bơng hoa màu đỏ. 2_As_Mid_A A Áo của chị trên tà thêu một bơng hoa. 2_As_Fi_A A Áo của chị cĩ thêu một bơng hoa trên tà. Context Một chị đi đặt may áo dài. Thợ may hỏi: 2_Int_In_A A Trên tà thêu gì khơng hả chị? 2_As_In_B B À... Trên tà thêu một bơng hoa màu đỏ. 2_Int_Mid_A A Áo chị trên tà thêu gì khơng thế? 2_As_Mid_B B À... Cái áo đấy trên tà thêu hoa màu đỏ. - 96 - Master thesis Mạc Đăng Khoa 2_Int_Fi_A A Áo chị thêu gì trên tà ? 2_As_Fi_B B À. Cái áo đấy thêu một bơng hoa đỏ trên tà. 2_Int_In_B B Trên tà thêu gì khơng hả chị? 2_Imp_In_A A Giống cái kia kìa.Trên tà thêu y như thế cho chị! 2_Int_Mid_B B Áo chị đặt trên tà thêu gì khơng chị? 2_Imp_Mid_A A Giống cái lần trước ý. Cái áo này trên tà thêu cũng thế nhé em! 2_Int_Fi_B B Áo chị đặt thêu gì trên tà? 2_Imp Fi_A A Giống lần trước. Cứ thêu cho chị một bơng hoa trên tà! Code Role Sentence 3_As_In_A A Trên tã thêu hình một quả bĩng. 3_As_Mid_A A Chị nhìn thấy trên tã thêu hình quả bĩng. 3_As_Fi_A A Chị nhìn thấy hình một quả bĩng được thêu trên tã. Context Chồng và vợ nĩi về cái tã mới của em bé: 3_Int_In_A A Trên tã thêu hình gì thế em ? 3_As_In_B B À. Trên tã thêu hình một quả bĩng anh ạ. 3_Int_Mid_A A Em cĩ thấy trên tã thêu hình gì khơng ? 3_As_Mid_B B À. Em thấy trên tã thêu hình quả bĩng anh ạ. 3_Int_Fi_A A Cĩ cả hình quả bĩng trên tã ? 3_As_Fi_B B Vâng. Em thấy cĩ hình một quả bĩng trên tã. Context Vợ đang thêu tã cho con, hỏi chồng 3_Int_In_B B Trên tã thêu hình gì đây hả anh? 3_Imp_In_A A Con nĩ thích bĩng. Trên tã thêu hình quả bĩng đi! 3_Int_Mid_B B Khơng biết trên tã thêu cái gì hả anh? 3_Imp_Mid_A A Con cĩ vẻ thích bĩng.Theo anh trên tã thêu bĩng đi em! 3_Int_Fi_B B Anh ơi! Thế mình thêu gì trên tã? 3_Imp Fi_A A Con thích bĩng.Thế thì em thêu cho con một quả bĩng trên tã! Code Role Sentence 4_As_In_A A Bên tả theo khuynh hướng bảo thủ. 4_As_Mid_A A Đảng chính trị bên tả theo khuynh hướng bảo thủ. - 97 - Master thesis Mạc Đăng Khoa 4_As_Fi_A A Bảo thủ là khuynh hướng của đảng chính trị bên tả. Context Hai người bàn về chính trị 4_Int_In_A A Bên tả theo khuynh hướng nào cậu nhỉ? 4_As_In_B B À. Bên tả theo khuynh hướng bảo thủ. 4_Int_Mid_A A Đảng chính trị bên tả theo khuynh hướng nào thế? 4_As_Mid_B B À. Đảng chính trị bên tả theo khuynh hướng bảo thủ. 4_Int_Fi_A A Anh ủng hộ đảng chính trị bên tả? 4_As_Fi_B B À vâng. Tơi ủng hộ đảng bên tả. Context Trong buổi tổng duyệt điễu hành. Người chỉ huy hơ: 4_Imp_In_A A Bên tả theo ngay sau đội trống! Nhanh lên! 4_Imp_Mid_A A Chú ý. Tồn đội bên tả theo ngay sau đội trống! 4_Imp_Fi_A A Tất cả theo sau đội bên tả ! Nhanh lên. Code Role Sentence 5_As_In_A A Đợt phong cấp lần này, lên tá theo anh cũng chẳng khĩ khăn gì. 5_As_Mid_A A Anh ta được thăng lên tá theo cách nào khơng ai biết. 5_As_Fi_A A Cuối cùng thì anh cũng được thăng lên tá. Context Trong cuộc họp bàn về phong cấp trong một đơn vị quân đội: 5_Int_In_A A Đợt phong cấp lần này, lên tá theo anh cĩ khĩ lắm khơng? 5_As_In_B B À. Lên tá theo tơi cũng chẳng khĩ khăn gì. Context Thủ trưởng hỏi về một trường hợp vừa được phong lên cấp tá 5_Int_Mid_A A Anh ta lên tá theo quyết định nào thế? 5_As_Mid_B B À. Anh ta được lên tá theo cách nào chẳng ai biết. 5_Int_Fi_A A Sao anh ta lại được lên tá? 5_As_Fi_B B Thưa anh, cũng khơng ai biết sao anh ta lại được lên tá. Context Cấp dưới hỏi thủ trưởng 5_Int_In_B B Cịn đồng chí Nam, lên tá theo anh cĩ nên khơng? 5_Imp_In_A A Nên chứ. Lên tá theo mọi người luơn đi ! 5_Int_Mid_B B Đồng chí Nam mà phong lên tá theo mọi người cĩ được khơng? 5_Imp_Mid_A A Được. Phong lên tá theo họ luơn đi! - 98 - Master thesis Mạc Đăng Khoa 5_Int_Fi_B B Anh khơng ủng hộ việc đồng chí đĩ sẽ được phong lên tá ? 5_Imp_Fi_A A Đúng. Các anh đừng cĩ phong anh ta lên tá ! Code Role Sentence 6_As_In_A A Lên tạ theo đúng hướng dẫn là cách tập luyện rất tốt. 6_As_Mid_A A Cách tốt nhất là tập lên tạ theo đúng hướng dẫn. 6_As_Fi_A A Tất cả các vận động viên đều phải tập lên tạ. Context Vận động viên hỏi huấn luyện viên về tập lên tạ: 6_Int_In_A A Lên tạ theo theo anh cĩ tốt khơng ? 6_As_In_B B Cĩ chứ. Lên tạ theo đúng hướng dẫn là cách tập luyện rất tốt. 6_Int_Mid_A A Theo anh thì tập lên tạ theo hướng dẫn cĩ tốt khơng ? 6_As_Mid_B B Cĩ chứ. Cách tốt nhất là tập lên tạ theo đúng hướng dẫn. 6_Int_Fi_A A Liệu em cĩ phải tập lên tạ ? 6_As_Fi_B B Cĩ. Mọi vận động viên đều phải tập lên tạ. 6_Int_In_B B Lên tạ theo cách nào hả anh ? 6_Imp_In_A A À. Lên tạ theo đúng hướng dẫn cho tơi! 6_Int_Mid_B B Thế em cĩ phải tập lên tạ bây giờ khơng? 6_Imp_Mid_A A Cĩ. Cậu tập lên tạ theo tơi ngay! 6_Int_Fi_B B Thế em phải tập cả lên tạ? 6_Imp_Fi_A A Chứ cịn gì nữa. Cứ theo hướng dẫn lên tạ! Code Role Sentence 5b_As_In_A A Trong cuộc thi của nhà nơng, bên tát theo cách mới đã giành thắng lợi. 5b_As_Mid_A A Chiến thắng thuộc về bên tát theo cách mới. 5b_As_Fi_A A Cuối cùng, phần thắng đã thuộc về bên tát. Context Trong một cuộc thi của nhà nơng. Hai khán giả hỏi nhau: 5b_Int_In_A A Bên tát theo cách mới là bên nào thế anh? 5b_As_In_B B À. Bên tát theo cách mới là bên mặc áo đỏ. - 99 - Master thesis Mạc Đăng Khoa 5b_Int_Mid_A A Bên nào là bên tát theo cách mới hả anh? 5b_As_Mid_B B À. Bên đỏ là bên tát theo cách mới. 5b_Int_Fi_A A Trong lần thi này, liệu phần thắng cĩ thuộc về bên tát? 5b_As_Fi_B B Cĩ. Tơi nghĩ phần thắng sẽ thuộc về bên tát. Context Hai anh em đang ở dưới ruộng, nĩi chuyện với nhau: 5b_Int_In_B B Ơ bố mẹ đang tát nước kìa. Lên tát theo bố mẹ khơng anh? 5b_Imp_In_A A Cĩ. Lên tát theo bố mẹ đi ! 5b_Int_Mid_B B Ơ, bố mẹ đang tát nước trên kia kìa. Mình cĩ lên tát theo bố mẹ khơng? 5b_Imp_Mid_A A Cĩ, dưới này cũng làm xong rồi. Mình lên tát theo bố mẹ đi ! 5b_Int_Fi_B B Bố mẹ tát nước trên kia kìa. Hay là anh em mình cũng lên tát? 5b_Imp_Fi_A A Ừ, dưới này cũng xong rồi. Nào anh em mình cùng lên tát ! Code Role Sentence 6b_As_In_A A Trong cuộc thi điêu khắc, bên tạc theo cách mới đã hồn thành bức tượng sớm hơn. 6b_As_Mid_A A Chiến thắng thuộc về bên tạc theo cách mới. 6b_As_Fi_A A Cuộc thi làm tượng nhanh diễn ra giữa bên đúc và bên tạc. Context Trong một cuộc thi điêu khắc, hai khán giả trị chuyện với nhau: 6b_Int_In_A A Bên tạc theo cách mới là bên nào thế anh? 6b_As_In_B B À. Bên tạc theo cách mới là bên áo đỏ. 6b_Int_Mid_A A Bên nào là bên tạc theo cách mới thế? 6b_As_Mid_B B À. Bên áo đỏ là bên tạc theo cách mới. 6b_Int_Fi_A A Theo anh, liệu phần thắng cĩ thuộc về bên tạc? 6b_As_Fi_B B Cĩ. Tơi nghĩ phần thắng sẽ thuộc về bên tạc. Context Trong một xưởng điêu khắc, thợ tạc hỏi người thợ cả: 6b_Int_In_B B Cĩ hai mẫu mới và cũ. Tạc theo mẫu nào hả anh ? 6b_Imp_In_A A À. Tạc theo mẫu mới cho tơi ! 6b_Int_Mid_B B Anh ơi! Lơ tượng bên trên tạc theo mẫu nào hả anh ? 6b_Imp_Mid_A A À. Lơ bên trên tạc theo mẫu mới đi! - 100 - Master thesis Mạc Đăng Khoa 6b_Int_Fi_B B Vụ tạc tượng phật trên núi đã đủ thợ chưa hả anh? Liệu em cĩ phải lên tạc? 6b_Imp_Fi_A A Cĩ. Cả cậu cũng phải lên tạc! B: Datasheet of prosody patterns Initial part Middle part Final part F0 F0 Intensity (Hz) (Semitone) (dB) F0 F0 Intensity (Hz) (Semitone) (dB) F0 F0 Intensity (Hz) (Semitone) (dB) M al e 162.65 8.23 70.99 163.03 8.27 71.53 162.96 8.25 71.82 162.76 8.23 71.92 162.37 8.19 71.87 162.07 8.15 71.70 161.80 8.13 71.48 161.84 8.14 71.25 161.78 8.13 71.04 161.42 8.09 70.84 161.55 8.11 70.62 161.49 8.10 70.33 161.27 8.08 69.94 161.07 8.06 69.40 160.69 8.03 68.65 160.42 8.00 67.65 160.02 7.96 66.34 159.48 7.91 64.73 158.63 7.82 62.83 157.04 7.65 60.75 Duration(ms) Syllable 149 Voiced part 99 180.21 9.54 69.90 180.42 9.52 70.80 178.30 9.30 71.26 177.88 9.27 71.43 177.40 9.23 71.35 177.01 9.20 71.07 176.65 9.16 70.70 176.48 9.15 70.34 176.58 9.15 70.04 176.85 9.16 69.77 177.00 9.16 69.46 176.94 9.14 69.05 176.84 9.11 68.53 176.89 9.10 67.91 176.78 9.09 67.15 176.53 9.07 66.18 174.97 8.97 65.04 172.82 8.81 63.64 171.33 8.68 61.75 169.78 8.54 59.26 Duration(ms) Syllable 177 Voiced part 121 153.85 7.28 71.28 153.42 7.23 71.94 153.01 7.17 72.14 152.51 7.11 72.12 152.42 7.09 71.99 152.38 7.08 71.84 152.45 7.09 71.66 152.22 7.06 71.37 151.97 7.03 71.01 151.27 6.96 70.61 150.76 6.90 70.14 150.17 6.84 69.45 149.58 6.77 68.36 149.10 6.73 66.81 149.37 6.80 65.00 147.78 6.62 63.27 146.89 6.52 61.54 145.75 6.40 59.80 146.23 6.45 57.75 146.34 6.47 55.54 Duration(ms) Syllable 320 Voiced part 210 A ss er tiv e se nt en ce Fe m al e 295.52 18.65 69.11 292.33 18.47 69.54 289.19 18.28 69.77 287.26 18.16 69.89 285.73 18.07 69.93 284.65 18.01 69.92 283.93 17.97 69.85 283.23 17.92 69.72 282.77 17.90 69.51 282.00 17.85 69.22 281.85 17.84 68.85 282.13 17.85 68.48 282.72 17.89 68.13 283.01 17.90 67.75 283.29 17.92 67.25 283.33 17.92 66.54 283.32 17.92 65.53 282.34 17.85 64.11 281.63 17.81 62.21 278.94 17.65 59.87 Duration Syllable 168 F0 100 264.90 16.55 70.87 261.16 16.32 71.71 258.42 16.14 72.10 257.24 16.07 72.21 256.46 16.02 72.14 255.35 15.95 71.98 254.52 15.89 71.78 253.64 15.83 71.53 253.43 15.82 71.24 253.37 15.82 70.96 253.47 15.82 70.71 254.04 15.86 70.43 254.42 15.89 70.02 254.88 15.92 69.41 255.35 15.94 68.53 255.78 15.97 67.29 255.10 15.90 65.62 254.54 15.85 63.47 252.85 15.74 60.98 249.09 15.47 58.81 Duration Syllable 196 F0 126 240.01 14.65 70.26 236.88 14.42 72.06 235.57 14.32 72.57 235.05 14.28 72.53 235.11 14.28 72.32 235.47 14.31 72.10 235.45 14.32 71.83 235.21 14.30 71.37 234.78 14.27 70.83 233.65 14.20 70.27 232.86 14.13 69.62 232.18 14.09 68.81 231.20 14.02 67.77 229.98 13.93 66.65 229.68 13.91 65.43 229.24 13.88 64.07 228.43 13.82 62.26 225.83 13.63 60.17 224.13 13.51 57.80 225.79 13.63 55.54 Duration Syllable 361 F0 229 - 101 - Master thesis Mạc Đăng Khoa M al e 181.86 10.18 73.56 180.69 10.06 74.16 179.54 9.93 74.50 179.11 9.89 74.69 178.00 9.78 74.75 177.18 9.69 74.69 176.87 9.65 74.51 176.28 9.59 74.25 175.96 9.55 73.92 175.81 9.53 73.55 175.40 9.48 73.12 175.06 9.45 72.63 174.67 9.41 72.05 174.38 9.38 71.34 174.17 9.36 70.47 173.86 9.33 69.36 173.25 9.26 67.95 171.56 9.10 66.20 170.71 9.00 64.18 169.21 8.85 62.07 Duration(ms) Syllable 132 Voiced part 86 172.43 9.23 72.50 172.60 9.26 73.09 171.89 9.16 73.25 170.61 9.00 73.25 170.23 8.96 73.17 169.71 8.91 73.00 169.35 8.87 72.76 169.06 8.85 72.52 168.96 8.84 72.29 168.82 8.82 72.07 168.66 8.80 71.82 168.54 8.79 71.52 168.47 8.78 71.08 168.31 8.77 70.46 167.87 8.72 69.66 167.13 8.66 68.66 166.43 8.59 67.33 166.45 8.59 65.51 165.36 8.47 63.02 163.67 8.29 59.91 Duration(ms) Syllable 185 Voiced part 127 169.79 8.85 71.48 168.78 8.78 72.34 167.25 8.63 72.38 166.22 8.53 71.95 165.54 8.47 71.32 165.42 8.46 70.92 165.44 8.47 70.64 165.22 8.45 70.34 165.00 8.43 70.14 164.76 8.41 69.99 164.93 8.42 69.74 165.48 8.47 69.22 165.51 8.47 68.35 166.50 8.56 67.23 166.36 8.53 66.13 166.20 8.51 65.13 166.44 8.52 64.03 164.40 8.25 62.55 164.01 8.20 60.67 164.35 8.24 58.35 Duration(ms) Syllable 281 Voiced part 187 In te rr og at iv e se nt en ce Fe m al e 312.38 19.63 72.34 307.61 19.39 73.23 305.55 19.27 73.80 302.99 19.14 74.14 301.12 19.03 74.34 299.87 18.96 74.42 298.81 18.90 74.44 298.09 18.86 74.36 297.63 18.83 74.20 297.28 18.81 73.94 297.09 18.80 73.57 297.07 18.80 73.10 297.01 18.80 72.51 297.24 18.81 71.80 297.68 18.84 70.93 298.50 18.88 69.83 298.50 18.88 68.42 297.57 18.83 66.62 296.96 18.79 64.38 294.52 18.66 61.75 Duration Syllable 145 F0 87 298.25 18.82 70.57 294.71 18.63 71.83 292.32 18.49 72.73 290.69 18.40 73.36 289.01 18.29 73.78 287.49 18.21 74.01 286.66 18.16 74.09 285.86 18.11 74.06 285.43 18.09 73.93 285.56 18.09 73.72 285.83 18.11 73.42 286.29 18.14 73.04 286.90 18.18 72.59 287.53 18.22 72.07 288.06 18.25 71.40 288.71 18.29 70.46 289.32 18.33 69.14 289.54 18.34 67.35 288.86 18.30 65.07 286.81 18.18 62.32 Duration Syllable 168 F0 106 291.53 18.42 70.55 286.30 18.12 72.38 283.53 17.95 73.19 281.86 17.86 73.44 281.04 17.81 73.45 280.33 17.77 73.27 280.58 17.79 73.00 280.23 17.77 72.61 280.17 17.76 72.13 280.29 17.77 71.63 281.01 17.80 71.08 282.34 17.88 70.35 283.69 17.96 69.49 285.85 18.08 68.61 288.17 18.21 67.63 288.25 18.22 66.31 291.93 18.42 64.50 291.64 18.40 62.22 292.02 18.41 59.60 290.68 18.33 57.14 Duration Syllable 281 F0 199 - 102 - Master thesis Mạc Đăng Khoa M al e 178.07 9.72 74.22 177.27 9.64 74.47 176.55 9.58 74.61 175.76 9.50 74.67 174.98 9.42 74.69 174.36 9.35 74.69 174.08 9.33 74.65 173.94 9.32 74.55 173.88 9.32 74.38 173.76 9.31 74.12 173.48 9.28 73.77 173.27 9.26 73.36 173.16 9.24 72.90 173.06 9.23 72.36 172.97 9.23 71.72 172.81 9.21 70.97 172.68 9.20 70.08 172.54 9.18 68.98 172.31 9.16 67.64 171.42 9.07 66.00 Duration(ms) Syllable 129 Voiced part 75 166.28 8.52 72.52 165.50 8.43 73.52 165.40 8.41 74.14 164.87 8.35 74.47 164.13 8.26 74.65 162.93 8.11 74.74 161.65 7.95 74.76 161.41 7.94 74.68 162.59 8.14 74.53 162.85 8.20 74.28 162.68 8.18 73.97 162.43 8.15 73.59 162.09 8.11 73.15 162.02 8.10 72.63 161.85 8.09 71.97 161.67 8.07 71.11 161.44 8.04 69.96 161.17 8.01 68.41 160.71 7.96 66.46 159.67 7.86 64.15 Duration(ms) Syllable 142 Voiced part 96 178.57 9.71 75.21 178.89 9.77 77.01 178.16 9.70 77.78 177.91 9.67 77.91 177.80 9.66 77.65 177.57 9.63 77.28 177.27 9.60 76.86 177.12 9.59 76.38 177.00 9.59 75.80 176.24 9.51 75.05 174.95 9.39 74.21 173.77 9.28 73.36 172.74 9.19 72.41 171.43 9.07 71.22 169.59 8.89 69.79 167.57 8.68 68.28 166.24 8.55 66.72 165.10 8.43 65.00 162.74 8.15 62.88 162.07 8.04 60.61 Duration(ms) Syllable 291 Voiced part 195 Im pe ra tiv e se nt en ce Fe m al e 269.43 16.50 71.84 266.25 16.34 72.66 263.43 16.18 73.17 261.32 16.05 73.44 259.70 15.95 73.54 258.46 15.87 73.50 257.49 15.81 73.37 256.97 15.78 73.16 256.61 15.75 72.89 256.85 15.77 72.54 257.22 15.79 72.11 257.56 15.81 71.62 258.24 15.85 71.05 258.67 15.88 70.38 259.16 15.90 69.54 259.81 15.95 68.46 259.58 15.93 67.05 259.13 15.90 65.35 254.95 15.66 63.37 250.14 15.36 61.20 Duration Syllable 0.154 F0 0.092 293.55 18.56 71.79 289.61 18.34 72.91 287.29 18.21 73.65 285.88 18.12 74.12 284.37 18.03 74.35 283.76 17.99 74.36 282.70 17.93 74.22 281.91 17.88 74.01 281.38 17.84 73.75 281.21 17.83 73.46 281.18 17.83 73.15 281.21 17.82 72.81 280.92 17.80 72.39 281.13 17.81 71.83 281.27 17.82 71.10 281.49 17.83 70.16 281.07 17.80 68.92 281.60 17.83 67.31 280.33 17.75 65.32 280.52 17.76 63.02 Duration Syllable 0.182 F0 0.110 311.46 19.61 72.53 306.69 19.35 74.05 304.61 19.23 74.48 303.23 19.16 74.62 302.40 19.11 74.50 302.06 19.09 74.19 301.84 19.08 73.80 301.71 19.08 73.49 301.29 19.05 73.20 300.63 19.01 72.74 299.39 18.94 72.22 298.05 18.86 71.69 297.06 18.81 71.19 295.21 18.70 70.59 295.00 18.69 69.79 293.48 18.60 68.61 294.19 18.65 66.97 292.11 18.52 64.72 285.62 18.12 61.72 282.03 17.91 58.59 Duration Syllable 0.317 F0 0.205 ._.

Các file đính kèm theo tài liệu này:

LA3232.pdf