Vol. 7 (2020)

Topic unit detection in spontaneous speech: Measuring reliability using the Kappa statistic

Frederico Amorim Cavalcante
Giulia Bossaglia
Maryualê Mittmann
Bruno Rocha
Publicado septiembre 28, 2020

Palabras clave:

interrater agreement, information structure, topic, spontaneous speech, prosody
Cómo citar
Cavalcante, F. A., Raso, T., Bossaglia, G., Mittmann, M., & Rocha, B. (2020). Topic unit detection in spontaneous speech: Measuring reliability using the Kappa statistic. CHIMERA: Revista De Corpus De Lenguas Romances Y Estudios Lingüísticos, 7, 69–106. https://doi.org/10.15366/chimera2020.7.004


This paper deals with an inter-annotator agreement test involving the identification of the information unit of Topic as defined within the framework of the Language into Act Theory (L-AcT). Fleiss’s kappa statistic was used to measure the agreement among the four annotators who took part in the test. The data used was sampled from C-ORAL-BRASIL II, a spontaneous speech corpus of Brazilian Portuguese. The paper begins by outlining of the theoretical underpinnings of L-AcT, dedicating special attention to aspects directly related to the notion of Topic. Section 2 presents the pilot test and discusses methodological and theoretical issues that were relevant for the design of the protocol that was eventually used in the actual test. Sections 3 and 4 deal with the test, its protocol and results (the kappa coefficient for the general agreement was 0.79, which by usual standards represents a substantial agreement). Section 5 first provides a brief review of a few studies conducted according to other frameworks which have dealt with inter-rater agreement on the annotation of information structure categories. Finally, the errors observed in the test are analyzed qualitatively.


Los datos de descargas todavía no están disponibles.


Beck, K. 2012. Tübinger Baumbank des Deutschen/Zeitungskorpus (TüBa-D/Z). Seminar for Linguistics. Eberhard Karls University of Tübingen, Tübingen. https://uni-tuebingen.de/fakultaeten/philosophische-fakultaet/fachbereiche/neuphilologie/seminar-fuer-sprachwissenschaft/arbeitsbereiche/allg-sprachwissenschaft-computerlinguistik/ressourcen/corpora/tueba-dz.html (accessed April 30, 2020).

Boersma, P. & Weenink, D. 2019. Praat: doing phonetics by computer: computer program. Version 5.4.21 Amsterdam: University of Amsterdam. http:// www.praat.org/ (accessed February 2020).

Bossaglia, G. & Ferrari, L. A. 2019. The C-ORAL-BRASIL project: varied resources for the study of spoken Brazilian Portuguese. The Journal Of Speech Sciences 7(2), 65-77. https://doi.org/10.20396/joss.v7i2.15000

Cavalcante, F.A. 2015. The topic unit in spontaneous American English: a corpus-based study. M.A. thesis, Faculdade de Letras, Universidade Federal de Minas Gerais.

Cavalcante, F.A. 2020. The information unit of Topic: a crosslinguistic, statistical study based on spontaneous speech corpora. PhD diss., Faculdade de Letras, Unversidade Federal de minas Gerais.

Chafe, W.L. 1976. Givenness, contrastiveness, definiteness, subjects, topics, and point of view. In C.N. Li (ed). Subject and topic. New York: Academic Press.

Cohen, Jacob. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20(1), 37-46. https://doi.org/10.1177/001316446002000104

Cresti, E. 2000. Corpus di italiano parlato. Firenze: Accademia della Crusca.

Cresti, E. 2009. La Stanza: un'unità di costruzione testuale del parlato. X Congresso SILFI, 2009, Basileia. Atti del X Congresso SILFI: Sintassi storica e sincronica dell'italiano. Subordinazione, coordinazione e giustapposizione. Basileia, 1-25.

Cresti, E. 2011. The definition of focus in Language into Act Theory (LAcT). In: Mello, H., Panunzi, A., Raso, T. (org.). Pragmatics and Prosody: Illocution Modality, Attitude, Information Patterning and Speech Annotation. Firenze: Firenze University Press.

Cresti, E. 2014. Syntactic properties of spontaneous speech in the Language into Act Theory: data on Italian complements and relative clauses. In T. Raso, H.R. MELLO (eds), Spoken corpora and linguistic studies. Amsterdam: John Benjamins, p. 365-410. https://doi.org/10.1075/scl.61.13cre

Cresti, E. 2018. The illocution-prosody relationship and the Information Pattern in spontaneous speech according to the Language into Act Theory (L-AcT). Linguistik Online, 88(1). https://doi.org/10.13092/lo.88.4189

Cresti, E. & Moneglia, M. (eds) 2005. C-ORAL-ROM. Integrated reference corpora for spoken Romance languages. Amsterdam: John Benjamins. https://doi.org/10.1075/scl.15

Cresti, E. & Moneglia, M. 2019. The Discourse Connector according to the Language into Act Theory: data from IPIC Italian. In: Bidese, E., Casalicchio, J. & Moroni, M. (eds.): La linguistica vista dalle Alpi. Teoria, lessicografia e multilinguismo / Linguistic views from the Alps. Language Theory, Lexicography and Multilin-gualism. Frankfurt am Main, Peter Lang, 99-126.

Cresti, E. & Moneglia, M. forthcoming. Il Connettore discorsivo secondo la Teoria sulla lin gua in atto. In A. De Meo & F. Dovetto (eds.), Atti del Congresso GSCP "La comunicazione parlata" Università degli Studi di Napoli "L'Orientale" - Università degli Studi di Napoli Federico II (Napoli, 12-14 dicembre 2018), Napoli: Aracne.

Cook, P. & Bildhauer, F. 2013. Identifying "aboutness topic": two annotation experiments. Dialogue and Discourse 4(3), 118-141. https://doi.org/10.5087/dad.2013.206

Fleiss, Joseph L. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin 76(5). https://doi.org/10.1037/h0031619

Frosali, F. 2008. L'unità di informazione di ausilio dialogico: Valori percentuali, caratteri intonativi, lessicali e morfo-sintattici in un corpus di italiano parlato (C-ORAL-ROM). In E. Cresti (ed.), Prospettive nello studio del lessico italiano. Florence: Firenze University Press, 417-424.

Gamer, M., Lemon, J., Singh, I.F.P. 2019. irr: Various Coefficients of Interrater Reliability and Agreement. R package, version 0.84.1. https://CRAN.R-project.org/package=irr (accessed March 10, 2020).

Grønnum, N. 2006. DanPASS - A Danish Phonetically Annotated Spontaneous Speech corpus. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Tapias, D. Proceedings of the 5th International Conference on Language Resources and Evaluation, Genova 24-26 May 2006. European Language Resources Association, Genova.

Krifka, M. 2008. Basic notions of information structure. Acta Linguistica Hungarica, 55(3-4), 243-276. https://doi.org/10.1556/ALing.55.2008.3-4.2

Kupietz, M., Keibel, H. 2009. The Mannheim German Reference Corpus (DeReKo) as a basis for empirical linguistic research. In Minegishi, M., Kawaguchi, Y. (eds.), Working Papers in Corpus-based Linguistics and Language Education, 3. Tokyo: Tokyo University of Foreign Studies (TUFS), 53-59. https://doi.org/10.1075/tufs.1.02min

Lambrecht, K. 1994. Information structure and sentence form: topic, focus and the mental representations of discourse referents. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511620607

Landis, R.J. & Koch, G.G. 1977. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics: 363-374. https://doi.org/10.2307/2529786

Li, C.N., Thompson, S.A. 1976. Subject and topic: a new typology of language. In C.N. Charles (ed). Subject and topic. New York: Academic Press.

Lüdeling, A., Ritz, J., Stede, M., Amir, Z. 2016. Corpus linguistics and information structure research. In C. Féry, S. Ishihara (eds). The Oxford Handbook of Information Structure. Oxford: Oxford University Press, 599-620. https://doi.org/10.1093/oxfordhb/9780199642670.013.013

Mello; H.R. 2014. Methodological issues for spontaneous speech corpora compilation: the case of the C-ORAL-BRASIL. In T. Raso, H.R. MELLO (eds), Spoken corpora and linguistic studies. Amsterdam: John Benjamins, 29-68.

Mittmann, M. M. 2012. O C-ORAL-BRASIL e o estudo da fala informal: um novo olhar sobre o tópico no português brasileiro. PhD diss., Faculdade de Letras, Universidade Federal de Minas Gerais, Belo Horizonte.

Moneglia, M., Raso, T. 2014. Notes on Language into Act Theory (L-AcT). In T. Raso, H.R. MELLO (eds), Spoken corpora and linguistic studies. Amsterdam: John Benjamins, 469-495.

Paggio, P. 2006. Annotating information structure in a corpus of spoken Danish. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC'06), 1606-1609. http://www.lrec-conf.org/proceedings/lrec2006/pdf/639_pdf.pdf (accessed April 2020). https://doi.org/10.3115/1608974.1609006

Raso, T. 2012. O corpus C-ORAL-BRAISL. In In T. Raso, H.R. MELLO (eds), C-ORAL-BRASIL I: corpus de referência de português brasileiro falado informal. Belo Horizonte: UFMG, 55-90.

Raso, T. 2014. Prosodic constraints for discourse markers. In T. Raso, H.R. MELLO (eds), Spoken corpora and linguistic studies. Amsterdam: John Benjamins, 411-467. https://doi.org/10.1075/scl.61.14ras

Raso, T. & Ferrari, L.A. forthcoming. Uso dei Segnali Discorsivi in corpora di parlato spontaneo italiano e brasiliano. In: Ferroni, R., Birello, M. (eds.) 2020. La compe-tenza discorsiva a lezione di lingua straniera. Roma: Aracne.

Raso, T., Mello, H.R. (eds) 2012. C-ORAL-BRASIL I: corpus de referência de português brasileiro falado informal. Belo Horizonte: UFMG.

Raso, T., Mello, H.R., Ferrari, L.A. forthcoming. C-ORAL-BRASIL II: corpora of Brazilian Portuguese speech in formal, media, and telephonic interactions.

Raso, T., Cavalcante, F., Mittmann, M. 2017. Prosodic forms of the Topic information unit in a cross-linguistic perspective: a first survey. Proceedings of the SLI-GSCP International Conference, 13-15 June, 2016. A. de Meo & F. M. Dovetto (eds), Rome: Aracne editrice, 473-498.

Raso, T. & Rocha, B. 2017. Illocution and attitude: on the complex interaction between prosody and pragmatic parameters. JOSS Journal Of Speech Science, 5, 5-27. https://doi.org/10.20396/joss.v5i2.15062

Raso, T., Vieira, M. 2016. A description of Dialogic Units/Discourse Markers in spontaneous speech corpora based on phonetic parameters. Chimera: Romance Corpora and Linguistic Studies, 3, 221-249.

R Development Core Team. 2019. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org (accessed February 2020).

Ritz, J., Dipper, S., Michael, G. 2008. Annotation of information strucute: an evaluation across different types of texts. Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08). http://www.lrec-conf.org/proceedings/lrec2008/ (accessed April 2020).

Skopeteas, S. Fiedler, I., Hellmuth, S., Schwarz, A., Stoel, R., Fanselow, G., Féry, C., Krifka, M. 2006. Questionnaire on Information Structure (QUIS). In Working Papers of the SFB632, Interdisciplinary Studies on Information Structure (ISIS) 4. Potsdam: Universitätsverlag Potsdam.

Stede, M. 2004. The Potsdam Commentary Corpus. Proceedings of the ACL 2004 Workshop on Discourse Annotation, 96-102. https://doi.org/10.3115/1608938.1608951

Viera, A.J. & Garrett, J.M. 2005. Understanding interobserver agreement: the kappa statistic. Fam med 37(5), 360-363.