Brasolin, P., & Bienati, A. (2025). Phraseology meets information theory: Going beyond the bag-of-words approach in complexity measures. Journal of the European Second Language Association, 9(1), 103–123. DOI: https://doi.org/10.22599/
Abstract
This article investigates how phraseological diversity measures behave across two different axes of variation – expertise and order – with the aim of determining their ability to discriminate between different levels of expertise in writing as well as their sensitivity to the orderliness of texts. Through a scoping review of phraseological complexity studies, we identify the conceptualizations and operationalizations underlying the phraseological complexity construct. Phraseological complexity relies on modelling interrelationships between words to account for their complexity not in isolation, but in relationship with co-occurring words. After trying extensions of classical lexical diversity measures (e.g., type-token ratio [TTR]-based), we borrow from information theory the measure of information fluctuation complexity, which is able to model the interrelationships between consecutive tokens via token pair frequencies. Using a controlled simulation setting, we apply these measures to four corpora of Italian spanning the spectrum from more expert to less expert writers, as well as synthetic corpora that represent more orderly and disorderly variants of the same texts. Although TTR-based measures computed on bigrams also capture changes in the text structure, fluctuation complexity is the only measure that exhibits a bell-shaped curve, peaking for original texts and decreasing for either orderly or disorderly variants, thus capturing an intuitive notion of complexity.