List of languages
Click on the ‘link’ near the name of a language to open the description. Summary data for languages being prepared for publication can be found in the ‘Patterns overview’ section
-
glottocode is the language’s code provided by Glottolog.
-
language is the name of the language. Note that in some cases there is no one-to-one correspondence between language names used in this project and their glottocodes. For example, Finnish and Ingrian Finnish are two different languages for the purposes of this project, but they share the same glottocode ‘finn1318’.
-
macroarea identifies the macro-area (typically subcontinent size) where the language is spoken. The following partition of the world is used in this project: Australia; East and Southeast Asia; Europe; Mesoamerica; North Africa; North America; North and Central Asia; Papunesia; South America; South Asia; Sub-Saharan Africa; West Asia and the Caucasus.
-
family (WALS) and genus (WALS) contain information on the genealogical affiliation of the language as provided in the World Atlas of Language Structures Online. Although imperfect in many respects, the system employed by WALS is convenient in that it provides a uniform two-level affiliation for each language, where “family” corresponds to a taxon with a time-depth comparable to that of the Indo-European languages, and “genus” to a taxon with a time-depth comparable to that of the major branches of the Indo-European family, such as Germanic or Celtic.
-
number of nominal cases is the total number of different cases in the language’s nouns according to the description employed in the project.
-
overall N is the total number of patterns that meet the acceptance criteria, see How to read the data for more detail.
-
transitives and intransitives are the total number of transitive and intransitive patterns respectively. Their sum always equals ‘overall N’.
-
transitivity ratio and intransitivity ratio are coefficients calculated by dividing the number of transitive and intransitive patterns, respectively, by the ‘overall N’. The sum of these two ratios always equals 1.
-
X-locus, Y-locus, and XY-locus are the number of patterns that display oblique encoding of the first argument (X), the second argument (Y), or both predefined arguments of the verb (X and Y) respectively. The sum of these three numbers always equals the total number of intransitives. See How to read the data for more detail on X-, Y- and XY-locus.
-
number of classes is the total number of different valency patterns observed in the data.
-
entropy (nat) measures the degree of diversity observed in the language’s valency class system. Shannon’s entropy (measured in nats) is calculated as follows: \[ \displaystyle H = - \sum^{n}_{i=1} \left( \dfrac{|C_i|}{|P|} \right) \log \left( \dfrac{|C_i|}{|P|} \right) \] where \(n\) is the number of different valency patterns observed in the data (‘number of classes’), \(C_i\) is the \(i\)-th valency class, \(P\) is the total number of patterns that meet the acceptance criteria (‘overall N’), and log corresponds to the natural logarithm. The theoretical minimum for \(H\) is 0 (it would be observed in a hypothetical language where all bivalent verbs belong to the same valency class). Higher entropy values correspond to greater levels of diversity.
-
normalised entropy is entropy corrected for the size of the collected sample of verbs. Entropy is positively correlated with the size of the support of the distribution, for which it is calculated. The number of data points obtained for different languages varies between 55 and 130, and we may assume that the entropy estimates for languages with a small overall N are skewed downwards because smaller samples underrepresent the range of variation found in them. To correct for this, we fix a sample size of 100 and for each language compute the expected value of its entropy for this sample size. The expected entropy is computed according to the formula \( H k_n \), where \(H\) is the actual entropy value for the language and \(k_n\) is the correction coefficient dependent on the sample size \(n\).
The values for the coefficient were computed using a variant of bootstrap resampling: for each sample of verbs \(k\) of size \(s \ge 100 \), we first sampled 100 subsamples of size 100 without replacement and computed their average entropy \(e_{k,100}\). We then took 100 subsamples \(a_{k,i,j}, i \in \{1, \dots, 100\}\) of sizes \(j \in \{55, 56, \dots , s\}\) and computed the average of \( \dfrac{e_{k,100}}{e_{k,i,j}} \) over them. These bootrstrapped per-language and per-sample-size coefficients were then averaged over languages to derive \(k_n\). In order to make them more robust, we divided sample sizes into several bins and associated a pooled correction coefficient with each bin.
from to Correction 55 59 1.04 60 69 1.03 70 78 1.02 79 92 1.01 93 109 1 110 130 0.99 - entopy of intransitives is the observed Shannon’s entropy calculated for intransitive patterns only. This measure estimates the degree of diversity in bivalent intransitive classes.