News

Deutsch

Using AI to explore language learning

[26.09.2024]

How can AI improve the quality of language testing? A CATALPA project is doing the basic research here. An insight on “European Day of Languages” on September 26.


Two students in the university library Photo: Torsten Silz
Before starting their studies, prospective students from outside Germany must first pass a language test.

Students from outside Germany are required to take a language test before the start of their studies. To be admitted to the university, applicants must demonstrate at least level C1, or “proficient language skills,” according to the Common European Framework of Reference for Languages (CEFR).

Large data sets create a basis

But how well do the current methods really reflect the learning progress of the tested persons? With large data sets, computational linguist Dr. Josef Ruppenhofer from CATALPA research center wants to create a basis for improving the validity of test procedures in the future.

“So far, parts of the research community have assumed that language development takes place in stages, regardless of the age or educational background of learners,” explains Ruppenhofer. ”There have already been many studies on this concept – but always only with rather small groups of learners.” The developmental stages can be understood, for example, by the word order of verbs in a sentence. First, simple sentences are learned, such as "Ich brauch eine neue Wohnung" (I need a new apartment). Step by step, the sentences become more complex (see figure), until finally, subordinate clauses in which the verb must be at the end can also be formed correctly. The intermediate stages include, among other things, typical errors in sentence structure.

1. canonical word order: Ich suche eine neue Wohnung. 2. adverb preposing: Darum ich suche eine neue Wohnung. 3. verb separation: Darum ich muss eine neue Wohnung suchen. 4. inversion: Darum muss ich eine neue Wohnung suchen.5. verb-end: Weil ich eine neue Wohnung suche. Figure: CATALPA
Example of developmental stages in the positioning of verbs in German.

Researchers from Hagen and Leipzig are working together

Current language tests are based on this concept of developmental stages. However, whether the concept actually applies uniformly to all learners is disputed in recent linguistic research. The DAKODA research project, funded by the BMBF and short for “Data Competencies in German as a Foreign Language/German as a Second Language”, aims to provide more clarity here. Ruppenhofer is working on the project together with the computational linguist Prof. Dr. Torsten Zesch from Hagen and a team from the University of Leipzig led by Prof. Dr. Katrin Wisniewski. The researchers want to use artificial intelligence to enable a more differentiated analysis and thus be able to provide more precise conclusions about the quality of the developmental stages approach.

Different text forms as a basis

A lot of preparatory work is required for this: “When learners' language has been collected in the past, the case numbers have usually not been very large,” explains Ruppenhofer. In order to be able to use AI to make statements about learners' language skills, he must therefore first combine numerous data sets of collected texts – so-called text corpora.

Josef Ruppenhofer Photo: Hardy Welsch
Computational linguist Dr. Josef Ruppenhofer

simple information, such as the age of the learners, was not recorded consistently. “Sometimes we only have the year of birth, sometimes the age in years, sometimes the age was noted in years and months.”

Ruppenhofer is now merging these text corpora – to examine them himself, but also to make them widely available to other researchers. “This is not always possible,” he explains. “When the data was collected in the 1980s, of course no one thought that it might be made accessible online. As a result, we don't have sufficient consent forms from the participants.” These parts of the data cannot be shared and are analyzed only within the project.

International interest in the data

Nevertheless, a large corpus of text remains that can be published and searched using different criteria. There is already a great deal of interest in Ruppenhofer's data from the international research community. The interdisciplinary project offers numerous workshops for young researchers who want to work with the text corpora. Josef Ruppenhofer and Torsten Zesch take on the role of data experts in these workshops. “We have workshop participants from China and other European countries, for example,” says Ruppenhofer. “All in all, it's a really exciting exchange.”

Ruppenhofer is currently presenting some of his research findings at the Learner Corpus Research Conference in Tartu, Estonia. The date of the conference launch is appropriate for the topic: it starts on September 26, which is also the European Day of Languages.