Keynote Speakers

Gaëtanelle Gilquin (PhD UCLouvain) is Professor of English Language and Linguistics at UCLouvain. She is the director of several corpus projects such as the Louvain International Database of Spoken English Interlanguage (LINDSEI), a collaborative project between some 25 universities internationally, and the Process Corpus of English in Education (PROCEED), a new type of learner corpus which reproduces the writing process through screencasting and keylogging. She is the co-editor-in-chief of the Corpora and Language in Use series and an associate editor of the Cambridge Elements in Corpus Linguistics series. She is one of the editors of the Cambridge Handbook of Learner Corpus Research and the author of several publications dealing with corpus linguistics, and in particular learner corpus research. Her research interests include corpus linguistics and learner corpus research, cognitive linguistics and construction grammar, the combination of corpora and experimentation, varieties of English and the link between New Englishes and Learner Englishes, as well as writing processes and writing fluency in L1 and L2.

Following the LCR 2024 organizers’ call to “champion corpus research involving learning of smaller languages and by learners with a small L1, as well as language learning in plurilingual settings”, this presentation aims to move the debate beyond the L1s and L2s studied in learner corpus research to ask whether the field as a whole is inclusive enough. In doing so, it also seeks to contribute to the broader trend of discussing and promoting inclusion in linguistics, a trend that has been growing lately (see, e.g., Charity Hudley et al. 2024).

I will start by showing how learner corpus research has been inclusive in some respects and how recent developments in the field have led to enhanced inclusion, for example through the consideration of a wider range of languages or the adoption of a “Diversity and Inclusion Statement” by the Learner Corpus Association. I will then review different aspects of learner corpus research for which some of our practices may have been less than optimal in fostering inclusion. These aspects will include the learner populations investigated, the choice of norms in learner corpus studies (see also Gilquin 2022), the terminology used in the field and the pedagogical applications based on learner corpora. For each of these aspects, the importance of more inclusive approaches will be underlined, with examples taken from the literature, and suggestions will be made on how to further enhance inclusion. To conclude, it will be emphasized that developing inclusive learner corpus research is a slow process, but one to which we should devote the attention it deserves, both as individuals and as a community.

References
Charity Hudley, Anne H., Mallinson, Christine & Bucholtz, Mary. 2024. Inclusion in Linguistics. Oxford: Oxford University Press.
Gilquin, Gaëtanelle. 2022. ‘One norm to rule them all? Corpus-derived norms in learner corpus research and foreign language teaching’. Language Teaching 55(1): 87-99.

Cristóbal Lozano (PhD Essex) is currently Associate Professor in English Applied Linguistics at the Universidad de Granada (Spain). His main research interests are Second Language Acquisition, Bilingualism, and Learner Corpus Research. He heads BilinguaLab (Laboratory of language Acquisition and Processing in bilinguals) and directs two learner corpora: CEDEL2 (Corpus de Español como L2) since 2006, a freely-available and large corpus of L2 Spanish learners coming from eleven different L1 backgrounds (English, Greek, Japanese, Arabic, Italian, etc) and also COREFL (Corpus of English as a Foreign Language). His workgroup also manages http://learnercorpora.com, a website where learners and natives from different and diverse L1s can participate in the corpora. Dr. Lozano is currently the Principal Investigator of ANACOREX, a research project that focuses on the acquisition of anaphora and reference in L2 Spanish and L2 English. He advocates for the triangulation of corpus methods and psycholinguistic experimental methods (reaction time, eye tracking) to better understand the language of bilinguals and L2 learners.

This talk will discuss the principles behind the design, data collection and compilation of any freely-available learner corpus that intends to be maximally informative not only for SLA researchers but also for a wide range of foreign language users and practitioners. The talk is ultimately intended to problematise the challenges that a solid L2 corpus design has to face with a view to signposting L2 corpus designers to foresee challenges of learner corpora that will likely grow and expand across time.

We will do so by assessing the lessons learnt from the evolution (and errors!) in the design and compilation of CEDEL2. Since its inception in 2004, CEDEL2 started as an L1 English-L2 Spanish written corpus. Over the past 20 years, many challenging decisions (which ultimately affect the overall design and usability of the corpus) had to be taken along the way. CEDEDL2 has grown into a large, multi-L1 corpus of L2 Spanish. In its current online web search & download interface (version 2.0), it contains mainly written (and some spoken) data from 11 typologically (un)related mother tongues (English, German, Dutch, French, Portuguese, Italian, Greek, Russian, Japanese, Chinese, Arabic). Version 3.0 is under development and will incorporate more written and spoken data, new L1s (Estonian, Swedish, Polish, Turkish and Vietnamese) and a new automatized data-collection interface.

The principles we will discuss relate to theoretical and technical decisions that have to be made if the corpus is intended to be used by a wide range of users. We will consider whether SLA-relevant questions should be asked first before deciding which linguistic-profile variables to collect for the corpus to be maximally informative for SLA researchers. We will discuss the convenience (and feasibility) of collecting other (non-)linguistic and cognitive variables that are central in SLA research (e.g., aptitude, motivation, dominance, attrition) and in pscyholinguistic approaches to SLA (e.g., working memory), which would require using additional standardized tests. However, a balance must be struck between the amount of linguistic information to be collected and the length of time participants need to complete the profiles and the actual linguistic data: the task. Task-related variables (from the actual task prompt to the resources, conditions and timing used to do the task) require special attention since these will affect the final output and may often mask the linguistic knowledge the learner is capable of attaining. We will also consider other data-related issues, like the benefits of collecting a native ‘control’ subcorpus (for every learner subcorpus), written and spoken data (from the same participant), different tasks (produced by the same participant), etc.

Technical aspects may be determined by the previous decisions and will determine the usability of the corpus: the transcription conventions for spoken data (if any), the protocols for massive data collection (and whether to automatize it), and the computational aspects behind an attractive (yet useful) online corpus search and download interface. These (and other) decisions are essential for the design of a valuable L2 corpus, but are vital if the corpus designer wants to go one step ahead and design a twin corpus (or mirror-image corpus). This was the case of COREFL, an L2 English corpus that was designed after CEDEL2 so as to be able to do mirror-image comparisons (e.g., L1 English-L2 Spanish vs L1 Spanish-L2 English, amongst others).

We will end up with a discussion of whether triangulation (i.e., the combination of naturalistic corpus-production data with less naturalistic but more controlled experimental data) is desirable to investigate a given linguistic phenomenon in sciences like SLA and bilingualism.

Resources:
LearnerCorpora website (for data collection and participation): http://learnercorpora.com
CEDEL2 corpus, L2 Spanish (freely searchable and downloadable): http://learnercorpora.com
COREFL corpus, L2 English (freely searchable and downloadable): http://corefl.learnercorpora.com
BilinguaLab (for an overview of the corpora and research aims): http://bilingualab.ugr.es

Ilmari Ivaska (PhD Turku) is Assistant Professor in the Department of Finnish and Finno-Ugric Languages at the University of Turku. His research interests revolve around methods in corpus linguistics, particularly crosslinguistic and quantitative contrastive research designs in the context of learner, mediated, and heritage language varieties. Most of Ivaska’s work addresses the forms linguistic variation takes in the outcome, the sources to which it can or has been attributed, as well as the effects it may have in language reception. He has published numerous articles on Finnish as a learner and heritage language, and on the commonalities and differences between learner language and translated language focusing on English, Finnish, and Italian. An important strand in Ivaska’s research stems from the typological nature of Finnish as a language with complex inflectional paradigms, and the resulting fact that it often diverges drastically from L2 Finnish learners’ earlier language repertoire. This has often led him to approach learner language from the point of view of cross-linguistic influences and to explore typologically motivated patterning in learner language production. In addition, Ivaska is interested in academic learner language and especially situations where the learned language is not, contrary to the majority of research in the field, a lingua franca. Such situations provide additional layers of complexity, both in terms of learners’ individual motivation for learning, and in terms of the overall language repertoires these learners possess and make use of.

Cross-linguistic influences (CLIs) between an individual’s second language (L2) and their first language (L1) have been one of the most central topics within learner corpus research. While the methodological approaches to addressing CLIs have been discussed extensively, yielding widely used comparable methodological frameworks (Jarvis 2000; Jarvis 2010; see also Jarvis & Crossley 2012; Jarvis & Paquot 2015), the conceptual nature of these influences has received less attention. Building in part on the work of Håkan Ringbom (2007), this presentation discusses construction level CLIs and system level CLIs as profoundly different types of phenomena. Consider examples (1)–(4) that all express more or less the same meaning ‘you run in the forest’.

The four languages belong genealogically to two different genera: Finnish and Estonian in the Finnic genus of the Uralic language family, and Italian and French in the Romance genus of the Indo-European language family, respectively. The linguistic subsystems that underlie the examples follow this categorization in many ways both lexically (e.g. juokset and jooksed as opposed to corri and cours) and grammatically (e.g. locative expression using case endings as opposed to prepositions). In some ways, however, these languages group differently: the overtly expressed grammatical subject (sinä, sa, tu) is obligatory only in French. While both system level and construction level CLIs have been widely accepted and extensively studied within learner corpus research, there has so far been only little discussion on their relationship, and even attempts to systematically tease the two apart are few and far between (however, see Murakami 2016). Crucially to theorizing the generalizability of CLIs, the two types of CLIs differ in that the system level is typically an open-ended categorization with a large or unknown number of possible values, as opposed to the construction level, which is typically a closed-class categorization with only few possible values.

In this paper, I argue that the system level similarities and differences (in the above example, Finnish and Estonian in contrast to Italian and French), and those specific to individual constructions (in the above example, the obligatoriness of the overt subject marking) can both induce cross-linguistic influences. What is more, I argue that they can combine in multiple ways: so that system level CLIs take place while construction level CLIs do not, so that construction level CLIs take place and system level do not, and so that they both play a role. I propose that learner corpus studies addressing CLIs should account for the difference in the type of CLI, and that this should optimally be visible both in the data and in the method. I will address this argument from the point-of-view of Finnish as an L2, and looking at several different linguistic phenomena (e.g. the above-described subject marking, case marking of grammatical objects, constructional strategies involved in tense expressions, as well as those involved in locative expressions). The data come from various corpora of L2 Finnish (ICLFI, see Jantunen 2011; LAS2, see Ivaska 2014; TOPLING, see University of Jyväskylä 2016) to allow for comparing L2 learners of Finnish with a range of both typologically and genealogically diverging L1s. I will also sketch some methodological options for capturing this fundamental difference of different types of CLIs, but would like to invite the LCR community to take up on the task of finding new appropriate yet feasible solutions.

References
Ivaska, Ilmari. 2014. The Corpus of Advanced Learner Finnish (LAS2): Database and toolkit to study academic learner Finnish. Apples – Journal of Applied Language Studies 8(3). 21–38.
Jantunen, Jarmo. 2011. Kansainvälinen oppijansuomen korpus (ICLFI): typologia, taustamuuttujat ja annotointi. Lähivõrdlusi. Lähivertailuja 21. 86–105. https://doi.org/10.5128/LV21.04.
Jarvis, Scott. 2000. Methodological rigor in the study of transfer: identifying L1 influence in the interlanguage lexicon. Language Learning 50(2). 245–309. https://doi.org/10.1111/0023-8333.00118.
Jarvis, Scott. 2010. Comparison-based and detection-based approaches to transfer research. EUROSLA Yearbook 10. 169–192.
Jarvis, Scott & Scott Crossley (eds.). 2012. Approaching Language Transfer through Text Classification: Explorations in the detection-based approach. Bristol, Buffalo, Toronto: Multilingual Matters.
Jarvis, Scott & Magali Paquot. 2015. Learner corpora and native language identification. In Sylviane Granger, Gaëtanelle Gilquin & FannyEditors Meunier (eds.), The Cambridge Handbook of Learner Corpus Research (Cambridge Handbooks in Language and Linguistics), 605–628. Cambridge University Press. https://doi.org/10.1017/CBO9781139649414.027.
Murakami, Akira. 2016. Modeling Systematicity and Individuality in Nonlinear Second Language Development: The Case of English Grammatical Morphemes. Language Learning 66(4). 834–871. https://doi.org/10.1111/lang.12166.
Ringbom, Håkan. 2007. Cross-linguistic Similarity in Foreign Language Learning. Clevedon: Multilingual Matters. https://doi.org/10.21832/9781853599361.
University of Jyväskylä. 2016. The Finnish Subcorpus of Topling – Paths in Second Language Acquisition. Data set. Kielipankki. http://urn.fi/urn:nbn:fi:lb-2016111802.

Keynote Speakers

Read the abstract of Gaëtanelle Gilquin’s presentation: “Opening Pandora’s box: Is learner corpus research inclusive enough?”

Read the abstract of Cristóbal Lozano’s presentation: “Designing and compiling a multi-L1 corpus of L2 Spanish: Lessons learnt from CEDEL2”

Read the abstract of Ilmari Ivaska’s presentation: “Cross-linguistic influences in different levels of granularity: How are they different and why does it matter for LCR?”