ARTIGO  
Getting started with digital literacy using pedagogic  
corpora in young learners’ English classes  
Iniciando o letramento digital usando corpora nas aulas de Inglês de alunos  
das séries iniciais  
FLUXO DA SUBMISSÃO  
Universidade Federal de Minas Gerais, MG, Brasil.  
Submissão do trabalho: 26/04/23  
Aprovação do trabalho: 11/11/23  
Publicação do trabalho: 24/11/23  
Universidade Federal de Minas Gerais, MG, Brasil.  
Abstract  
This paper addresses the provision of instruction to teachers and learners on  
becoming digitally literate and skilled with corpus-based tools. It proposes  
the compilation of specific pedagogic corpora, in this case in Geography and  
Science, to be used in the English young learners’ classes in elementary  
education.The paper also illustrates the handling of user-friendly  
concordancer’s tools of #LancsBox 6.0 to perform basic analysis of the  
corpora language through a more accessible technology. It concludes by  
summarizing the possible benefits of an alternative approach to expose  
learners to customized corpus-informed language in English.  
1
0.23925/2318-7115.2023v44i2e61842  
Keywords: Pedagogic corpora, Digital literacy, Young learners, Concordancer  
tools.  
Resumo  
Este artigo discorre sobre a necessidade de professores e alunos se tornarem  
letrados digitalmente e habilitados a usar ferramentas de análise do conteúdo de  
corpora. Propõe-se a compilação de dados específicos, neste caso de Geografia e  
Ciências, para uso pedagógico em aulas de inglês do ensino fundamental I. O  
artigo ilustra a utilização básica das ferramentas digitais do software #LancsBox  
6.0 que analisam o conteúdo através de uma tecnologia mais acessível. Conclui-se  
apresentando possíveis benefícios do uso de uma abordagem de ensino  
alternativa que exponha os jovens aprendizes ao conteúdo linguístico em inglês  
baseado em corpora.  
Palavras-chave: Corpora pedagógicos, Letramento digital, Jovens aprendizes,  
Ferramentas de concordância.  
Distribuído sob Licença Creative Commons  
Getting started with digital literacy using pedagogic corpora...  
122  
_____________________________________________________________________  
1. Introduction  
The most recent generation of young learners, currently in elementary school, is certainly  
one that challenges the learning / teaching boundaries of the past even further while demanding  
teachers’ mentoring in new ways in the classroom. To this date, teachers’ roles have been multi-  
faceted, trying to provide learners not only with the subject-matter contents of their lessons but  
also guiding them towards meeting learning goals. However, most of those roles are now being  
disputed as technology and portable devices, available to a large portion of the population, offer  
learners instantly the information required for their day-to-day lives. It is the turning point of the  
source of knowledge: the tools available can supply the present generation of learners with the  
right answers at the tip of their fingers.  
The teachers are still invaluable as curators of the information readily available, but their  
roles have been changing fast as the new generation gradually takes agency of their own learning  
path. It is a long-held belief among educators that contemporary education should enable learners  
to be more engaged and committed to their own learning process and responsible for the results  
(Chambers, 2010). This is the window of opportunity teachers have to motivate learners to make  
effective use of the digital tools available to improve learning and lighten their weight as linguistic  
authorities they traditionally have had (Aston, 2007).  
As an immediate result of the fast pace of change in the educational scenario, the sudden  
advance of technology has brought into the scene the need to know English to navigate and visit  
websites, as it is the most commonly used language on the internet. Young learners need to start  
learning at a much younger age and in tandem they need to acquire the ICT skills demanded by  
the new digital learning mode. All those involved in their literacy process should start considering  
that the changes have to take place urgently.  
Thus, to broaden young learners’ linguistic scope, we started to speculate about a way to  
accelerate the English learning in subjects like Geography and Science at a much younger age  
(Marinova-Todd et al, 2000), while attempting to integrate corpus-informed pedagogy in our  
Fundamental 1 schools. If, as a first step, second language (L2) teachers could resort to topicalized  
corpora, i. e., corpora in other subjects other than general English, to devise activities for cross-  
1
Fundamental I (Brazil) and Elementary (US) years are equivalent and will be used interchangeably.  
_________________________________________________________________________________________  
São Paulo (SP), v. 44 n.2, ago./dez.2023  
ISSN 2318-7115  
 
Ana Lúcia Surerus Pitanguy MARQUES  
123  
_________________________________________________________________________________________  
curricular projects, they would be able to increase young learners exposure to L2 vocabulary in  
their early school years. If learners could take part in the change, it would motivate and engage  
them in the process of discovering the intricacies of the language patterns by themselves  
(
Schmidt, 1990; Johns, 1990; Sinclair, 1991). The corpora would offer the authentic language  
McCarten, 2007) in the correct levels of proficiency and in the topics they would otherwise be  
(
only exposed to in their first language (L1). Such exposure could trigger deliberate learning, for  
example, with learners working with vocabulary where the primary aim of the activity is to learn  
intentionally and explicitly target words (Webb; Chang, 2012).  
In response to the claims above, this paper addresses the demands of contemporary  
society for a more inclusive classroom. We focus on the advantages of using pedagogic corpora  
and the digital tools that facilitate its use. Section 2 discusses briefly relevant principles underlying  
second language acquisition which resonate with approaches now proposed for the classwork.  
Section 3 claims that both teachers and learners should be better oriented in the educational  
digital environments. It then proposes one way of improving vocabulary and language patterns  
learning by using pedagogic corpora and a concordancer which analyzes language to be used for  
class activities. In Section 4, it outlines the compilation of a pedagogic corpus with language at an  
appropriate level for elementary school students (Pérez-Parede, 2020) and in Section 5 describes  
how the contents are accessed and analyzed by a very user-friendly concordancing software -  
#
LancsBox 6.0 . In Section 6, it describes how digital tools are used to examine some language  
combinations and patterns and hopes to motivate teachers and young learners to make  
meaningful use of it. In Section 7, it compares briefly the pedagogic corpus with a traditional adult  
corpus, in this case the BNC2014-baby. Section 8 has the authors’ remarks and a conclusion  
showing the positive aspects of using technology with teachers and young learners.  
2
. Revisiting second language acquisition (SLA)  
Research in the depth and breadth of SLA to date has reiterated the advantages of learners  
being exposed to contextualized L2 in the classroom. One widely recommended approach to  
vocabulary learning is the selection and use of different activities on the same topic where words  
http://corpora.lancs.ac.uk/lancsbox/download.php > However, there is a new version of the software at  
https://lancsbox.lancs.ac.uk/ >. Access In February 20, 2023.  
_________________________________________________________________________________________  
São Paulo, SP v. 43 n.2 ISSN 2318-7115  
 
Getting started with digital literacy using pedagogic corpora...  
124  
_____________________________________________________________________  
are grouped semantically into lexical sets to increase the potential for the recurrence of target  
vocabulary exposure (Nation, 2020; Schmidt, 1990). Investigations have pointed out that noticing  
and discovering the relationships between the words that are presented together have a great  
impact on language learning (Ellis, 2012). Many of those multi-word clusters, words that are  
frequently together in a specific corpus, also known as formulaic language, are constantly being  
scrutinized by researchers. From the decade-old usage-based model (Ibidem, 2012) advocating the  
importance of frequency, context and recency of formulaic language to enable learners to  
transform it into intake, to the more recent Formulaicity Principle4 (O’Keeffe; Mark, 2023), it  
seems that language noticing, frequency of occurrence and recurrence of exposure have pivotal  
roles in the L2 learning literature.  
Summarizing, the combination of explicit instruction (Ellis, 2002) with target language (TL)  
recurrence of exposure (Gabrielatos, 2005) can promote noticing (Schmidt, 1990) of words and  
multi-word sequences (Cortes, 2004). These can be made salient through learners’ manipulation  
of concordance lines (Johns, 1990) in activities which can foster language retention. Concordance  
lines are micro contexts of specific keywords yielded by concordancer tools explained in detail in  
Section 6.  
3
. Digital literacy for teachers and learners  
As of 2018 and onwards, the guidelines in the Base Nacional Comum Curricular (BNCC) state  
that teachers need to be skilled and equipped to help young learners acquire and develop, among  
other competencies, the digital competency in the Fundamental I:  
5. Understand, use and create digital information and communication technology, in a  
meaningful, reflexive and ethical way, in the various social practices (including the school ones)  
to communicate, access and share information, create knowledge, solve problems and take  
agency of own personal and collective life (BRASIL, 2018: 9)6  
3
Recency claims that the more recently we experience a construction, the stronger our memory of it is (O’Keeffe;  
Mark, 2023 adapted from Ellis, 2012).  
4
Formulaicity develops across levels of language proficiency and is a marker of an advanced learner (O’Keeffe; Mark,  
2
023).  
5
Brazilian National Syllabus Core.  
Authors’ translation for: “5. Compreender, utilizar e criar tecnologias digitais de informação e comunicação de forma  
6
crítica, significativa, reflexiva e ética nas diversas práticas sociais (incluindo as escolares) para se comunicar, acessar  
e disseminar informações, produzir conhecimentos, resolver problemas e exercer protagonismo e autoria na vida  
pessoal e coletiva(BRASIL, 2018: 9).  
_________________________________________________________________________________________  
São Paulo (SP), v. 44 n.2, ago./dez.2023  
ISSN 2318-7115  
 
 
 
 
Ana Lúcia Surerus Pitanguy MARQUES  
125  
_________________________________________________________________________________________  
Therefore, it is of paramount importance to consider, first, the extent to which the  
teachers have access to the hardware and, second, if they have the technical skills to be able to  
navigate the online medium. They also need to learn to curate the appropriate information before  
they can implement innovations in the classroom. According to Meunier (2020), teachers  
themselves may still today not always see the added value of integrating them into their lessons.  
And recently, Crosthwaite (2022) corroborated those statements by positing that many young  
learners’ teachers still lack both the technical and pedagogical knowledge to integrate Computer  
Assisted Language Learning (CALL) applications into teaching practice.  
The scenario described above reflects just some of the difficulties teachers have to  
st  
overcome to transform their classrooms into 21 century educational environments. First, they  
need to learn to curate the information available online. Second, they need to learn to use the  
digital tools to search for language patterns, and only then they would be ready to help learners  
interpret the results and make effective use of the findings in the classroom. That is, when working  
with concordancing software and data analysis, teachers need to realize and accept the fact that  
their role changes fundamentally,  
as s/he is no longer the sole source of knowledge about the target language, but rather a  
facilitator of the learning process, helping the learners to interpret the data, and giving  
them advice on how best to search the corpus and analyze their search results  
(Chambers, 2019, p.354).  
Learners also need to be digitally literate to look for information on the web. According to  
BNCC, they need to be exposed to the digital medium, learn to navigate safely and identify suitable  
and trustworthy sites which suit their learning goals. They also need to learn to analyze critically  
the information they receive or send. They need to be acquainted with the use of web search  
engines appropriately to understand how to find what they need and be able to understand and  
interpret output of its particular discursive functions in context (Hafner; Candlin, 2007). To  
empower learners with digital skills, Redecker insists that learning resources and activities, among  
other aims, should “open up learning to new, real-world contexts, which involve learners  
themselves in hands-on activities, scientific investigation or complex problem solving, or in other  
ways increase learners’ active involvement in complex subject matters” (2017, p.22).  
7
The term Computer Assisted Language Learning (CALL) was coined by Hardisty and Windeatt (1989).  
_________________________________________________________________________________________  
São Paulo, SP  
v. 43 n.2  
ISSN 2318-7115  
 
Getting started with digital literacy using pedagogic corpora...  
126  
_____________________________________________________________________  
4
. Pedagogic corpora for young learners  
To change the teaching approach and empower learners with hands-on activities, this  
Section outlines the compilation of pedagogic corpora which are aimed to be used in language  
teaching in a future study and, therefore, has been designed with pedagogic purposes in mind  
(
Pérez-Paredes, 2020; Willis, 2011). The balanced corpora, COREL-GEO+SCI , composed of  
textbooks and website oral and written texts represent the language variety in Geography and  
Science relevant in the specific teaching context (Friginal, 2018) of elementary school grades.  
Hence, and in line with Willis’s (1998) definition of pedagogic corpus as a body of texts to  
be used in the classroom to support teaching (texts from the learners’ coursebooks) with any  
additional texts that the teacher may bring into the classroom (Gilquin; Granger, 2010), pedagogic  
corpora were compiled to suit elementary young learners’ - 9 - 12 years old - needs and interests.  
Small corpora would seem to be both useful as instruments of language learning in their  
own right, and as means of training learners to use corpora appropriately (Aston, 2007). Aston  
also mentions other benefits of small corpora: easier to construct, to interpret, to become familiar  
with and to allow language to be more fully analysable. While size is usually an issue, it should be  
considered hand-in-hand with the appropriateness of corpus design. In terms of suitability,  
however, it is often the design of a corpus as opposed to its size which is the determining factor  
(O’Keeffe et al., 2007, p.4).  
Regarding corpora contents, besides the intended audience of young learners in  
elementary school, the authors considered their regular school syllabi, selecting subjects with an  
overarching reach: Geography and Science. That is, the corpora should serve the pedagogic aims  
of the project they had been designed for (Reppen, 2010; Xiao, 2010; Jablonkai, 2022). Topics  
(Table 1) were selected based on the usefulness and essentialness constructs for learners` school  
grades and the connection with the syllabi, so as to make the experience of working with  
concordance lines not only useful but also meaningful and beneficial.  
8
COREL-GEO+SCI - Corpora for Elementary levels, a conflation of two corpora: a Geography corpus and a Science  
corpus.  
_________________________________________________________________________________________  
São Paulo (SP), v. 44 n.2, ago./dez.2023  
ISSN 2318-7115  
 
Ana Lúcia Surerus Pitanguy MARQUES  
127  
_________________________________________________________________________________________  
Table 1. Themes and topics for the pedagogic corpora.  
Source: the authors.  
Texts from six different authentic sources within the A1-A2 framework of language were  
then collected: printed workbooks A , printed workbooks B , web texts / video transcriptions and  
articles from assorted sources on the web.  
Table 2. Subsets of corpora information.  
Source: #LancsBox.  
The pedagogic corpus, COREL- GEO+SCI, for this study is a conflation of two corpora with  
a total of 895 texts, a balanced number of 437 texts in Science (COREL- SCI) and 458 texts in  
Geography (COREL- GEO) separated by topic / theme (Table 2). They can also be identified by  
school grade if necessary. Corpus contents also make it possible to expose learners to authentic  
9
According to the CEFR Common European Framework of Reference for Languages.  
1
1
1
0
1
th  
1
80 Days of Science and Geography, Shell Education, K to 5 grade, 2014.  
th  
DK WORKBOOKS, Penguin Random House, Pre-K to 4 grade, 2016.  
2
The GEO and SCI corpora can be split in two and teachers would use them according to the syllabi.  
_________________________________________________________________________________________  
São Paulo, SP v. 43 n.2 ISSN 2318-7115  
 
 
 
 
Getting started with digital literacy using pedagogic corpora...  
128  
_____________________________________________________________________  
language and actually present them with a large number of instances of a particular linguistic item  
to work with them all at once (Cobb, 1997).  
5. The concordancer #LancsBox 6.0  
To access the corpora data, it is paramount that teachers have the digital skills to work  
with a concordancer. Concordancers are common corpus analysis tools that search texts based  
on a word or phrase provided by the user and yield them in contexts called concordance lines or  
rank them according to their frequency in that corpora. This section explains how the  
concordancer was chosen and what its tools can accomplish at the user`s basic level of  
understanding. The literature suggests that if the digital tools are (1) hard to use or (2) perceived  
to be hard to use, then widespread adoption of the tools is not likely (Hendry; Sheepy, 2022).  
These authors mention the importance of the multidimensional construct of usability to identify  
and select the most appropriate concordancer to use in the classroom.  
According to Hendry and Sheepy (2022), in a recent study comparing concordancing  
softwares, #LancsBox , a freely available online concordancer with a unique graphical interface,  
was found to be the easiest for some to use. The 6.0 version has a straightforward interface and  
accompanying tutorials and is very suitable as a first step to those teachers trying to get  
acquainted with the current technological tools. Those factors have prompted us to choose it  
taking into account the aims of its creators (Brezina; Gablasova, 2018). At the time, they declared  
that they were interested in improving learner vocabulary instruction through corpus analysis,  
mainly keyword and collocation analysis.  
Once the software has been chosen, the first requirement for teachers and learners to  
output data is to know its tools. The software tutorials are short, the explanations objective, and  
the repetition of procedures associated with a little curiosity towards experimenting with other  
tools may help teachers overcome any initial barriers. The software already has many corpora  
embedded in its system such as American English, British English, BNC, Brown, LOB, English  
Literature, etc. Additionally, one can upload one’s specialized corpus and use it like the example  
of the pedagogic corpus mentioned here and in the next Sections. By using a concordancer with  
13  
_________________________________________________________________________________________  
São Paulo (SP), v. 44 n.2, ago./dez.2023 ISSN 2318-7115  
 
Ana Lúcia Surerus Pitanguy MARQUES  
129  
_________________________________________________________________________________________  
a readily-understood interface and meeting the criterium of being user-friendly, our main aim was  
to address the ‘user-friendliness’ aspect mentioned by Frankenberg-Garcia (2012) to show  
teachers the way to start developing their digital literacy.  
6
. Pedagogic corpora and the concordancer analysis tools  
Hendry and Sheepy (2022, p.439) put forward the idea that “learners can use corpus  
analysis tools to support vocabulary acquisition (1) as a reference to identify important words to  
study, (2) as a reference to check for patterns in typical usage in authentic texts”, language  
improvement and development of autonomous work. Hence, Table 3 illustrates the interface of  
the software #LancsBox 6.0 showing its basic tools in the black bar at the top. The tools were  
used, first, to extract lists of most frequent words, then, content words as well as multi-word 3-  
gram clusters from COREL- GEO (Corpus for Elementary Levels on Geography) and COREL- SCI  
(Corpus for Elementary Levels on Science) to be described in sections 6.1 to 6.4.  
The next step was to have the most frequent items in the lists, and some were selected as  
KWICS (key words in context) to obtain the concordance lines - their micro contexts. Findings in  
concordance lines (KWIC tool) and collocation visualization tools (GraphColl tool) can help  
learners recognize and remember collocations (Hendry and Sheepy, 2022). This condensed  
exposure (Gabrielatos, 2005) can contribute to heightened awareness of language patterns,  
vocabulary expansion and retention (Granger, 1998).  
Table 3. #LancsBox 6.0 interface.  
Source: #LancsBox 6.0.  
_________________________________________________________________________________________  
São Paulo, SP  
v. 43 n.2  
ISSN 2318-7115  
Getting started with digital literacy using pedagogic corpora...  
130  
_____________________________________________________________________  
6
.1 Words tool  
First, the ‘Words’ tool (in the black bar in Table 3) was used to generate the lists of the  
most frequent words in the corpus COREL-GEO (Table 4) and COREL-SCI (Table 5).  
Table 4. Most frequent words in COREL-GEO.  
Source: #LancsBox 6.0.  
Table 5. Most frequent words in COREL-SCI.  
Source: #LancsBox 6.0.  
_________________________________________________________________________________________  
São Paulo (SP), v. 44 n.2, ago./dez.2023 ISSN 2318-7115  
Ana Lúcia Surerus Pitanguy MARQUES  
131  
_________________________________________________________________________________________  
At a glance it is possible to see that the results are somewhat similar as most frequent  
words are function words. The content words, though, which constitute the core vocabulary of  
each subject come up somewhat differently (e.g., people, plants). Therefore, it is necessary to  
make some changes in the header to obtain the list of most frequent nouns and verbs to get the  
corpora content differences. The same procedures can be used for adverbs, adjectives and other  
word classes of interest. First, one should left-click ‘Type’ at the top blue header (Table 5),  
changing Type to Lemma by clicking the arrow. After that, right-click on the black bar, next to the  
word Type, and a pop-up window will open. Add: *_v, or *_n, or *_adj or *_adv to have the most  
frequent words of the different word classes, one at a time.  
The resulting lists (Tables 6  9) show nouns and verbs, KWIC options, which need to be  
selected by the teachers according to the relevance of the vocabulary in the syllabus and the  
connection with the lessons’ contents. The data should be mediated by the teacher (McCarthy’s,  
2
004), so learners can read and handle the concordance lines without difficulties to discover  
language patterns (Johns, 1990; Schmidt, 1990) and notice their meaning (Rutherford, 1987).  
Table 6. Most frequent nouns in COREL-GEO.  
Source: #LancsBox 6.0.  
_________________________________________________________________________________________  
São Paulo, SP  
v. 43 n.2  
ISSN 2318-7115  
Getting started with digital literacy using pedagogic corpora...  
132  
_____________________________________________________________________  
Table 7. Most frequent verbs in COREL-GEO.  
Source: #LancsBox 6.0.  
Table 8. Most frequent nouns in COREL-SCI.  
Source: #LancsBox 6.0.  
Table 9. Most frequent verbs in COREL-SCI.  
Source: #LancsBox 6.0.  
_________________________________________________________________________________________  
São Paulo (SP), v. 44 n.2, ago./dez.2023 ISSN 2318-7115  
Ana Lúcia Surerus Pitanguy MARQUES  
133  
_________________________________________________________________________________________  
6
.2 KWIC tool  
The concordance lines are produced using the KWIC tool, the easiest tool to handle of them  
all. Once teachers select the content word to be dealt with in the classroom, they left-click the  
KWIC tool on the header and insert it in the appropriate slot on the left next to Search. Just clicking  
search the lines are yielded instantly showing the KWIC in red at the center. Screenshots of two  
samples, ‘Earth’ and ‘live’ are below in Tables 10 and 11. The concordance lines can be handled in  
assorted ways in the classroom. They suit learners mainly from 3rd and 4th grades onwards, when  
most of them are probably already familiar with digital gadgets like tablets and phones. The  
teacher has at her disposal an array of ways to explore the KWIC tool for pedagogical reasons, for  
example, she can enlarge, print the list, cut out the lines and distribute to each learner, so they  
can work out a definition of the planet Earth in small groups; or individual learners can look for  
other words they already know and create new sentences related to planet Earth. Learners can  
also look for adjectives on the left side which qualify Earth, and also extract information for the  
description of Earth, names of planets, and so on. It will depend on the learners’ grades and their  
syllabus. To round off the work, the teacher can show some concordance lines she had previously  
selected which have meaningful information for the group.  
Table 10. Screenshot of KWIC Earth from COREL-GEO.  
Source: #LancsBox 6.0.  
_________________________________________________________________________________________  
São Paulo, SP  
v. 43 n.2  
ISSN 2318-7115  
Getting started with digital literacy using pedagogic corpora...  
134  
_____________________________________________________________________  
Table 11. Screenshot of KWIC live from COREL-GEO.  
Source: #LancsBox 6.0.  
Concordance line #19 in Table 11 above has the word Capybaras, which is well known by  
Brazilians and could be explored in a class about animals and their behavior. The most meaningful  
advantage of using concordance lines from a pedagogic corpus is that the language is not only  
authentic but also at the appropriate level for learners. The concordance lines in Table 12 below  
have information about the Amazon biome and they can be worked with to prompt a discussion  
about the Amazon region with the 3rd and 4th graders, for example. The learners can look for the  
meaning of biome, very similar to Portuguese, and guess a definition. Other words scattered in  
the lines can also be made salient (Rutherford, 1987) with students looking for their definitions.  
Or else, they can speculate about the different animals which live in different biomes.  
Table 12. Excerpt with the top five concordance lines about the Amazon biome.  
Source: #LancsBox 6.0.  
_________________________________________________________________________________________  
São Paulo (SP), v. 44 n.2, ago./dez.2023  
ISSN 2318-7115  
Ana Lúcia Surerus Pitanguy MARQUES  
135  
_________________________________________________________________________________________  
6
.3 GraphColl tool  
A third tool, GraphColl, usually catches the attention of users in the #LancsBox. It enables them to  
visualize the collocates of the node chosen and the degree of mutual strength indicated by their  
positions in the Graph. The stronger the link, the closer to the node the collocate is in the graph.  
One example is the graph of ‘live’ below (Table 13). The visualization helps the user to  
identify the closest collocates (e.g. ‘animals live’, ‘live in forests’) which are listed in Table 14 below.  
The table also displays the value of the selected association measure in Stats column, while Freq  
(
coll) displays the frequency of the collocation (combination of node + collocate) and Freq  
corpus) the frequency of the collocate anywhere in the corpus.  
(
Table 13. Graphcoll of ‘live’ with the 64 strongest collocations in the corpus COREL-GEO.  
Source: #LancsBox 6.0.  
The node ‘live’ was selected and the Span was changed to 3<>3 (Table 14 in the header) words to  
each side of the node, so that we have only the strongest collocations with it. The R (right) and L  
(left) indicate the position in relation to the node in the concordance lines.  
_________________________________________________________________________________________  
São Paulo, SP v. 43 n.2 ISSN 2318-7115  
Getting started with digital literacy using pedagogic corpora...  
136  
_____________________________________________________________________  
Table 14. The 20 strongest collocations of ‘live’ in the corpus COREL-GEO.  
Source: #LancsBox 6.0.  
6
.4 Ngrams tool  
A fourth tool used was Ngrams to generate the word clusters (Table 15). In this example,  
we chose to have the program generate 3-gram sequences. Since an n-gram is a contiguous  
sequence of n items that come from a text or a corpus, some of them, though frequent, may not  
be pedagogically relevant. Even if the clusters deemed most useful for the learners are not the  
most frequent, teachers should choose at their discretion those more meaningful to their class. In  
Table 15 the first two more frequent n-grams are used for instruction ‘answer the questions’ and  
read the text’ which are quite often used in class and may not need to be highlighted.  
Table 15. Screenshot of the most frequent 3-grams multi-word clusters.  
Source: COREL-GEO + COREL-SCI.  
_________________________________________________________________________________________  
São Paulo (SP), v. 44 n.2, ago./dez.2023 ISSN 2318-7115  
Ana Lúcia Surerus Pitanguy MARQUES  
137  
_________________________________________________________________________________________  
Other n-grams, such as, ‘the text and’ and ‘part of the’ are phrase fragments that would  
not be relevant to be taught. However, the prepositional phrases ‘in the world’ and ‘of the world’  
as well as the verb phrase ‘made up of’ (Table 16) would be useful for young learners. Groups of  
words which contain prepositions are usually those to present difficulties in the future when  
learners are speaking. If they are exposed to them in context, the prepositions can be internalized  
appropriately from start.  
Table 16. 3-gram ‘made up of in the corpus COREL-GEO + COREL-SCI.  
Source: COREL - GEO+SCI.  
Young learners will certainly benefit from handling the lines, distributed in small batches to  
each one, to identify what comes after ‘made up o’, for example. Is it always a noun? Can they  
categorize what kind of noun it is? Once they identify the noun, learners can create lists under  
categories such: abstract, concrete, related to people, related to things, and so on. Once done,  
they can create sentences using elements in the classroom or in their backpacks, or even in their  
homes.  
_________________________________________________________________________________________  
São Paulo, SP  
v. 43 n.2  
ISSN 2318-7115  
Getting started with digital literacy using pedagogic corpora...  
138  
_____________________________________________________________________  
7
. Comparison between BNC2014-baby corpora and COREL_GEO+SCI  
Many researchers admit that adult corpora are too difficult for young learners to use  
(Anthony, 2007) or, in the author’s view, most of their contents are above the language level of  
young learners and also more diversified in relation to topics. This would probably yield query  
results that are not related to learners` questions. Table 17 below shows the difference between  
BCN2014-baby contents, one of the smallest adult corpus already embedded in #LancsBox 6.0,  
and the pedagogic corpus COREL-GEO+SCI. Even though BCN2014-baby has fewer files, the mean  
number of words indicate the texts are longer and probably more complex.  
Table 17. Comparison between two corpora.  
Source: #LancsBox 6.0.  
In Table 18, the difference in the lists of most frequent words outputted from both corpora  
is clear. Although some words are present in both lists, such as function words (the, of, to, etc.),  
some are not the same. This is an indication of the wider scope of topics in BNC2014-baby, yielding  
a general frequency list which does not present any content words (nouns, verbs or adjectives)  
among the 23 most frequent words in the corpus. On the other hand, the COREL-GEO-SCI frequent  
word list carries nouns such as ‘animals’ and ‘plants.  
Table 18. Comparison of most frequent words in both corpus.  
_________________________________________________________________________________________  
São Paulo (SP), v. 44 n.2, ago./dez.2023  
ISSN 2318-7115  
Ana Lúcia Surerus Pitanguy MARQUES  
139  
_________________________________________________________________________________________  
Source: #LancsBox 6.0.  
In Table 19, the difference between the two lists of most common nouns is considerable as  
the COREL-GEO+SCI has a specific focus on Geography and Science lexis for elementary grades,  
making this corpus much more appropriate and meaningful to the young learners’ goals in the  
classroom. The situation changes when we analyze the most common verbs in both corpora as  
there are many similar verbs in both lists (Table 20). Once teachers are more skilled with the tools,  
they can resort to comparing different corpora lists and enhance the contents of their lessons.  
After all, learners will probably be exposed to all of them in their general English classes.  
Table 19. Screenshot of most frequent nouns in both corpora.  
Source: #LancsBox 6.0.  
_________________________________________________________________________________________  
São Paulo, SP  
v. 43 n.2  
ISSN 2318-7115  
Getting started with digital literacy using pedagogic corpora...  
140  
_____________________________________________________________________  
Table 20. Screenshot of most frequent verbs in both corpora.  
Source: #LancsBox 6.0.  
Remarks and conclusion  
This paper has intended to raise readers’ awareness of the meaningful use of pedagogic  
corpora and #LancsBox, a freely available concordancer, to expose young learners to authentic  
subject-informed English in the classrooms. It described the steps into getting started with user-  
friendly concordancer tools to access language from the corpora, exposing learners to one of the  
2
1st century digital media that addresses language learning challenges.  
It also showed the relevance of creating a pedagogic corpus targeted at a specific  
discourse community of young learners. The language output presented in Section 6 should  
encourage teachers to work with authentic level-appropriate subject-specific subsets of corpora  
that can be used in materials design (McCarthy, 2004). To succeed, Jablonkai (2022, p.474)  
recommends the involvement of “subject- specialist informants in the corpus building process  
especially for future pedagogically motivated specialized corpora” to inform teaching on a wider  
scale.  
Until that moment in the future, this paper has been an attempt to motivate teachers to  
start getting acquainted with the web environment as much as they do in their daily lives using  
the mobile devices for different purposes. The digital tools and websites available can make  
________________________________________________________________________________________  
_
São Paulo (SP), v. 44 n.2, ago./dez.2023  
ISSN 2318-7115  
Ana Lúcia Surerus Pitanguy MARQUES  
141  
_________________________________________________________________________________________  
education more inclusive while targeting at much broader types of student populations, helping  
students with learning difficulties, etc. (Meunier, 2020). However, with content of various subjects  
being taught in English to young learners, the use of pedagogic corpora with the support of  
specific software, such as #LancsBox, can speed up both content and language learning.  
In times of great cultural awareness and the need for social equity and inclusion in regular  
schools, English and digital literacy can be the key elements to help reduce the educational  
differences our youngsters experience while attending the primary school years. This  
empowerment in the use of digital technologies can “enhance accessibility and inclusion,  
differentiation and personalisation, and learners’ active engagement (Meunier, 2022, p.350). A  
better prepared and skilled young adult will certainly stand a better chance to succeed and be  
prepared to seek further opportunities in the future.  
References  
LancsBox - Version 6.0. Lancaster University corpus toolbox. Available online at #LancsBox:  
Lancaster University corpus toolbox. Last accessed: Jan 31 st 2023.  
CHUJO, Kiyomi; ANTHONY, Lawrence; OGHIGIAN, Kathryn. DDL for the EFL classroom - Effective  
uses of a Japanese-English parallel corpus and the development of a learner-friendly, online  
parallel concordancer. Tokyo: Waseda University, 2009.  
ASTON, Guy. Small and large corpora in language learning. In: LEWANDOWSKA-TOMASZCZYK,  
Barbara & MELIA, Patrick (Orgs.) PALC ’97 Proceedings of the first annual conference. Łodz:  
Łodz University Press, 1997. p. 51-62.  
BASE NACIONAL COMUM CURRICULAR (BNCC). Ministério da Educação e Cultura. Available  
online at http://basenacionalcomum.mec.gov.br. Last accessed: Feb 23 rd 2023.  
BREZINA, Vaclav; GABLASOVA, Dana. #LancsBox. Lancaster: Lancaster University, 2018.  
CHAMBERS, Angela. Towards the corpus revolution? Bridging the research practice gap.  
Language Teaching, Cambridge, v.52, n.04, p. 460-475, 2019.  
CHAMBERS, Angela. What is data-driven learning? In: O`KEEFFE, Anne & MCCARTHY, Michael  
(Eds.) The Routledge handbook of Corpus Lingusitics. London: Routledge, 2010. p. 345- 358.  
COBB, Thomas. Is there any measurable learning from hands-on concordancing ? System,  
Elsevier, v.3, n.25, p. 301-315, 1997.  
_________________________________________________________________________________________  
São Paulo, SP  
v. 43 n.2  
ISSN 2318-7115  
Getting started with digital literacy using pedagogic corpora...  
142  
_____________________________________________________________________  
CORTES, Viviana. Lexical bundles in published and student disciplinary writing: examples from  
History and Biology. English for Specific purposes, v.23, n.04, p. 397  423, 2004.  
CROSTHWAITE, Peter; STELL, Annita. It helps me get ideas on how to use my words - Primary  
school students’ initial reactions to corpus use in a private tutoring setting. In: CROSTHWAITE,  
Peter (Ed.) Data-Driven Learning for the Next Generation - Corpora and DDL for Pre-tertiary  
Learners. London: Routledge, 2020. Kindle Edition, Kindle Locations: 3837-3838.  
ELLIS, Nick. Formulaic Language and Second Language Acquisition: Zipf and the Phrasal Teddy  
Bear. Annual Review of Applied Linguistics, Cambridge, v. 32, p. 1744, 2012.  
ELLIS, Nick. Frequency effects in language processing a review with implications for theories of  
implicit and explicit language acquisition. Studies in Second Language Acquisition, v.24, p. 143–  
1
88, 2002.  
FRANKENBERG-GARCIA, Ana. Raising teachers` awareness of corpora. Language Teaching, v.45,  
n.04, p. 475-489, 2012.  
FRIGINAL, Eric. Corpus linguistics for English teachers: New tools, online resources, and  
classroom activities. London: Routledge, 2018.  
GABRIELATOS, Costas. Corpora and Language Teaching: Just a fling or wedding bells? TESL  EJ,  
v.8, n.04, p. 1-37, 2005.  
GRANGER, Sylviane. The computer learner corpus: a versatile new source of data for SLA  
research. In: GRANGER, Sylviane (Org.) Learner English on Computer. London: Addison Wesley  
Longman, 1998. p. 3-18.  
GILQUIN, Gaetanelle; GRANGER, Sylviane. How can data-driven learning be used in language  
teaching. In: O’KEEFFE, Anne & MCCARTHY, Michael (Orgs.) The Routledge Handbook of  
Corpus Linguistics. London: Routledge, 2010. p. 359-371.  
HAFNER, Christoph; CANDLIN, Christopher. Corpus tools as an affordance to learning in  
professional legal education. Journal of English for Academic Purposes, v.6, n.04, p. 303318,  
2
007. Available online at https://doi.org/10.1016/j.jeap.2007.09.005. Last accessed: Feb 23 rd 2023.  
HENDRY, Clinton; SHEEPY, Emily. Evaluating corpus analysis tools for the classroom. In:  
JABLONKAI, Reka &. CSOMAY, Eniko (Orgs.) The Routledge Handbook of Corpora and English  
Language Teaching and Learning. London: Routledge, 2022. p. 437  459.  
JABLONKAI, Reka. Building Corpora for ELT. In: JABLONKAI, Reka & CSOMAY, Eniko (Orgs.) The  
Routledge Handbook of Corpora and English Language Teaching and Learning. London:  
Routledge, 2022. p. 460-477.  
JOHNS, Tim. Should you be persuaded: Two examples of data driven learning. In: JOHNS, Tim &  
KING, Philip (Orgs.) Classroom concordancing. ELR Journal, Birmingham v.4, p. 1 16, 1991.  
_________________________________________________________________________________________  
São Paulo (SP), v. 44 n.2, ago./dez.2023  
ISSN 2318-7115  
Ana Lúcia Surerus Pitanguy MARQUES  
143  
_________________________________________________________________________________________  
JOHNS, Tim. From printout to handout: Grammar and vocabulary teaching in the context of data  
driven learning. CALL Austria, v.10, p. 14  34, 1990.  
MARINOVA-TODD, Stefka; MARSHALL, Bradford; SNOW, Catherine. Three Misconceptions about  
Age and L2 Learning. TESOL Quarterly, v. 34, n.01, p. 9  34, 2000.  
MCCARTEN, Jeanne. Teaching vocabulary. Lessons from the Corpus, Lessons for the Classroom.  
Cambridge: Cambridge University Press, 2007.  
MCCARTHY, Michael. Touchstone: from Corpus to Coursebook. Cambridge: Cambridge  
University Press, 2004.  
MEUNIER, Fanny. Revamping DDL: Affordances of Digital Technology. In: JABLONKAI, Reka &  
CSOMAY, Eniko (Orgs.) The Routledge Handbook of Corpora and English Language Teaching  
and Learning. London: Routledge, 2022. p. 344-360.  
MEUNIER, Fanny. A case for constructive alignment in DDL - Rethinking outcomes, practices,  
and assessment in (data-driven) language learning. In: CROSTHWAITE, Peter (Org.) Data-Driven  
Learning for the Next Generation. New York: Routledge, 2020. Kindle Edition, p. 757-759.  
NATION, Paul. What matters in vocabulary learning? LALS, 2020, Victoria University of  
Wellington, New Zealand. Webinar.  
O’KEEFFE, Anne; MARK, Geraldine. Principled pattern curation to guide data-driven learning  
2
023.  
OKEEFFE, Anne; MCCARTHY, Michael; CARTER, Ronald. From corpus to classroom: language  
use and language teaching. Cambridge: Cambridge University Press, 2007.  
PÉREZ-PAREDE, Pascual. The pedagogic advantage of teenage corpora for secondary school  
learners. In: CROSTHWAITE, Peter (Org.) Data driven learning for the next generation: Corpora  
and DDL for pre-tertiary learners. London: Routledge, 2020. p. 6787.  
REDECKER, Christine. European framework for the digital competence of educators:  
DigCompEdu. In: PUNIE, Yves (Org.) EUR 28775 EN. Publications Office of the European Union,  
JRC107466, 2017. Available online at https://doi.org/10.2760/159770. Last accessed: Jan 31 st  
2
023.  
REPPEN, Randi. Building a Corpus – What are the key considerations? In: O’KEEFFE, Anne &  
MCCARTHY, Michael (Org.) Routledge Handbook of Corpus Linguistics. London: Routledge,  
2
010. p. 31-37.  
RUTHERFORD, William. Second language grammar: Learning and teaching. London: Longman,  
987.  
________________________________________________________________________________________  
1
_
São Paulo, SP v. 43 n.2 ISSN 2318-7115  
Getting started with digital literacy using pedagogic corpora...  
144  
_____________________________________________________________________  
SCHIMDT, Richard. The role of consciousness in second language learning. Applied Linguistics,  
v.11, n0.2, p. 129158, 1990.  
SINCLAIR, John. Corpus, Concordance, Collocation. Oxford: Oxford University Press, 1991.  
XIAO, Richard. Corpus creation. In: INDURKHYA, Nitin & DAMERAU, Fred (Orgs.) The Handbook  
of Natural Language Processing. London: Taylor and Francis, 2010. p. 147165.  
WEBB, Stuart and CHANG, Anna. Vocabulary learning through assisted and unassisted repeated  
reading. Canadian Modern Language Review, v.68, n.03, p. 124, 2012.  
WILLIS, Dave. The language syllabus: building language study into a task-based approach. In:  
WILLIS, Jane. Concordances in the classroom without a computer: Assembling and exploiting  
concordances of common words. In: TOMLINSON, Brian (Org.) Materials development in  
language teaching. Cambridge: Cambridge University Press, 1998. p. 5177.  
_________________________________________________________________________________________  
São Paulo (SP), v. 44 n.2, ago./dez.2023  
ISSN 2318-7115