Samuli Kaislaniemi | University of Eastern Finland (original) (raw)

Papers by Samuli Kaislaniemi

Patterns of Change in 18th-century English

First, the idea was to provide diachronic continuity to the first part of the CEEC (finished in 1... more First, the idea was to provide diachronic continuity to the first part of the CEEC (finished in 1998, called CEEC-1998), in order to follow through the linguistic changes that were still ongoing in 1681, where the CEEC-1998 terminates (Nevalainen & Raumolin-Brunberg 2003); and second, to track linguistic change in the long eighteenth century. The CEEC-1998 and CEECE allow tracking linguistic features over nearly four centuries from Late Middle English into Late Modern English (the entire CEEC family is consequently referred to as CEEC-400; see Kaislaniemi 2006; Nevala & Nurmi 2013). The compilation of the CEECE followed the completion of the CEEC-1998, which has its roots in the compilation of the Early Modern sections of the Helsinki Corpus of English Texts (or Helsinki Corpus for short; HC). After the completion and release of the HC, the compilers of the Early Modern section, Terttu Nevalainen and Helena Raumolin-Brunberg, decided to pursue investigating language change in Early Modern English in more detail. To this end, they would compile a separate corpus of Early Modern English; and due to their interest in testing the applicability of modern sociolinguistic methods on historical materials, they decided to compile the corpus from personal correspondence. The project was initiated in 1993, and with the help of student assistants, the corpus was finished by 1998. The CEEC-1998 covers the period from about 1410 to 1681 and contains some 2.6 million words in nearly 6,000 letters from almost 800 informants. Due to copyright reasons (the CEEC corpora are compiled from published editions) the CEEC-1998 could not be released without acquiring permissions from the publishers. Unfortunately, not all publishers granted permission to publish their texts as part of the corpus, and the version released publicly in 2006, the Parsed Corpus of Early English Correspondence

Advances in Historical Sociolinguistics

Patterns of Change in 18th-century English

The Corpus of Early English Correspondence Extension Sampler part 1 (CEECES 1) is the first publi... more The Corpus of Early English Correspondence Extension Sampler part 1 (CEECES 1) is the first public release of the 18th-century part of the Corpora of Early English Correspondence (CEEC-400). The full CEECES will be released in 3 parts over the course of 2021. See the accompanying manual for more. See https://varieng.helsinki.fi/CoRD/corpora/CEEC/ for more on the CEEC-400. Citation: CEECES 1 = Corpus of Early English Correspondence Extension Sampler part 1. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio at the Department of Languages, University of Helsinki. XML conversion and encoding by Lassi Saario. ETA Some of the hyperlinks in the manual don't work: use the link above instead.

The ERRATAS database is the primary output of the ERRATAS project, which surveyed all the sources... more The ERRATAS database is the primary output of the ERRATAS project, which surveyed all the sources of the Corpora of Early English Correspondence (CEEC-400) in order to investigate their editorial principles and practices. These sources are almost exclusively printed editions of English historical correspondence. The data in the ERRATAS database consists of surveys of: editorial principles (explicit statements or anything resembling such); editorial practices (features that can be found in the edited texts); editorial work (evidence of contributors other than the stated editor(s)). The database is in the form of an MS Access database file, which includes separate data entry forms for editorial principles and practices (ie. the texts). A manual is under compilation. Related material: ERRATAS checklist of editorial practices with audit word lists (working document) Corpus of Editorial Principles (in CEEC-400 Sources)</em&gt...

The Corpus of Editorial Principles (in CEEC-400 Sources), or CEP, contains s... more The Corpus of Editorial Principles (in CEEC-400 Sources), or CEP, contains scanned images of title pages and editorial principles (or any text interpretable as such) from the printed sources of the Corpora of Early English Correspondence (CEEC-400). CEP was built as part of the ERRATAS project, which surveyed all the sources of CEEC-400 in order to investigate their editorial practices. CEP consists of: A manual A list of sources (bibliographical details) Pdfs of scanned and photographed images from printed editions, zipped in folders grouped by CEEC-400 subcorpus As most of the sources of CEEC-400 are under copyright, this resource is only available for use upon request. Version history: 26.5.2020 Version 0.8 – Unfinished (due to coronavirus restrictions); data checked once. 2.11.2020 Version 1.0 – Data rechecked and corrected, images added and fixed, files renamed.

On December 13th 1937, the celebrated children's author Arthur Ransome wrote to J. R. R. Tolkien ... more On December 13th 1937, the celebrated children's author Arthur Ransome wrote to J. R. R. Tolkien with a few comments on Tolkien's newly published book The Hobbit. Tolkien lost no time in replying, and his letter, held in the Brotherton Library of the University of Leeds, provides one of his earliest comments on his published fiction, and a relatively early explicit commentary on his mythic writing. This article publishes for the first time Tolkien's response to Ransome in its entirety, and answers some of the questions regarding the chronology of Tolkien's correspondence which arise. An analysis of the letter reveals that while, as many scholars have shown, the 'sources' and 'inspirations' of The Hobbit include the likes of Beowulf and the Poetic Edda, already in 1937-and contrary to his own later claims-Tolkien's principal primary source for fleshing out his prose stories with characters, places, and references to historical events was the vast legendarium he had created himself.

This is a list of the sources used in the compilation of the Corpora of Early English C... more This is a list of the sources used in the compilation of the Corpora of Early English Correspondence (CEEC-400). Details to be updated shortly.

Merchants of Innovation, 2017

Research in Corpus Linguistics, 2021

This paper discusses the process of part-of-speech tagging the Corpus of Early English Correspond... more This paper discusses the process of part-of-speech tagging the Corpus of Early English Correspondence Extension (CEECE), as well as the end result. The process involved normalisation of historical spelling variation, conversion from a legacy format into TEI-XML, and finally, tokenisation and tagging by the CLAWS software. At each stage, we had to face and work around problems such as whether to retain original spelling variants in corpus markup, how to implement overlapping hierarchies in XML, and how to calculate the accuracy of tagging in a way that acknowledges errors in tokenisation. The final tagged corpus is estimated to have an accuracy of 94.5 per cent (in the C7 tagset), which is circa two percentage points (pp) lower than that of present-day corpora but respectable for Late Modern English. The most accurate tag groups include pronouns and numerals, whereas adjectives and adverbs are among the least accurate. Normalisation increased the overall accuracy of tagging by circa ...

Verbal and Visual Communication in Early English Texts, 2017

Journal for Early Modern Cultural Studies, 2017

Abstract:This article looks at the multilingual environment of the English East India Company (EI... more Abstract:This article looks at the multilingual environment of the English East India Company (EIC) trading post in Japan, 1613–23. Reconstructing the linguistic world of early EIC merchants in Southeast Asia may appear straightforward: the primary lingua francas across the Asian seaboard were Portuguese and Malay. But knowledge of these two languages alone was not enough to conduct business at trading posts in dozens of locations across a region where more than two thousand languages were spoken. To form a complete picture of the linguistic environment of EIC merchants in the East Indies, together with an understanding of the linguistic competence of the English merchants, direct references to language use must be supplemented with indirect evidence drawn from the linguistic record. EIC merchants' letters are full of loanwords and borrowed phrases from foreign languages. An analysis of these words and phrases using quantitative methods borrowed from corpus linguistics bolsters ...

This dissertation investigates the multilingual practices of 17th-century English East India Comp... more This dissertation investigates the multilingual practices of 17th-century English East India Company merchants, as revealed by the vocabulary they used in the texts they produced while stationed in the East Indies. The English East India Company (EIC) was founded in 1600. At first it was a moderately successful trading company, but in the 18th and 19th centuries the EIC rose to dominate the European trade in Indian goods, and then in Chinese tea and porcelain. During this period, it also came to control large territories in India, paving the way for the British Empire to take over the entire subcontinent. Despite the immense economic, social and cultural impact of the EIC on world history, its records have not previously been studied for their language. This thesis breaks fresh ground in this respect, by investigating letters written by early EIC employees stationed at a trading post in Japan, 1613–1623. The five studies forming the nucleus of this dissertation focus on lexis in these letters. Through a study of foreign words – of lexical borrowings from and code-switches into languages like Japanese, Spanish, Malay and Portuguese – it is shown that by charting the use of foreign words in correspondence, we can identify discourse communities of writers with shared practices. Moreover, foreign words can also be used to reveal the linguistic competence of the writers. Two of the studies use methods of historical lexicography and lexicology to look at native English vocabulary. They show that EIC records can be used to trace change over time in lexical fields, which in turn reveals that the EIC had direct influence on the development of the English lexicon. They also show that investigations of hapax legomena can yield insights into the intimate connections between early modern English merchants and contemporary literature on the one hand, and lexical ecologies of early dictionaries on the other. A central finding of this dissertation is that historical linguistics in general, and lexical studies in specific, not only can benefit from multidisciplinary methodologies, but should adopt them as a matter of course. This dissertation shows that a blend of quantitative and qualitative methods is not only convenient, but in fact necessary if we want to draw reliable conclusions about historical multilingualism."Kauppiaiden monikielisyyttä kartoittamassa: Sanastotutkimuksia Englannin Itä-Intian kauppakomppanian varhaisesta kirjeenvaihdosta" Tutkimukseni käsittelee 1600-luvulla eläneiden Englannin Itä-Intian kauppakomppanian kauppiaiden kirjeenvaihdossa ilmenevää monikielisyyttä. Englannin Itä-Intian kauppakomppania perustettiin vuonna 1600. Kohtalaisesti menestyneestä kauppiasyhdistyksestä muodostui 1700-luvulla Euroopan ulkomaankaupan suurin toimija, joka dominoi intialaisten tekstiilien tuontikauppaa sekä myöhemmin kiinalaisen posliinin ja teen tuontia. Kaupankäynnin ohella komppania valloitti 1700- ja 1800-luvuilla laajoja alueita Intiassa, mikä mahdollisti koko niemimaan alistamisen Britannian imperiumin valtaan. Kauppakomppanialla oli merkittävä rooli maailmanhistoriassa. Tästä huolimatta komppanian arkiston asiakirjoja ei ole aiemmin käytetty kielitieteelliseen tutkimukseen. Väitöskirjani uusi avaus onkin kauppakomppanian Japanin kauppapisteen (1613–1623) kauppiaiden kirjeenvaihdon kielitieteellinen tutkimus. Väitöskirjan keskiössä olevat viisi artikkelia luotaavat tämän kirjeenvaihdon sanastoa. Artikkeleissa tutkin kauppiaiden koodinvaihtoa sekä kartoitan kirjeissä esiintyviä, esimerkiksi japanista, espanjasta, portugalista ja malaijista lainattuja sanoja. Tutkimukseni osoittavat, että tätä sanastoa kartoittamalla voidaan tunnistaa kieliyhteisöjä, joiden jäsenillä on samoja tapoja käyttää kieltä. Vierasperäisten sanojen käyttö kertoo myös kirjoittajan kielitaidon laajuudesta ja kielellisestä kompetenssista. Artikkeleista kaksi käsittelee englannin kielen kotoperäistä sanastoa. Näistä ilmenee, että kauppakomppanian arkiston avulla voidaan seurata sanojen merkityskenttien muuttumista ajassa, ja että yhtiön sisäinen kielenkäyttö vaikutti suoraan englannin kielen sanaston kehittymiseen. Lähteissä vain kerran esiintyvää sanaa tutkimalla osoitan lisäksi, miten näinkin rajattu tutkimusaihe voi kertoa 1600-luvun alun kauppiaiden kiinnostuksesta kaunokirjallisuuteen ja osoittaa, kuinka varhaiset sanakirjat plagioivat toisiaan. Väitöskirjan keskeiset tulokset tukevat monitieteisiä lähestymistapoja historiallisessa kielitieteessä ja etenkin sanastontutkimuksessa. Väitöskirja todistaa, että kvantitatiivisten ja kvalitatiivisten metodien yhdistäminen ei ole ainoastaan hyödyllistä, vaan tarpeen, kun halutaan vetää luotettavia johtopäätöksiä historiallisesta monikielisyydestä

Research in the digital humanities and computational social sciences requires overcoming complexi... more Research in the digital humanities and computational social sciences requires overcoming complexity in research data, methodology, and research questions. In this article, we show through case studies of three different digital humanities and computational social science projects, that these problems are prevalent, multiform, as well as laborious to counter. Yet, without facilities for acknowledging, detecting, handling and correcting for such bias, any results based on the material will be faulty. Therefore, we argue for the need for a wider recognition and acknowledgement of the problematic nature of many DH/CSS datasets, and correspondingly of the amount of work required to render such data usable for research. These arguments have implications both for evaluating feasibility and allocation of funding with respect to project proposals, but also in assigning academic value and credit to the labour of cleaning up and documenting datasets of interest.

ICAME Journal

Research into orthography in the history of English is not a simple venture. The history of Engli... more Research into orthography in the history of English is not a simple venture. The history of English spelling is primarily based on printed texts, which fail to capture the range of variation inherent in the language; many manuscript phenomena are simply not found in printed texts. Manuscript-based corpora would be the ideal research data, but as this is resource-intensive, linguists use editions that have been produced by non-linguists. Many editions claim to retain original spellings, but in practice text is always normalized at the graph level and possibly more so. This does not preclude using such a corpus for orthographical research, but there has been no systematic way to determine the philological reliability of an edited text. In this paper we present a typological methodology we are developing for the evaluation of orthographical quality of edition-based corpora, with the aim of making the best use of bad data in the context of editions and manuscript practices. As a case st...

Pragmatics Beyond New Series, 2009

I would also like to thank the Centre for Editing Lives and Letters at Queen Mary, University of ... more I would also like to thank the Centre for Editing Lives and Letters at Queen Mary, University of London, for access to the full-text version of EEBO, and Teo Juvonen and my anonymous reviewer for their invaluable comments and suggestions for this paper. 2 The "East Indies" had vague borders, but usually referred to all of South, SouthEast and East Asia. 3 The India Office Records at the British Library contain more than 14 shelf kilometres of both published and unpublished works spanning the years 1600-1948

Proceedings of the Digital Humanities in the Nordic Countries 5th Conference, 2020

Research in the digital humanities and computational social sciences requires overcoming complexi... more Research in the digital humanities and computational social sciences requires overcoming complexity in research data, methodology, and research questions. In this article, we show through case studies of three different digital humanities and computational social science projects, that these problems are prevalent, multiform, as well as laborious to counter. Yet, without facilities for acknowledging , detecting, handling and correcting for such bias, any results based on the material will be faulty. Therefore, we argue for the need for a wider recognition and acknowledgement of the problematic nature of many DH/CSS datasets, and correspondingly of the amount of work required to render such data usable for research. These arguments have implications both for evaluating feasibility and allocation of funding with respect to project proposals, but also in assigning academic value and credit to the labour of cleaning up and documenting datasets of interest.

Pragmatics & Beyond New Series, 2009

Terminology and Lexicography Research and Practice, 2011

Patterns of Change in 18th-century English

Advances in Historical Sociolinguistics

Patterns of Change in 18th-century English

Merchants of Innovation, 2017

Research in Corpus Linguistics, 2021

Verbal and Visual Communication in Early English Texts, 2017

Journal for Early Modern Cultural Studies, 2017

ICAME Journal

Pragmatics Beyond New Series, 2009

Proceedings of the Digital Humanities in the Nordic Countries 5th Conference, 2020

Research in the digital humanities and computational social sciences requires overcoming complexi... more Research in the digital humanities and computational social sciences requires overcoming complexity in research data, methodology, and research questions. In this article, we show through case studies of three different digital humanities and computational social science projects, that these problems are prevalent, multiform, as well as laborious to counter. Yet, without facilities for acknowledging , detecting, handling and correcting for such bias, any results based on the material will be faulty. Therefore, we argue for the need for a wider recognition and acknowledgement of the problematic nature of many DH/CSS datasets, and correspondingly of the amount of work required to render such data usable for research. These arguments have implications both for evaluating feasibility and allocation of funding with respect to project proposals, but also in assigning academic value and credit to the labour of cleaning up and documenting datasets of interest.

Pragmatics & Beyond New Series, 2009

Terminology and Lexicography Research and Practice, 2011