Rui Sousa-Silva | Universidade do Porto (original) (raw)

Talks by Rui Sousa-Silva

Forensic linguists, amongst others, have a strong interest in plagiarism detection (Angélil-Carte... more Forensic linguists, amongst others, have a strong interest in plagiarism detection (Angélil-Carter, 2000; Coulthard & Johnson, 2007; Hänlein, 1998; Lobo, 2003; Semple, Kenkre, & Achilles, 2004) but there is relatively little research attention on bilingual plagiarism. The borderline of plagiarism is both dependent on its definition and on the author’s intention, as much as it is on the text genre: the usage of large amounts of text by journalists with little or no attribution at all, for instance, does not seem to be usually regarded as plagiarism (Coulthard & Johnson, 2007). However, although the conventions/regulations regarding use of newswire copy are not universal, agencies require that the source be credited, and forbid the use of ‘authored articles’.
Detecting verbatim copying of news agencies words is easy and straightforward. However, plagiarism detection requires more sophisticated techniques when news items are plagiarised in languages other than English (e.g. Portuguese), where journalists tend to translate the text intuitively into their mother tongue and make adjustments, while retaining a structure that is more similar to the English counterpart than to the other news sections.
To investigate which mechanisms journalists use to write ‘their own’ texts from news agencies texts (and how they use them), we selected news pieces from the ‘World’ section of Portuguese quality newspapers and compared them to possible English sources. To do a suitable contrastive analysis, we created a comparable/translation corpus ("LREC 2008 Workshop on Comparable Corpora," 2008; McEnery & Wilson, 1996) using the Corpógrafo (a web-based environment for the creation and analysis of personal corpora) (Sarmento, Maia, & Santos, 2004).
We then investigated how translation is usually done by journalists and how (and when) authorship attribution is made explicit, and questioned how much unacknowledged journalistic text can be accepted without being called plagiarism, challenged by the news agencies and proceed to trial. The results obtained so far show that, even though quality papers may cite their sources (usually well-known international agencies), attribution is often inadequate, and there is not a one-to-one match between the Portuguese and the English versions, i.e. the same piece of news often includes different releases from the foreign press and websites. Applications of this investigation to more forensic contexts will be discussed.

References
Angélil-Carter, S. (2000). Stolen Language? Plagiarism in Writing. Harlow: Longman.
Coulthard, M., & Johnson, A. l. (2007). An Introduction to Forensic Linguistics: Language in Evidence. Londres e Nova Iorque: Routledge.
Hänlein, H. (1998). Studies in Authorship Recognition - A Corpus-based Approach Francoforte: Peter Lang.
Lobo, R. A. (2003). Plagiarism Revisited. Journal of the Society for Gynecologic Investigation, 10, 389-389.
LREC 2008 Workshop on Comparable Corpora (2008). Retrieved 02/11/2008, from http://www.limsi.fr/~pz/lrec2008-comparable-corpora
McEnery, T., & Wilson, A. (1996). Corpus Linguistics: An Introduction (Second Edition ed.). Edinburgo: Edinburgh University Press.
Sarmento, L., Maia, B., & Santos, D. (2004). The Corpógrafo - a Web-based environment for corpora research.
Semple, M., Kenkre, J., & Achilles, J. (2004). Student fraud: The need for clear regulations for dismissal or transfer from healthcare training programmes for students who are not of good character. Nursing Times Research, 9(4), 272-280.

Nos últimos anos, diversos autores têm proposto diferentes teorias e abordagens ao estudo e análi... more Nos últimos anos, diversos autores têm proposto diferentes teorias e abordagens ao estudo e análise do discurso (Coulthard, 1977; Dijk, 1997; Fairclough & Wodak, 1997; Sinclair, 1991). Embora algumas destas teorias analisem a interacção entre o discurso e a sociedade (Dijk, 1997; Fairclough & Wodak, 1997), outras procuram analisar sobretudo o discurso enquanto realização linguística (Coulthard, 1977; Sinclair, 1991), ou, inclusivamente, enquanto estudo da relação entre a linguística e a lei como forma de linguística forense (análise forense do discurso) (Coulthard & Johnson, 2007). É neste contexto que se inserem os estudos de autoria, desde a disputa à identificação de autoria, passando pela sua atribuição. A linguística forense, nas suas diversas aplicações (incluindo a identificação de autoria, a identificação modal, a tradução e interpretação jurídica, a transcrição de declarações e depoimentos, a linguagem e discurso dos tribunais, os direitos linguísticos, a análise de declarações, a fonética forense e o estatuto textual), tem permitido ajudar a determinar o verdadeiro autor em casos de disputa de autoria, a provar ocorrências de plágio, ou a descobrir autores de mensagens de resgate ou de ameaças. A metodologia utilizada inclui aspectos linguísticos como a dimensão média dos textos ou das frases, a estilística forense (McMenamin, 2002), a frequência de determinados padrões linguísticos, hapax legomena, hapax dislegomena, entre outros.
Neste estudo pretendemos averiguar se um dos marcadores de discurso que a linguística forense considera válido na determinação da autoria (Grant, Tim, citado em Coulthard & Johnson, 2007) em inglês – a riqueza lexical – também permanece válido como elemento determinante de autoria em português. Assim, recorremos aos estudos em linguística de corpus (Biber, Conrad, Reppen, & Aitchison, 2000; McEnery & Wilson, 2001) para criar um corpus de textos, com cerca de 100.000 palavras, escritos por dois cronistas distintos (António Barreto e José Pacheco Pereira), publicados no jornal Público entre Janeiro e Dezembro de 2007. Utilizando o “Corpógrafo” (Sarmento, Maia, & Santos, 2004), estudamos a densidade lexical, o comprimento (médio) das palavras utilizadas e a densidade de palavras utilizadas uma única vez (hapax legomena) para determinar a riqueza lexical dos textos produzidos pelos dois autores. A metodologia utilizada para verificar os resultados do presente estudo consiste na análise de dois textos publicados no jornal Público em 2008, escritos pelos dois autores referidos, cuja autoria foi ocultada para efeitos deste estudo. Após a análise, a identidade dos autores será revelada, e os textos serão confrontados com as conclusões do estudo do corpus.
Em conclusão, este modelo de análise deverá permitir-nos confirmar a riqueza lexical como um critério válido de identificação de autoria em português, à semelhança do que acontece com o inglês. Os resultados do estudo irão mostrar que os textos produzidos por autores diferentes recorrem à utilização de padrões linguísticos distintos e idiossincráticos, isto é, cada autor possui um idiolecto próprio (Coulthard & Johnson, 2007), com marcas de autoria distintas, que permitem diferenciar essa produção linguística das demais.

Bibliografia
Biber, D., Conrad, S., Reppen, R., & Aitchison, J. (2000). Corpus Linguistics: Investigating Language Structure and Use Cambridge: Cambridge University Press.
Coulthard, M. (1977). An Introduction to Discourse Analysis. London: Longman.
Coulthard, M., & Johnson, A. l. (2007). An Introduction to Forensic Linguistics: Language in Evidence. London and New York: Routledge.
Dijk, T. A. v. (1997). Discourse as Interaction in Society. In T. A. v. Dijk (Ed.), Discourse Studies: A Multidisciplinary Introduction - Discourse as Social Interaction (Vol. 2, pp. 1-37). London: SAGE Publications Ltd.
Fairclough, N., & Wodak, R. (1997). Critical Discourse Analysis. In T. A. v. Dijk (Ed.), Discourse Studies: A Multidisciplinary Introduction - Discourse as Social Interaction (Vol. 2, pp. 258-284). London: SAGE Publications Ltd.
McEnery, T., & Wilson, A. (2001). Corpus Linguistics: An Introduction. Edinburgh: Edinburgh University Press.
McMenamin, G. (2002). Forensic Linguistics: Advances in Forensic Stylistics. B¬oca Raton and New York: CRC Press.
Sarmento, L., Maia, B., & Santos, D. (2004). The Copógrafo - a Web-based environment for corpora research.
Sinclair, J. M. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.

Authors and their texts are not fully independent from their socio-cultural contexts and the extr... more Authors and their texts are not fully independent from their socio-cultural contexts and the extra-textual reality. However, as authorship attribution studies (e.g. on the Bible and Shakespeare texts) and forensic linguistics have been trying to demonstrate, each text has individual traits. Although it is yet to be proven that texts have a “personal fingerprint”, such traits help us distinguish between texts. To identify some of these traits and show how they are used differently across different texts, we performed a systematic and contrastive analysis of naturally occurring texts written by two Portuguese authors (José Pacheco Pereira, right-wing; and António Barreto, left-wing), published over the last year in the Portuguese benchmark quality paper Público. Using corpus linguistics, we extracted the information via a web-based environment for the creation and analysis of personal corpora (the ‘Corpógrafo’), and obtained the ‘concordance’ and its context. We then analyzed the most recurrent n-grams or multi-word units to identify the particular semantic relations used most often by each writer. Starting from functional categories of meaning, we could determine the dominating semantic classes in the texts of each author, and hence identify some of their discourse conventions. It is not surprising, we concluded, that the choice of a specific function, like the choice of words, is essential to profile both the authors and their group ‘membership’.

Diversas teorias têm procurado analisar a estrutura do discurso de diversos géneros linguístic... more Diversas teorias têm procurado analisar a estrutura do discurso de diversos géneros linguísticos. A Teoria da Relevância (Sperber & Wilson 1995), por exemplo, propõe uma análise em que as proposições se inter-relacionam em função da sua relevância. Grosz & Sidner (1986), por outro lado, propõem um método de análise baseado na intenção do autor. Polanyi (Polanyi 1988) propõe uma teoria de análise da estrutura do discurso em que este é composto por unidades constitutivas do discurso (DCU). A RST (“Rhetorical Structure Theory”) (Taboada & Mann 2006a), por outro lado, defende que um texto possui uma estrutura retórica associada, constituída por proposições elementares, inter-relacionadas por meio de relações retóricas.
No entanto, nenhuma destas teorias se dedicou especificamente à análise do discurso artístico. Neste estudo, que procura descrever e representar o género de texto artístico, e explicar como funciona e de que forma se constrói a sua coerência, propomos um método de análise que cruza dados da análise de discurso (Dijk 1985) com dados da retórica geral e de organização da retórica (cf. Kittredge et al. 1991). Recorremos, assim, à linguística de corpus (Biber et al. 2000; McEnery & Wilson 2001) para criar o COLLAGE, um corpus comparável de textos publicados em catálogos de exposições de arte contemporânea, em inglês e em português. Utilizando o “Corpógrafo” (Sarmento et al. 2004), obtivemos as expressões-chave em “concordância" e em contexto, e identificámos as unidades de análise estatisticamente mais representativas. Seguidamente, procedemos à sua classificação manual, de acordo com uma taxinomia de relações definida. Contrariamente a outros modelos de análise (cf. Taboada & Mann 2006b), que requerem a definição de regras específicas como a língua do texto, este modelo pode ser aplicado independentemente da língua de estudo e dos processos de formação sintáctica. As unidades extraídas do Corpógrafo, automaticamente reajustadas de acordo com a sua colocação, não estão sujeitas a critérios de formação (por exemplo, de ordem sintáctica).
A evolução futura desta metodologia de análise requer, no entanto, alguns ajustes. Em primeiro lugar, a classificação das relações aplicáveis num contexto específico parte de “julgamentos de possibilidades”, devendo os resultados obtidos nesta fase preliminar ser validados por outros “agentes”. Em segundo lugar, a evolução futura do Corpógrafo deverá permitir identificar no texto outros pormenores, tais como títulos, itálicos e outros elementos
do texto cuja relevância retórica depende de factores qualitativos, e não quantitativos. Em terceiro lugar, procuraremos aplicar a este modelo de análise outras taxinomias (e.g. RST), e cruzar os resultados obtidos.
Em conclusão, este modelo de análise permite-nos explicar a coerência do texto, considerando as relações entre as suas diferentes partes, que possuem um papel e uma função no texto. As relações resultantes desta análise, cujos resultados iremos apresentar pormenorizadamente, mostram que o inglês e o português utilizam no discurso artístico convenções claramente distintas: enquanto o inglês apresenta uma estruturação clara e linear, o português mostra uma argumentação mais enredada, complexa e, inclusivamente, hesitante.
Bibliografia Biber, D., Conrad, S., Reppen, R., & Aitchison, J. (2000). Corpus Linguistics: Investigating
Language Structure and Use Cambridge: Cambridge University Press. Dijk, T. A. v. (1985). 'Discourse Analysis as a New Cross-Discipline'. In Dijk, T. A. v. (Ed.), Handbook of Discourse Analysis (Vol. Vol. 1: Disciplines of Discourse, pp. 1-10).
London: Academic Press Inc. (London) Ltd. Grosz, B. J., & Sidner, C. L. (1986). 'Attention, Intentions, and the Structure of Discourse'.
Computational Linguistics, 12, 175-204. Kittredge, R., Korelsky, T., & Rambow, O. (1991). 'On the Need for Domain Communication
Language'. Computational Intelligence, 7, 305-314. McEnery, T., & Wilson, A. (2001). Corpus Linguistics: An Introduction. Edinburgh:
Edinburgh University Press. Polanyi, L. (1988). 'A Formal Model of the Structure of Discourse"'. Journal of Pragmatics,
12, 601-638. Sarmento, L., Maia, B., & Santos, D. (2004). The Copógrafo - a Web-based environment for
corpora research. Sperber, D., & Wilson, D. (1995). Relevance: Communication and Cognition (2nd edn. ed.).
Oxford: Blackwell. Taboada, M., & Mann, W. C. (2006a). 'Applications of Rhetorical Structure Theory'.
Discourse Studies, 8. Taboada, M., & Mann, W. C. (2006b). 'Rhetorical Structure Theory: looking back and
moving ahead'. Discourse Studies, 8(423).

Papers by Rui Sousa-Silva

The number of computational approaches to forensic linguistics has increased significantly over t... more The number of computational approaches to forensic linguistics has increased significantly over the last decades, as a result not only of increasing computer processing power, but also of the growing interest of computer scientists in natural language processing and in forensic applications. At the same time, forensic linguists faced the need to use computer resources in both their research and their casework – especially when dealing with large volumes of data. This article presents a brief, non-systematic survey of computational linguistics research in forensic contexts. Given the very large body of research conducted over the years, as well as the speed at which new research is regularly published, a systematic survey is virtually impossible. Therefore, this survey focuses on some of the studies that are relevant in the field of computational forensic linguistics. The research cited is discussed in relation to the aims and objectives of the linguistic analysis in forensic context...

Proceedings of the 6 th International …, Jan 1, 2008

N-grams' and 'Multi-word Units' are expressions used by engineers to describe significant word co... more N-grams' and 'Multi-word Units' are expressions used by engineers to describe significant word combinations in text of the kind that linguists refer to as 'collocations, 'lexical bundles' and similar terms. This paper explores the different perspectives of engineers and linguists and describes research that has been done on word combinations as terminology, discourse markers and paraphrases.

In recent years, several cases of plagiarism attracted media attention worldwide, due to the high... more In recent years, several cases of plagiarism attracted media attention worldwide, due to the high prole of the suspected plagiarists. The highest pro-le cases involved politicians, such as the German Defence Minister Guttenberg (2011), the Romanian Prime Minister Victor Ponta (2012), and the German Education Minister Schavan (2013). The two German ministers had their doctoral titles rescinded and eventually resigned. In both cases, the instances of plagiarism were detected and made public by whistleblowers, who publicly demonstrated that the two ministers had plagiarised (substantial) parts of their theses. The serious impact of these cases led into discussing the possibility that anonymous allegations of plagiarism would no longer be investigated, and moreover that a statute of limitation on the investigation of suspected plagiarism cases could be introduced. The possible implications of a decision of this nature, if adopted, raise some questions. This paper presents, on the one ha...

International Journal of Speech Language and the Law, 2014

Language in Society, 2011

Linguagem e Direito: os eixos temáticos, 2015

O plágio jornalístico constitui um dos temas de investigação sobre plágio mais desafiantes. Ao co... more O plágio jornalístico constitui um dos temas de investigação sobre plágio mais desafiantes. Ao contrário do plágio de estudantes, a reutilização textual por jornalistas sem indicação (ou com uma indicação muito limitada) das fontes originais não é, frequentemente, considerada plágio , nem mesmo nos casos em que essa reutilização é substancial, conforme assinala . Uma vez que as fronteiras que identificam um texto como plagiador dependem tanto da definição de plágio aplicável e da intenção do autor, como do género de texto, a utilização, por jornalistas, de grandes volumes de texto sem atribuição adequada da autoria tende a ser minimizada ou, inclusivamente, desvalorizada. Esta tendência de desvalorização ou minimização resulta do pressuposto de que as peças noticiosas relatam, supostamente, factos e eventos do "mundo real". E uma vez que estes factos e eventos não podem (ou, por questões inerentemente éticas, não devem) ser relatados com prejuízo da fidelidade jornalística, quanto mais fiel for a descrição dos factos, menor é a liberdade de escrita criativa do jornalista e, em teoria, mais elevado será o grau permissível de sobreposição textual. Neste contexto, sobreposição textual pode ser, assim, facilmente associada a profissionalismo, o que dificulta a consideração de qualquer texto jornalístico como plágio.

Plagiarism detection methods have improved signi cantly over the last decades, and as a result of... more Plagiarism detection methods have improved signi cantly over the last decades, and as a result of the advanced research conducted by computational and mostly forensic linguists, simple and sophisticated textual borrowing strategies can now be identi ed more easily. In particular, simple text comparison algorithms developed by computational linguists allow literal, word-for-word plagiarism (i.e. where identical strings of text are reused across di erent documents) to be easily detected (semi-)automatically (e.g. Turnitin or SafeAssign), although these methods tend to perform less well when the borrowing is obfuscated by introducing edits to the original text. In this case, more sophisticated linguistic techniques, such as an analysis of lexical overlap , are required to detect the borrowing. However, these have limited applicability in cases of 'translingual' plagiarism, where a text is translated and borrowed without acknowledgment from an original in another language. Considering that (a) traditionally non-professional translation (e.g. literal or free machine translation) is the method used to plagiarise; (b) the plagiarist usually edits the text for grammar and syntax, especially when machine-translated; and (c) lexical items are those that tend to be translated more correctly, and carried over to the derivative text, this paper proposes a method for 'translingual' plagiarism detection that is grounded on translation and interlanguage theories , as well as on the principle of 'linguistic uniqueness' . Empirical evidence from the CorRUPT corpus (Corpus of Reused and Plagiarised Texts), a corpus of real academic and non-academic texts that were investigated and accused of plagiarising originals in other languages, is used to illustrate the applicability of the methodology proposed for 'translingual' plagiarism detection. Finally, applications of the method as an investigative tool in forensic contexts are discussed.

International Journal for Educational Integrity, 10(1)

In this paper we propose a set of stylistic markers for automatically attributing authorship to m... more In this paper we propose a set of stylistic markers for automatically attributing authorship to micro-blogging messages. The proposed markers include highly personal and idiosyncratic editing options, such as ‘emoticons’, interjections, punctuation, abbreviations and other low-level features. We evaluate the ability of these features to help discriminate the authorship of Twitter messages among three authors. For that purpose, we train SVM classifiers to learn stylometric models for each author based on different combinations of the groups of stylistic features that we propose. Results show a relatively good-performance in attributing authorship of micro-blogging messages (F = 0.63) using this set of features, even when training the classifiers with as few as 60 examples from each author (F = 0.54). Additionally, we conclude that emoticons are the most discriminating features in these groups.

In this paper we compare the robustness of several types of stylistic markers to help discriminat... more In this paper we compare the robustness of several types of stylistic markers to help discriminate authorship at sentence level. We train a SVM-based classifier using each set of features separately and perform sentence-level authorship analysis over corpus of editorials published in a Portuguese quality newspaper. Results show that features based on POS information, punctuation and word / sentence length contribute to a more robust sentence-level authorship analysis.

Proceedings of the 6 th International …, Jan 1, 2008

International Journal of Speech Language and the Law, 2014

Language in Society, 2011

Linguagem e Direito: os eixos temáticos, 2015

International Journal for Educational Integrity, 10(1)

Recent developments in forensic discourse analysis (i.e., forensic linguistics) enabled authorshi... more Recent developments in forensic discourse analysis (i.e., forensic linguistics) enabled authorship studies and authorship recognition to be more reliable in determining the real author in cases of plagiarism or even criminal offense where linguistic proof is involved. In this paper, we discuss the usefulness of discourse markers such as word length, sentence length and lexical density in authorship recognition in Portuguese. We show that these markers remain valid in Portuguese, as in English, and that texts written by different authors use distinct linguistic patterns, and show different authorship markers that differentiate them from all other texts.