Ariane Pinche - Academia.edu (original) (raw)

Uploads

Conference papers by Ariane Pinche

Research paper thumbnail of Stylometry for Noisy Medieval Data: Evaluating Paul Meyer’s Hagiographic Hypothesis

Digital Humanities Conference (DH2019), 2019

Stylometric analysis of medieval vernacular texts is still much of a challenge: the importance of... more Stylometric analysis of medieval vernacular texts is still much of a challenge: the importance of scribal variation, be it graphic or more substantial, as well as the variants and errors introduced in the tradition, complicates the task of the would-be stylometrist. Basing the analysis on the study of the copy from a single hand of several texts can partially be a way around this issue (Camps & Cafiero, 2012), but the limited availability of complete diplomatic transcription might make it difficult. In this paper, we use a workflow combining handwritten text recognition and stylometric analysis, and apply it to the case of the hagiographic works contained in MS BnF, fr. 412. We seek to evaluate Paul Meyer's hypothesis about the constitution of groups of hagiographic works, as well as to examine potential authorial groupings in a vastly anonym corpus.

Papers by Ariane Pinche

Research paper thumbnail of Chaînes d’acquisition et de pré-éditorialisation du texte

Research paper thumbnail of Historical Documents and Automatic Text Recognition: Introduction

Journal of data mining and digital humanities, Mar 19, 2024

Research paper thumbnail of Artificial colorization of digitized microfilms: a preliminary study

Journal of Data Mining & Digital Humanities

A lot of available digitized manuscripts online are actually digitized microfilms, a technology d... more A lot of available digitized manuscripts online are actually digitized microfilms, a technology dating back from the 1930s. With the progress of artificial colorization, we make the hypothesis that microfilms could be colored with these recent technologies, testing InstColorization. We train a model over an ad-hoc dataset of 18 788 color images that are artificially gray-scaled for this purpose. With promising results in terms of colorization but clear limitations due to the difference between artificially grayscaled images and "naturaly" greyscaled microfilms, we evaluate the impact of this artificial colorization on two downstream tasks using Kraken: layout analysis and text recognition. Unfortunately, the results show little to no improvements which limits the interest of artificial colorization on manuscripts in the computer vision domain.

Research paper thumbnail of Océriser les imprimés du XVIe siècle en langue française

Depuis quelques années, la philologie computationnelle a ouvert la voie à de nouvelles approches ... more Depuis quelques années, la philologie computationnelle a ouvert la voie à de nouvelles approches pour l'étude des textes médiévaux et modernes. Ces approches nécessitent cependant des données en grande quantité que l'on ne peut obtenir qu'en extrayant les textes à partir des fac-similés numériques. Pour ce faire, la recherche a besoin d'outils efficaces, s'appuyant sur des guides qui garantissent une interopérabilité maximale entre les différents états d'une langue (ancien français, moyen français, etc.) et les différents types de textes (manuscrits, imprimés, etc.). Cet article se concentre sur la production imprimée du XVI e siècle, en langue française et en caractères gothiques, en prenant pour cas d'étude un corpus romand. Nous proposons deux modèles qui améliorent l'état de l'art actuel : l'un pour l'analyse de la mise en page et l'autre pour l'OCR. Ces modèles s'appuient sur un vocabulaire contrôlé pour la description des pages et sur un guide de transcription pour les textes en gothique.

Research paper thumbnail of SegmOnto: A Controlled Vocabulary to Describe and Process Digital Facsimiles

Our initiative aims at designing a controlled vocabulary for the description of the layout of tex... more Our initiative aims at designing a controlled vocabulary for the description of the layout of textual sources: SegmOnto. Following a codicological approach rather than a semantic one, it is designed as a generic typology, coping with a maximal number of cases rather than answering specific needs. The harmonisation of the layout description has a double objective: on the one hand it facilitates the mutualisation of annotated data and therefore the training of better models for image segmentation (a crucial preliminary step for text recognition), on the other hand it allows the development of a shared post-processing workflow and pipeline for the transformation of ALTO or PAGE files into DH standard formats, which preserves as much as possible the link between the extracted information and the digital facsimile.

Research paper thumbnail of Data Diversity in handwritten text recognition. Challenge or opportunity?

HAL (Le Centre pour la Communication Scientifique Directe), Jul 25, 2022

Research paper thumbnail of SegmOnto: common vocabulary and practices for analysing the layout of manuscripts (and more)

HAL (Le Centre pour la Communication Scientifique Directe), Sep 5, 2021

Research paper thumbnail of Artificial colorization of digitized microfilms: a preliminary study

Journal of Data Mining and Digital Humanities, Apr 12, 2023

A lot of available digitized manuscripts online are actually digitized microfilms, a technology d... more A lot of available digitized manuscripts online are actually digitized microfilms, a technology dating back from the 1930s. With the progress of artificial colorization, we make the hypothesis that microfilms could be colored with these recent technologies, testing InstColorization. We train a model over a new dataset of 18 788 color images that are artificially gray-scaled for this purpose. With promising results in terms of colorization but clear limitations due to the difference between artificially grayscaled images and "naturaly" grayscaled microfilms, we evaluate the impact of this artificial colorization on two downstream tasks using Kraken: layout analysis and text recognition. The results show little to no improvements which limits the interest of artificial colorization on manuscripts in the computer vision domain. Many low resolution digital scans of microfilms exist. These are surrogates of surrogates. They can still be (and are) profitably used, for example to corroborate a particular reading. I am however skeptical of using them as a single source for making an edition. Perhaps, indeed, 99% of a manuscript can still be deciphered by using them, but it is about that 1% of cases in which the scribe fumbled a bit with his pen and it is unclear what the word reads. In those 1% cases, you do not wish to have a low-resolution, black and white reproduction of a reproduction as your sole witness L. W. C. van Lit [2019]

Research paper thumbnail of Gallic(orpor)a: Traitement des sources textuelles en diachronie longue de Gallica

HAL (Le Centre pour la Communication Scientifique Directe), Jun 17, 2022

Research paper thumbnail of Reconnaissance automatique d’écriture et documents historiques

HAL (Le Centre pour la Communication Scientifique Directe), Mar 8, 2023

Research paper thumbnail of L’édition à l’ère numérique

HAL (Le Centre pour la Communication Scientifique Directe), Mar 8, 2023

Research paper thumbnail of Exploitations et valorisations des données numériques connexes à l’édition

HAL (Le Centre pour la Communication Scientifique Directe), Mar 1, 2023

Research paper thumbnail of Comprendre la composition des légendiers en l’aide à l’aide des méthodes numériques

HAL (Le Centre pour la Communication Scientifique Directe), Jan 24, 2023

Research paper thumbnail of Comprendre la composition des premiers légendiers en langue vernaculaire à l’aide des méthodes numériques : acquisition automatique du texte (HTR) et analyses stylométriques

CEM, Nov 29, 2022

Éditeur Centre d'études médiévales Saint-Germain d'Auxerre Référence électronique Ariane Pinche, ... more Éditeur Centre d'études médiévales Saint-Germain d'Auxerre Référence électronique Ariane Pinche, « Comprendre la composition des premiers légendiers en langue vernaculaire à l'aide des méthodes numériques : acquisition automatique du texte (HTR) et analyses stylométriques », Bulletin du centre d'études médiévales d'

Research paper thumbnail of SegmOnto : Vocabulaire contrôlé pour décrire les manuscrits et les imprimés

HAL (Le Centre pour la Communication Scientifique Directe), Nov 15, 2022

Research paper thumbnail of HTR Models and genericity for Medieval Manuscripts

HAL (Le Centre pour la Communication Scientifique Directe), Jul 22, 2022

Within the infrastructure of the CREMMA project (Consortium for Handwriting Recognition of Ancien... more Within the infrastructure of the CREMMA project (Consortium for Handwriting Recognition of Ancient Materials) supported by the DIM (research funded by the Île-de-France Region) MAP (Ancient and Heritage Materials), the CREMMALab 1 project combines research questions, creation and release of data from medieval French literary manuscripts for HTR. The objective of the CREMMALab project is to propose open training data and HTR models for medieval documents. All data and models produced by the project are already available in the CREMMA Medieval repository (Pinche 2022) on HTR-united catalogue (Chagué, Clérice, and Chiffoleau, 2021). In accordance with this objective, the project implements transcription protocols to optimise the training of HTR models and to produce homogeneous and shareable data and models.

Research paper thumbnail of Gallic(orpor)a: Traitment des sources textuelles en diachronie longue de Gallica

Research paper thumbnail of Guide de transcription pour les manuscrits du Xe au XVe siècle

HAL (Le Centre pour la Communication Scientifique Directe), Jun 16, 2022

Les données pourront toujours être interopérables à condition de documenter l'ensemble des choix ... more Les données pourront toujours être interopérables à condition de documenter l'ensemble des choix divergents. Toutefois, la conversion de données signifie souvent la réduction au plus petit dénominateur commun et engendre une grande perte d'informations qu'il faut essayer de limiter.

Research paper thumbnail of Handwritten Text Recognition and Medieval manuscripts

HAL (Le Centre pour la Communication Scientifique Directe), Feb 23, 2023

Research paper thumbnail of Stylometry for Noisy Medieval Data: Evaluating Paul Meyer’s Hagiographic Hypothesis

Digital Humanities Conference (DH2019), 2019

Stylometric analysis of medieval vernacular texts is still much of a challenge: the importance of... more Stylometric analysis of medieval vernacular texts is still much of a challenge: the importance of scribal variation, be it graphic or more substantial, as well as the variants and errors introduced in the tradition, complicates the task of the would-be stylometrist. Basing the analysis on the study of the copy from a single hand of several texts can partially be a way around this issue (Camps & Cafiero, 2012), but the limited availability of complete diplomatic transcription might make it difficult. In this paper, we use a workflow combining handwritten text recognition and stylometric analysis, and apply it to the case of the hagiographic works contained in MS BnF, fr. 412. We seek to evaluate Paul Meyer's hypothesis about the constitution of groups of hagiographic works, as well as to examine potential authorial groupings in a vastly anonym corpus.

Research paper thumbnail of Chaînes d’acquisition et de pré-éditorialisation du texte

Research paper thumbnail of Historical Documents and Automatic Text Recognition: Introduction

Journal of data mining and digital humanities, Mar 19, 2024

Research paper thumbnail of Artificial colorization of digitized microfilms: a preliminary study

Journal of Data Mining & Digital Humanities

A lot of available digitized manuscripts online are actually digitized microfilms, a technology d... more A lot of available digitized manuscripts online are actually digitized microfilms, a technology dating back from the 1930s. With the progress of artificial colorization, we make the hypothesis that microfilms could be colored with these recent technologies, testing InstColorization. We train a model over an ad-hoc dataset of 18 788 color images that are artificially gray-scaled for this purpose. With promising results in terms of colorization but clear limitations due to the difference between artificially grayscaled images and "naturaly" greyscaled microfilms, we evaluate the impact of this artificial colorization on two downstream tasks using Kraken: layout analysis and text recognition. Unfortunately, the results show little to no improvements which limits the interest of artificial colorization on manuscripts in the computer vision domain.

Research paper thumbnail of Océriser les imprimés du XVIe siècle en langue française

Depuis quelques années, la philologie computationnelle a ouvert la voie à de nouvelles approches ... more Depuis quelques années, la philologie computationnelle a ouvert la voie à de nouvelles approches pour l'étude des textes médiévaux et modernes. Ces approches nécessitent cependant des données en grande quantité que l'on ne peut obtenir qu'en extrayant les textes à partir des fac-similés numériques. Pour ce faire, la recherche a besoin d'outils efficaces, s'appuyant sur des guides qui garantissent une interopérabilité maximale entre les différents états d'une langue (ancien français, moyen français, etc.) et les différents types de textes (manuscrits, imprimés, etc.). Cet article se concentre sur la production imprimée du XVI e siècle, en langue française et en caractères gothiques, en prenant pour cas d'étude un corpus romand. Nous proposons deux modèles qui améliorent l'état de l'art actuel : l'un pour l'analyse de la mise en page et l'autre pour l'OCR. Ces modèles s'appuient sur un vocabulaire contrôlé pour la description des pages et sur un guide de transcription pour les textes en gothique.

Research paper thumbnail of SegmOnto: A Controlled Vocabulary to Describe and Process Digital Facsimiles

Our initiative aims at designing a controlled vocabulary for the description of the layout of tex... more Our initiative aims at designing a controlled vocabulary for the description of the layout of textual sources: SegmOnto. Following a codicological approach rather than a semantic one, it is designed as a generic typology, coping with a maximal number of cases rather than answering specific needs. The harmonisation of the layout description has a double objective: on the one hand it facilitates the mutualisation of annotated data and therefore the training of better models for image segmentation (a crucial preliminary step for text recognition), on the other hand it allows the development of a shared post-processing workflow and pipeline for the transformation of ALTO or PAGE files into DH standard formats, which preserves as much as possible the link between the extracted information and the digital facsimile.

Research paper thumbnail of Data Diversity in handwritten text recognition. Challenge or opportunity?

HAL (Le Centre pour la Communication Scientifique Directe), Jul 25, 2022

Research paper thumbnail of SegmOnto: common vocabulary and practices for analysing the layout of manuscripts (and more)

HAL (Le Centre pour la Communication Scientifique Directe), Sep 5, 2021

Research paper thumbnail of Artificial colorization of digitized microfilms: a preliminary study

Journal of Data Mining and Digital Humanities, Apr 12, 2023

A lot of available digitized manuscripts online are actually digitized microfilms, a technology d... more A lot of available digitized manuscripts online are actually digitized microfilms, a technology dating back from the 1930s. With the progress of artificial colorization, we make the hypothesis that microfilms could be colored with these recent technologies, testing InstColorization. We train a model over a new dataset of 18 788 color images that are artificially gray-scaled for this purpose. With promising results in terms of colorization but clear limitations due to the difference between artificially grayscaled images and "naturaly" grayscaled microfilms, we evaluate the impact of this artificial colorization on two downstream tasks using Kraken: layout analysis and text recognition. The results show little to no improvements which limits the interest of artificial colorization on manuscripts in the computer vision domain. Many low resolution digital scans of microfilms exist. These are surrogates of surrogates. They can still be (and are) profitably used, for example to corroborate a particular reading. I am however skeptical of using them as a single source for making an edition. Perhaps, indeed, 99% of a manuscript can still be deciphered by using them, but it is about that 1% of cases in which the scribe fumbled a bit with his pen and it is unclear what the word reads. In those 1% cases, you do not wish to have a low-resolution, black and white reproduction of a reproduction as your sole witness L. W. C. van Lit [2019]

Research paper thumbnail of Gallic(orpor)a: Traitement des sources textuelles en diachronie longue de Gallica

HAL (Le Centre pour la Communication Scientifique Directe), Jun 17, 2022

Research paper thumbnail of Reconnaissance automatique d’écriture et documents historiques

HAL (Le Centre pour la Communication Scientifique Directe), Mar 8, 2023

Research paper thumbnail of L’édition à l’ère numérique

HAL (Le Centre pour la Communication Scientifique Directe), Mar 8, 2023

Research paper thumbnail of Exploitations et valorisations des données numériques connexes à l’édition

HAL (Le Centre pour la Communication Scientifique Directe), Mar 1, 2023

Research paper thumbnail of Comprendre la composition des légendiers en l’aide à l’aide des méthodes numériques

HAL (Le Centre pour la Communication Scientifique Directe), Jan 24, 2023

Research paper thumbnail of Comprendre la composition des premiers légendiers en langue vernaculaire à l’aide des méthodes numériques : acquisition automatique du texte (HTR) et analyses stylométriques

CEM, Nov 29, 2022

Éditeur Centre d'études médiévales Saint-Germain d'Auxerre Référence électronique Ariane Pinche, ... more Éditeur Centre d'études médiévales Saint-Germain d'Auxerre Référence électronique Ariane Pinche, « Comprendre la composition des premiers légendiers en langue vernaculaire à l'aide des méthodes numériques : acquisition automatique du texte (HTR) et analyses stylométriques », Bulletin du centre d'études médiévales d'

Research paper thumbnail of SegmOnto : Vocabulaire contrôlé pour décrire les manuscrits et les imprimés

HAL (Le Centre pour la Communication Scientifique Directe), Nov 15, 2022

Research paper thumbnail of HTR Models and genericity for Medieval Manuscripts

HAL (Le Centre pour la Communication Scientifique Directe), Jul 22, 2022

Within the infrastructure of the CREMMA project (Consortium for Handwriting Recognition of Ancien... more Within the infrastructure of the CREMMA project (Consortium for Handwriting Recognition of Ancient Materials) supported by the DIM (research funded by the Île-de-France Region) MAP (Ancient and Heritage Materials), the CREMMALab 1 project combines research questions, creation and release of data from medieval French literary manuscripts for HTR. The objective of the CREMMALab project is to propose open training data and HTR models for medieval documents. All data and models produced by the project are already available in the CREMMA Medieval repository (Pinche 2022) on HTR-united catalogue (Chagué, Clérice, and Chiffoleau, 2021). In accordance with this objective, the project implements transcription protocols to optimise the training of HTR models and to produce homogeneous and shareable data and models.

Research paper thumbnail of Gallic(orpor)a: Traitment des sources textuelles en diachronie longue de Gallica

Research paper thumbnail of Guide de transcription pour les manuscrits du Xe au XVe siècle

HAL (Le Centre pour la Communication Scientifique Directe), Jun 16, 2022

Les données pourront toujours être interopérables à condition de documenter l'ensemble des choix ... more Les données pourront toujours être interopérables à condition de documenter l'ensemble des choix divergents. Toutefois, la conversion de données signifie souvent la réduction au plus petit dénominateur commun et engendre une grande perte d'informations qu'il faut essayer de limiter.

Research paper thumbnail of Handwritten Text Recognition and Medieval manuscripts

HAL (Le Centre pour la Communication Scientifique Directe), Feb 23, 2023

Research paper thumbnail of Gallic(orpor)a : Extraction, annotation et diffusion de l’information textuelle et visuelle en diachronie longue

HAL (Le Centre pour la Communication Scientifique Directe), Dec 9, 2022