Ariane Pinche - Academia.edu (original) (raw)
Uploads
Conference papers by Ariane Pinche
Digital Humanities Conference (DH2019), 2019
Stylometric analysis of medieval vernacular texts is still much of a challenge: the importance of... more Stylometric analysis of medieval vernacular texts is still much of a challenge: the importance of scribal variation, be it graphic or more substantial, as well as the variants and errors introduced in the tradition, complicates the task of the would-be stylometrist. Basing the analysis on the study of the copy from a single hand of several texts can partially be a way around this issue (Camps & Cafiero, 2012), but the limited availability of complete diplomatic transcription might make it difficult. In this paper, we use a workflow combining handwritten text recognition and stylometric analysis, and apply it to the case of the hagiographic works contained in MS BnF, fr. 412. We seek to evaluate Paul Meyer's hypothesis about the constitution of groups of hagiographic works, as well as to examine potential authorial groupings in a vastly anonym corpus.
Papers by Ariane Pinche
Journal of data mining and digital humanities, Mar 19, 2024
Journal of Data Mining & Digital Humanities
A lot of available digitized manuscripts online are actually digitized microfilms, a technology d... more A lot of available digitized manuscripts online are actually digitized microfilms, a technology dating back from the 1930s. With the progress of artificial colorization, we make the hypothesis that microfilms could be colored with these recent technologies, testing InstColorization. We train a model over an ad-hoc dataset of 18 788 color images that are artificially gray-scaled for this purpose. With promising results in terms of colorization but clear limitations due to the difference between artificially grayscaled images and "naturaly" greyscaled microfilms, we evaluate the impact of this artificial colorization on two downstream tasks using Kraken: layout analysis and text recognition. Unfortunately, the results show little to no improvements which limits the interest of artificial colorization on manuscripts in the computer vision domain.
Depuis quelques années, la philologie computationnelle a ouvert la voie à de nouvelles approches ... more Depuis quelques années, la philologie computationnelle a ouvert la voie à de nouvelles approches pour l'étude des textes médiévaux et modernes. Ces approches nécessitent cependant des données en grande quantité que l'on ne peut obtenir qu'en extrayant les textes à partir des fac-similés numériques. Pour ce faire, la recherche a besoin d'outils efficaces, s'appuyant sur des guides qui garantissent une interopérabilité maximale entre les différents états d'une langue (ancien français, moyen français, etc.) et les différents types de textes (manuscrits, imprimés, etc.). Cet article se concentre sur la production imprimée du XVI e siècle, en langue française et en caractères gothiques, en prenant pour cas d'étude un corpus romand. Nous proposons deux modèles qui améliorent l'état de l'art actuel : l'un pour l'analyse de la mise en page et l'autre pour l'OCR. Ces modèles s'appuient sur un vocabulaire contrôlé pour la description des pages et sur un guide de transcription pour les textes en gothique.
Our initiative aims at designing a controlled vocabulary for the description of the layout of tex... more Our initiative aims at designing a controlled vocabulary for the description of the layout of textual sources: SegmOnto. Following a codicological approach rather than a semantic one, it is designed as a generic typology, coping with a maximal number of cases rather than answering specific needs. The harmonisation of the layout description has a double objective: on the one hand it facilitates the mutualisation of annotated data and therefore the training of better models for image segmentation (a crucial preliminary step for text recognition), on the other hand it allows the development of a shared post-processing workflow and pipeline for the transformation of ALTO or PAGE files into DH standard formats, which preserves as much as possible the link between the extracted information and the digital facsimile.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 25, 2022
HAL (Le Centre pour la Communication Scientifique Directe), Sep 5, 2021
Journal of Data Mining and Digital Humanities, Apr 12, 2023
A lot of available digitized manuscripts online are actually digitized microfilms, a technology d... more A lot of available digitized manuscripts online are actually digitized microfilms, a technology dating back from the 1930s. With the progress of artificial colorization, we make the hypothesis that microfilms could be colored with these recent technologies, testing InstColorization. We train a model over a new dataset of 18 788 color images that are artificially gray-scaled for this purpose. With promising results in terms of colorization but clear limitations due to the difference between artificially grayscaled images and "naturaly" grayscaled microfilms, we evaluate the impact of this artificial colorization on two downstream tasks using Kraken: layout analysis and text recognition. The results show little to no improvements which limits the interest of artificial colorization on manuscripts in the computer vision domain. Many low resolution digital scans of microfilms exist. These are surrogates of surrogates. They can still be (and are) profitably used, for example to corroborate a particular reading. I am however skeptical of using them as a single source for making an edition. Perhaps, indeed, 99% of a manuscript can still be deciphered by using them, but it is about that 1% of cases in which the scribe fumbled a bit with his pen and it is unclear what the word reads. In those 1% cases, you do not wish to have a low-resolution, black and white reproduction of a reproduction as your sole witness L. W. C. van Lit [2019]
HAL (Le Centre pour la Communication Scientifique Directe), Jun 17, 2022
HAL (Le Centre pour la Communication Scientifique Directe), Mar 8, 2023
HAL (Le Centre pour la Communication Scientifique Directe), Mar 8, 2023
HAL (Le Centre pour la Communication Scientifique Directe), Mar 1, 2023
HAL (Le Centre pour la Communication Scientifique Directe), Jan 24, 2023
CEM, Nov 29, 2022
Éditeur Centre d'études médiévales Saint-Germain d'Auxerre Référence électronique Ariane Pinche, ... more Éditeur Centre d'études médiévales Saint-Germain d'Auxerre Référence électronique Ariane Pinche, « Comprendre la composition des premiers légendiers en langue vernaculaire à l'aide des méthodes numériques : acquisition automatique du texte (HTR) et analyses stylométriques », Bulletin du centre d'études médiévales d'
HAL (Le Centre pour la Communication Scientifique Directe), Nov 15, 2022
HAL (Le Centre pour la Communication Scientifique Directe), Jul 22, 2022
Within the infrastructure of the CREMMA project (Consortium for Handwriting Recognition of Ancien... more Within the infrastructure of the CREMMA project (Consortium for Handwriting Recognition of Ancient Materials) supported by the DIM (research funded by the Île-de-France Region) MAP (Ancient and Heritage Materials), the CREMMALab 1 project combines research questions, creation and release of data from medieval French literary manuscripts for HTR. The objective of the CREMMALab project is to propose open training data and HTR models for medieval documents. All data and models produced by the project are already available in the CREMMA Medieval repository (Pinche 2022) on HTR-united catalogue (Chagué, Clérice, and Chiffoleau, 2021). In accordance with this objective, the project implements transcription protocols to optimise the training of HTR models and to produce homogeneous and shareable data and models.
HAL (Le Centre pour la Communication Scientifique Directe), Jun 16, 2022
Les données pourront toujours être interopérables à condition de documenter l'ensemble des choix ... more Les données pourront toujours être interopérables à condition de documenter l'ensemble des choix divergents. Toutefois, la conversion de données signifie souvent la réduction au plus petit dénominateur commun et engendre une grande perte d'informations qu'il faut essayer de limiter.
HAL (Le Centre pour la Communication Scientifique Directe), Feb 23, 2023
Digital Humanities Conference (DH2019), 2019
Stylometric analysis of medieval vernacular texts is still much of a challenge: the importance of... more Stylometric analysis of medieval vernacular texts is still much of a challenge: the importance of scribal variation, be it graphic or more substantial, as well as the variants and errors introduced in the tradition, complicates the task of the would-be stylometrist. Basing the analysis on the study of the copy from a single hand of several texts can partially be a way around this issue (Camps & Cafiero, 2012), but the limited availability of complete diplomatic transcription might make it difficult. In this paper, we use a workflow combining handwritten text recognition and stylometric analysis, and apply it to the case of the hagiographic works contained in MS BnF, fr. 412. We seek to evaluate Paul Meyer's hypothesis about the constitution of groups of hagiographic works, as well as to examine potential authorial groupings in a vastly anonym corpus.
Journal of data mining and digital humanities, Mar 19, 2024
Journal of Data Mining & Digital Humanities
A lot of available digitized manuscripts online are actually digitized microfilms, a technology d... more A lot of available digitized manuscripts online are actually digitized microfilms, a technology dating back from the 1930s. With the progress of artificial colorization, we make the hypothesis that microfilms could be colored with these recent technologies, testing InstColorization. We train a model over an ad-hoc dataset of 18 788 color images that are artificially gray-scaled for this purpose. With promising results in terms of colorization but clear limitations due to the difference between artificially grayscaled images and "naturaly" greyscaled microfilms, we evaluate the impact of this artificial colorization on two downstream tasks using Kraken: layout analysis and text recognition. Unfortunately, the results show little to no improvements which limits the interest of artificial colorization on manuscripts in the computer vision domain.
Depuis quelques années, la philologie computationnelle a ouvert la voie à de nouvelles approches ... more Depuis quelques années, la philologie computationnelle a ouvert la voie à de nouvelles approches pour l'étude des textes médiévaux et modernes. Ces approches nécessitent cependant des données en grande quantité que l'on ne peut obtenir qu'en extrayant les textes à partir des fac-similés numériques. Pour ce faire, la recherche a besoin d'outils efficaces, s'appuyant sur des guides qui garantissent une interopérabilité maximale entre les différents états d'une langue (ancien français, moyen français, etc.) et les différents types de textes (manuscrits, imprimés, etc.). Cet article se concentre sur la production imprimée du XVI e siècle, en langue française et en caractères gothiques, en prenant pour cas d'étude un corpus romand. Nous proposons deux modèles qui améliorent l'état de l'art actuel : l'un pour l'analyse de la mise en page et l'autre pour l'OCR. Ces modèles s'appuient sur un vocabulaire contrôlé pour la description des pages et sur un guide de transcription pour les textes en gothique.
Our initiative aims at designing a controlled vocabulary for the description of the layout of tex... more Our initiative aims at designing a controlled vocabulary for the description of the layout of textual sources: SegmOnto. Following a codicological approach rather than a semantic one, it is designed as a generic typology, coping with a maximal number of cases rather than answering specific needs. The harmonisation of the layout description has a double objective: on the one hand it facilitates the mutualisation of annotated data and therefore the training of better models for image segmentation (a crucial preliminary step for text recognition), on the other hand it allows the development of a shared post-processing workflow and pipeline for the transformation of ALTO or PAGE files into DH standard formats, which preserves as much as possible the link between the extracted information and the digital facsimile.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 25, 2022
HAL (Le Centre pour la Communication Scientifique Directe), Sep 5, 2021
Journal of Data Mining and Digital Humanities, Apr 12, 2023
A lot of available digitized manuscripts online are actually digitized microfilms, a technology d... more A lot of available digitized manuscripts online are actually digitized microfilms, a technology dating back from the 1930s. With the progress of artificial colorization, we make the hypothesis that microfilms could be colored with these recent technologies, testing InstColorization. We train a model over a new dataset of 18 788 color images that are artificially gray-scaled for this purpose. With promising results in terms of colorization but clear limitations due to the difference between artificially grayscaled images and "naturaly" grayscaled microfilms, we evaluate the impact of this artificial colorization on two downstream tasks using Kraken: layout analysis and text recognition. The results show little to no improvements which limits the interest of artificial colorization on manuscripts in the computer vision domain. Many low resolution digital scans of microfilms exist. These are surrogates of surrogates. They can still be (and are) profitably used, for example to corroborate a particular reading. I am however skeptical of using them as a single source for making an edition. Perhaps, indeed, 99% of a manuscript can still be deciphered by using them, but it is about that 1% of cases in which the scribe fumbled a bit with his pen and it is unclear what the word reads. In those 1% cases, you do not wish to have a low-resolution, black and white reproduction of a reproduction as your sole witness L. W. C. van Lit [2019]
HAL (Le Centre pour la Communication Scientifique Directe), Jun 17, 2022
HAL (Le Centre pour la Communication Scientifique Directe), Mar 8, 2023
HAL (Le Centre pour la Communication Scientifique Directe), Mar 8, 2023
HAL (Le Centre pour la Communication Scientifique Directe), Mar 1, 2023
HAL (Le Centre pour la Communication Scientifique Directe), Jan 24, 2023
CEM, Nov 29, 2022
Éditeur Centre d'études médiévales Saint-Germain d'Auxerre Référence électronique Ariane Pinche, ... more Éditeur Centre d'études médiévales Saint-Germain d'Auxerre Référence électronique Ariane Pinche, « Comprendre la composition des premiers légendiers en langue vernaculaire à l'aide des méthodes numériques : acquisition automatique du texte (HTR) et analyses stylométriques », Bulletin du centre d'études médiévales d'
HAL (Le Centre pour la Communication Scientifique Directe), Nov 15, 2022
HAL (Le Centre pour la Communication Scientifique Directe), Jul 22, 2022
Within the infrastructure of the CREMMA project (Consortium for Handwriting Recognition of Ancien... more Within the infrastructure of the CREMMA project (Consortium for Handwriting Recognition of Ancient Materials) supported by the DIM (research funded by the Île-de-France Region) MAP (Ancient and Heritage Materials), the CREMMALab 1 project combines research questions, creation and release of data from medieval French literary manuscripts for HTR. The objective of the CREMMALab project is to propose open training data and HTR models for medieval documents. All data and models produced by the project are already available in the CREMMA Medieval repository (Pinche 2022) on HTR-united catalogue (Chagué, Clérice, and Chiffoleau, 2021). In accordance with this objective, the project implements transcription protocols to optimise the training of HTR models and to produce homogeneous and shareable data and models.
HAL (Le Centre pour la Communication Scientifique Directe), Jun 16, 2022
Les données pourront toujours être interopérables à condition de documenter l'ensemble des choix ... more Les données pourront toujours être interopérables à condition de documenter l'ensemble des choix divergents. Toutefois, la conversion de données signifie souvent la réduction au plus petit dénominateur commun et engendre une grande perte d'informations qu'il faut essayer de limiter.
HAL (Le Centre pour la Communication Scientifique Directe), Feb 23, 2023
HAL (Le Centre pour la Communication Scientifique Directe), Dec 9, 2022