Clément Sage - Academia.edu (original) (raw)
Papers by Clément Sage
First, I am deeply grateful to Esker for funding this PhD thesis and to all its collaborators tha... more First, I am deeply grateful to Esker for funding this PhD thesis and to all its collaborators that helped me during these three intense years. Particularly, I would like to thank Jean-Jacques Bérard and Cédric Viste for having proposed this thesis and let me conduct my research with complete freedom. I also thank Thibault Douzon for the scientific exchanges we had and for the help to conduct the experiments of the last chapter. Last but not least, I would like to offer my special thanks to my industrial supervisor, Jérémy Espinas, for having been my travel companion to present our work at Sydney, for his continuous support and for the proofreading of my scientific production. I would like to extend my sincere thanks to Alexandre Aussem, Véronique Eglin, Haytham Elghazel for being my academic supervisors. Their invaluable advice and plentiful experience have allowed me to greatly improve my research. I would also like to thank Antoine Doucet and Aurélie Lemaitre for accepting to be the rapporteurs of my PhD manuscript. Additionally, I would like to express gratitude to Thierry Paquet, Noura Faci and Yolande Belaïd for accepting to belong to my defense jury and also to my monitoring committee for the two former. I also thank Olivier Commowick for publicly providing this beautiful thesis template 1. My appreciation finally goes out to my family, girlfriend and friends for their encouragement and support all through my studies. 1 https://olivier.commowick.org/thesis_template.php i v vi sont significativement plus efficients en matière de données que les modèles apprenant la tâche d'extraction à partir de zéro. Nous révélons également de précieuses capacités de transfert de connaissances pour ce modèle de langage puisque les performances sont améliorées en apprenant au préalable à extraire de l'information sur un autre jeu de données, même si ses champs ciblés diffèrent de la tâche initiale.
The predominant approaches for extracting key information from documents resort to classifiers pr... more The predominant approaches for extracting key information from documents resort to classifiers predicting the information type of each word. However, the word level ground truth used for learning is expensive to obtain since it is not naturally produced by the extraction task. In this paper, we discuss a new method for training extraction models directly from the textual value of information. The extracted information of a document is represented as a sequence of tokens in the XML language. We learn to output this representation with a pointer-generator network that alternately copies the document words carrying information and generates the XML tags delimiting the types of information. The ability of our end-to-end method to retrieve structured information is assessed on a large set of business documents. We show that it performs competitively with a standard word classifier without requiring costly word level supervision.
Springer eBooks, 2021
Like for many text understanding and generation tasks, pretrained languages models have emerged a... more Like for many text understanding and generation tasks, pretrained languages models have emerged as a powerful approach for extracting information from business documents. However, their performance has not been properly studied in data-constrained settings which are often encountered in industrial applications. In this paper, we show that LayoutLM, a pre-trained model recently proposed for encoding 2D documents, reveals a high sample-efficiency when fine-tuned on public and real-world Information Extraction (IE) datasets. Indeed, LayoutLM reaches more than 80% of its full performance with as few as 32 documents for fine-tuning. When compared with a strong baseline learning IE from scratch, the pre-trained model needs between 4 to 30 times fewer annotated documents in the toughest data conditions. Finally, LayoutLM performs better on the real-world dataset when having been beforehand fine-tuned on the full public dataset, thus indicating valuable knowledge transfer abilities. We therefore advocate the use of pre-trained language models for tackling practical extraction problems.
Efficiently extracting information from documents issued by their partners is crucial for compani... more Efficiently extracting information from documents issued by their partners is crucial for companies that face huge daily document flows. Particularly, tables contain most valuable information of business documents. However, their contents are challenging to automatically parse as tables from industrial contexts may have complex and ambiguous physical structure. Bypassing their structure recognition, we propose a generic method for end-to-end table field extraction that starts with the sequence of document tokens segmented by an OCR engine and directly tags each token with one of the possible field types. Similar to the state-of-the-art methods for non-tabular field extraction, our approach resorts to a token level recurrent neural network combining spatial and textual features. We empirically assess the effectiveness of recurrent connections for our task by comparing our method with a baseline feedforward network having local context knowledge added to its inputs. We train and evaluate both approaches on a dataset of 28,570 purchase orders to retrieve the ID numbers and quantities of the ordered products. Our method outperforms the baseline with micro F1 score on unknown document layouts of 0.821 compared to 0.764.
2019 International Conference on Document Analysis and Recognition (ICDAR)
Efficiently extracting information from documents issued by their partners is crucial for compani... more Efficiently extracting information from documents issued by their partners is crucial for companies that face huge daily document flows. Particularly, tables contain most valuable information of business documents. However, their contents are challenging to automatically parse as tables from industrial contexts may have complex and ambiguous physical structure. Bypassing their structure recognition, we propose a generic method for end-to-end table field extraction that starts with the sequence of document tokens segmented by an OCR engine and directly tags each token with one of the possible field types. Similar to the state-of-the-art methods for non-tabular field extraction, our approach resorts to a token level recurrent neural network combining spatial and textual features. We empirically assess the effectiveness of recurrent connections for our task by comparing our method with a baseline feedforward network having local context knowledge added to its inputs. We train and evaluate both approaches on a dataset of 28,570 purchase orders to retrieve the ID numbers and quantities of the ordered products. Our method outperforms the baseline with micro F1 score on unknown document layouts of 0.821 compared to 0.764.
Document Analysis and Recognition – ICDAR 2021 Workshops
Proceedings of the Fourth Workshop on Structured Prediction for NLP
Proceedings of the Fourth Workshop on Structured Prediction for NLP
First, I am deeply grateful to Esker for funding this PhD thesis and to all its collaborators tha... more First, I am deeply grateful to Esker for funding this PhD thesis and to all its collaborators that helped me during these three intense years. Particularly, I would like to thank Jean-Jacques Bérard and Cédric Viste for having proposed this thesis and let me conduct my research with complete freedom. I also thank Thibault Douzon for the scientific exchanges we had and for the help to conduct the experiments of the last chapter. Last but not least, I would like to offer my special thanks to my industrial supervisor, Jérémy Espinas, for having been my travel companion to present our work at Sydney, for his continuous support and for the proofreading of my scientific production. I would like to extend my sincere thanks to Alexandre Aussem, Véronique Eglin, Haytham Elghazel for being my academic supervisors. Their invaluable advice and plentiful experience have allowed me to greatly improve my research. I would also like to thank Antoine Doucet and Aurélie Lemaitre for accepting to be the rapporteurs of my PhD manuscript. Additionally, I would like to express gratitude to Thierry Paquet, Noura Faci and Yolande Belaïd for accepting to belong to my defense jury and also to my monitoring committee for the two former. I also thank Olivier Commowick for publicly providing this beautiful thesis template 1. My appreciation finally goes out to my family, girlfriend and friends for their encouragement and support all through my studies. 1 https://olivier.commowick.org/thesis_template.php i v vi sont significativement plus efficients en matière de données que les modèles apprenant la tâche d'extraction à partir de zéro. Nous révélons également de précieuses capacités de transfert de connaissances pour ce modèle de langage puisque les performances sont améliorées en apprenant au préalable à extraire de l'information sur un autre jeu de données, même si ses champs ciblés diffèrent de la tâche initiale.
The predominant approaches for extracting key information from documents resort to classifiers pr... more The predominant approaches for extracting key information from documents resort to classifiers predicting the information type of each word. However, the word level ground truth used for learning is expensive to obtain since it is not naturally produced by the extraction task. In this paper, we discuss a new method for training extraction models directly from the textual value of information. The extracted information of a document is represented as a sequence of tokens in the XML language. We learn to output this representation with a pointer-generator network that alternately copies the document words carrying information and generates the XML tags delimiting the types of information. The ability of our end-to-end method to retrieve structured information is assessed on a large set of business documents. We show that it performs competitively with a standard word classifier without requiring costly word level supervision.
Springer eBooks, 2021
Like for many text understanding and generation tasks, pretrained languages models have emerged a... more Like for many text understanding and generation tasks, pretrained languages models have emerged as a powerful approach for extracting information from business documents. However, their performance has not been properly studied in data-constrained settings which are often encountered in industrial applications. In this paper, we show that LayoutLM, a pre-trained model recently proposed for encoding 2D documents, reveals a high sample-efficiency when fine-tuned on public and real-world Information Extraction (IE) datasets. Indeed, LayoutLM reaches more than 80% of its full performance with as few as 32 documents for fine-tuning. When compared with a strong baseline learning IE from scratch, the pre-trained model needs between 4 to 30 times fewer annotated documents in the toughest data conditions. Finally, LayoutLM performs better on the real-world dataset when having been beforehand fine-tuned on the full public dataset, thus indicating valuable knowledge transfer abilities. We therefore advocate the use of pre-trained language models for tackling practical extraction problems.
Efficiently extracting information from documents issued by their partners is crucial for compani... more Efficiently extracting information from documents issued by their partners is crucial for companies that face huge daily document flows. Particularly, tables contain most valuable information of business documents. However, their contents are challenging to automatically parse as tables from industrial contexts may have complex and ambiguous physical structure. Bypassing their structure recognition, we propose a generic method for end-to-end table field extraction that starts with the sequence of document tokens segmented by an OCR engine and directly tags each token with one of the possible field types. Similar to the state-of-the-art methods for non-tabular field extraction, our approach resorts to a token level recurrent neural network combining spatial and textual features. We empirically assess the effectiveness of recurrent connections for our task by comparing our method with a baseline feedforward network having local context knowledge added to its inputs. We train and evaluate both approaches on a dataset of 28,570 purchase orders to retrieve the ID numbers and quantities of the ordered products. Our method outperforms the baseline with micro F1 score on unknown document layouts of 0.821 compared to 0.764.
2019 International Conference on Document Analysis and Recognition (ICDAR)
Efficiently extracting information from documents issued by their partners is crucial for compani... more Efficiently extracting information from documents issued by their partners is crucial for companies that face huge daily document flows. Particularly, tables contain most valuable information of business documents. However, their contents are challenging to automatically parse as tables from industrial contexts may have complex and ambiguous physical structure. Bypassing their structure recognition, we propose a generic method for end-to-end table field extraction that starts with the sequence of document tokens segmented by an OCR engine and directly tags each token with one of the possible field types. Similar to the state-of-the-art methods for non-tabular field extraction, our approach resorts to a token level recurrent neural network combining spatial and textual features. We empirically assess the effectiveness of recurrent connections for our task by comparing our method with a baseline feedforward network having local context knowledge added to its inputs. We train and evaluate both approaches on a dataset of 28,570 purchase orders to retrieve the ID numbers and quantities of the ordered products. Our method outperforms the baseline with micro F1 score on unknown document layouts of 0.821 compared to 0.764.
Document Analysis and Recognition – ICDAR 2021 Workshops
Proceedings of the Fourth Workshop on Structured Prediction for NLP
Proceedings of the Fourth Workshop on Structured Prediction for NLP