Erno Liukkonen - Academia.edu (original) (raw)

Related Authors

Pierrick Tranouez

Giorgos Sfikas

Karim Hadjar

Markus Diem

Thomas Breuel

鹏 李

Huazhong University of Science and Technology

Ioannis Pratikakis

Uploads

Papers by Erno Liukkonen

Research paper thumbnail of Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771-1929: Early Results Using the PIVAJ Software

Datech 2019, 2019

This paper describes first large scale article detection and extraction efforts on the Finnish Di... more This paper describes first large scale article detection and extraction efforts on the Finnish Digi 1 newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898. The historical digital newspaper archive environment of the NLF is based on commercial docWorks 2 software. The software is capable of article detection and extraction, but our material does not seem to behave well in the system in this respect. Therefore, we have been in search of an alternative article segmentation system and have now focused our efforts on the PIVAJ machine learning based platform developed at the LITIS laboratory of University of Rouen Normandy [11-13, 16, 17]. As training and evaluation data for PIVAJ we chose one newspaper, Uusi Suometar. We established a data set that contains 56 issues of the newspaper from years 1869-1898 with 4 pages each, i.e. 224 pages in total. Given the selected set of 56 issues, our first data annotation and experiment phase consisted of annotating a subset of 28 issues (112 pages) and conducting preliminary experiments. After the preliminary annotation and 1 https://digi.kansalliskirjasto.fi/etusivu?set_language=en 2 https://extranet.content-conversion.com experimentation resulting in a consistent practice, we fixed the annotation of the first 28 issues accordingly. Subsequently, we annotated the remaining 28 issues. We then divided the annotated set into training and evaluation sets of 168 and 56 pages. We trained PIVAJ successfully and evaluated the results using the layout evaluation software developed by PRImA research laboratory of University of Salford [6]. The results of our experiments show that PIVAJ achieves success rates of 67.9, 76.1, and 92.2 for the whole data set of 56 pages with three different evaluation scenarios introduced in [6]. On the whole, the results seem reasonable considering the varying layouts of the different issues of Uusi Suometar along the time scale of the data. CCS CONCEPTS • CCS → Applied computing → Document management and text processing → Document capture → Document analysis • CCS → Information systems → Information retrieval → Document representation → Document structure • CCS → Information systems → Information systems applications → Digital libraries and archives KEYWORDS Document layout analysis, article extraction, historical digitized Finnish newspaper archives, PIVAJ software ACM Reference format:

Research paper thumbnail of Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771-1929: Early Results Using the PIVAJ Software

Datech 2019, 2019

This paper describes first large scale article detection and extraction efforts on the Finnish Di... more This paper describes first large scale article detection and extraction efforts on the Finnish Digi 1 newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898. The historical digital newspaper archive environment of the NLF is based on commercial docWorks 2 software. The software is capable of article detection and extraction, but our material does not seem to behave well in the system in this respect. Therefore, we have been in search of an alternative article segmentation system and have now focused our efforts on the PIVAJ machine learning based platform developed at the LITIS laboratory of University of Rouen Normandy [11-13, 16, 17]. As training and evaluation data for PIVAJ we chose one newspaper, Uusi Suometar. We established a data set that contains 56 issues of the newspaper from years 1869-1898 with 4 pages each, i.e. 224 pages in total. Given the selected set of 56 issues, our first data annotation and experiment phase consisted of annotating a subset of 28 issues (112 pages) and conducting preliminary experiments. After the preliminary annotation and 1 https://digi.kansalliskirjasto.fi/etusivu?set_language=en 2 https://extranet.content-conversion.com experimentation resulting in a consistent practice, we fixed the annotation of the first 28 issues accordingly. Subsequently, we annotated the remaining 28 issues. We then divided the annotated set into training and evaluation sets of 168 and 56 pages. We trained PIVAJ successfully and evaluated the results using the layout evaluation software developed by PRImA research laboratory of University of Salford [6]. The results of our experiments show that PIVAJ achieves success rates of 67.9, 76.1, and 92.2 for the whole data set of 56 pages with three different evaluation scenarios introduced in [6]. On the whole, the results seem reasonable considering the varying layouts of the different issues of Uusi Suometar along the time scale of the data. CCS CONCEPTS • CCS → Applied computing → Document management and text processing → Document capture → Document analysis • CCS → Information systems → Information retrieval → Document representation → Document structure • CCS → Information systems → Information systems applications → Digital libraries and archives KEYWORDS Document layout analysis, article extraction, historical digitized Finnish newspaper archives, PIVAJ software ACM Reference format:

Log In