Improving the representation and conversion of mathematical formulae by considering their textual context - PubMed (original) (raw)

Improving the representation and conversion of mathematical formulae by considering their textual context

Moritz Schubotz et al. TUGboat (Provid). 2018 May.

Abstract

Mathematical formulae represent complex semantic information in a concise form. Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial for communicating information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in mathematical formulae requires machine-readable formats that can represent both the presentation and content, i.e., the semantics, of formulae. Exchanging such information between systems additionally requires conversion methods for mathematical representation formats. We analyze how the semantic enrichment of formulae improves the format conversion process and show that considering the textual context of formulae reduces the error rate of such conversions. Our main contributions are: (1) providing an openly available benchmark dataset for the mathematical format conversion task consisting of a newly created test collection, an extensive, manually curated gold standard and task-specific evaluation metrics; (2) performing a quantitative evaluation of state-of-the-art tools for mathematical format conversions; (3) presenting a new approach that considers the textual context of formulae to reduce the error rate for mathematical format conversions. Our benchmark dataset facilitates future research on mathematical format conversions as well as research on many problems in mathematical information retrieval. Because we annotated and linked all components of formulae, e.g., identifiers, operators and other entities, to Wikidata entries, the gold standard can, for instance, be used to train methods for formula concept discovery and recognition. Such methods can then be applied to improve mathematical information retrieval systems, e.g., for semantic formula search, recommendation of mathematical content, or detection of mathematical plagiarism.

PubMed Disclaimer

Figures

Figure 1:

Figure 1:

Graphical user interface to support the creation of our gold standard. The interface provides several TEX input fields (left) and a mathematical expression tree rendered by the VMEXT visualization tool (right).

Figure 2:

Figure 2:

Overview of the structural tree edit distances (using r = 0, i = d = 1) between the MathML trees generated by the conversion tools and the gold standard MathML trees.

Figure 3:

Figure 3:

Time in seconds required by each tool to parse the 305 gold standard LATEX expressions in logarithmic scale.

Figure 4:

Figure 4:

Mathematical language processing is the task of mapping textual descriptions to components of mathematical formulae (Part-of-Math tagging).

Similar articles

Cited by

References

    1. Aizawa A, Kohlhase M, et al.NTCIR-11 math-2 task overview. In Proc. 11th NTCIR Conf. on Evaluation of Information Access Technologies, Tokyo, Japan, 2014.
    1. Cajori F. A History of Mathematical Notations, vol. 1. Courier Corporation, 1928.
    1. Cohl HS, McClain MA, et al.Digital repository of mathematical formulae. In Conference on Intelligent Computer Mathematics (CICM), Coimbra, Portugal, pp. 419–422, 2014. doi:10.1007/978-3-319-08434-3_30 - DOI
    1. Cohl HS, Schubotz M, et al.Growing the digital repository of mathematical formulae with generic sources. In Kerber M, Carette J, et al., eds., CICM, Washington, DC, USA, vol. 9150, pp. 280–287, 2015. doi:10.1007/978-3-319-20615-8_18 - DOI
    1. Cohl HS, Schubotz M, et al.Semantic preserving bijective mappings of mathematical formulae between document preparation systems and computer algebra systems. In CICM, Edinburgh, UK, 2017. doi:10.1007/978-3-319-62075-6_9 - DOI - PMC - PubMed

LinkOut - more resources