URLs where the regex capture less than the annotations are not consolidated with the clickable links from the PDF document (original) (raw)

In this document, the aboundance of spaces in the middle of the extracted URL makes sure that our regex falls short. However, the annotation are correct, but we somehow do not extend the matching beyond the initial regex extracted URL.

The result is quite messy:

        <div type="acknowledgement">
            <div>
                <head>Acknowledgements</head>
                [...] We thank 
                    <rs type="person">Mr. Tetsuo Kishi</rs> from the 
                    <rs type="affiliation">Department of Medicine, Kyushu University School of Medicine</rs> for the immunohistochemical analysis. We thank 
                    <rs type="person">J. Ludovic Croxford, PhD</rs>, from Edanz (
                    <ref type="url" target="https://jp">https:// jp</ref>. edanz. com/ ac) for editing a draft of this manuscript.
                </p>
            </div>

PDF: 10_10.1038_s41598-021-96064-6.pdf

I've already drafted a fix with PR: #1190