A Framework for Duplicate Detection from Online Job Postings (original) (raw)

IEEE/WIC/ACM International Conference on Web Intelligence

Online job boards have greatly improved the efficiency of job searching and have also provided valuable data for labour market research. However, there are a high proportion of duplicate job postings in most (if not all) job boards, because recruiters and job boards seek to improve their coverage of the market by integrating job postings from many different sources. These duplicate postings undermine the usability of job boards and the quality of labour market analytics derived from them. In this paper, we tackle the challenging problem of duplicate detection from online job postings. Specifically, we design a framework for duplicate detection from online job postings and, under the framework, implement and test 24 methods built with four different tokenisers, three vectorisers and six similarity measures. We conduct a comparative study and experimental evaluation of the 24 methods and compare their performance with a baseline approach. All methods are tested with a real-world dataset from a job boarding platform and are evaluated with six performance metrics. The experiment reveals that the top two methods are Overlap with skip-gram (OS) and Overlap with n-gram (OG), followed by TFIDF-cosine with n-gram (TCG) and TFIDF-cosine with skip-gram (TCS), and that all above four methods outperform the baseline approach in detecting duplicates. CCS CONCEPTS • Applied computing → Document analysis; • Computing methodologies → Information extraction; • Information systems → Data cleaning.

Loading...

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.