Comparative Analysis on Academic Paper Similarity using Jaccard and Levenshtein and Blocking

Main Article Content

Muhammad Rizqi Nur
Gandhi Surya Buana
Nur Aini Rakhmawati

Abstract

Paper search engines have made it easier for academics to conduct literature reviews. However, easy doesn't mean accurate. For certain niche topics, search results often aren’t quite good. Snowballing can be done to overcome this, but it is limited to the initial articles owned, especially the author's access when the article was written. As an alternative, paper databases provide recommendations for relevant articles of an article, but it’s limited to that database. A tool to search for similar articles without relying on a specific database would be very helpful, but before that, the appropriate method for measuring article similarity needs to be determined. This research aims to measure article similarity based on title, author, and keywords using Weighted Jaccard Measure and Levenshtein distance and evaluate it. This study also compares performance by adding blocking with overlap blocking and stop word removal. The Jaccard evaluation results are quite poor, but the Levenshtein + Jaccard evaluation results are decent. In addition, it was found that emphasizing weighting on the title produces the best results. Overlap blocking and stop words removal increases processing time instead. Overlap blocking can reduce the number of measurements by almost half with an overlap of 1, but overlaps above 1 will discard many pairs that should be similar. Removing stop words improves Jaccard and Levenshtein performance but requires threshold adjustment.

Downloads

Download data is not yet available.

Article Details

How to Cite
[1]
M. R. Nur, G. S. Buana, and N. A. Rakhmawati, “Comparative Analysis on Academic Paper Similarity using Jaccard and Levenshtein and Blocking”, JuTISI, vol. 9, no. 2, pp. 272 –, Aug. 2023.
Section
Articles