T. C. Rajapakse, R. D. Nawarathna
Abstract
Often dismissed as a harmless pastime for the gullible, pseudoscience nonetheless has devastating effects, including the loss of life. Its menace is amplified by the difficulty of differentiating pseudoscience from science, especially for the untrained eye. We propose a method to automatically detect pseudoscience in text by utilizing a finetuned Robustly Optimized Bidirectional Encoder Representation from Transformers Approach (RoBERTa) model. The dataset of 112,720 full-text articles used in this work is made publicly available to remedy the lack of datasets related to pseudoscience. We present a novel technique, Intelligent ReLabeling (IRL), to minimize mislabeled data, enabling the rapid creation of high quality textual datasets. We show that IRL eliminates the need for expensive manual verification processes and minimizes domain expertise requirements in many applications. The final model trained with IRL achieves a F1 score of 0.929 on a separate manually labelled test dataset.
Keywords: Natural Language Processing, Classification, Data Cleaning, Pseudoscience Dataset, Transformer Models
**View Full Paper in the Springer Source