Please use this identifier to cite or link to this item:
Title: How sensitive are the term-weighting models of information retrieval to spam Web pages?
Authors: Arslan, Ahmet
Keywords: Information retrieval
Spam sensitivity
Term weighting
Web search
Web spam
Issue Date: 2019
Publisher: Elsevier Science Bv
Abstract: Many term-weighting models have been proposed for information retrieval. In this paper we investigate the extent to how retrieval effectiveness of term-weighting models is affected by the presence of spam Web pages. We perform retrieval experiments on the ClueWeb09-English (Category A) dataset - a substantial fraction of which are spam pages that are deliberately designed to manipulate commercial search engines - as well as the ClueWeb12 (Category A) dataset. Ad hoc tasks of TREC Web tracks 2009 through 2012 are completed to examine the spam sensitivity of the state-of-the-art retrieval models using Apache Lucene as the retrieval engine. Moreover, ad hoc tasks of two Web tracks and two Tasks tracks 2013 through 2016 are also included in a part of the experiment where the number of documents that are explicitly judged as spam in the search results returned by each retrieval model is inspected. Our experimental results show that hypergeometric models of information retrieval are more immune than other models to spam content. All the results presented in this article are fully repeatable and reproducible with data and code available online at a public GitHub repository. (C) 2018 Elsevier B.V. All rights reserved.
ISSN: 0020-0190
Appears in Collections:Elektrik-Elektronik Mühendisliği Bölümü Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu
WoS İndeksli Yayınlar Koleksiyonu

Show full item record

CORE Recommender


checked on Dec 28, 2022


checked on Jul 14, 2022

Page view(s)

checked on Oct 3, 2022

Google ScholarTM



Items in GCRIS Repository are protected by copyright, with all rights reserved, unless otherwise indicated.