TY - GEN
T1 - PageRank optimization applied to spam detection
AU - Fercoq, Olivier
PY - 2012/12/1
Y1 - 2012/12/1
N2 - We give a new link spam detection and PageRank demotion algorithm called MaxRank. Like TrustRank and Anti-TrustRank, it starts with a seed of hand-picked trusted and spam pages. We define the MaxRank of a page as the frequency of visit of this page by a random surfer minimizing an average cost per time unit. On a given page, the random surfer selects a set of hyperlinks and clicks with uniform probability on any of these hyperlinks. The cost function penalizes spam pages and hyperlink removals. The goal is to determine a hyperlink deletion policy that minimizes this score. The MaxRank is interpreted as a modified PageRank vector, used to sort web pages instead of the usual PageRank vector. We show that the bias vector of the associated ergodic control problem, which is unique up to an additive constant, is a measure of the 'spamicity' of each page, used to detect spam pages. We give a scalable algorithm for MaxRank computation that allowed us to perform numerical experiments on the WEBSPAM-UK2007 dataset. We show that our algorithm outperforms both TrustRank and AntiTrustRank for spam and nonspam page detection.
AB - We give a new link spam detection and PageRank demotion algorithm called MaxRank. Like TrustRank and Anti-TrustRank, it starts with a seed of hand-picked trusted and spam pages. We define the MaxRank of a page as the frequency of visit of this page by a random surfer minimizing an average cost per time unit. On a given page, the random surfer selects a set of hyperlinks and clicks with uniform probability on any of these hyperlinks. The cost function penalizes spam pages and hyperlink removals. The goal is to determine a hyperlink deletion policy that minimizes this score. The MaxRank is interpreted as a modified PageRank vector, used to sort web pages instead of the usual PageRank vector. We show that the bias vector of the associated ergodic control problem, which is unique up to an additive constant, is a measure of the 'spamicity' of each page, used to detect spam pages. We give a scalable algorithm for MaxRank computation that allowed us to perform numerical experiments on the WEBSPAM-UK2007 dataset. We show that our algorithm outperforms both TrustRank and AntiTrustRank for spam and nonspam page detection.
M3 - Conference contribution
AN - SCOPUS:84876104834
SN - 9782357680357
T3 - NetGCoop 2012 - 6th International Conference on Network Games, Control and Optimization
SP - 127
EP - 134
BT - NetGCoop 2012 - 6th International Conference on Network Games, Control and Optimization
T2 - 6th International Conference on Network Games, Control and Optimization, NetGCoop 2012
Y2 - 28 November 2012 through 30 November 2012
ER -