Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.
Abstract
Automated Genre Identification (AGI) of web pages is a
problem of increasing importance since web genre (e.g. blog, news, eshops,
etc.) information can enhance modern Information Retrieval (IR)
systems. The state-of-the-art in this field considers AGI as a closed-set
classification problem where a variety of web page representation and machine
learning models have intensively studied. In this paper, we study
AGI as an open-set classification problem which better formulates the
real world conditions of exploiting AGI in practice. Focusing on the use
of content information, different text representation methods (words and
character n-grams) are tested. Moreover, two classification methods are
examined, one-class SVM learners, used as a baseline, and an ensemble
of classifiers based on random feature subspacing, originally proposed for
author identification. It is demonstrated that very high precision can be
achieved in open-set AGI while recall remains relatively high.
Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.
Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.
Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted or mass reproduced without the explicit permission of the copyright holder.