%0 Journal Article %A LIU Han %T Spam Web Detection Based on Hybrid-Sampling and Genetic Algorithm %D 2019 %R 10.13190/j.jbupt.2019-147 %J Journal of Beijing University of Posts and Telecommunications %P 111-117 %V 42 %N 6 %X Spam web detection is of ten troubled by the problem of unbalanced data and high feature space dimension. In order to solve these two problems, the ensemble classification algorithm based on random hybrid-sampling and genetic algorithm was proposed. Firstly, a number of balanced training data subsets is obtained by reducing the number of majority samples through random sampling and generating minority samples by synthetic minority over-sampling technique(SMOTE) method. Then, the improved genetic algorithm is used to reduce the dimension of training data set to obtain multiple subsets of training data with optimal feature. Extreme gradient boosting(XGBoost)is also used as the classifier to train multiple balanced data subsets, and so a new classifier is obtained by ensemble multiple classifiers with simple voting method. Finally, the test set is predicted and the final prediction is obtained. Experiments show that, compared with XGBoost, the proposed algorithm improves the accuracy by about 19.25%, reduces the time to build the learning model, and improves the classification performance. %U https://journal.bupt.edu.cn/EN/10.13190/j.jbupt.2019-147