VNeSafe: Machine Learning-assisted System for Detecting Malicious URLs and Spam Calls
Main Article Content
Abstract
Spam calls and malicious Uniform Resource Locators (URLs) have become major concerns for Internet users. Phishing, spam, and drive-by-download attacks can be initiated by malicious URLs, while normal users may experience irritation from spam calls. To tackle the aforementioned issues, we provide VNeSafe, a machine learning-assisted system, in this paper. By leveraging user feedback, VNeSafe may identify a phone number that is spam. Particularly, it keeps track of how many times a phone subscriber has been reported as spam. When such a number is over a predetermined threshold, VNeSafe automatically adds the phone number to a blacklist and blocks it. Furthermore, VNeSafe uses a natural language processing technique named TF-IDF in order to extract good features from a URL. The Random Forest algorithm then makes use of these features to determine whether the URL is malicious or not. Our empirical research has demonstrated that Random Forest can offer a real-time detection with an F1-score of 0.9298. This algorithm is ready to be deployed in VNeSafe and used on a general mobile device.
Keywords
VNeSafe, malicious URL, spam call, Random Forest
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
References
[1] Samaya. D., Vietnam Targets 85% Smartphone Usage by 2022. (accessed April 19, 2022), Online Avaiable: https://opengovasia.com/vietnam-targets-85- smartphone-usage-by-2022-end
[2] Ghalati, Nastaran Farhadi, Nahid Farhady Ghalaty, and José Barata, Towards the detection of malicious URL and domain names using machine learning, Technological Innovation for Life Improvement. Proceedings of 11th IFIP WG 5.5/SOCOLNET Advanced Doctoral Conference on Computing, Electrical and Industrial Systems, DoCEIS 2020, Costa de Caparica, Portugal, July 1-3, 2020. Springer International Publishing, 2020.
[3] Asiri, Sultan, Yang Xiao, Saleh Alzahrani, Shuhui Li, and Tieshan Li., A survey of intelligent detection designs of HTML URL phishing attacks, IEEE Access 11 (2023), pp. 6421-6443. https://doi.org/10.1109/ACCESS.2023.3237798
[4] Ministry of Information and Communications. Fighting Spam Messages, Spam emails and Spam Calls. Online Avaiable: https://thuvienphapluat.vn/van-ban/EN/Cong-nghethong-tin/Decree-91-2020-ND-CP-fighting-spammessages-spam-emails-and-spam-calls/451726/tienganh.aspx (accessed December 13, 2021).
[5] Bao Ha Noi Moi. Continue to block spam calls and messages. Online avaibale: https://hanoimoi.vn/tiep-tuc-ngan-chan-cuoc-goi-tinnhan-rac-450145.htmll (accessed July 28, 2022).
[6] Celso. M and Sabin. Z., Cloudflare Radar Domain Rankings, accessed September 30, 2022, Online Avaiable: https://blog.cloudflare.com/radar-domain-rankings/.
[7] Cisco Talos Intelligence Group. PhishTank. (accessed November 14, 2023). Online Avaiable: https://phishtank.org/
[8] Genuer, R., Poggi, JM. (2020). Random Forests. In: Random Forests with R. Use R!. Springer, Cham. https://doi.org/10.1007/978-3-030-56485-8_3
[9] Freund, Yoav, and Robert E. Schapire. A desiciontheoretic generalization of on-line learning and an application to boosting. European Conference on Computational Learning Theory. Berlin, Heidelberg: Springer Berlin Heidelberg, 1995. https://doi.org/10.1007/3-540-59119-2_166
[10] Nalluri, Mounika, Mounika Pentela, and Nageswara Rao Eluri. A Scalable Tree Boosting System: XG Boost. Int. J. Res. Stud. Sci. Eng. Technol 7, no. 12 (2020): 36-51.
[11] Min, Bonan, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys 56, no. 2 (2023): 1-40. https://doi.org/10.1145/3605943
[12] Liu, Hao, Xi Chen, and Xiaoxiao Liu. A study of the application of weight distributing method combining sentiment dictionary and TF-IDF for text sentiment analysis. IEEE Access 10 (2022): 32280-32289. https://doi.org/10.1109/ACCESS.2022.3160172 Table 3. Comparison of VNeSafe and other call blocking applications.
[13] Lubis, Devi Hawana, Sawaluddin Sawaluddin, and Ade Candra, Machine learning model for language classification: bag-of-words and multilayer perceptron, Journal of Informatics and Telecommunication Engineering 7, no. 1 (2023): 356-365. https://doi.org/10.31289/jite.v7i1.10114
[14] Dharma, Eddy Muntina, F. Lumban Gaol, H. L. H. S. Warnars, and B. E. N. F. A. N. O. Soewito. The accuracy comparison among word2vec, glove, and fasttext towards convolution neural network (cnn) text classification. J Theor Appl Inf Technol 100, no. 2 (2022): 349-359.
[15] Mienye, Ibomoiye Domor, and Yanxia Sun. A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access 10 (2022): 99129-99149. https://doi.org/10.1109/ACCESS.2022.3207287
[16] Arunabell. G., Benign-phishing url classification using whois and lexical features. Online Avaiable: https://github.com/arunabellgutteramesh/benignphishing-url-classification-using-whois-and-lexicalfeatures (accessed August 30, 2019).
[2] Ghalati, Nastaran Farhadi, Nahid Farhady Ghalaty, and José Barata, Towards the detection of malicious URL and domain names using machine learning, Technological Innovation for Life Improvement. Proceedings of 11th IFIP WG 5.5/SOCOLNET Advanced Doctoral Conference on Computing, Electrical and Industrial Systems, DoCEIS 2020, Costa de Caparica, Portugal, July 1-3, 2020. Springer International Publishing, 2020.
[3] Asiri, Sultan, Yang Xiao, Saleh Alzahrani, Shuhui Li, and Tieshan Li., A survey of intelligent detection designs of HTML URL phishing attacks, IEEE Access 11 (2023), pp. 6421-6443. https://doi.org/10.1109/ACCESS.2023.3237798
[4] Ministry of Information and Communications. Fighting Spam Messages, Spam emails and Spam Calls. Online Avaiable: https://thuvienphapluat.vn/van-ban/EN/Cong-nghethong-tin/Decree-91-2020-ND-CP-fighting-spammessages-spam-emails-and-spam-calls/451726/tienganh.aspx (accessed December 13, 2021).
[5] Bao Ha Noi Moi. Continue to block spam calls and messages. Online avaibale: https://hanoimoi.vn/tiep-tuc-ngan-chan-cuoc-goi-tinnhan-rac-450145.htmll (accessed July 28, 2022).
[6] Celso. M and Sabin. Z., Cloudflare Radar Domain Rankings, accessed September 30, 2022, Online Avaiable: https://blog.cloudflare.com/radar-domain-rankings/.
[7] Cisco Talos Intelligence Group. PhishTank. (accessed November 14, 2023). Online Avaiable: https://phishtank.org/
[8] Genuer, R., Poggi, JM. (2020). Random Forests. In: Random Forests with R. Use R!. Springer, Cham. https://doi.org/10.1007/978-3-030-56485-8_3
[9] Freund, Yoav, and Robert E. Schapire. A desiciontheoretic generalization of on-line learning and an application to boosting. European Conference on Computational Learning Theory. Berlin, Heidelberg: Springer Berlin Heidelberg, 1995. https://doi.org/10.1007/3-540-59119-2_166
[10] Nalluri, Mounika, Mounika Pentela, and Nageswara Rao Eluri. A Scalable Tree Boosting System: XG Boost. Int. J. Res. Stud. Sci. Eng. Technol 7, no. 12 (2020): 36-51.
[11] Min, Bonan, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys 56, no. 2 (2023): 1-40. https://doi.org/10.1145/3605943
[12] Liu, Hao, Xi Chen, and Xiaoxiao Liu. A study of the application of weight distributing method combining sentiment dictionary and TF-IDF for text sentiment analysis. IEEE Access 10 (2022): 32280-32289. https://doi.org/10.1109/ACCESS.2022.3160172 Table 3. Comparison of VNeSafe and other call blocking applications.
[13] Lubis, Devi Hawana, Sawaluddin Sawaluddin, and Ade Candra, Machine learning model for language classification: bag-of-words and multilayer perceptron, Journal of Informatics and Telecommunication Engineering 7, no. 1 (2023): 356-365. https://doi.org/10.31289/jite.v7i1.10114
[14] Dharma, Eddy Muntina, F. Lumban Gaol, H. L. H. S. Warnars, and B. E. N. F. A. N. O. Soewito. The accuracy comparison among word2vec, glove, and fasttext towards convolution neural network (cnn) text classification. J Theor Appl Inf Technol 100, no. 2 (2022): 349-359.
[15] Mienye, Ibomoiye Domor, and Yanxia Sun. A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access 10 (2022): 99129-99149. https://doi.org/10.1109/ACCESS.2022.3207287
[16] Arunabell. G., Benign-phishing url classification using whois and lexical features. Online Avaiable: https://github.com/arunabellgutteramesh/benignphishing-url-classification-using-whois-and-lexicalfeatures (accessed August 30, 2019).