Ngo Kien Tuan, Vo Dinh Hieu, Bui Ngoc Thang, Pham Le Viet Anh, Pham Khanh Ly, Phan Hai

Main Article Content

Abstract

Today, bibliometric databases are indispensable sources for researchers and research institutions. The main role of these databases is to find research articles and estimate the performance of researchers and institutions. Regarding the evaluation of the research performance of an organization, the accuracy in determining institutions of authors of articles is decisive. However, current popular bibliometric databases such as Scopus and Web of Science have not addressed this point efficiently. To this end, we propose an approach to revise the authors’ affiliation information of articles in bibliometric databases. We build a model to classify articles to institutions with high accuracy by assembling the bag of words and n-grams techniques for extracting features of affiliation strings. After that, these features are weighted to determine their importance to each institution. Affiliation strings of articles are transformed into the new feature space by integrating weights of features and local characteristics of words and phrases contributing to the sequences. Finally, on the feature space, the support vector classifier method is applied to learn a predictive model. Our experimental result shows that the proposed model’s accuracy is about 99.1%.


Keywords:
Affiliation, Disambiguation, Data cleaning, Classification, Supervised learning, if-iif, Support vector machine, Support vector classifier


References
[1] B. Shereen Hanafi, Discover the data behind the times higher education world university rankings, Elsevier Connect.
[2] Dobrota, M. Bulajic, L. Bornmann, V. Jeremic, A new approach to the qs university ranking using the composite i-distance indicator: Uncertainty and sensitivity analyses, JASIST 67 (2016) 200-211.
[3] -P. Pavel,  Global  university  rankings  -  a comparative analysis, Procedia Economics and Finance 26 (2015) 54-63. https://doi.org/10.1016/S2212-5671(15)00838-2.
[4] Web of science databases, Clarivate Analytics.
[5] F. Burnham, Scopus database: a review, Biomedical Digital Libraries 3. http://doi.org/10.1186/1742-5581-3-1.[6] Franceschini, D. Maisano, L. Mastrogiacomo, A novel approach for estimating the omitted-citation rate of bibliometric databases with an application to the field of bibliometrics, Journal of the american society for information science and technology 64 (2013) 2149-2156. https://doi.org/10.1002/asi.22898.
[7] Franceschini, D. Maisano, L. Mastrogiacomo, Scientific journal publishers and omitted citations in bibliometric databases: Any relationship?, Journal of Informetrics 8(3) (2014) 751 - 765. https://doi.org/10.1016/j.joi.2014.07.003.
[8] Buchanan, Accuracy of cited references: The role of citation databases, College Research Libraries 67. http://doi.org/10.5860/crl.67.4.292.
[9] Valderrama-Zurián, R. Aguilar-Moya, D. Melero-Fuentes, R. Aleixandre-Benavent, A systematic analysis of  duplicate  records  in scopus, Journal of Informetrics 9 (2015) 570–576. http://doi.org/ 10.1016/j.joi.2015.05.002.
[10] Zhu, G. Hu, W. Liu, Doi errors and possible solutions for web of science, Scientometrics 118(2) (2019) 709-718. http://doi.org/10.1007/s11192-018-2980-7.
[11] Xu, L. Hao, X. An, D. Zhai, H. Pang, Types of doi errors of cited references in web of science with a cleaning method, Scientometrics 120(3) (2019) 1427-1437. http://doi.org/ 10.1007/s11192-019-03162-4.
[12] Krauskopf, Missing documents in scopus: the case of the journal enfermeria nefrologica, Scientometrics 119(1) (2019) 543-547. https://doi.org/10.1007/ s11192-019-03040-z.
[13] Liu, G. Hu, L. Tang, Missing author address information in web of science-an explorative study, Journal of Informetrics 12(3) (2018) 985-997. https://doi.org/10.1016/j.joi.2018.07.008.
[14] Krauskopf, Standardization of the institutional address, Scientometrics 94(3) (2013) 1313-1315. http://doi.org/10.1007/s11192-012-0852-0.
[15] Krauskopf, Call for caution in the use of bibliometric data, J. Assoc. Inf. Sci. Technol. 68(8) (2017) 2029-2032. http://doi.org/10.1002/asi.23809.
[16] Awad, R. Khanna, Support Vector Machines for Classification, Apress, Berkeley, CA, 2015, pp. 39-66. http://doi:10.1007/978-1-4302-5990-9-3.
[17] Breiman, Random forests, Machine Learning 45(1) (2001) 5-32. https://doi.org/10.1023/A:1010933404324.[18] Cover, P. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf. Theor. 13(1) (2006) 21-27. http://doi.org/10.1109/TIT.1967.1053964.
[19] J.-C.B. Cuxac, P., Efficient supervised and semi-supervised approaches for affiliations disambiguation, Scientometrics 97(1) (2013) 47-58.