Research Article Open Access

SemSimp: A Parametric Method for Evaluating the Semantic Similarity of Digital Resources

Antonio De Nicola1, Anna Formica2, Ida Mele3 and Francesco Taglino3
  • 1 Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), Casaccia Research Centre, Via Anguillarese 301, Rome, Italy
  • 2 Institute of Systems Analysis and Informatics (IASI) “Antonio Ruberti”, National Research Council, Via dei Taurini 19, Rome, Italy
  • 3 Institute of Systems Analysis and Informatics (IASI) “Antonio Ruberti”, National Research Council, Via dei Taurini 19, Rome, Italy

Abstract

SemSimp is a parametric method for evaluating the semantic similarity of digital resources that is based on the notion of information content. It exploits a weighted reference ontology of concepts and requires resources to be semantically annotated, each by means of a set of concepts from the ontology. Specifically, the weights of the concepts can be calculated either by considering the available annotations or only the structure of the ontology. SemSimp was evaluated against six representative semantic similarity methods proposed in the literature. Experiments were run on a large real-world dataset based on the Association for Computing Machinery (ACM) digital library, including both a statistical analysis and an expert judgment assessment. The main result shows that the SemSimp annotation frequency configuration, when combined with the geometric average normalization factor, outperforms the other methods.

References

Abioui, H., Idarrou, A., Bouzit, A., & Mammass, D. (2018). Towards a Novel and Generic Approach for OWL Ontology Weighting. Procedia Computer Science, 127, 426–435. https://doi.org/10.1016/j.procs.2018.01.140
Adhikari, A., Dutta, B., Dutta, A., Mondal, D., & Singh, S. (2018). An intrinsic information content-based semantic similarity measure considering the disjoint common subsumers of concepts of an ontology. Journal of the Association for Information Science and Technology, 69(8), 1023–1034. https://doi.org/10.1002/asi.24021
Alizadeh, D., Alesheikh, A. A., & Sharif, M. (2021). Prediction of vessels locations and maritime traffic using similarity measurement of trajectory. Annals of GIS, 27(2), 151–162. https://doi.org/10.1080/19475683.2020.1840434
Banu, A., Fatima, S. S., & Khan, K. U. R. (2015). Information content based semantic similarity measure for concepts subsumed by multiple concepts. International Journal Web Applications, 7(3), 85–94.
Batet, M., & Sánchez, D. (2020). Leveraging synonymy and polysemy to improve semantic similarity assessments based on intrinsic information content. Artificial Intelligence Review, 53(3), 2023–2041. https://doi.org/10.1007/s10462-019-09725-4
Beeri, C., Formica, A., & Missikoff, M. (1999). Inheritance hierarchy design in object-oriented databases. Data & Knowledge Engineering, 30(3), 191–216. https://doi.org/10.1016/s0169-023x(99)00011-7
Berrhail, F., & Belhadef, H. (2020). Genetic Algorithm-based Feature Selection Approach for Enhancing the Effectiveness of Similarity Searching in Ligand-based Virtual Screening. Current Bioinformatics, 15(5), 431–444. https://doi.org/10.2174/1574893614666191119123935
Bloehdorn, S., & Moschitti, A. (2007). Combined Syntactic and Semantic Kernels for Text Classification (G. Amati, C. Carpineto, & G. Romano, Eds.; Vol. 4425). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71496-5_29
Bollegala, D., Matsuo, Y., & Ishizuka, M. (2011). A Web Search Engine-Based Approach to Measure Semantic Similarity between Words. IEEE Transactions on Knowledge and Data Engineering, 23(7), 977–990. https://doi.org/10.1109/tkde.2010.172
Cazzanti, L., & Gupta, M. R. (2006). Information-theoretic and Set-theoretic Similarity. 2006 IEEE International Symposium on Information Theory, 1836–1840. https://doi.org/10.1109/isit.2006.261752
Chandrasekaran, D., & Mago, V. (2022). Evolution of Semantic Similarity—A Survey. ACM Computing Surveys, 54(2), 1–37. https://doi.org/10.1145/3440755
De Nicola, A., & D’Agostino, G. (2021). Assessment of gender divide in scientific communities. Scientometrics, 126(5), 3807–3840. https://doi.org/10.1007/s11192-021-03885-3
De Nicola, A., Formica, A., Missikoff, M., Pourabbas, E., & Taglino, F. (2023). A parametric similarity method: Comparative experiments based on semantically annotated large datasets. Journal of Web Semantics, 76, 100773. https://doi.org/10.1016/j.websem.2023.100773
De Nicola, A., Melchiori, M., & Villani, M. L. (2019). Creative design of emergency management scenarios driven by semantics: An application to smart cities. Information Systems, 81, 21–48. https://doi.org/10.1016/j.is.2018.10.005
De Nicola, A., Villani, M. L., Sujan, M., Watt, J., Costantino, F., Falegnami, A., & Patriarca, R. (2023). Development and measurement of a resilience indicator for cyber-socio-technical systems: The allostatic load. Journal of Industrial Information Integration, 35, 100489. https://doi.org/10.1016/j.jii.2023.100489
De Nicola, A., Zgheib, R., & Taglino, F. (2022). Chapter 7 - Toward a knowledge graph for medical diagnosis: issues and usage scenarios. In S. Tiwari, F. Ortiz Rodriguez, & M. A. Jabbar (Eds.), Semantic Models in IoT and eHealth Applications (pp. 129–142). Academic Press. https://doi.org/10.1016/b978-0-32-391773-5.00013-3
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186. https://doi.org/10.18653/v1/N19-1423
Dhami, M. K., & Harries, C. (2001). Fast and frugal versus regression models of human judgement. Thinking & Reasoning, 7(1), 5–27. https://doi.org/10.1080/13546780042000019
Dice, L. R. (1945). Measures of the Amount of Ecologic Association Between Species. Ecology, 26(3), 297–302. https://doi.org/10.2307/1932409
Dulmage, A. L., & Mendelsohn, N. S. (1958). Coverings of Bipartite Graphs. Canadian Journal of Mathematics, 10, 517–534. https://doi.org/10.4153/cjm-1958-052-0
Fellbaum, C., & Miller, G. (1998). Combining Local Context and WordNet Similarity for Word Sense Identification. In WordNet: An Electronic Lexical Database (pp. 265–283). MIT Press.
Formica, A. (2019). Similarity reasoning in formal concept analysis: from one- to many-valued contexts. Knowledge and Information Systems, 60(2), 715–739. https://doi.org/10.1007/s10115-018-1252-4
Formica, A., & Missikoff, M. (2004). Inheritance processing and conflicts in structural generalization hierarchies. ACM Computing Surveys, 36(3), 263–290. https://doi.org/10.1145/1035570.1035572
Formica, A., Missikoff, M., Pourabbas, E., & Taglino, F. (2010). Semantic Search for Enterprises Competencies Management. Proceedings of the International Conference on Knowledge Engineering and Ontology Development (IC3K 2010) - KEOD, 183–192. https://doi.org/10.5220/0003069801830192
Formica, A., Missikoff, M., Pourabbas, E., & Taglino, F. (2013). Semantic search for matching user requests with profiled enterprises. Computers in Industry, 64(3), 191–202. https://doi.org/10.1016/j.compind.2012.09.007
Formica, A., & Pourabbas, E. (2009). Content based similarity of geographic classes organized as partition hierarchies. Knowledge and Information Systems, 20(2), 221–241. https://doi.org/10.1007/s10115-008-0177-8
Formica, A., & Taglino, F. (2021). An Enriched Information-Theoretic Definition of Semantic Similarity in a Taxonomy. IEEE Access, 9, 100583–100593. https://doi.org/10.1109/access.2021.3096598
Formica, A., & Taglino, F. (2023). Semantic relatedness in DBpedia: A comparative and experimental assessment. Information Sciences, 621, 474–505. https://doi.org/10.1016/j.ins.2022.11.025
Gruber, T. R. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2), 199–220. https://doi.org/10.1006/knac.1993.1008
Haase, P., Siebes, R., & Van Harmelen, F. (2004). Peer Selection in Peer-to-Peer Networks with Semantic Topologies. In M. Bouzeghoub, C. Goble, V. Kashyap, & S. Spaccapietra (Eds.), Semantics of a Networked World. Semantics for Grid Databases (Vol. 3226, pp. 108–125). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-30145-5_7
Hadj Taieb, M. A., Zesch, T., & Ben Aouicha, M. (2020). A survey of semantic relatedness evaluation datasets and procedures. Artificial Intelligence Review, 53(6), 4407–4448. https://doi.org/10.1007/s10462-019-09796-3
Hassan, B., Abdelrahman, S. E., Bahgat, R., & Farag, I. (2019). UESTS: An Unsupervised Ensemble Semantic Textual Similarity Method. IEEE Access, 7, 85462–85482. https://doi.org/10.1109/access.2019.2925006
Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50. https://doi.org/10.1111/j.1469-8137.1912.tb05611.x
Jia, Z., Lu, X., Duan, H., & Li, H. (2019). Using the distance between sets of hierarchical taxonomic clinical concepts to measure patient similarity. BMC Medical Informatics and Decision Making, 19(1), 91. https://doi.org/10.1186/s12911-019-0807-y
Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference Research on Computational Linguistics, 19–33. https://doi.org/https://doi.org/10.48550/arXiv.cmp-lg/9709008
Köhler, S., Schulz, M. H., Krawitz, P., Bauer, S., Dölken, S., Ott, C. E., Mundlos, C., Horn, D., Mundlos, S., & Robinson, P. N. (2009). Clinical Diagnostics in Human Genetics with Semantic Similarity Searches in Ontologies. American Journal of Human Genetics, 85(4), 457–464. https://doi.org/10.1016/j.ajhg.2009.09.003
Li, Y., McLean, D., Bandar, Z. A., O’Shea, J. D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138–1150. https://doi.org/10.1109/tkde.2006.130
Likavec, S., Lombardi, I., & Cena, F. (2019). Sigmoid similarity - a new feature-based similarity measure. Information Sciences, 481, 203–218. https://doi.org/10.1016/j.ins.2018.12.018
Lin, D. (1998). An information-theoretic definition of similarity. 296–304.
Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
Meng, L., Gu, Junzhong, & Zhou, Z. (2012). A new model of information content based on concept’s topology for measuring semantic similarity in WordNet. International Journal of Grid and Distributed Computing, 5(3), 81–94.
Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1–28. https://doi.org/10.1080/01690969108406936
Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. Proceedings of the 14th International Joint Conference on Artificial Intelligence, 448–453. https://doi.org/10.48550/arXiv.cmp-lg/9511007
Rezaei, M., & Fränti, P. (2014). Matching Similarity for Keyword-Based Clustering. In P. Fränti, G. Brown, M. Loog, F. Escolano, & M. Pelillo (Eds.), Structural, Syntactic, and Statistical Pattern Recognition (pp. 193–202). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-662-44415-3_20
Rubenstein, H., & Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications of the ACM, 8(10), 627–633. https://doi.org/10.1145/365628.365657
Sammut, C., & Webb, G. I. (2011). Encyclopedia of Machine Learning. Springer Science & Business Media.
Sánchez, D., Batet, M., & Isern, D. (2011). Ontology-based information content computation. Knowledge-Based Systems, 24(2), 297–303. https://doi.org/10.1016/j.knosys.2010.10.001
Seco, N., Veale, T., & Hayes, J. (2004). An intrinsic information content metric for semantic similarity in wordnet. Proceedings European Conference on Artificial Intelligence (ECAI), 4, 1089–1090.
Shajalal, Md., & Aono, M. (2019). Semantic textual similarity between sentences using bilingual word semantics. Progress in Artificial Intelligence, 8(2), 263–272. https://doi.org/10.1007/s13748-019-00180-4
Sharma, S., Sharma, S., Pathak, V., Kaur, P., & Singh, R. K. (2021). Drug Repurposing Using Similarity-based Target Prediction, Docking Studies and Scaffold Hopping of Lefamulin. Letters in Drug Design & Discovery, 18(7), 733–743. https://doi.org/10.2174/1570180817999201201113712
Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge University Press. https://doi.org/10.1017/CBO9780511809682
Szumlanski, S., Gomez, F., & Sims, V. K. (2013). A new set of norms for semantic relatedness measures. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 890–895.
Taglino, F., Cumbo, F., Antognoli, G., Arisi, I., D’Onofrio, M., Perazzoni, F., Voyat, R., Fiscon, G., Conte, F., Canevelli, M., Bruno, G., Mecocci, P., & Bertolazzi, P. (2023). An ontology-based approach for modelling and querying Alzheimer’s disease data. BMC Medical Informatics and Decision Making, 23(1), 153. https://doi.org/10.1186/s12911-023-02211-6
Tien, N. H., Le, N. M., Tomohiro, Y., & Tatsuya, I. (2019). Sentence modeling via multiple word embeddings and multi-level comparison for semantic textual similarity. Information Processing & Management, 56(6), 102090. https://doi.org/10.1016/j.ipm.2019.102090
Toch, E., Reinhartz-Berger, I., & Dori, D. (2011). Humans, semantic services and similarity: A user study of semantic Web services matching and composition. Journal of Web Semantics, 9(1), 16–28. https://doi.org/10.1016/j.websem.2010.10.002
Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327–352. https://doi.org/10.1037//0033-295x.84.4.327
Wang, F., Wang, N., Cai, S., & Zhang, W. (2020). A Similarity Measure in Formal Concept Analysis Containing General Semantic Information and Domain Information. IEEE Access, 8, 75303–75312. https://doi.org/10.1109/access.2020.2988689
Wu, Z., & Palmer, M. (1994). Verbs semantics and lexical selection. Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, 133–138. https://doi.org/10.3115/981732.981751
Yang, S., Wei, R., Guo, J., & Tan, H. (2020). Chinese semantic document classification based on strategies of semantic similarity computation and correlation analysis. Journal of Web Semantics, 63, 100578. https://doi.org/10.1016/j.websem.2020.100578

Journal of Computer Science
Volume 20 No. 8, 2024, 841-849

DOI: https://doi.org/10.3844/jcssp.2024.841.849

Submitted On: 12 January 2024 Published On: 27 May 2024

How to Cite: Nicola, A. D., Formica, A., Mele, I. & Taglino, F. (2024). SemSimp: A Parametric Method for Evaluating the Semantic Similarity of Digital Resources. Journal of Computer Science, 20(8), 841-849. https://doi.org/10.3844/jcssp.2024.841.849

  • 1,067 Views
  • 371 Downloads
  • 0 Citations

Download

Keywords

  • Semantic Similarity
  • Information Content
  • Weighted Reference Ontology
  • Semantic Annotation