SEMANTIC SIMILARITY MEASUREMENT FOR MALAY WORDS USING WORDNET BAHASA AND WIKIPEDIA BAHASA MELAYU: ISSUES AND PROPOSED SOLUTIONS
What is the similarity between ‘car’ and ‘automobile’? How many similarities are shared by these two words? This equation can be easily evaluated by humans but not by computers. Human language is very complicated and ambiguous. This ambiguity is a barrier that separates human understanding with computer comprehension. Semantic similarity between words is a very important task and widely practiced in the field of natural language processing. In this article, some issues regarding semantic similarity for Malay language using two Malay lexical resources (WordNet Bahasa and Wikipedia Bahasa Melayu) are discussed. Then, some solutions to solve the arising issues are proposed. An experiment was done to evaluate the performance of WordNet Bahasa and Wikipedia Bahasa Melayu on the coverage of semantic information for 150 Malay translated words (75 word-pairs). The result showed that the WordNet Bahasa and Wikipedia Bahasa Melayu are capable to be adapted to literature techniques. For WordNet Bahasa, we tested the coverage of WordNet Bahasa based on three word-levels (stem level, root level and mix level) to find the most applicable word level as our dataset. This is because WordNet Bahasa is a limited resource and some of the compound words cannot be match with the lemma in its database. The test indicated that the mix level of translated words outperformed the stem level and root level with 86.7% compared to stem level (78.7%) and root level (68.0%). For Wikipedia Bahasa Melayu, we evaluated the coverage of three main features in its article (gloss definitions, hyperlinks and categories) where these features are important in some previous techniques. The result of this test was used to choose the best technique based on the coverage of these features. The results of the experiment revealed that the gloss definition feature gave full coverage (100%) for our 75 word-pairs input compared to hyperlinks and categories (88.0%).