COMPARATIVE ANALYSIS OF BACK-TRANSLATION MODELS FOR NORMALIZATION MOBILE APP USER REVIEWS

Authors

  • Amran Salleh Faculty of Computer Science and Information Technology, Universiti Putra Malaysia
  • Mohd Hafeez Osman Faculty of Computer Science and Information Technology, Universiti Putra Malaysia
  • Sa’adah Hassan Faculty of Computer Science and Information Technology, Universiti Putra Malaysia
  • Mar Yah Said Faculty of Computer Science and Information Technology, Universiti Putra Malaysia

DOI:

https://doi.org/10.15282/ijsecs.11.2.2025.10.0142

Keywords:

Back-translation, Mobile app reviews, Machine learning, Natural language processing, Evaluation metrics

Abstract

The increase of mobile apps has led to an exponential growth of user-generated reviews, which are often noisy, informal, and linguistically diverse, thereby posing significant challenges for automated analysis in requirements engineering. This study evaluates whether back-translation (BT) can normalize informal reviews while preserving meaning, and which model (Google Translate vs Facebook M2M100_418M) offers better semantic preservation, grammatical quality, and lexical alignment. We collected 323 Google Play reviews (667 sentences) from three Malaysian government apps. Texts were cleaned, expanded for colloquial forms, and then BT was applied using Malay as an intermediate language. Evaluation used four metrics which are semantic similarity (Sentence-BERT), grammar error count (LanguageTool), BLEU (NLTK), and perplexity (GPT-2). Models differences were tested with paired t-tests and Wilcoxon signed- rank tests, while paired scatterplots showed distributional patterns. Google was significantly better on semantic similarity (t(322)=5.38, p<.001), grammar errors (t(322)=3.66, p<.001), and BLEU (t(322)=2.99, p=.003); effect sizes were small to moderate. Perplexity differences were not significant, indicating comparable sentence-level fluency. Visualizations confirmed Google’s steadier performance with fewer extreme outliers. BT is a practical normalization step for noisy reviews. For the English–Malay pipeline studied here, Google provides more reliable semantic preservation and grammatical quality, while both systems are similar in fluency. However, the generalizability of these results are constrained by the relatively modest sample size (323 reviews, 667 sentences), and future work should validate results on large datasets and explore hybrid strategies combining strengths of both models.

References

[1] Naseem U, Khan SK, Razzak I, Hameed IA. Hybrid words representation for airlines sentiment analysis. In: Australasian Joint Conference on Artificial Intelligence. Cham: Springer International Publishing; 2019. p. 381-92.

[2] Jha N, Mahmoud A. Using frame semantics for classifying and summarizing application store reviews. Empirical Software Engineering. 2018;23(6):3734-67.

[3] Shermamatova GB, Shermamatova ZA, Baxronov AB, Roziqov OV. Usage of colloquial style. Mental Enlightenment Scientific-Methodological Journal. 2024;5(05):278-86.

[4] Al-Obaidi AY, Samawi VW. Opinion mining: Analysis of comments written in Arabic colloquial. In: Proceedings of the World Congress on Engineering and Computer Science. 2016;1.

[5] Genc-Nayebi N, Abran A. A systematic literature review: Opinion mining studies from mobile app store user reviews. Journal of Systems and Software. 2017;125:207-19.

[6] Eftee SY, Khan MY, Noor R, Mahmud H, Hasan MK. Extraction of app problems and its corresponding user action from user review of apps using few-shot learning. SSRN. Available from: https://ssrn.com/abstract=5093717

[7] Szeto MD, Barber C, Ranpariya VK, Anderson J, Hatch J, Ward J, et al. Emojis and emoticons in health care and dermatology communication: Narrative review. JMIR Dermatology. 2022;5(3):e33851.

[8] Aslam A, Hussian B. Emotion recognition techniques with rule-based and machine learning approaches. arXiv preprint. 2021;arXiv:2103.00658.

[9] Noori B. Classification of customer reviews using machine learning algorithms. Applied Artificial Intelligence. 2021;35(8):567-88.

[10] Sangeetha J, Kumaran U. Sentiment analysis of Amazon user reviews using a hybrid approach. Measurement: Sensors. 2023;27:100790.

[11] Ahmad M, Aftab S, Ali I, Hameed N. Hybrid tools and techniques for sentiment analysis: A review. International Journal of Multidisciplinary Science and Engineering. 2017;8(3):29-33.

[12] Subedi IM, Singh M, Ramasamy V, Walia GS. Application of back-translation: A transfer learning approach to identify ambiguous software requirements. In: Proceedings of the 2021 ACM Southeast Conference. 2021. p. 130-7.

[13] Fan A, Bhosale S, Schwenk H, Ma Z, El-Kishky A, Goyal S, et al. Beyond English-centric multilingual machine translation. Journal of Machine Learning Research. 2021;22(107):1-48.

[14] Chew E, Chakraborti M, Weisman W, Frey S. Machine translation for accessible multi-language text analysis. Computational Communication Research. 2025;7(1):1.

[15] Mukherjee A, Shrivastava M. Lost in translation? Found in evaluation: A comprehensive survey on sentence-level translation evaluation. ACM Computing Surveys. 2025.

[16] Mathur N, Baldwin T, Cohn T. Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. arXiv preprint. 2020;arXiv:2006.06264.

[17] Klotz AC, Swider BW, Kwon SH. Back-translation practices in organizational research: Avoiding loss in translation. Journal of Applied Psychology. 2023;108(5):699.

[18] Poh S, Yang SJ, Tan J, Chieng L, Tan J, Yu Z, et al. MalayMMLU: A multitask benchmark for the low-resource Malay language. In: Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. p. 650-69.

[19] Ma J, Li L. Data augmentation for Chinese text classification using back-translation. In: Journal of Physics: Conference Series. 2020;1651(1):012039.

[20] Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002. p. 311-8.

[21] Lee J, Kim J, Kang P. Back-translated task adaptive pretraining: Improving accuracy and robustness on text classification. arXiv preprint. 2021;arXiv:2107.10474.

[22] Edunov S, Ott M, Auli M, Grangier D. Understanding back-translation at scale. arXiv preprint. 2018;arXiv:1808.09381.

[23] Tong Y, Chen Y, Zhang G, Zheng J, Zhu H, Shi X. Generating diverse back-translations via constraint random decoding. In: China Conference on Machine Translation. Singapore: Springer; 2021. p. 92-104.

[24] Chen B, Golebiowski J, Abedjan Z. Data augmentation for supervised code translation learning. In: Proceedings of the 21st International Conference on Mining Software Repositories. 2024. p. 444-56.

[25] Prenner JA, Robbes R. Making the most of small software engineering datasets with modern machine learning. IEEE Transactions on Software Engineering. 2021;48(12):5050-67.

[26] Brigato L, Iocchi L. A close look at deep learning with small data. In: 25th International Conference on Pattern Recognition. IEEE; 2021. p. 2490-7.

[27] Shadman R, Murshed MS, Verenich E, Velasquez A, Hussain F. The utility of feature reuse: Transfer learning in data-starved regimes. In: 2023 International Conference on Computational Science and Computational Intelligence (CSCI). IEEE; 2023. p. 37-42.

[28] Feldman I, Coto-Solano R. Neural machine translation models with back-translation for the extremely low-resource indigenous language Bribri. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020. p. 3965-76.

[29] McNamee P, Duh K. An extensive exploration of back-translation in 60 languages. In: Findings of the Association for Computational Linguistics: ACL 2023. 2023. p. 8166-83.

[30] Liu S, Zhu W. An analysis of the evaluation of the translation quality of neural machine translation application systems. Applied Artificial Intelligence. 2023;37(1):2214460.

[31] Song J, Zan H, Liu T, Zhang K, Ji X, Cui T. Text classification based on multilingual back-translation and model ensemble. In: China Health Information Processing Conference. Singapore: Springer; 2023. p. 231-41.

[32] Yadav A, Patel A, Shah M. A comprehensive review on resolving ambiguities in natural language processing. AI Open. 2021;2:85-92.

[33] Boussougou MKM, Hamandawana P, Park DJ. Enhancing voice phishing detection using multilingual back-translation and SMOTE: An empirical study. IEEE Access. 2025.

[34] Do QM, Zeng K, Paik I. Resolving lexical ambiguity in English-Japanese neural machine translation. In: Proceedings of the 3rd Artificial Intelligence and Cloud Computing Conference. 2020. p. 46-51.

[35] Kiyomoto S, Hidano S, Nguyen-Son HQ, Phuong TT. Detecting machine-translated text using back translation. In: Proceedings of the 12th International Conference on Natural Language Generation. 2019.

[36] Benmansour M, Hdouch Y. The role of the latest technologies in the translation industry. Emirati Journal of Education and Literatures. 2023;1(2):31-6.

[37] Castilho S, Moorkens J, Gaspari F, Calixto I, Tinsley J, Way A. Is neural machine translation the new state of the art? The Prague Bulletin of Mathematical Linguistics. 2017;(108).

[38] Wongso W, Joyoadikusumo A, Buana BS, Suhartono D. Many-to-many multilingual translation model for languages of Indonesia. IEEE Access. 2023;11:91385-97.

Published

2025-12-18

How to Cite

[1]
amran salleh, Mohd Hafeez Osman, Sa’adah Hassan, and Mar Yah Said, “COMPARATIVE ANALYSIS OF BACK-TRANSLATION MODELS FOR NORMALIZATION MOBILE APP USER REVIEWS”, IJSECS, vol. 11, no. 2, pp. 124–136, Dec. 2025, doi: 10.15282/ijsecs.11.2.2025.10.0142.

Similar Articles

1-10 of 74

You may also start an advanced similarity search for this article.