Roman Urdu sentiment analysis using Machine Learning with best parameters and comparative study of Machine Learning algorithms

  • Sameen Aziz Khwaja Freed University of Engineering and Information Technology Rahim Yar Khan, Pakistan
  • Saleem Ullah Khwaja Freed University of Engineering and Information Technology Rahim Yar Khan, Pakistan
  • Bushra Mughal Khwaja Freed University of Engineering and Information Technology Rahim Yar Khan, Pakistan
  • Faheem Mushtaq Khwaja Freed University of Engineering and Information Technology Rahim Yar Khan, Pakistan
  • Sabih Zahra Khwaja Freed University of Engineering and Information Technology Rahim Yar Khan, Pakistan
Keywords: Machine Learning, TFIDF, Kaggle, SVM, RF, Logistic Regression, Naïve Bayes, AdaBoost, RANSAC, Hyper parameter

Abstract

People talks on the social media as they feel good and easy way to express their feelings about topic, post or product on the ecommerce websites. In the Asia mostly the people use the Roman Urdu language script for expressing their opinion about the topic. The Sentiment analysis of the Roman Urdu (Bilal et al. 2016)language processes is a big challenging task for the researchers because of lack of resources and its non-structured and non-standard syntax / script. We have collected the Dataset from Kaggle containing 21000 values with manually annotated and prepare the data for machine learning and then we apply different machine learning algorithms(SVM , Logistic regression , Random Forest, Naïve Bayes ,AdaBoost, KNN )(Bowers et al. 2018) with different parameters and kernels  and with TFIDF(Unigram , Bigram , Uni-Bigram)(Pereira et al. 2018) from the algorithms we find the best fit algorithm , then from the best algorithm we choose 4 algorithms and combined them to deploy on the data set but after the deployment of the hyperparameters we get the best model build by the  Support Vector Machine  with linear kernel which are 80% accuracy and F1 score 0.79 precision 0.79 and recall is 0.78 with (Ezpeleta et al. 2018)Grid Search CV and CV is 5 fold. Then we perform experiments on the Robust linear Regression model estimation using (Huang, Gao, and Zhou 2018)(Chum and Matas 2008)RANSAC(random sample Consensus) that gives us the best estimators with 82.19%.

Author Biographies

Saleem Ullah, Khwaja Freed University of Engineering and Information Technology Rahim Yar Khan, Pakistan

Head of the Computer Science Department   Khwaja Freed University of Engineering and Information Technology Rahim Yar Khan, Pakistan

Bushra Mughal, Khwaja Freed University of Engineering and Information Technology Rahim Yar Khan, Pakistan

Lecturer at  Khwaja Freed University of Engineering and Information Technology Rahim Yar Khan, Pakistan

Faheem Mushtaq, Khwaja Freed University of Engineering and Information Technology Rahim Yar Khan, Pakistan

Head of the Information Technology Department  Khwaja Freed University of Engineering and Information Technology Rahim Yar Khan, Pakistan

Sabih Zahra, Khwaja Freed University of Engineering and Information Technology Rahim Yar Khan, Pakistan

Ph.D scholer in  Khwaja Freed University of Engineering and Information Technology Rahim Yar Khan, Pakistan I.T Department

Published
2020-10-22
How to Cite
[1]
S. Aziz, S. Ullah, B. Mughal, F. Mushtaq, and S. Zahra, “Roman Urdu sentiment analysis using Machine Learning with best parameters and comparative study of Machine Learning algorithms”, PakJET, vol. 3, no. 2, pp. 172-177, Oct. 2020.