Sentiment Analysis about the Sinovac Vaccine on Twitter Using RStudio
Assalamualaikum Wr. Wb.
Hi data enthusiast!
On this occasion, we will learn about Twitter sentiment analysis using the R Studio application. Actually, what is the meaning of Sentiment Analysis?
Sentiment analysis or commonly known as opinion mining is a research branch of text mining which aims to determine public (audience) perceptions or subjectivity to a topic, event, or problem. Sentiment analysis is a classification task that classifies a text into a positive or negative orientation. Technically, sentiment analysis can be divided into four types of approaches, namely Machine learning approach, Lexicon-based approach, Rule-based approach, and Statistical model approach. Sentiment analysis has many uses, for example as social media monitoring, brand monitoring, customer feedback, market research, etc. Okay, let’s just start the discussion.
First, first open the RStudio application then install several packages followed by activating the package.
install.packages(SnowballC)
install.packages(twitteR)
install.packages(tm)
install.packages(NLP)
install.packages(SentimentAnalysis)
install.packages(plyr)
install.packages(ggplot2)
install.packages(RColorBrewer)
install.packages(wordcloud2)
install.packages(sentimentr)
install.packages(e1071)
install.packages(caret)
install.packages(syuzhet)library(SnowballC)
library(twitteR)
library(tm)
library(NLP)
library(SentimentAnalysis)
library(plyr)
library(ggplot2)
library(RColorBrewer)
library(wordcloud2)
library(sentimentr)
library(e1071)
library(caret)
library(syuzhet)
Then, sign in on Twitter API by clicking https://developer.twitter.com/en/apps > create an app. If successful, will get the consumer key, consumer sector, access token, and access token secret
consumer_key <- "1R8C5Qn0lbXj5mRIQxM3JQJly"
consumer_secret <- "QaYPcQTd9ewciY5kY7WCeMyWjKtzSgZP9RTFIEdtB06ZHkugjQ"
access_token <- "1243007371862925312-49s5RvjdhuKX6YgvuksG1wEHftPr4V"
access_secret <- "hfhNKURnpC5DlDc9peVNeyQnlTSn1gii14gCik8UvOIXJ"setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
Then take some tweets with the selected topic, I happened to be taking tweet regarding the Sinovac vaccine with the required number of tweets of 10000 which will be stored in a .txt file
tw = searchTwitter('vaksin sinovac', n = 1000, since = "2021-07-06", until = "2021-07-20" )
View(tw)
saveRDS(tw,file = 'tweet-mentah.rds')
tw <-readRDS('tweet-mentah.rds')
d = twListToDF(tw)
komen <- d$text
komenc <- Corpus(VectorSource(komen))
The next step is to clean the data used to delete symbols, links, and emoticons on tweets.
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
twitclean <- tm_map(komenc, removeURL)removeNL <- function(y) gsub("\n", " ", y)
twitclean <- tm_map(twitclean, removeNL)replacecomma <- function(y) gsub(",", "", y)
twitclean <- tm_map(twitclean, replacecomma)removeRT <- function(y) gsub("RT ", "", y)
twitclean <- tm_map(twitclean, removeRT)removetitik2 <- function(y) gsub(":", "", y)
twitclean <- tm_map(twitclean, removetitik2)removetitikkoma <- function(y) gsub(";", " ", y)
twitclean <- tm_map(twitclean, removetitikkoma)removetitik3 <- function(y) gsub("p.", "", y)
twitclean <- tm_map(twitclean, removetitik3)removeamp <- function(y) gsub("&", "", y)
twitclean <- tm_map(twitclean, removeamp)removeUN <- function(z) gsub("@\\w+", "", z)
twitclean <- tm_map(twitclean, removeUN)remove.all <- function(xy) gsub("[^[:alpha:][:space:]]*", "", xy)
twitclean <- tm_map(twitclean,remove.all)
View(twitclean)
twitclean <-tm_map(twitclean,stripWhitespace)
inspect(twitclean[1:10])
twitclean <- tm_map(twitclean,remove.all)
twitclean <- tm_map(twitclean, removePunctuation) #tanda baca
twitclean <- tm_map(twitclean, tolower) #mengubah huruf kecil
Then the removal of stop words or common words that typically appear in a number of large and considered have no meaning, to find out the words included in stop words can be downloaded here and look for the stopword-id.txt file. Then save it according to working directory the desired.
myStopwords <- readLines("C:\\SEM 6\\DATVIS\\UAS\\stop.txt", warn = FALSE)
twitclean <- tm_map(twitclean,removeWords,myStopwords)
twitclean <- tm_map(twitclean , removeWords,
c('kalo','gak','org',"apa","atau","dan","dari","dengan","yang","dg","nya","sih","deh","aku","saja","ini",
"karena","krn","dri","disini","dsni","juga","bgt","banget","dgn","jdi","tpi","jadi","for","mak",
"nih","agar","kalian","gitu","yg","tu","dah","lam","kau","kalau","sal","rtama","lbh","temtekno"))
The next step is to create a term document matrix (TDM) which is a process to form a matrix containing occurrence values (frequency) each word in each document.
dtm <- TermDocumentMatrix(twitclean)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
}
head(d,n=10)
From the output, it can be seen that the words that often appear from tweets related to the topic of “sinovac vaccine” are as follows.
Next, a wordcloud is displayed, which is a visual representation of text data to describe words on a website.
wordcloud2(d,shape = "circle",
backgroundColor = "white",
color = 'random-light' ,size = 1)
In wordcloud, the largest word size indicates the most frequency of tweets.
Then, make word associations that aim to find the probability value of the data relationship. The higher the probability (closer to 1), the closer the word relationship.
v<-as.list(findAssocs(dtm,
terms= c('vaksin'),
corlimit= c(0.50,0.15,0.15,0.15,0.15,0.15,0.15)))
v
After cleaning the data, save the data that has been clean using the following syntax.
dataframe<-data.frame(text=unlist(sapply(twitclean, `[`)), stringsAsFactors=F)
View(dataframe)write.csv(dataframe,file = "C:\\SEM 6\\DATVIS\\UAS\\twitclean-vaksin.csv")
dataframe[1000,]#digunakan untuk membaca file csv yang sudah di cleaning data
vaksin_dataset <-read.csv("C:\\SEM 6\\DATVIS\\UAS\\twitclean-vaksin.csv",stringsAsFactors = FALSE)#digunakan untuk mengeset variabel cloumn text menjadi char
review <- as.character(vaksin_dataset$text)
Next, emotion classification is performed using the following syntax.
syuzhet_vector <- get_sentiment(review, method="syuzhet")
head(syuzhet_vector)
summary(syuzhet_vector)bing_vector <- get_sentiment(review, method="bing")
head(bing_vector)
summary(bing_vector)
afinn_vector <- get_sentiment(review, method="afinn")
head(afinn_vector)
summary(afinn_vector)#compare the first row of each vector using sign function
rbind(
sign(head(syuzhet_vector)),
sign(head(bing_vector)),
sign(head(afinn_vector))
)
d<-get_nrc_sentiment(review)
head (d,10)
Then, a plot containing the classification of emotions is displayed.
td<-data.frame(t(d))
td_new <- data.frame(rowSums(td[1:1000]))
names(td_new)[1] <- "count"
td_new <- cbind("sentiment" = rownames(td_new), td_new)
rownames(td_new) <- NULL
td_new2<-td_new[1:8,]
quickplot(sentiment, data=td_new2, weight=count, geom="bar", fill=sentiment, ylab="count")+ggtitle("Survey sentiments")barplot(
sort(colSums(prop.table(d[, 1:8]))),
horiz = TRUE,
cex.names = 0.7,
las = 1,
main = "Emotions in Text", xlab="Percentage"
)
From the plot, it can be seen that most of the twitter users are afraid of the Sinovac vaccine as evidenced by the number of “fear” words listed in the tweet.
Then, make a text scoring by comparing the results of positive, negative, and neutral tweets.
kalimat2<-read.csv("C:\\SEM 6\\DATVIS\\UAS\\twitclean-vaksin.csv",header=TRUE)#ambil kata kata untuk skoring
positif <- scan("C:\\SEM 6\\DATVIS\\UAS\\s-pos.txt",what="character",comment.char=";")
negatif <- scan("C:\\SEM 6\\DATVIS\\UAS\\s-neg.txt",what="character",comment.char=";")
kata.positif = c(positif, "mencegah","percaya","aman")
kata.negatif = c(negatif, "bahaya", "hoax")
score.sentiment = function(kalimat2, kata.positif, kata.negatif, .progress='none')
{
require(plyr)
require(stringr)
scores = laply(kalimat2, function(kalimat, kata.positif, kata.negatif) {
kalimat = gsub('[[:punct:]]', '', kalimat)
kalimat = gsub('[[:cntrl:]]', '', kalimat)
kalimat = gsub('\\d+', '', kalimat)
kalimat = tolower(kalimat)
list.kata = str_split(kalimat, '\\s+')
kata2 = unlist(list.kata)
positif.matches = match(kata2, kata.positif)
negatif.matches = match(kata2, kata.negatif)
positif.matches = !is.na(positif.matches)
negatif.matches = !is.na(negatif.matches)
score = sum(positif.matches) - (sum(negatif.matches))
return(score)
}, kata.positif, kata.negatif, .progress=.progress )
scores.df = data.frame(score=scores, text=kalimat2)
return(scores.df)
}
Then the classification of the resulting
#melakukan skoring text
hasil = score.sentiment(kalimat2$text, kata.positif, kata.negatif)
head(hasil)#CONVERT SCORE TO SENTIMENT
hasil$klasifikasi<- ifelse(hasil$score<0, "Negatif",
ifelse(hasil$score>0,"Positif", "Netral"))
hasil$klasifikasi
View(hasil)
Results in the following classification.
Next, a plot of the classification results is made.
# plot distribution of polarity
ggplot(hasil, aes(x=klasifikasi)) +
geom_bar(aes(y=..count.., fill=klasifikasi)) +
scale_fill_brewer(palette="RdYlBu") +
labs(x="classification", y="number of tweets") +
labs(title = "Sentiment Analysis of Public Opinion about the Vaksin Sinovac",
plot.title = element_text(size=12))
plotSentiments2 <- function(hasil, title)
{
library(ggplot2)
ggplot(hasil, aes(x=klasifikasi)) +
geom_bar(aes(y=..count.., fill=klasifikasi)) +
scale_fill_brewer(palette="RdGy") +
ggtitle(title) +
theme(legend.position="right") +
ylab("Number of Comments") +
xlab("Classification")
}#Tukar Row
data <- hasil[c(3,1,2)]
View(data)
write.csv(data, file = "C:\\SEM 6\\DATVIS\\UAS\\data-label-vaksin.csv")
Following plots are generated. Based on the plot, it can be seen that out of 1000 tweets related to the Sinovac vaccine, the most negative tweets, followed by neutral, and positive tweets were in the least position.
That’s all i can explain, Thankyou and Have a Nice Day!:)
Wassalamualaikum Wr. Wb.
Reference:
https://medium.com/@17611063/analisis-sentimen-twitter-menggunakan-rstudio-162c8195603d