It has been nearly seven months since I put up anything on my blog due to undertaking some life and time-devouring projects in my professional life. While my blog writing was de-prioritized, I managed to learn some tricks on text analysis through byte sized online learning. Today, I am going to do an amateur attempt to put that into use.
We need two packages for this - tidyverse for all types fo data wrangling and tidytext for all types of text analysis.
The dataset consists of all the news that appeared in the online edition of The Daily Star, the leading English newspaper in Bangladesh over the month of February 2021. You can download the dataset from my Kaggle page -
## Rows: 2,138 ## Columns: 4 ## $ path <chr> "https://www.thedailystar.net///frontpage/news/university-... ## $ date <date> 2021-02-01, 2021-02-01, 2021-02-01, 2021-02-01, 2021-02-0... ## $ headline <chr> "University student dies after rape", "50 lakh shots sent ... ## $ text <chr> "A private university student in the capital died yesterda...
The dataset got four columns-
- Path -> Absolute URL. We won’t need it for this analysis
- Date -> Date the news appeared
- Headline -> Headline of the news
- Text -> News content. We won’t be using it too for this analysis
There are total 2138 different news.
Saturdays seem slowest news day
It seems Saturdays are slowest news day with 55-60 articles whereas other days we get 70+ articles.
Before going to further analysis, we need to break the headlines into smaller fragments or tokenize them. We can tokenize by character, word or sentence. Here, I am more interested in individual words. Hence, I am going to break the headlines into words. The result is below for the first headline.
Most frequent words
After accounting for stop_words, we can see that Covid is still dominating as expected and with upward trend in recent times, I expect it to continue. Followed by Bangladesh- also expected as the paper mostly covers news from here. Myanmar features heavily with the recent crisis catapulting the country in our headlines.
For sentiment scoring, I am going to use “Bing Lexicon” which categorizes words into positive and negative categories only.
It was a grim month indeed. Out of 1347 matched words with Bing lexicon 65% were negative.
Apart from 11th February, the net sentiment score was negative for everyday. Hopefully, things will look up in near future
That’s it for today folks who have read through my ramblings this far. Language processing is an amazing tool and can help us in getting subtle signals from large amount of texts without actually reading them. For any query or new challenges, give me a knock in mail or twitter. Until then, lets hope for a positive March in news.
# Load libraries library(tidyverse) # For Data wrangling library(tidytext) # For text analysis library(kableExtra) # For beautiful tables # Read Data ds<-read_delim("Daily_Star_February_2021.txt",delim="^") ds%>% glimpse() # Tokenize by words tidy_ds<-ds%>% select(date,headline)%>% mutate(id=row_number())%>% unnest_tokens(word,headline) # List the first headline in tokenized form tidy_ds%>% filter(id==1)%>% kable()%>% kable_paper() # Visualizing most frequent words tidy_ds%>% anti_join(get_stopwords())%>% count(word,sort=T)%>% top_n(10)%>% ggplot(aes(reorder(word,-n),n,fill='#c30c3d',label=n))+ geom_col()+ geom_text(nudge_y = 4,size=3)+ scale_fill_identity()+ labs( title="Most frequent words in headlines of February 2021", x="Word", y="# of appearance" )+ theme_minimal()+ theme( plot.title=element_text(color='#838383',hjust=0.5,size=18,face='bold'), plot.caption=element_text(color='#BD1D10',face='italic'), legend.position = "none", axis.text.x=element_text(angle=90,size=8), axis.text.y=element_blank(), panel.grid=element_blank() ) # Making the donut of positive and negative ds_sentiment<-tidy_ds%>% inner_join(get_sentiments("bing")) ds_sentiment%>% count(sentiment)%>% arrange(desc(sentiment))%>% mutate( percentage=round(n/sum(n),3)*100, lab.pos=cumsum(percentage)-0.5*percentage )%>% ggplot(aes(4,percentage,fill=sentiment))+ geom_bar(stat='identity')+ geom_text(aes(y=lab.pos,label=paste0(percentage,"%")),size=3,color='white')+ coord_polar("y",start=0)+ theme_void()+ xlim(1,5)+ scale_fill_manual(values=c('#c30c3d','#00776f')) # Visualizing net sentiment by date ds_sentiment%>% count(date,sentiment)%>% pivot_wider(names_from = sentiment,values_from = n,values_fill=0)%>% mutate(sentiment=positive-negative, fill=if_else(sentiment>0,'#00776f','#c30c3d') )%>% ggplot(aes(as.Date(date),sentiment,fill=fill,label=sentiment,))+ geom_col(show.legend=FALSE)+ geom_text(nudge_y = -1,size=3)+ scale_fill_identity()+ scale_x_date(date_labels = "%d-%b",date_breaks="1 day",expand=c(0,0))+ labs( title="Net Sentiment of News - Feb'21", subtitle='based on Bing Lexicon', x="Date", y="Net sentiment (Positve-Negative)" )+ theme_minimal()+ theme( plot.title=element_text(color='#838383',hjust=0.5,size=18,face='bold'), plot.subtitle=element_text(color='#c30c3d',hjust=0.5,size=12), plot.caption=element_text(color='#BD1D10',face='italic'), legend.position = "none", axis.text.x=element_text(angle=90,size=8), axis.text.y=element_blank(), panel.grid=element_blank() )