News_Feb_2021.utf8

It has been nearly seven months since I put up anything on my blog due to undertaking some life and time-devouring projects in my professional life. While my blog writing was de-prioritized, I managed to learn some tricks on text analysis through byte sized online learning. Today, I am going to do an amateur attempt to put that into use.

Loading libraries

We need two packages for this - tidyverse for all types fo data wrangling and tidytext for all types of text analysis.

Data

The dataset consists of all the news that appeared in the online edition of The Daily Star, the leading English newspaper in Bangladesh over the month of February 2021. You can download the dataset from my Kaggle page -

Download dataset

## Rows: 2,138
## Columns: 4
## $ path     <chr> "https://www.thedailystar.net///frontpage/news/university-...
## $ date     <date> 2021-02-01, 2021-02-01, 2021-02-01, 2021-02-01, 2021-02-0...
## $ headline <chr> "University student dies after rape", "50 lakh shots sent ...
## $ text     <chr> "A private university student in the capital died yesterda...

The dataset got four columns-

  1. Path -> Absolute URL. We won’t need it for this analysis
  2. Date -> Date the news appeared
  3. Headline -> Headline of the news
  4. Text -> News content. We won’t be using it too for this analysis

There are total 2138 different news.

Saturdays seem slowest news day

It seems Saturdays are slowest news day with 55-60 articles whereas other days we get 70+ articles.

Lets Tokenize

Before going to further analysis, we need to break the headlines into smaller fragments or tokenize them. We can tokenize by character, word or sentence. Here, I am more interested in individual words. Hence, I am going to break the headlines into words. The result is below for the first headline.

date id word
2021-02-01 1 university
2021-02-01 1 student
2021-02-01 1 dies
2021-02-01 1 after
2021-02-01 1 rape

Most frequent words

After accounting for stop_words, we can see that Covid is still dominating as expected and with upward trend in recent times, I expect it to continue. Followed by Bangladesh- also expected as the paper mostly covers news from here. Myanmar features heavily with the recent crisis catapulting the country in our headlines.

Sentiment

For sentiment scoring, I am going to use “Bing Lexicon” which categorizes words into positive and negative categories only.

It was a grim month indeed. Out of 1347 matched words with Bing lexicon 65% were negative.

Apart from 11th February, the net sentiment score was negative for everyday. Hopefully, things will look up in near future

Summary

That’s it for today folks who have read through my ramblings this far. Language processing is an amazing tool and can help us in getting subtle signals from large amount of texts without actually reading them. For any query or new challenges, give me a knock in mail or twitter. Until then, lets hope for a positive March in news.

Full Code

# Load libraries

library(tidyverse) # For Data wrangling
library(tidytext) # For text analysis 
library(kableExtra) # For beautiful tables


# Read Data

ds<-read_delim("Daily_Star_February_2021.txt",delim="^")

ds%>%
    glimpse()


# Tokenize by words

tidy_ds<-ds%>%
    select(date,headline)%>%
    mutate(id=row_number())%>%
    unnest_tokens(word,headline)

# List the first headline in tokenized form

tidy_ds%>%
  filter(id==1)%>%
    kable()%>%
    kable_paper()


# Visualizing most frequent words

tidy_ds%>%
    anti_join(get_stopwords())%>%
    count(word,sort=T)%>%
    top_n(10)%>%
    ggplot(aes(reorder(word,-n),n,fill='#c30c3d',label=n))+
    geom_col()+
    geom_text(nudge_y = 4,size=3)+
    scale_fill_identity()+
    labs(
        title="Most frequent words in headlines of February 2021",
        x="Word",
        y="# of appearance"
    )+
    theme_minimal()+
    theme(
        plot.title=element_text(color='#838383',hjust=0.5,size=18,face='bold'),
        plot.caption=element_text(color='#BD1D10',face='italic'),
        legend.position = "none",
        axis.text.x=element_text(angle=90,size=8),
        axis.text.y=element_blank(),
        panel.grid=element_blank()
    )



# Making the donut of positive and negative

ds_sentiment<-tidy_ds%>%
    inner_join(get_sentiments("bing"))

ds_sentiment%>%
    count(sentiment)%>%
    arrange(desc(sentiment))%>%
    mutate(
        percentage=round(n/sum(n),3)*100,
        lab.pos=cumsum(percentage)-0.5*percentage
    )%>%
    ggplot(aes(4,percentage,fill=sentiment))+
    geom_bar(stat='identity')+
    geom_text(aes(y=lab.pos,label=paste0(percentage,"%")),size=3,color='white')+
    coord_polar("y",start=0)+
    theme_void()+
    xlim(1,5)+
    scale_fill_manual(values=c('#c30c3d','#00776f'))




# Visualizing net sentiment by date

ds_sentiment%>%
    count(date,sentiment)%>%
    pivot_wider(names_from = sentiment,values_from = n,values_fill=0)%>%
    mutate(sentiment=positive-negative,
           fill=if_else(sentiment>0,'#00776f','#c30c3d')
           )%>%
    ggplot(aes(as.Date(date),sentiment,fill=fill,label=sentiment,))+
    geom_col(show.legend=FALSE)+
    geom_text(nudge_y = -1,size=3)+
    scale_fill_identity()+
    scale_x_date(date_labels = "%d-%b",date_breaks="1 day",expand=c(0,0))+
    labs(
        title="Net Sentiment of News - Feb'21",
        subtitle='based on Bing Lexicon',
        x="Date",
        y="Net sentiment (Positve-Negative)"
    )+
    theme_minimal()+
    theme(
        plot.title=element_text(color='#838383',hjust=0.5,size=18,face='bold'),
        plot.subtitle=element_text(color='#c30c3d',hjust=0.5,size=12),
        plot.caption=element_text(color='#BD1D10',face='italic'),
        legend.position = "none",
        axis.text.x=element_text(angle=90,size=8),
        axis.text.y=element_blank(),
        panel.grid=element_blank()
    )