In this post, I want to share the algorithm that I learnt in Stylometry with R course in the Digital Humanities Summer Institute summer school in Victoria.
So, first of all, what is Stylometry? One of the formal definitions says “Style is a property of texts constituted by an ensemble of formal features which can be observed quantitatively or qualitatively” (for more see here). It can work as authorship attribution, genre recognition, or distant reading classification of a collection of texts algorithm. Simply, it can predict how texts are alike by looking at similarities or differences between them. For example, let’s say there is a book whose chapters written by different authors and one wants to learn who wrote which chapter. A stylometry algorithm can help to figure that out by looking at the previous writtings of those authors.
stylo is an R package developed by Maciej Eder and his colleagues to operate computational stylistics analyses, more detail is here.
I will use oppose function of stylo package. The oppose function generates the lists of strongly preferred and avoided words by one of the given set of texts as a result of contrastive comparison analysis between the two given set of texts.
The data I will use in that post is available here. It contains data of news headlines published over a period of 15 years in Australia. As a person with no background about Austrialan agenda, it will be a blind analysis for me, but let’s see if we can find something interesting.
Set Working Directory
As a very first step, let’s set working directory to where you stored data.
#Load Data df <- read.csv(file = "abcnews-date-text.csv", encoding = "UTF-8", head = T) #View Data head(df)
publish_date headline_text 1 20030219 aba decides against community broadcasting licence 2 20030219 act fire witnesses must be aware of defamation 3 20030219 a g calls for infrastructure protection summit 4 20030219 air nz staff in aust strike for pay rise 5 20030219 air nz strike to affect australian travellers 6 20030219 ambitious olsson wins triple jump
I want to compare an equal number of headlines for an accurate result. So, I want to see how the headlines distributed over the years. I will take the headlines in the first ten percent of the data set as my first set of texts. These are the oldest headlines. As a second set of texts, I will take the headlines in the last ten percent of the data set. These are the recent ones. These two dates, 2004-08-20 and 2015-11-08 will be the cut points: I will look at how the headlines before 2004-08-20 differed from those after 2015-11-08.
quantile(df$publish_date, probs = seq(0, 1, 0.1))
## 0% 10% 20% 30% 40% 50% 60% ## 20030219 20040820 20060225 20070915 20090205 20100721 20111223 ## 70% 80% 90% 100% ## 20130321 20140616 20151108 20171231
I created two data frames by filtering the data based on the cut points.
library(dplyr) old <- df %>% filter(publish_date <= 20040820) new <- df %>% filter(publish_date >= 20151108) old$t = cut(old$publish_date, 2) new$t = cut(new$publish_date, 2)
The texts within the same group are collapsed together to obtain the longer texts.
old2 = old %>% group_by(t) %>% mutate(collection = paste0(headline_text, collapse = " ")) %>% distinct(t, .keep_all = TRUE) new2 = new %>% group_by(t) %>% mutate(collection = paste0(headline_text, collapse = " "))%>% distinct(t, .keep_all = TRUE) first = old2$collection second = new2$collection first.parsed = as.list(strsplit(first, " ")) second.parsed = as.list(strsplit(second, " ")) names(first.parsed) = paste("before", c(1:2)) names(second.parsed) = paste("after", c(1:2))
primary.corpus, secondary.corpus: The headlines before 2004-08-20 will be assigned to primary.corpus and the headlines after 2015-11-08 will be assigned to secondary.corpus.
slice.size: The default value is 10.000 but here because we have short headlines, a small number makes more sense.
gui: default value is TRUE if switched on, the parameters can be set through a graphical interface.
library(stylo) oppose(primary.corpus = first.parsed , secondary.corpus = second.parsed , slice.size= 200, gui = F )
A possible interpretation
In the output, the preferred word list indicates the words which were strongly preferred by the first given set of texts compared to the second set, and the avoided list shows the words were used by the second set of texts but avoided by the first one.
It seems that older headlines are more related to internal issues. The words such as government, quensland, island, farmer, malcolm (Australian head of state) were popular. Facebook as a recent invention at that time also took its place in Australian news headline. On the other hand, the sooner headlines strongly refer to more global issues as we understand from the popularity of words such as iraqi, iraq, baghdad, saddam, terror, palestonian, isreal, spanish, british.
I hope this post would be helpful in your research. Please contact me if you have any comment or question.