## Background

This analysis came about because the dataset was the topic of the Stockholm R User Group (SRUG) Meetup Group Hackathon 2017 (and took me a while to get back to). The task was simply to do something interesting using the dataset.

Text data can be analysed using a variety of different methods: topic modelling, network analysis, sentiment analysis, word frequencies etc. One less commonly applied metric is that of readability, or in other words, linguistic complexity. The thought was that this might reveal an interesting dimension of the data that might be missed by other tools. It also made for a nice case for demonstrating how readability scores can be applied.

Readability formulas were developed as early as the first half of the twentieth century, and used to be calculated by hand. ‘True’ readability is dependent on all sorts of factors: the complexity of the ideas expressed, the logical coherence of the text, the words used etc. What readability formulas measure is usually primarily a function of the most easily quantifiable aspects of readability: words per sentence, syllables per word etc. These aspects are then assembled together in a formula which weights the different components appropriately, to arrive at a readability score. There exist many different readability scores, which differ primarily in the degree of weighting they give to one or the other concept (e.g. sentence length vs word length), or to the way that aspects of complexity are assessed (e.g. word length vs membership to an easy word list).

As such, readability formulas tend to be rather crude tools for assessing readability. However, while these measures do not perfectly capture the true readability of a text, they can be especially informative when examining relative changes in large sets of texts to examine changes. For example, I and some friends applied readability formulas to scientific abstracts as a hobby project, finding very strong trend indicating that scientific writing has been growing increasingly complex. Another great example of their use is in an analysis of US State of the Union addreses, showing them becoming more simple over time.

The idea here was to apply readability fomulas to TED talk transcripts, and to examine whether there have been any changes over time, as well as whether the complexity of the language of the talks had any relation to the popularity of the talks.

## The Data Set

The data set is a Kaggle data set available here. The description is as follows:

These datasets contain information about all audio-video recordings of TED Talks uploaded to the official TED.com website until September 21st, 2017. The TED main dataset contains information about all talks including number of views, number of comments, descriptions, speakers and titles. The TED transcripts dataset contains the transcripts for all talks available on TED.com.

## Setup

### Packages

library(tidyverse)
library(koRpus)
library(stringi)
library(viridis)

The readability package I’ll be using is called koRpus. While it has its quirks, and it tends to be a little bit slower than some equivalent tools in Python, it is quite easy to use and showcases a very comprehensive set of tools. First, we need to install the english language, as below. We install it using the commented code below, and then load it up like a usual library.

# install.koRpus.lang("en")
library(koRpus.lang.en)

First, I read the transcripts and the information in, join the two, and throw out everything where there was any missing data.

talks <- read_csv('../RawData/ted-talks/ted_main.csv')
Parsed with column specification:
cols(
description = [31mcol_character()[39m,
duration = [32mcol_double()[39m,
event = [31mcol_character()[39m,
film_date = [32mcol_double()[39m,
languages = [32mcol_double()[39m,
main_speaker = [31mcol_character()[39m,
name = [31mcol_character()[39m,
num_speaker = [32mcol_double()[39m,
published_date = [32mcol_double()[39m,
ratings = [31mcol_character()[39m,
related_talks = [31mcol_character()[39m,
speaker_occupation = [31mcol_character()[39m,
tags = [31mcol_character()[39m,
title = [31mcol_character()[39m,
url = [31mcol_character()[39m,
views = [32mcol_double()[39m
)
transcripts <- read_csv('../RawData/ted-talks/transcripts.csv')
Parsed with column specification:
cols(
transcript = [31mcol_character()[39m,
url = [31mcol_character()[39m
)
alldat <- full_join(talks, transcripts) %>%
filter(complete.cases(.))
Joining, by = "url"

It’s always a good idea to have a bit of a look at your data before running functions that take a long time. I sound like I’m being smart here. I was less smart at the time and had to re-do some steps here and there to my great chagrin. One discovers that there are little surprises hidden in the data, such as these gems:

alldat$transcript[1171] [1] "(Music)(Applause)(Music)(Applause)(Music)(Applause)(Music)(Applause)" # Shortened this one a bit for brevity str_sub(alldat$transcript[1176], start = 2742, 2816)
[1] "\"sh\" in Spanish. (Laughter) And I thought that was worth sharing.(Applause)"

So, there seem to be various audience reactions included in the transcripts wrapped in brackets. Let’s make sure to remove these, as well as save them for later in case they come in handy.

## Audience reactions

Here I first removed and saved the bracketed audience actions. I then used stringi to get word counts so that we can remove the cases for which there’s nothing, or not much, left after removing the brackets, for cases like the first example above.

alldat <- alldat %>%
mutate(bracketedthings = str_extract_all(transcript, pattern = "\$$\\w+\$$"),
transcript = str_replace_all(transcript, "\$$\\w+\$$", " "),
word_count = map_dbl(transcript, ~stri_stats_latex(.)[["Words"]]))

Let’s examine these actions a little bit further

actions <- c(unlist(alldat$bracketedthings)) actions <- gsub(pattern = '[\$$\$$]', replacement = '', x = actions) length(unique(actions)) [1] 202 202 unique actions is a lot… But what are the main ones? head( sort(table(actions), decreasing = T), 20) actions Laughter Applause Music Video Audio Laughs Singing Cheers 10224 5429 614 354 56 49 45 40 Beatboxing Cheering English Whistling k Sighs Guitar Sings 24 17 17 17 16 16 14 13 Audience Beep Clapping Arabic 12 12 10 9  So, most of these actions are too rare to be useful. But the first three could well be helpful for later. alldat <- alldat %>% # Laughter mutate(nlolz = stringr::str_count(bracketedthings, '(Laughter)')) %>% mutate(lolzpermin = nlolz/(duration/60)) %>% # Applause mutate(napplause = stringr::str_count(bracketedthings, '(Applause)')) %>% mutate(applausepermin = napplause/(duration/60)) %>% # Has Music mutate(hasmusic = stringr::str_detect(bracketedthings, '(Music)')) argument is not an atomic vector; coercingargument is not an atomic vector; coercingargument is not an atomic vector; coercing ## Calculating readability scores ### Tokenisation, POS tagging and calculating readability scores The koRpus package requires that one first applies treetagger to a text. This software tokenises the text, as well as classifies words by parts of speech. This requires that you install TreeTagger first, and set it up accordingly. So, first we set up a function for treetagging the text. Remember to check the path for your computer. treetag_text <- function(text) { treetagged_text <- treetag( textConnection(text) , lang="en", treetagger="manual", TT.options=list(path="C:/TreeTagger", preset="en")) return(treetagged_text) } Once this is done, one can just run the koRpus::readability() command on the treetagged text, and you can get all the readability outcomes. This all takes quite a long time though, so let’s be sure to save the results after each step. And I’ll save after each of the two processes. # Tagging alldat_read <- alldat %>% filter(word_count > 50) %>% # Not enough data with <50 words mutate(text_tt = map(transcript, ~(treetag_text(.)))) saveRDS(alldat_read, '../../static/data/20190321_ReadabilityTED/alldat_tagged.rds') # Readability alldat_read <- alldat_read %>% mutate(readability = map(text_tt, ~readability(.))) saveRDS(alldat_read, '../../static/data/20190321_ReadabilityTED/alldat_read.rds') On the other side of that, we can simply pull out whichever readability score we want to work with. I will use the Flesch-Kincaid age in this analysis. There is both a score, and an age for which that score is considered appropriate, and so I’ll use the latter. This provides a more intuitive way to understand the scores than the raw scores. Though, at the same time, the more easily interpretable scores are also more dangerous: the ease of interpreting them tends to make it easier to make the mistake of forgetting that readability scores are most appropriately interpreted relative to the rest of the texts, rather than as absolute measures of a text’s readability. This is because differences in the particular topic, or the format (e.g. speeches, novels, scientific articles) can result in great differences in the estimated age for which the readability is considered to be appropriate. But we’ll just keep this in mind as we press on. alldat <- readRDS('../DerivedData/alldat.rds') alldat_read <- readRDS('../DerivedData/alldat_read.rds') alldat_read <- alldat_read %>% mutate(FKA = map_dbl(readability, c("Flesch.Kincaid", "age"))) ## Cleaning Again When we look at the distribution of the readability scores, we can see that not everything looks so dandy. theme_set(theme_bw()) ggplot(alldat_read, aes(x=FKA)) + geom_histogram(fill="grey", colour="black") + labs(title="Hmmmm...") So what are these super complex talks? And what do the most simple talks look like too? alldat_read <- alldat_read %>% arrange(FKA) # Most Simple head(alldat_read$transcript, 1)
[1] "Daffodil Hudson: Hello? Yeah, this is she. What? Oh, yeah, yeah, yeah, yeah, of course I accept. What are the dates again? Pen. Pen. Pen. March 17 through 21. Okay, all right, great. Thanks.Lab Partner: Who was that?DH: It was TED.LP: Who's TED?DH: I've got to prepare.[\"Give Your Talk: A Musical\"]  [\"My Talk\"]♪ Procrastination. ♪ What do you think? Can I help you? Speaker Coach 1: ♪ Let's prepare for main stage. ♪ ♪ It's your time to shine. ♪ ♪ If you want to succeed then ♪ ♪ you must be primed. ♪Speaker Coach 2: ♪ Your slides are bad ♪ ♪ but your idea is good ♪ ♪ so you can bet before we're through, ♪ ♪ speaker, we'll make a TED Talk out of you. ♪Speaker Coach 3: ♪ We know about climate change, ♪ ♪ but what can you say that's new? ♪♪ SC 1: Once you find your focus ♪ ♪ then the talk comes into view. ♪SC 2: ♪ Don't ever try to sell something ♪ ♪ from up on that stage ♪ ♪ or we won't post your talk online. ♪All: ♪ Somehow we'll make a TED Talk out of you. ♪ SC 1: Ready to practice one more time?DH: Right now?Stagehand: Break a leg.DH: ♪ I'll never remember all this. ♪ ♪ Will the clicker work when I press it? ♪ ♪ Why must Al Gore go right before me? ♪ ♪ Oh man, I'm scared to death. ♪ ♪ I hope I don't pass out onstage ♪ ♪ and now I really wish I wasn't wearing green. ♪All: ♪ Give your talk. ♪SC 1: ♪ You must be be sweet like Brené Brown. ♪All: ♪ Give your talk. ♪SC 2: ♪ You must be funny like Ken Robinson. ♪All: ♪ Give your talk. ♪SC 3: ♪ You must be cool like Reggie Watts ♪All: ♪ and bring out a prop like Jill Bolte Taylor. ♪DH: ♪ My time is running over. The clock now says nil. ♪ ♪ I'm saying my words faster. Understand me still. ♪ ♪ I'm too nervous to give this TED Talk. ♪All: ♪ Don't give up. Rehearse. You're good. ♪ ♪ We'll edit out the mistakes that you make. ♪ ♪ Give your talk. ♪DH: ♪ I will be big like Amy Cuddy. ♪All: ♪ Give your talk. ♪DH: ♪ I will inspire like Liz Gilbert. ♪All: ♪ Give your talk. ♪DH: ♪ I will engage like Hans Rosling ♪ ♪ and release mosquitos ♪ ♪ like Bill Gates. ♪SC 2: ♪ I'll make a TED Talk out of you. ♪ ♪ I'll make a TED Talk out of you. ♪ ♪ I'll make a TED Talk out of you. ♪ ♪ I'll make a TED Talk out of you. ♪ ♪ I'll make a TED Talk out of you. ♪ [\"Brought to you by TED staff and friends\"] "
# Most Complex
tail(alldat_read$transcript, 1) [1] " I went down to St. James InfirmaryTo see my baby thereShe was lying on a long wooden tableSo cold, so still, so fairI went up to see the doctor\"She's very low,\" he saidI went back to see my babyGood God she's lying there deadI went down to old Joe's bar roomOn the corner of the squareThey were serving drinks as per usualAnd the usual crowd was thereTo my left stood Old Joe McKennedyHis eyes were bloodshot redHe turned to the crowd around himAnd these are the words he said\"Let her go, let her go, God bless herWherever she may beShe can search this whole wide world all overBut she'll never find another man like meShe can search this whole wide world all overAnd she'll never find another man like meWhen I die, please God, bury meIn my ten-dollar Stetson hatPut a twenty-dollar gold piece on my watch chainSo my friends know I died standing patGet six gamblers to carry my coffinAnd six choir girls to sing me a songStick a jazz band on my hearse wagonTo raise hell as I go alongNow that's the end of my storyLet's have another round of boozeAnd if anyone should ask youJust tell them I got the St. James Infirmary blues " They’re songs! This is an obvious failing of readability scores: they’re made for full sentences. Songs don’t have the usual sentences, and instead the formulae see them as one long sentence. Let’s take a look. alldat_read <- alldat_read %>% mutate(textstats = map(text_tt, ~koRpus::describe(.))) %>% mutate(sentences = map_dbl(textstats, 'sentences'), sentencelength = map_dbl(textstats, 'avg.sentc.length')) ggplot(alldat_read, aes(x=sentencelength)) + geom_histogram(fill="grey", colour="black") + labs(title="Average Sentence Lengths") So let’s try to remove the bad cases to the right, choosing a limit of average 40 words per sentence, as well remove those talks including music using the variable we created earlier from the audience reactions, and see how everything looks again readdat <- alldat_read %>% select(-bracketedthings) %>% filter(sentencelength < 40, hasmusic==FALSE) ggplot(readdat, aes(x=FKA)) + geom_histogram(fill="grey", colour="black") + labs(title="Muuuuch better!") Now that’s a normal distribution if ever I saw one! This data looks pretty ripe for digging into now! # Exploration In order to look at trends over time, let’s first fix up the dates. The dates are UNIX timestamps, so I’ll first convert these to dates. readdat <- readdat %>% mutate(published_date = as.POSIXct(published_date, origin="1970-01-01"), published_date = as.Date(published_date)) ## Viewership As a crude indicator of interest in each video, I’ll calculate the number of views per day elapsed since the video was published. The dataset describes videos on the TED.com website published before September 21st 2017, with the dataset created on September 25th according to one of the comments on the page. readdat <- readdat %>% mutate(days = as.Date("2017-09-25") - published_date, days = as.numeric(days), viewsperday = views/days) ggplot(readdat, aes(x=viewsperday)) + geom_histogram(fill="grey", colour="black") Looks like this might be more useful with a log transformation readdat <- readdat %>% mutate(log_vpd = log10(viewsperday)) ggplot(readdat, aes(x=log_vpd)) + geom_histogram(fill="grey", colour="black") That looks much better! Let’s take a look at the trend over time. ggplot(readdat, aes(x=published_date, y=log_vpd)) + geom_point() Hmmm… I suspect that the date that the data was published is probably wrong. Or something else is funny here. I’ll just use the raw views data instead, but I can remove those datapoints in the last few months. readdat$too_recent <- readdat$published_date > as.Date("2017-04-01") ggplot(readdat, aes(x=published_date, y=log10(views))) + geom_point(aes(colour=too_recent)) That looks ok. Let’s use those. readdat <- readdat %>% filter(published_date < as.Date("2017-04-01")) ## Changes over Time As a first step, let’s take a look at whether there’s any trend in readability over time. corstrength <- cor(readdat$FKA, as.numeric(readdat$published_date)) readability_trend <- ggplot(readdat, aes(x=published_date, y=FKA)) + geom_point(aes(colour=log10(views))) + scale_colour_viridis(option = 'D', 'Log(Views)') + geom_smooth(method="lm") + labs(title='Readability over Time', subtitle=paste0("Linguistic complexity has increased over time: R = ",round(corstrength, 2)), y='Flesch-Kincaid Readability (Age)', x='Published Date') readability_trend This is actually rather stronger than I anticipated, and seems to be a pretty clear result of talks becoming more complex over time. Let’s also take a look then at how readability relates to views. readability_popularity <- ggplot(readdat, aes(x=FKA, y=log10(views))) + geom_point() + labs(title='Readability Score and Views', subtitle="No strong relation, but notice the gap top right", x='Flesch-Kincaid Readability (Age)', y='Log(Views)') readability_popularity We can see here that there are many talks which are highly readable (left) and with many views, however very few which are complex and also popular. Let’s take a closer look at this. For this, I will divide the readability scores into deciles, and compare the distributions of the views. readdat <- readdat %>% mutate(read_percentile = cut(FKA, 10), read_percentile_num = as.numeric(read_percentile)*10, read_percentile_num_mid = read_percentile_num-5) %>% group_by(read_percentile_num) %>% mutate(meanRead = mean(FKA)) %>% ungroup() readability_quantile <- ggplot(readdat, aes(x=read_percentile_num_mid, y=log10(views), fill=meanRead, group=read_percentile_num)) + geom_boxplot() + scale_fill_viridis('Mean \nReadability \nAge', direction = -1) + labs(title='Readability Percentile and Views', subtitle="Especially simple talks are most popular, \nand especially complex talks have uniformly few views", x='Readability Percentile (Simplest to Hardest)', y='Views (Log10 Scale)') readability_quantile Next, let’s take a look at how funny talks were compared to their complexity. We saved the laughs per minute earlier, so we can use that data. Let’s filter for the laughs for which laughter was recorded first though. lolpermin <- readdat %>% filter(lolzpermin > 0) %>% ggplot(aes(x=FKA, y=log10(lolzpermin), colour=log10(views))) + geom_point() + geom_smooth(method="lm") + labs(title='Readability and Laughs', subtitle='Simpler talks get more laughs', x='Flesch-Kincaid Readability (Age)', y='Lolz/minute (Log10)') + scale_colour_viridis(option = 'D', 'Log(Views)') lolpermin And engagement. Let’s make a crude marker of engagement by taking the log of the number of ratings per view. First, we need to extract the number of ratings, and then we can calculate engagement. readdat$ratings[[1]]
[1] "[{'id': 23, 'name': 'Jaw-dropping', 'count': 402}, {'id': 7, 'name': 'Funny', 'count': 1637}, {'id': 1, 'name': 'Beautiful', 'count': 59}, {'id': 22, 'name': 'Fascinating', 'count': 267}, {'id': 9, 'name': 'Ingenious', 'count': 116}, {'id': 21, 'name': 'Unconvincing', 'count': 15}, {'id': 10, 'name': 'Inspiring', 'count': 57}, {'id': 25, 'name': 'OK', 'count': 126}, {'id': 3, 'name': 'Courageous', 'count': 72}, {'id': 24, 'name': 'Persuasive', 'count': 14}, {'id': 26, 'name': 'Obnoxious', 'count': 52}, {'id': 11, 'name': 'Longwinded', 'count': 56}, {'id': 8, 'name': 'Informative', 'count': 9}, {'id': 2, 'name': 'Confusing', 'count': 6}]"
get_ratingcount <- function(ratingtext) {
ratingcount <- stringr::str_match_all(ratingtext, "'count': (\\d*)")[[1]][,2]
sum(as.numeric(ratingcount))
}

mutate(ratingcount = map_dbl(ratings, ~get_ratingcount(.x)),
engagement = log10(ratingcount/views))

Right, now let’s see how it looks

engagement <- readdat %>%
ggplot(aes(x=FKA, y=engagement, colour=log10(views))) +
geom_point() +
geom_smooth(method="lm") +
scale_colour_viridis('Log Views') +
engagement