Using Readability Formulas to Assess Linguistic Complexity of TED Talks using koRpus

a.k.a. Readability is related to everything

Posted by Granville Matheson on Thursday, March 21, 2019

TOC

Background

This analysis came about because the dataset was the topic of the Stockholm R User Group (SRUG) Meetup Group Hackathon 2017 (and took me a while to get back to). The task was simply to do something interesting using the dataset.

Text data can be analysed using a variety of different methods: topic modelling, network analysis, sentiment analysis, word frequencies etc. One less commonly applied approach is that of readability, or in other words, linguistic complexity. The thought was that this might reveal an interesting dimension of the data that might be missed by other approaches. It also made for a nice case for demonstrating how readability scores can be applied.

Readability Formulas

Readability formulas were developed as early as the first half of the twentieth century, and therefore used to be calculated by hand. ‘True’ readability is dependent on all sorts of factors: the complexity of the ideas expressed, the logical coherence of the text, the words used etc. What readability formulas measure is usually primarily a function of the most easily quantifiable aspects of readability: words per sentence, syllables per word etc. These quantities are then assembled together in a formula which weights the different components appropriately, to arrive at a readability score. There exist many different readability scores, which differ primarily in the degree of weighting they give to one or the other concept (e.g. sentence length vs word length), or to the way that different components of complexity are assessed (e.g. word length vs membership to an easy word list).

As such, readability formulas tend to be rather crude tools for assessing readability. However, while these measures do not perfectly capture the true readability of a text, they can be especially informative when examining relative changes in large sets of texts to examine changes. For example, I and some friends applied readability formulas to scientific abstracts as a hobby project, finding very strong trend indicating that scientific writing has been growing increasingly complex. Another nice example of their application is in an analysis of US State of the Union addreses, showing them becoming more simple over time.

The idea here was to apply readability fomulas to TED talk transcripts, and to examine whether there have been any changes over time, as well as whether the complexity of the language of the talks had any relation to the popularity of the talks.

The Data Set

The data set is a Kaggle data set available here. The description is as follows:

These datasets contain information about all audio-video recordings of TED Talks uploaded to the official TED.com website until September 21st, 2017. The TED main dataset contains information about all talks including number of views, number of comments, descriptions, speakers and titles. The TED transcripts dataset contains the transcripts for all talks available on TED.com.

Setup

Packages

library(tidyverse)
library(koRpus)
library(stringi)
library(viridis)

The readability package I’ll be using is called koRpus. While it has its quirks, and it tends to be a little bit slower than some equivalent tools in Python, it is quite easy to use and showcases a very comprehensive set of tools. First, we need to install the english language, as below. We install it using the commented code below, and then load it up like a usual library.

# install.koRpus.lang("en")
library(koRpus.lang.en)

Reading in the data

First, I read the transcripts and the information in, join the two, and throw out everything where there was any missing data.

talks <- read_csv('../../static/data/20190321_ReadabilityTED/ted_main.csv')
## Parsed with column specification:
## cols(
##   comments = col_double(),
##   description = col_character(),
##   duration = col_double(),
##   event = col_character(),
##   film_date = col_double(),
##   languages = col_double(),
##   main_speaker = col_character(),
##   name = col_character(),
##   num_speaker = col_double(),
##   published_date = col_double(),
##   ratings = col_character(),
##   related_talks = col_character(),
##   speaker_occupation = col_character(),
##   tags = col_character(),
##   title = col_character(),
##   url = col_character(),
##   views = col_double()
## )
transcripts <- read_csv('../../static/data/20190321_ReadabilityTED/transcripts.csv')
## Parsed with column specification:
## cols(
##   transcript = col_character(),
##   url = col_character()
## )
alldat <- full_join(talks, transcripts) %>%
  filter(complete.cases(.))
## Joining, by = "url"

It’s always a good idea to have a bit of a look at your data before running functions that take a long time. I sound like I’m being smart here. I was less smart at the time and had to re-do some steps here and there to my great chagrin. One discovers that there are little surprises hidden in the data, such as these gems:

alldat$transcript[1171]
## [1] "(Music)(Applause)(Music)(Applause)(Music)(Applause)(Music)(Applause)"
# Shortened this one a bit for brevity
str_sub(alldat$transcript[1176], start = 2742, 2816)
## [1] "\"sh\" in Spanish. (Laughter) And I thought that was worth sharing.(Applause)"

So, there seem to be various audience reactions included in the transcripts wrapped in brackets. Let’s make sure to remove these, as well as save them for later in case they come in handy.

Audience reactions

Here I first removed and saved the bracketed audience actions. I then used stringi to get word counts so that we can remove the cases for which there’s nothing, or not much, left after removing the brackets, for cases like the first example above.

alldat <- alldat %>%
  mutate(bracketedthings = str_extract_all(transcript, pattern = "\\(\\w+\\)"),
         transcript = str_replace_all(transcript, "\\(\\w+\\)", " "),
         word_count = map_dbl(transcript, ~stri_stats_latex(.)[["Words"]]))

Let’s examine these actions a little bit further

actions <- c(unlist(alldat$bracketedthings))
actions <- gsub(pattern = '[\\(\\)]', replacement = '', x = actions)
length(unique(actions))
## [1] 202

202 unique actions is a lot… But what are the main ones?

head( sort(table(actions), decreasing = T), 20)
## actions
##   Laughter   Applause      Music      Video      Audio     Laughs 
##      10224       5429        614        354         56         49 
##    Singing     Cheers Beatboxing   Cheering    English  Whistling 
##         45         40         24         17         17         17 
##          k      Sighs     Guitar      Sings   Audience       Beep 
##         16         16         14         13         12         12 
##   Clapping     Arabic 
##         10          9

So, most of these actions are too rare to be useful. But the first three could well be helpful for later.

alldat <- alldat %>%
  # Laughter
  mutate(nlolz = stringr::str_count(bracketedthings, '(Laughter)')) %>%
  mutate(lolzpermin = nlolz/(duration/60)) %>%
  
  # Applause
  mutate(napplause = stringr::str_count(bracketedthings, '(Applause)')) %>%
  mutate(applausepermin = napplause/(duration/60)) %>%
  
  # Has Music
  mutate(hasmusic = stringr::str_detect(bracketedthings, '(Music)'))
## Warning in stri_count_regex(string, pattern, opts_regex = opts(pattern)):
## argument is not an atomic vector; coercing

## Warning in stri_count_regex(string, pattern, opts_regex = opts(pattern)):
## argument is not an atomic vector; coercing
## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing

Calculating readability scores

The koRpus package requires that you install TreeTagger first, and set it up accordingly (in its README). This software tokenises the text, as well as classifies words by parts of speech.

So, first we set up a function for treetagging the text. Remember to check the path for your computer.

treetag_text <- function(text) {
  treetagged_text <- treetag( textConnection(text) , 
                               lang="en", treetagger="manual", 
                               TT.options=list(path="C:/TreeTagger", 
                                               preset="en"))
  return(treetagged_text)
}

Once this is done, one can just run the koRpus::readability() command on the treetagged text, and you can get all the readability outcomes.

This all takes quite a long time though, so let’s be sure to save the results after each step. And I’ll save after each of the two processes.

# Tagging
alldat_read <- alldat %>%
  filter(word_count > 50) %>%   # Not enough data with <50 words
  mutate(text_tt = map(transcript, ~(treetag_text(.))))

saveRDS(alldat_read, '../../static/data/20190321_ReadabilityTED/alldat_tagged.rds')


# Readability   
alldat_read <- alldat_read %>% 
  mutate(readability = map(text_tt, ~readability(.))) %>% 
  select(-text_tt)

saveRDS(alldat_read, '../../static/data/20190321_ReadabilityTED/alldat_read.rds')

On the other side of that, we can simply pull out whichever readability score we want to work with. I will use the Flesch-Kincaid age in this analysis. There is both a score, and an age for which that score is considered appropriate, and so I’ll use the latter. This provides a more intuitive way to understand the scores than the raw scores. Though, at the same time, the more easily interpretable scores are also more dangerous: the ease of interpreting them tends to make it easier to make the mistake of forgetting that readability scores are most appropriately interpreted relative to the rest of the texts, rather than as absolute measures of a text’s readability. This is because differences in the particular topic, or the format (e.g. speeches, novels, scientific articles) can result in differences in the estimated age for which the readability is considered to be appropriate. But we’ll just keep this in mind as we press on.

alldat_read <- readRDS('../../static/data/20190321_ReadabilityTED/alldat_read.rds')

alldat_read <- alldat_read %>% 
  mutate(FKA = map_dbl(readability, c("Flesch.Kincaid", "age")))

Cleaning Again

When we look at the distribution of the readability scores, we can see that not everything looks so dandy.

theme_set(theme_bw())

ggplot(alldat_read, aes(x=FKA)) +
  geom_histogram(fill="grey", colour="black") +
  labs(title="Hmmmm...")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

So what are these super complex talks? And what do the most simple talks look like too?

alldat_read <- alldat_read %>% 
  arrange(FKA)

# Most Simple
head(alldat_read$transcript, 1)
## [1] "Daffodil Hudson: Hello? Yeah, this is she. What? Oh, yeah, yeah, yeah, yeah, of course I accept. What are the dates again? Pen. Pen. Pen. March 17 through 21. Okay, all right, great. Thanks.Lab Partner: Who was that?DH: It was TED.LP: Who's TED?DH: I've got to prepare.[\"Give Your Talk: A Musical\"]  [\"My Talk\"]<U+266A> Procrastination. <U+266A> What do you think? Can I help you? Speaker Coach 1: <U+266A> Let's prepare for main stage. <U+266A> <U+266A> It's your time to shine. <U+266A> <U+266A> If you want to succeed then <U+266A> <U+266A> you must be primed. <U+266A>Speaker Coach 2: <U+266A> Your slides are bad <U+266A> <U+266A> but your idea is good <U+266A> <U+266A> so you can bet before we're through, <U+266A> <U+266A> speaker, we'll make a TED Talk out of you. <U+266A>Speaker Coach 3: <U+266A> We know about climate change, <U+266A> <U+266A> but what can you say that's new? <U+266A><U+266A> SC 1: Once you find your focus <U+266A> <U+266A> then the talk comes into view. <U+266A>SC 2: <U+266A> Don't ever try to sell something <U+266A> <U+266A> from up on that stage <U+266A> <U+266A> or we won't post your talk online. <U+266A>All: <U+266A> Somehow we'll make a TED Talk out of you. <U+266A> SC 1: Ready to practice one more time?DH: Right now?Stagehand: Break a leg.DH: <U+266A> I'll never remember all this. <U+266A> <U+266A> Will the clicker work when I press it? <U+266A> <U+266A> Why must Al Gore go right before me? <U+266A> <U+266A> Oh man, I'm scared to death. <U+266A> <U+266A> I hope I don't pass out onstage <U+266A> <U+266A> and now I really wish I wasn't wearing green. <U+266A>All: <U+266A> Give your talk. <U+266A>SC 1: <U+266A> You must be be sweet like Brené Brown. <U+266A>All: <U+266A> Give your talk. <U+266A>SC 2: <U+266A> You must be funny like Ken Robinson. <U+266A>All: <U+266A> Give your talk. <U+266A>SC 3: <U+266A> You must be cool like Reggie Watts <U+266A>All: <U+266A> and bring out a prop like Jill Bolte Taylor. <U+266A>DH: <U+266A> My time is running over. The clock now says nil. <U+266A> <U+266A> I'm saying my words faster. Understand me still. <U+266A> <U+266A> I'm too nervous to give this TED Talk. <U+266A>All: <U+266A> Don't give up. Rehearse. You're good. <U+266A> <U+266A> We'll edit out the mistakes that you make. <U+266A> <U+266A> Give your talk. <U+266A>DH: <U+266A> I will be big like Amy Cuddy. <U+266A>All: <U+266A> Give your talk. <U+266A>DH: <U+266A> I will inspire like Liz Gilbert. <U+266A>All: <U+266A> Give your talk. <U+266A>DH: <U+266A> I will engage like Hans Rosling <U+266A> <U+266A> and release mosquitos <U+266A> <U+266A> like Bill Gates. <U+266A>SC 2: <U+266A> I'll make a TED Talk out of you. <U+266A> <U+266A> I'll make a TED Talk out of you. <U+266A> <U+266A> I'll make a TED Talk out of you. <U+266A> <U+266A> I'll make a TED Talk out of you. <U+266A> <U+266A> I'll make a TED Talk out of you. <U+266A> [\"Brought to you by TED staff and friends\"] "
# Most Complex
tail(alldat_read$transcript, 1)
## [1] " I went down to St. James InfirmaryTo see my baby thereShe was lying on a long wooden tableSo cold, so still, so fairI went up to see the doctor\"She's very low,\" he saidI went back to see my babyGood God she's lying there deadI went down to old Joe's bar roomOn the corner of the squareThey were serving drinks as per usualAnd the usual crowd was thereTo my left stood Old Joe McKennedyHis eyes were bloodshot redHe turned to the crowd around himAnd these are the words he said\"Let her go, let her go, God bless herWherever she may beShe can search this whole wide world all overBut she'll never find another man like meShe can search this whole wide world all overAnd she'll never find another man like meWhen I die, please God, bury meIn my ten-dollar Stetson hatPut a twenty-dollar gold piece on my watch chainSo my friends know I died standing patGet six gamblers to carry my coffinAnd six choir girls to sing me a songStick a jazz band on my hearse wagonTo raise hell as I go alongNow that's the end of my storyLet's have another round of boozeAnd if anyone should ask youJust tell them I got the St. James Infirmary blues "

They’re both songs! This is something that will trip up readability scores: they’re made for full sentences. Songs don’t have the usual sentences, and, at least in the second case, they are considered as one long sentence. Let’s take a look.

alldat_read <- alldat_read %>%
  mutate(textstats = map(text_tt, ~koRpus::describe(.))) %>%
  mutate(sentences = map_dbl(textstats, 'sentences'),
         sentencelength = map_dbl(textstats, 'avg.sentc.length'))
 
ggplot(alldat_read, aes(x=sentencelength)) +
  geom_histogram(fill="grey", colour="black") +
  labs(title="Average Sentence Lengths")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).

So let’s try to remove the bad cases to the right, choosing a limit of average 40 words per sentence, as well removing those talks including music using the variable we created earlier from the audience reactions, and see how everything looks again

readdat <- alldat_read %>%
  select(-bracketedthings) %>%
  filter(sentencelength < 40,
         hasmusic==FALSE)

ggplot(readdat, aes(x=FKA)) +
  geom_histogram(fill="grey", colour="black") +
  labs(title="Muuuuch better!")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now that’s a normal distribution if ever I saw one! This data looks pretty ripe for digging into now!

Exploration

In order to look at trends over time, let’s first fix up the dates. The dates are UNIX timestamps, so I’ll first convert these to more normal dates.

readdat <- readdat %>% 
  mutate(published_date = as.POSIXct(published_date, origin="1970-01-01"),
         published_date = as.Date(published_date))

Views per day

As a crude indicator of interest in each video, I’ll calculate the number of views per day elapsed since the video was published, in order not to be biased by the time that the video has been available in which to be viewed. The dataset describes videos on the TED.com website published before September 21st 2017, with the dataset created on September 25th according to one of the comments on the page.

readdat <- readdat %>% 
  mutate(days = as.Date("2017-09-25") - published_date,
         days = as.numeric(days),
         viewsperday = views/days)

ggplot(readdat, aes(x=viewsperday)) +
  geom_histogram(fill="grey", colour="black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Looks like this might be more useful with a log transformation

readdat <- readdat %>% 
  mutate(log_vpd = log10(viewsperday))

ggplot(readdat, aes(x=log_vpd)) +
  geom_histogram(fill="grey", colour="black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

That looks much better! Let’s take a look at the trend over time.

ggplot(readdat, aes(x=published_date, y=log_vpd)) +
  geom_point() +
  labs(y="Log views per day")

My reaction

My reaction

I suspect that the date that the data was frozen is probably wrong. Or something else is funny here. Maybe we can just use the raw views data instead, but I would then remove those datapoints in the last few months which haven’t been around long enough to go viral.

readdat$too_recent <- readdat$published_date > as.Date("2017-04-01")

ggplot(readdat, aes(x=published_date, y=log10(views))) +
  geom_point(aes(colour=too_recent))

That looks ok. And they actually also look to be affected by published date to a surprisingly small extent. Therefore I think it makes sense to just use the views figures, and to cut out the most recent talks to avoid their bias.

readdat <- readdat %>% 
  filter(published_date < as.Date("2017-04-01"))

Changes over Time

As a first step, let’s take a look at whether there are any trend in readability over time.

corstrength <- cor(readdat$FKA, as.numeric(readdat$published_date))

readability_trend <- ggplot(readdat, aes(x=published_date, y=FKA)) +
  geom_point(aes(colour=log10(views))) +
  scale_colour_viridis(option = 'D', 'Log(Views)') +
  geom_smooth(colour="red") +
  geom_smooth(method="lm") +
  labs(title='Readability over Time',
       subtitle=paste0("Linguistic complexity has increased over time: R = ",round(corstrength, 2)),
       y='Flesch-Kincaid Readability (Age)', x='Published Date')

readability_trend
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

This is actually rather stronger than I anticipated, and seems to be a pretty clear result of talks becoming more complex over time. The linear trend is the same as we saw in the scientific literature. Interestingly, from the smooth model fit to the data, there appears to have been a peak in complexity around the beginning of 2014, which has perhaps tapered off, but I wouldn’t be able to begin to start to speculate about what might have caused that, so it could well be nothing.

Viewership

Let’s also take a look then at how readability relates to views.

readability_popularity <- ggplot(readdat, aes(x=FKA, y=log10(views))) +
  geom_point() +
  labs(title='Readability Score and Views',
       subtitle="No strong relation, but notice the gap top right",
       x='Flesch-Kincaid Readability (Age)',
       y='Log(Views)')

readability_popularity

We can see here that there are many talks which are highly readable (left) and with many views, however very few which are complex and also popular. Let’s take a closer look at this.

For this, I will divide the readability scores into deciles, and compare the distributions of the views.

readdat <- readdat %>%
  mutate(read_percentile = cut(FKA, 10),
         read_percentile_num = as.numeric(read_percentile)*10,
         read_percentile_num_mid = read_percentile_num-5) %>%
  group_by(read_percentile_num) %>%
  mutate(meanRead = mean(FKA)) %>%
  ungroup()

readability_quantile <- ggplot(readdat, aes(x=read_percentile_num_mid,
                                            y=log10(views), fill=meanRead,
                                            group=read_percentile_num)) +
  geom_boxplot() +
  scale_fill_viridis('Mean \nReadability \nAge', direction = -1) +
  labs(title='Readability Percentile and Views',
       subtitle="Especially simple talks are most popular, \nand especially complex talks have uniformly few views",
       x='Readability Percentile (Simplest to Hardest)',
       y='Views (Log10)')

readability_quantile

Topics

It would have been nice to separate the data into different topics or sections, but unfortunately that data isn’t quite so clear. What we do have is a set of tags. Let’s maybe take a little look at that data and see whether we might be able to try to see which topics are most complex and which are most simple.

readdat$tags[[1]]
## [1] "['collaboration', 'entertainment', 'humor', 'physics']"
topicdat <- readdat %>%
  mutate(tags = str_match_all(tags, pattern = "\\'(\\w+)\\'"),
         tags = map(tags, ~.x[,2]))

Let’s examine these actions a little bit further

tags <- c(unlist(topicdat$tags))
length(unique(tags))
## [1] 328

That looks like too many - this data will be too sparse. But let’s see what the most common are.

head( sort(table(tags), decreasing = T), 30)
## tags
##    technology       science       culture          TEDx        design 
##           613           501           433           358           350 
##      business        health    innovation entertainment       society 
##           311           203           192           189           166 
##        future       biology           art communication     economics 
##           164           161           160           158           148 
##         brain      medicine   environment collaboration    creativity 
##           144           143           140           138           135 
##      humanity      activism     education     invention     community 
##           134           132           132           125           121 
##       history      children    psychology      politics         women 
##           120           112           110           109           108

These all have reasonable numbers of talks. Let’s compare them. Keep in mind that the same talks might belong to multiple categories, so there will be overlap.

topics <- names(head( sort(table(tags), decreasing = T), 30))

topic_read <- topicdat %>% 
  select(FKA, tags, views)

selectbytopic <- function(topic, data) {
  filter(data, map_lgl(tags, ~topic %in% .x))
}

topic_readgroups <- map(topics, ~selectbytopic(.x, topic_read))
names(topic_readgroups) <- topics

topic_readgroups <- bind_rows(topic_readgroups, .id="Tag") %>% 
  select(-tags) %>% 
  group_by(Tag) %>% 
  mutate(meanRead = mean(FKA),
         meanViews = mean(views)) %>% 
  ungroup()

So let’s take a look at the readability of the different topics. Let’s first look by views

views_topics <- topic_readgroups %>% 
  arrange(-meanViews) %>% 
  mutate(Tag = fct_inorder(Tag)) %>% 
  ggplot(aes(x=Tag, y=log10(views), fill=meanRead, group=Tag)) +
  geom_boxplot() +
  scale_fill_viridis('Mean \nReadability \nAge', direction = -1) +
  labs(title='Views by Tags',
       subtitle="Psychology and brain talks get more views, but the tags don't differ greatly otherwise",
       x='Tag',
       y='Views (Log10)') +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

views_topics

Interesting to see such a strong preference for psychology and brain talks. We can also already see that the entertainment tag appears to be associated with more readable transcripts. But let’s take a look at the distributions.

readability_topics <- topic_readgroups %>% 
  arrange(-meanRead) %>% 
  mutate(Tag = fct_inorder(Tag)) %>% 
  ggplot(aes(x=Tag, y=FKA, fill=log10(meanViews), group=Tag)) +
  geom_boxplot() +
  scale_fill_viridis('Mean \nViews \n(Log10)') +
  labs(title='Readability by Tags',
       subtitle="Politics are complex, entertainment is simple, but the others don't differ much",
       x='Tag',
       y='Flesch-Kincaid Readability Age') +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

readability_topics

That politics and economics should be most complex, and that education, children and entertainment should be simplest, makes intuitive sense. It definitely seems like, despite being quite crude instruments, that the readability formulas do a pretty good job of capturing the real differences in complexity.

Also, interesting to note above that the topics of the talks don’t appear to be completely driving the differences in readability: brain talks are very popular, but reasonably complicated; while entertainment talks are very simple, but not massively popular.

Audience Laughter

Next, let’s take a look at how funny talks were compared to their complexity. We saved the laughs per minute earlier, so we can use that data. Let’s filter for the laughs for which laughter was recorded first though.

lolpermin <- readdat %>%
  filter(lolzpermin > 0) %>%
  ggplot(aes(x=FKA, y=log10(lolzpermin), colour=log10(views))) +
    geom_point() +
    geom_smooth(method="lm") +
    labs(title='Readability and Laughs',
         subtitle='Simpler talks get more laughs',
         x='Flesch-Kincaid Readability (Age)',
         y='Lolz/minute (Log10)') +
  scale_colour_viridis(option = 'D', 'Log(Views)')

lolpermin

Engagement

And engagement. Let’s make a crude marker of engagement by taking the log of the number of ratings per view. First, we need to extract the number of ratings, and then we can calculate engagement.

readdat$ratings[[1]]
## [1] "[{'id': 23, 'name': 'Jaw-dropping', 'count': 402}, {'id': 7, 'name': 'Funny', 'count': 1637}, {'id': 1, 'name': 'Beautiful', 'count': 59}, {'id': 22, 'name': 'Fascinating', 'count': 267}, {'id': 9, 'name': 'Ingenious', 'count': 116}, {'id': 21, 'name': 'Unconvincing', 'count': 15}, {'id': 10, 'name': 'Inspiring', 'count': 57}, {'id': 25, 'name': 'OK', 'count': 126}, {'id': 3, 'name': 'Courageous', 'count': 72}, {'id': 24, 'name': 'Persuasive', 'count': 14}, {'id': 26, 'name': 'Obnoxious', 'count': 52}, {'id': 11, 'name': 'Longwinded', 'count': 56}, {'id': 8, 'name': 'Informative', 'count': 9}, {'id': 2, 'name': 'Confusing', 'count': 6}]"
get_ratingcount <- function(ratingtext) {
  ratingcount <- stringr::str_match_all(ratingtext, "'count': (\\d*)")[[1]][,2]
  sum(as.numeric(ratingcount))
}

readdat <- readdat %>% 
  mutate(ratingcount = map_dbl(ratings, ~get_ratingcount(.x)),
    engagement = log10(ratingcount/views))

Right, now let’s see how it looks

engagement <- readdat %>%
  ggplot(aes(x=FKA, y=engagement, colour=log10(views))) +
    geom_point() +
    geom_smooth(method="lm") +
  scale_colour_viridis('Log Views') +
    labs(title='Readability and Engagement',
         subtitle='Simpler talks get more engagement',
         x='Flesch-Kincaid Readability (Age)',
         y='Engagement (Log (Ratings / Views)')

engagement

Conclusions

In this analysis, and much to my surprise I must admit, we found that linguistic complexity, measured using readability formulas, appears to be related to pretty much everything we looked at. It seemed to be changing over time, with talks growing more complex. We showed that talks with high complexity had universally low viewership, and that simpler talks are not only more viewed, but also get more laughs, and get more engagement. We further showed that politics and economics talks tend to be more complicated, while talks about education, children and entertainment tend to be more simple.

And, lastly, if you should ever be preparing to do a TED talk: keep it simple!


comments powered by Disqus