Chapter 2 Set-up and Data preparation

2.1 Load libraries

library("readxl")
library("lme4")
library("emmeans")
library("broom.mixed")
library("kableExtra")
library("tidyverse")
library("optimx") #optimx package for modelling

2.2 Prepare GECO Material

GECO Materials are downloaded from Cop et al. (2017) https://expsy.ugent.be/downloads/geco/

EnglishMaterial_raw <- read_excel("EnglishMaterial.xlsx")
DutchMaterial_raw <- read_excel("DutchMaterials.xlsx")

I created Material tibbles (EnglishMaterial & DutchMaterial) by only selecting necessary columns (Language, WORD_ID, WORD, PART_OF_SPEECH, CONTENT_WORD, WORD_LENGTH) and only content words.

EnglishMaterial <- EnglishMaterial_raw %>%
  mutate(Language = "English") %>%
  select(Language, WORD_ID, WORD, PART_OF_SPEECH, CONTENT_WORD, WORD_LENGTH) %>%
  filter(CONTENT_WORD == "1") %>% 
  unique()

DutchMaterial <- DutchMaterial_raw %>%
  mutate(Language = "Dutch",
         WORD_ID = IA_ID) %>%
  select(Language, WORD_ID, WORD, PART_OF_SPEECH, CONTENT_WORD, WORD_LENGTH) %>%
  filter(CONTENT_WORD == "1") %>%
  unique()

2.3 Prepare Valence dataset

Valence rating datasets are downloaded from http://crr.ugent.be/programs-data/word-ratings. Because two valence datasets use different rating metrics, I transformed valence ratings (V_Mean) to V_Mean_percent, ranging 0-1. close to 0 is negative, and close to 1 is positive.

English
Original rating by Warriner et al. (2013) is 1(happy) - 9(unhappy). The valence rating are already reversed post-hoc to maintain more intuitive low-to-high scale 1(unhappy) - 9(happy) (5 = neutral). Thus no need to reverse again.

EnglishValence <- read.csv("Ratings_Warriner_et_al.csv")

In this dataset, Percent_known not collected. Number of contribution is stored as V.Rat.Sum in original dataset.

EnglishValence <- EnglishValence %>%
  select(Word, V.Mean.Sum, V.SD.Sum) %>%
    rename(WORD = Word,
         V_Mean = V.Mean.Sum,
         V_SD = V.SD.Sum) %>%
  mutate(V_Mean_Percent = (V_Mean-1)/(9-1))

Dutch
Original rating by Moors et al. (2013) is 1(unhappy) - 7(happy) (4 = neutral).

DutchValence <- read_excel("WordNorms Moors et al.xlsx", skip = 1)


DutchValence <- DutchValence %>%
  select(Words, Translation, 'M V...3', 'SD V...4', 'N (%)') %>%
  rename(WORD = Words,
         V_Mean = 'M V...3',
         V_SD = 'SD V...4',
         UnknownRatio = 'N (%)') %>%
  mutate(V_Percent_known = (100 - UnknownRatio)/100,
         V_Mean_Percent = (V_Mean-1)/(7-1))

For DutchWords, we will remove any words which scores more than 30 on UnknownRatio c.f., less than 70% of participants knew the words The max of UnknownRation is zweem’s 26.9%, therefore include all words in analysis.

2.4 Prepare concreteness rating

Two published concreteness ratings (Brysbaert, Stevens, et al., 2014 for Dutch; Brysbaert, Warriner, et al., 2014 for English) are used to include in a model as a control variable.

English

EnglishConcreteness <- read_excel("Concreteness_ratings_Brysbaert_et_al_BRM.xlsx") 

EnglishConcreteness <- EnglishConcreteness %>%
  select(Word, Bigram, Conc.M, Conc.SD, Percent_known) %>%
  rename(WORD = Word,
         Conc_Mean = Conc.M,
         Conc_SD = Conc.SD,
         C_Percent_known = Percent_known) %>%
  filter(Bigram == "0") %>% #filter out two-word expressions (2896 out of 39954 words)
  select(-Bigram)

Dutch
read_excel() returned warning, thus I converted the excel file into csv file and used read_csv() function.

DutchConcreteness <- read_csv("Concreteness ratings Brysbaert et al.csv")

DutchConcreteness <- DutchConcreteness %>%
  select(stimulus, Concrete_m, Concrete_sd, `Number_of_ratings`, Number_of_subjects) %>%
  mutate(C_Percent_known = `Number_of_ratings`/ Number_of_subjects) %>%
  select(stimulus, Concrete_m, Concrete_sd, C_Percent_known) %>%
  rename(WORD = stimulus,
         Conc_Mean = Concrete_m,
         Conc_SD = Concrete_sd)

In the original Dutch dataset, I found that there are 17 duplicate Dutch words with different concreteness ratings, making the number of words to 30070.

DutchConcreteness_dup <- DutchConcreteness %>%
  subset(duplicated(WORD)) %>%
  select(WORD) %>%
  unique() %>%
  mutate(Duplicate = "1") #1 means Yes

DutchConcreteness_dup2 <- left_join(DutchConcreteness, DutchConcreteness_dup, "WORD") %>%
  mutate(Duplicate = replace_na(Duplicate, "0")) %>%
  filter(Duplicate == "1")

Now DutchConcreteness_dup contains 73 duplicate rows. Here are the steps I followed: 1) calculate the Conc_M means of each duplicate word, 2) make a list of 17 words with the mean Conc_M, then 3) replace it into final DutchConcreteness dataset.

#1) calculate Mean of Conc_Mean for each duplicate word
DutchConcreteness_dup2 <- DutchConcreteness_dup2 %>%
  group_by(WORD) %>%
  mutate(Conc_Mean_M = mean(Conc_Mean)) %>%
  ungroup()
  
#2) make a list of 17 words
DutchConcreteness_dup2 <- DutchConcreteness_dup2 %>%
  select(WORD, Conc_Mean_M) %>%
  unique() %>%
  rename(Conc_Mean = Conc_Mean_M) %>%
  # add back Conc_SD and C_Percent_known columns to match up with the DutchConcreteness dataset
  mutate(Conc_SD = 1,
         C_Percent_known = 1,
         Duplicate = "0") #0 means No

#3) Replace it into final DutchConcreteness dataset
DutchConcreteness_F <- left_join(DutchConcreteness, DutchConcreteness_dup, "WORD") %>%
  mutate(Duplicate = replace_na(Duplicate, "0")) %>%
  filter(Duplicate == "0") #0 means No

#29997 words

29997+17=30014 words in the final tibble named DutchConcreteness.

DutchConcreteness <- bind_rows(DutchConcreteness_F, DutchConcreteness_dup2) %>%
  select(-Duplicate)

2.5 Prepare list of Identical cognates

We decided to exclude identical cognates as research shows that cognates are recognised and processed faster (i.e., cognates effect; Dijkstra et al. (2010)). Data is retrieved from Poort & Rodd (2019) (https://osf.io/tcdxb/).

DutchEnglishCognates <- read_excel("PoortRodd.DatabaseOf58IdenticalCognates76Non-IdenticalCognates72InterlingualHomographs78TranslationEquivalents.xlsx", 'identical cognates')

DutchEnglishCognates <- DutchEnglishCognates %>%
  select(word_NL, word_EN) %>%
  mutate(WORD = word_EN,
         Cognate = "1") %>% # 1 means Yes
  select(WORD, Cognate)

2.6 Combine prepared datasets to create target word lists

English
I used inner_join() to combine English Material, Valence rating(independent variable), and concreteness rating(control variable). Then I used left_join() to add information of cognates.

EnglishMaterialValence <- inner_join(EnglishMaterial, EnglishValence, "WORD")
EnglishMaterialValConc <- inner_join(EnglishMaterialValence, EnglishConcreteness, "WORD")

##Cognates
EnglishWordList_w_Cognate <- left_join(EnglishMaterialValConc, DutchEnglishCognates, "WORD")
#EnglishCognateList <- EnglishWordList_w_Cognate %>%
#  filter(Cognate == "1")
#23 out of 2200 words are cognates to remove

To determine whether I use inner_join() or left_join() to combine concreteness rating (control variable), I checked the number of words with no concreteness rating. With the below coding I found that only 24 out of 2224 words do not have concreteness rating, thus removing them would not affect the quality of final word list.

#EnglishMaterialValenceConcreteness <- left_join(EnglishMaterialValence, EnglishConcreteness, "WORD")
#24 out of 2224 words do not have concreteness rating -> removing them would not affect the final word list

Finally, I create EnglishWordList tibble by removing cognates.

EnglishWordList <- EnglishWordList_w_Cognate %>%
  replace(is.na(.),"0") %>%
  filter(Cognate == "0")

Dutch
Same as English, I checked the number of words with no concreteness rating to decide whether I use innter_join() or left_join() to add concreteness rating. With the below coding I found that 8 out of 1294 words do not have concreteness rating, thus removing them would not affect the quality of final word list.

DutchMaterialValenceConcreteness <- left_join(DutchMaterialValence, DutchConcreteness, "WORD")
#8 out of 1294 words do not have concreteness rating -> removing them would not affect the final word list

The below codes shows how I combined DutchMaterial, Valence rating(independent variable), and concreteness rating(control variable).

DutchMaterialValence <- inner_join(DutchMaterial, DutchValence, "WORD")
DutchMaterialValConc <- inner_join(DutchMaterialValence, DutchConcreteness, "WORD")

##Cognates
DutchWordList_w_Cognate <-left_join(DutchMaterialValConc, DutchEnglishCognates, "WORD")
#DutchCognateList <- DutchWordList_w_Cognate %>%
#  filter(Cognate == "1")
#14 out of 1286 words are cognates to remove

Finally I created DutchWordList tibble by removing cognates.

DutchWordList <- DutchWordList_w_Cognate %>%
  replace(is.na(.),"0") %>%
  filter(Cognate == "0")

2.7 Categorise target words and remove unknown words

At 2.3, I transformed two Valence ratings (English & Dutch) to the normalised 0-1 scale (V_Mean_Percentage) so that two different likert scales (7-point & 9-point) can be compared.

I followed Toivo & Scheepers (2019) regarding how to categorise words into three varence groups(Positive/Negative/Neutral); that is, based on the normalised scale, words with a valence rating under 0.33 were categorised as Negative, words with a valence rating over 0.66 were categorised as Positive, and the rest were categorised as Neutral.

Also, words with <75% KnownRatio are removed from the wordlists. Although there was no English words with lower than 85% known ratio, some Dutch words were applicable to this criteria thus removed from the final dataset.

English

EnglishWordList <- EnglishWordList %>% 
  mutate(V_Category = case_when(V_Mean_Percent > 0.66 ~ "Positive",
                                V_Mean_Percent < 0.33 ~ "Negative",
                                TRUE ~ "Neutral")) %>% 
  select(-CONTENT_WORD, -Cognate)

Dutch

DutchWordList <- DutchWordList %>% 
  mutate(V_Category = case_when(V_Mean_Percent > 0.66 ~ "Positive",
                                V_Mean_Percent < 0.33 ~ "Negative",
                                TRUE ~ "Neutral")) %>%
  select(-CONTENT_WORD, -UnknownRatio, -Cognate)

DutchWordList <- DutchWordList %>%
  filter(!(C_Percent_known < 0.75))

2.8 Create summary tables for WordLists

I excluded WORD_ID and then select unique rows to summarise the characteristics of unique words in the target word lists.

English

EnglishWordList_noDup <- EnglishWordList %>%
  select(-WORD_ID) %>%
  unique()

Sum_EnglishWordList <- EnglishWordList_noDup %>%
  group_by(V_Category) %>%
  summarise(N = n(),
            Mean_Valence_Percent = mean(V_Mean_Percent, na.rm = TRUE),
            SD_Valence_Percent = sd(V_Mean_Percent, na.rm = TRUE),
            Mean_WordLength = mean(WORD_LENGTH, na.rm = TRUE),
            SD_WordLength = sd(WORD_LENGTH, na.rm = TRUE),
            Mean_Conc = mean(Conc_Mean, na.rm = TRUE),
            SD_Conc = sd(Conc_Mean, na.rm = TRUE),
            Mean_KnownRatio = mean(C_Percent_known, na.rm = TRUE), #as there is no V_Percent_known data, using C_Percent_known
            SD_KnownRatio = sd(C_Percent_known, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(Language = "English")

Dutch

DutchWordList_noDup <- DutchWordList %>%
  select(-WORD_ID) %>%
  unique()

Sum_DutchWordList <- DutchWordList_noDup %>%
  group_by(V_Category) %>%
  summarise(N = n(),
            Mean_Valence_Percent = mean(V_Mean_Percent, na.rm = TRUE),
            SD_Valence_Percent = sd(V_Mean_Percent, na.rm = TRUE),
            Mean_WordLength = mean(WORD_LENGTH, na.rm = TRUE),
            SD_WordLength = sd(WORD_LENGTH, na.rm = TRUE),
            Mean_Conc = mean(Conc_Mean, na.rm = TRUE),
            SD_Conc = sd(Conc_Mean, na.rm = TRUE),
            Mean_KnownRatio = mean(C_Percent_known, na.rm = TRUE), #C_Percent_known to match w/ English though V_Percent_known is available
            SD_KnownRatio = sd(C_Percent_known, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(Language = "Dutch")

Summary
I combined two summary tables. Note that this table describes the characteristics of the words in the target word list, which is created with ReadingMaterial = the words in the novel. The target word list will be compared with ReadingData, which contains the eye-tracking data generated in the experiment.

Sum_WordList <- bind_rows(Sum_DutchWordList,Sum_EnglishWordList)

df_Sum_WordList <- as.data.frame(Sum_WordList)
wordsummary <- data.frame(df_Sum_WordList$Language,
                          df_Sum_WordList$V_Category,
                          df_Sum_WordList$N,
                          round(df_Sum_WordList$Mean_Valence_Percent,3),
                          round(df_Sum_WordList$SD_Valence_Percent,3),
                          round(df_Sum_WordList$Mean_WordLength,2),
                          round(df_Sum_WordList$SD_WordLength,2),
                          round(df_Sum_WordList$Mean_Conc,2),
                          round(df_Sum_WordList$SD_Conc,2),
                          round(df_Sum_WordList$Mean_KnownRatio,3),
                          round(df_Sum_WordList$SD_KnownRatio,3))
names(wordsummary) <- (c("Language", "Valence Category", "N", "Valence rating (Mean)", "Valence rating (SD)", "Word Length (Mean)", "Word Length (SD)", "Concreteness rating (Mean)", "Concreteness rating (SD)", "Word known ratio (Mean)", "Word known ratio (SD)"))

kable(wordsummary) %>% kable_styling() %>% scroll_box(width = "100%", height = "100%")

Language	Valence Category	N	Valence rating (Mean)	Valence rating (SD)	Word Length (Mean)	Word Length (SD)	Concreteness rating (Mean)	Concreteness rating (SD)	Word known ratio (Mean)	Word known ratio (SD)
Dutch	Negative	210	0.235	0.065	6.86	2.32	2.60	0.79	0.997	0.013
Dutch	Neutral	754	0.516	0.085	5.28	1.63	3.39	1.06	0.996	0.021
Dutch	Positive	289	0.764	0.068	6.54	2.40	2.48	0.87	0.999	0.014
English	Negative	319	0.234	0.063	7.03	2.23	2.73	0.83	0.994	0.019
English	Neutral	1300	0.528	0.085	6.48	2.32	3.20	1.06	0.994	0.019
English	Positive	558	0.744	0.060	6.81	2.37	2.89	1.04	0.998	0.010

2.9 Visualisation of target word lists

For the reference, I created histograms for valence ratings of the target words. From the histogram, you can see that 1) English valence rating seems negatively skewed, and 2) Dutch valence rating seems normally distributed.

English

ggplot(EnglishWordList_noDup, aes(V_Mean_Percent)) + 
  geom_histogram(binwidth = .01,
                 colour = "black",
                 fill = "grey",
                 aes(y = ..density..)) + 
  scale_x_continuous(name = "English Mean Valence (0-1)") +
  stat_function(fun = dnorm, # this adds a normal density function curve
                colour = "red", # this makes it red
                args = list(mean = mean(EnglishWordList_noDup$V_Mean_Percent, na.rm = TRUE),
                           sd = sd(EnglishWordList_noDup$V_Mean_Percent, na.rm = TRUE)))

Dutch

ggplot(DutchWordList_noDup, aes(V_Mean_Percent)) + 
  geom_histogram(binwidth = .01,
                 colour = "black",
                 fill = "grey",
                 aes(y = ..density..)) + 
  scale_x_continuous(name = "Dutch Mean Valence (0-1)") +
  stat_function(fun = dnorm, # this adds a normal density function curve
                colour = "red", # this makes it red
                args = list(mean = mean(DutchWordList_noDup$V_Mean_Percent, na.rm = TRUE),
                           sd = sd(DutchWordList_noDup$V_Mean_Percent, na.rm = TRUE)))

2.10 Demographic information

First of all, I loaded the dataset and selected only necessary information. One participant(pp18) was removed from the dataset because they only read first half of the book in English (Cop et al., 2017).

#Load the dataset
Demographic <- read_excel("SubjectInformation.xlsx")

#Select necessary information.
Demographic <- Demographic %>%
  select(PP_NR, GROUP, AGE, SEX, AOA_ENG) %>%
  filter(GROUP == "bilingual") %>%
  mutate(PP_NR_N = recode(PP_NR, #Change PP_NR label to match with ReadingData
                          "1" = "pp01",
                          "2" = "pp02",
                          "3" = "pp03",
                          "4" = "pp04",
                          "5" = "pp05",
                          "6" = "pp06",
                          "7" = "pp07",
                          "8" = "pp08",
                          "9" = "pp09",
                          "10" = "pp10",
                          "11" = "pp11",
                          "12" = "pp12",
                          "13" = "pp13",
                          "14" = "pp14",
                          "15" = "pp15",
                          "16" = "pp16",
                          "17" = "pp17",
                          "18" = "pp18",
                          "19" = "pp19")) %>%
  select(PP_NR_N, GROUP, AGE, SEX, AOA_ENG) %>%
  rename(PP_NR = PP_NR_N) %>%
  filter(!(PP_NR == "pp18")) #remove pp18

Demographic information is summarised in the Demographic_table tibble.

Demographic_table <- Demographic %>%
  summarise(N = n(),
            MAge = mean(AGE, na.rm = TRUE),
            SDAge = sd(AGE, na.rm = TRUE),
            MAoA = mean(AOA_ENG, na.rm = TRUE),
            SDAoA = sd(AOA_ENG, na.rm = TRUE))

kable(Demographic_table) %>% kable_styling()

N	MAge	SDAge	MAoA	SDAoA
18	21.11111	2.138963	11.11111	2.44682

2.11 Data wrangling on ReadingData

Here I uploaded the ReadingData from GECO project (Cop et al., 2017), retrieved from https://expsy.ugent.be/downloads/geco/.

EnglishReadingData_raw <- read_excel("L2ReadingData.xlsx")
DutchReadingData_raw <- read_excel("L1ReadingData.xlsx")

English

I first worked on EnglishReadingData. For the analyses, I selected PP_NR, PART, WORD, WORD_FIXATION_COUNT, WORD_FIRST_FIXATION_DURATION. Single Fixation Duration (SFD) is our dependent variables which are the Words that has WORD_FIXATION_COUNT = 1.

EnglishReadingData <- EnglishReadingData_raw %>% 
  select(PP_NR, PART, WORD_ID, WORD, WORD_FIXATION_COUNT, WORD_FIRST_FIXATION_DURATION) %>%
  filter(WORD_FIXATION_COUNT == "1",
         !(PP_NR == "pp18")) #pp18 is removed as this participant only completed half of the experiment

The reading data contains punctuation, which needs to be removed so that we can combine the reading data with target wordlists.

EnglishReadingData_Final <- EnglishReadingData %>%
  mutate(WORD2 = lapply(WORD, function(x) {str_replace_all(x,"[,.'?!:;-]","")})) #Remove ,.'?! from WORD and keep the results in WORD2
EnglishReadingData_Final$WORD3 <- gsub(x=EnglishReadingData_Final$WORD2, pattern ="\"", "") #Remove "" from WORD2 and keep the results in WORD3

EnglishReadingData_Final <- EnglishReadingData_Final %>%
  select(PP_NR, PART, WORD_ID, WORD3, WORD_FIXATION_COUNT, WORD_FIRST_FIXATION_DURATION) %>%
  rename(WORD = WORD3) %>%
  mutate(WORD_FIRST_FIXATION_DURATION = as.numeric(WORD_FIRST_FIXATION_DURATION))

Dutch
Same cleaning process as English is required for Dutch reading data.

DutchReadingData <- DutchReadingData_raw %>%
  select(PP_NR, PART, WORD_ID, WORD, WORD_FIXATION_COUNT, WORD_FIRST_FIXATION_DURATION) %>%
  filter(WORD_FIXATION_COUNT == "1")
        
DutchReadingData_Final <- DutchReadingData %>%
  mutate(WORD2 = lapply(WORD, function(x) {str_replace_all(x,"[,.'?!:;-]","")}))
DutchReadingData_Final$WORD3 <- gsub(x=DutchReadingData_Final$WORD2, pattern ="\"", "")
DutchReadingData_Final <- DutchReadingData_Final %>%
  select(PP_NR, PART, WORD_ID, WORD3, WORD_FIXATION_COUNT, WORD_FIRST_FIXATION_DURATION) %>%
  rename(WORD = WORD3) %>%
  mutate(WORD_FIRST_FIXATION_DURATION = as.numeric(WORD_FIRST_FIXATION_DURATION))

2.12 Join ReadingData and Target Word Lists

We used WORD_ID to join the two tibbles.

English

There are 55737 words after the ReadingData_Final is compared against the target word list.

EnglishReadingData_w_WordList <- inner_join(EnglishReadingData_Final, EnglishWordList, "WORD_ID") %>%
  select(-WORD.y) %>%
  rename(WORD = WORD.x)

Dutch

There are 47333 words after the ReadingData_Final is compared against the target word list.

DutchReadingData_w_WordList <- inner_join(DutchReadingData_Final,DutchWordList, "WORD_ID") %>%
  select(-WORD.y, -Translation, -V_Percent_known) %>%
  rename(WORD = WORD.x)

2.13 Detect outliers

Single Fixation Duration (SFD) that differed more than 2.5 standard deviations from the subject means were considered outliers and excluded from the dataset.

English

SubjectMeanSFD_EN <- EnglishReadingData_w_WordList %>%
  group_by(PP_NR) %>%
  summarise(SFDSubjectMean = mean(WORD_FIRST_FIXATION_DURATION, na.rm = TRUE),
            SFDSubjectSD = sd(WORD_FIRST_FIXATION_DURATION, na.rm = TRUE)) %>%
  ungroup()

EnglishReadingData_w_WordList <- inner_join(EnglishReadingData_w_WordList, SubjectMeanSFD_EN, "PP_NR")
EnglishReadingData_w_WordList_Outlier <- EnglishReadingData_w_WordList %>%
  mutate(Outlier = case_when(WORD_FIRST_FIXATION_DURATION > SFDSubjectMean + (SFDSubjectSD * 2.5) ~ "1",
                             WORD_FIRST_FIXATION_DURATION < SFDSubjectMean - (SFDSubjectSD * 2.5) ~ "1",
                             TRUE ~ "0")) # Outliers if 1

EnglishReadingData_w_WordList <- EnglishReadingData_w_WordList_Outlier %>%
  filter(Outlier == "0")

According to the summary created below, 1,189 of 55,737 items are identified as outliers in EnglishReadingData.

Sum_Outlier_EN <- EnglishReadingData_w_WordList_Outlier %>%
  group_by(Outlier) %>%
  summarise(n = n()) %>%
  ungroup()

Sum_Outlier_EN

## # A tibble: 2 x 2
##   Outlier     n
##   <chr>   <int>
## 1 0       54548
## 2 1        1189

Dutch

SubjectMeanSFD_NL <- DutchReadingData_w_WordList %>%
  group_by(PP_NR) %>%
  summarise(SFDSubjectMean = mean(WORD_FIRST_FIXATION_DURATION, na.rm = TRUE),
            SFDSubjectSD = sd(WORD_FIRST_FIXATION_DURATION, na.rm = TRUE)) %>%
  ungroup()

DutchReadingData_w_WordList <- inner_join(DutchReadingData_w_WordList, SubjectMeanSFD_NL, "PP_NR")
DutchReadingData_w_WordList_Outlier <- DutchReadingData_w_WordList %>%
  mutate(Outlier = case_when(WORD_FIRST_FIXATION_DURATION > SFDSubjectMean + (SFDSubjectSD * 2.5) ~ "1",
                             WORD_FIRST_FIXATION_DURATION < SFDSubjectMean - (SFDSubjectSD * 2.5) ~ "1",
                             TRUE ~ "0")) # Outliers if 1


DutchReadingData_w_WordList <- DutchReadingData_w_WordList_Outlier %>%
  filter(Outlier == "0")

According to the summary created below, 1,000 of 47,333 items are identified as outliers in DutchReadingData.

Sum_Outlier_NL <- DutchReadingData_w_WordList_Outlier %>%
  group_by(Outlier) %>%
  summarise(n = n()) %>%
  ungroup()

Sum_Outlier_NL

## # A tibble: 2 x 2
##   Outlier     n
##   <chr>   <int>
## 1 0       46333
## 2 1        1000

2.14 Account for the repeated words in the datasets

Repeated words refer to the target words that are repeated multiple times in the ReadingData (e.g., “that” in English). Here, I added a new column that flags if the target words are repeated in the ReadingData dataset.

English
According to the below summary, there are 18005 target words in reading dataset, in which 8037 words are repeated. That means, 44.64% of words are repeated with a range of 2-129 times.

Sum_EnglishReadingData_w_WordList <- EnglishReadingData_w_WordList %>%
  group_by(PP_NR, WORD) %>%
  summarise(N = n()) %>%
  ungroup() %>%
  mutate(Repeated = case_when(N > 1 ~ "1",
                              TRUE ~"0"))

count(Sum_EnglishReadingData_w_WordList, Repeated == "0")

## # A tibble: 2 x 2
##   `Repeated == "0"`     n
##   <lgl>             <int>
## 1 FALSE              8037
## 2 TRUE               9968

I created visualisation to find characteristics of these repeated words. Here you can see that most of repeated words are only repeated twice or three times.

Sum2_EnglishReadingData_w_WordList <- Sum_EnglishReadingData_w_WordList %>%
  filter(Repeated == "1")

ggplot(Sum2_EnglishReadingData_w_WordList, aes(N)) +
  geom_bar()

Dutch
According to the below summary, there are 11487 target words in reading dataset, in which 5933 words are repeated. That means, 51.65% of words are repeated with a range of 2-253 times.

Sum_DutchReadingData_w_WordList <- DutchReadingData_w_WordList %>%
  group_by(PP_NR, WORD) %>%
  summarise(N = n()) %>%
  ungroup() %>%
  mutate(Repeated = case_when(N > 1 ~ "1",
                              TRUE ~"0"))

count(Sum_DutchReadingData_w_WordList, Repeated == "0")

## # A tibble: 2 x 2
##   `Repeated == "0"`     n
##   <lgl>             <int>
## 1 FALSE              5933
## 2 TRUE               5554

Visualisations are created for Dutch data as well. Same as English data, you can see that most of repeated words are only repeated twice or three times.

Sum2_DutchReadingData_w_WordList <- Sum_DutchReadingData_w_WordList %>%
  filter(Repeated == "1")

#Most of repeated words only repeated twice/three times
ggplot(Sum2_DutchReadingData_w_WordList, aes(N)) +
  geom_bar()

2.15 Clean up repeating words in ReadingData

Based on the analysis at 2.14, we concluded that the word proportion of repeated words is not significant (= less than 80%). Thus, we keep only the first instance of each of the repeating words per participant for our analysis. Here I used slice(1) to select the first row. slice_head() would also work. If want random selection, slice_sample() can be used instead of slice(#) or slice_head().

English

EnglishReadingData_w_WordList_NoRep <- EnglishReadingData_w_WordList %>%
  group_by(PP_NR, WORD) %>%
  slice(1) %>% 
  ungroup()

Dutch

DutchReadingData_w_WordList_NoRep <- DutchReadingData_w_WordList %>%
  group_by(PP_NR, WORD) %>%
  slice(1) %>%
  ungroup()

2.16 Summary of words for analysis

ReadingData_raw contains 549,290 words for Dutch and 534,154 words for English. After selecting SFD and compared against our target word list, the number of words is 47,333 (Dutch) and 55,737 (English), which is 9.51% of the raw data. Then we selected only one appearance per word per person, which made the number of words for our analysis 11,487 (Dutch) and 18,005 (English), 2.72% of the raw data.

I created a summary table of the words in ReadingData dataset that are analysed in the current study.

ReadingData_w_WordList_NoRep <- bind_rows(EnglishReadingData_w_WordList_NoRep, DutchReadingData_w_WordList_NoRep)

Sum_ReadingData_ENandNL <- ReadingData_w_WordList_NoRep %>%
  select(Language, V_Category, V_Mean_Percent, WORD_LENGTH, Conc_Mean, C_Percent_known) %>%
  group_by(Language, V_Category) %>%
  summarise(N = n(),
            Mean_Valence_Percent = mean(V_Mean_Percent, na.rm = TRUE),
            SD_Valence_Percent = sd(V_Mean_Percent, na.rm = TRUE),
            Mean_WordLength = mean(WORD_LENGTH, na.rm = TRUE),
            SD_WordLength = sd(WORD_LENGTH, na.rm = TRUE),
            Mean_Conc = mean(Conc_Mean, na.rm = TRUE),
            SD_Conc = sd(Conc_Mean, na.rm = TRUE),
            Mean_KnownRatio = mean(C_Percent_known, na.rm = TRUE),
            SD_KnownRatio = sd(C_Percent_known, na.rm = TRUE)) %>%
  ungroup()

df_Sum_ReadingData <- as.data.frame(Sum_ReadingData_ENandNL)
ReadingDataSummary <- data.frame(df_Sum_ReadingData$Language,
                                 df_Sum_ReadingData$V_Category,
                                 df_Sum_ReadingData$N,
                                 round(df_Sum_ReadingData$Mean_Valence_Percent,2),
                                 round(df_Sum_ReadingData$SD_Valence_Percent,2),
                                 round(df_Sum_ReadingData$Mean_WordLength,2),
                                 round(df_Sum_ReadingData$SD_WordLength,2),
                                 round(df_Sum_ReadingData$Mean_Conc,2),
                                 round(df_Sum_ReadingData$SD_Conc,2),
                          round(df_Sum_ReadingData$Mean_KnownRatio,3),
                          round(df_Sum_ReadingData$SD_KnownRatio,3))
names(ReadingDataSummary) <- (c("Language", "Valence Category", "N", "Valence rating (Mean)", "Valence rating (SD)", "Word Length (Mean)", "Word Length (SD)", "Concreteness rating (Mean)", "Concreteness rating (SD)", "Word known ratio (Mean)", "Word known ratio (SD)"))

kable(ReadingDataSummary) %>% kable_styling() %>% scroll_box(width = "100%", height = "100%")

Language	Valence Category	N	Valence rating (Mean)	Valence rating (SD)	Word Length (Mean)	Word Length (SD)	Concreteness rating (Mean)	Concreteness rating (SD)	Word known ratio (Mean)	Word known ratio (SD)
Dutch	Negative	1713	0.24	0.07	6.71	2.25	2.64	0.78	0.998	0.012
Dutch	Neutral	7072	0.52	0.08	5.27	1.57	3.36	1.07	0.996	0.021
Dutch	Positive	2702	0.76	0.07	6.31	2.29	2.48	0.89	0.999	0.013
English	Negative	2433	0.23	0.06	6.59	2.14	2.81	0.85	0.995	0.015
English	Neutral	10445	0.53	0.08	5.97	2.10	3.27	1.05	0.995	0.017
English	Positive	5127	0.74	0.06	6.39	2.25	2.97	1.07	0.998	0.009

Here I also summarised the number of items per participant per language.

Sum_ReadingData_ENandNL_itemsperparticipants <- ReadingData_w_WordList_NoRep %>%
  group_by(PP_NR, Language) %>%
  summarise(N = n()) %>%
  ungroup()

kable(Sum_ReadingData_ENandNL_itemsperparticipants) %>% kable_styling() %>% scroll_box(width = "100%", height = "500px")

PP_NR	Language	N
pp01	Dutch	669
pp01	English	990
pp02	Dutch	627
pp02	English	1076
pp03	Dutch	642
pp03	English	1007
pp04	Dutch	617
pp04	English	1106
pp05	Dutch	663
pp05	English	981
pp06	Dutch	660
pp06	English	879
pp07	Dutch	669
pp07	English	1041
pp08	Dutch	651
pp08	English	1001
pp09	Dutch	643
pp09	English	952
pp10	Dutch	621
pp10	English	1053
pp11	Dutch	661
pp11	English	1097
pp12	Dutch	640
pp12	English	1110
pp13	Dutch	615
pp13	English	860
pp14	Dutch	620
pp14	English	1099
pp15	Dutch	629
pp15	English	885
pp16	Dutch	586
pp16	English	1036
pp17	Dutch	649
pp17	English	786
pp19	Dutch	625
pp19	English	1046

2.17 Mean centering

All continuous variable will be centred to reduce collinearity between main effects and interactions: WORD_FIRST_FIXATION_DURATION, WORD_LENGTH, Conc_Mean. When you have continuous variables in a regression, it is often sensible to transform them by mean centering. You mean center a predictor X simply by subtracting the mean (X_centered = X - mean(X)). This has two useful consequences: https://psyteachr.github.io/msc-conv/multiple-regression.html

ReadingData_ENandNL <- ReadingData_w_WordList_NoRep %>%
  select(-WORD_FIXATION_COUNT, -V_SD, -Conc_SD,-C_Percent_known, -SFDSubjectMean, - SFDSubjectSD, -Outlier) %>% #Cleaning up the tibble by removing columns that are no longer required for analysis
  mutate(WORD_FIRST_FIXATION_DURATION_centered = WORD_FIRST_FIXATION_DURATION - mean(WORD_FIRST_FIXATION_DURATION),
         WORD_LENGTH_centered = WORD_LENGTH - mean(WORD_LENGTH),
         Conc_Mean_centered = Conc_Mean - mean(Conc_Mean))

2.18 Summary of SFD

Summary of Single Fixation Duration is also created as below.

Sum_SFD <- ReadingData_ENandNL %>%
  group_by(V_Category, Language) %>%
  summarise(Mean = mean(WORD_FIRST_FIXATION_DURATION),
            SD = sd(WORD_FIRST_FIXATION_DURATION)) %>%
  ungroup()

df_Sum_SFD <- as.data.frame(Sum_SFD)
SFDsummary <- data.frame(df_Sum_SFD$Language,
                         df_Sum_SFD$V_Category,
                         round(df_Sum_SFD$Mean,2),
                         round(df_Sum_SFD$SD,2))
names(SFDsummary) <- (c("Language", "Valence Category", "SFD (Mean)", "SFD (SD)"))

Sum_SFDwoLang <- ReadingData_ENandNL %>%
  group_by(V_Category) %>%
  summarise(Mean = mean(WORD_FIRST_FIXATION_DURATION),
            SD = sd(WORD_FIRST_FIXATION_DURATION)) %>%
  ungroup()

df_Sum_SFDwoLang <- as.data.frame(Sum_SFDwoLang)
SFDsummary_woLang <- data.frame(df_Sum_SFDwoLang$V_Category,
                         round(df_Sum_SFDwoLang$Mean,2),
                         round(df_Sum_SFDwoLang$SD,2))
names(SFDsummary_woLang) <- (c("Valence Category", "SFD (Mean)", "SFD (SD)")) 

kable(SFDsummary) %>% kable_styling()

Language	Valence Category	SFD (Mean)	SFD (SD)
Dutch	Negative	209.06	70.60
English	Negative	235.18	83.34
Dutch	Neutral	204.38	69.85
English	Neutral	232.73	82.78
Dutch	Positive	205.37	70.08
English	Positive	228.92	82.59

kable(SFDsummary_woLang) %>% kable_styling()

Valence Category	SFD (Mean)	SFD (SD)
Negative	224.39	79.37
Neutral	221.29	79.05
Positive	220.79	79.29

References

Brysbaert, M., Stevens, M., De Deyne, S., Voorspoels, W., & Storms, G. (2014). Norms of age of acquisition and concreteness for 30,000 Dutch words. Acta Psychologica, 150, 80–84. https://doi.org/10.1016/j.actpsy.2014.04.010

Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46(3), 904–911. https://doi.org/10.3758/s13428-013-0403-5

Cop, U., Dirix, N., Drieghe, D., & Duyck, W. (2017). Presenting GECO: An eyetracking corpus of monolingual and bilingual sentence reading. Behavior Research Methods, 49(2), 602–615. https://doi.org/10.3758/s13428-016-0734-0

Dijkstra, T., Miwa, K., Brummelhuis, B., Sappelli, M., & Baayen, H. (2010). How cross-language similarity and task demands affect cognate recognition. Journal of Memory and Language, 62(3), 284–301. https://doi.org/10.1016/j.jml.2009.12.003

Moors, A., De Houwer, J., Hermans, D., Wanmaker, S., Schie, K. van, Van Harmelen, A.-L., De Schryver, M., De Winne, J., & Brysbaert, M. (2013). Norms of valence, arousal, dominance, and age of acquisition for 4,300 Dutch words. Behavior Research Methods, 45(1), 169–177. https://doi.org/10.3758/s13428-012-0243-8

Poort, E. D., & Rodd, J. M. (2019). A Database of DutchEnglish Cognates, Interlingual Homographs and Translation Equivalents. Journal of Cognition, 2(1). https://doi.org/10.5334/joc.67

Toivo, W., & Scheepers, C. (2019). Pupillary responses to affective words in bilinguals’ first versus second language. PLOS ONE, 14(4), e0210450. https://doi.org/10.1371/journal.pone.0210450

Warriner, A. B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45(4), 1191–1207. https://doi.org/10.3758/s13428-012-0314-x