(original) (raw)

The entire R Notebook for the tutorial can be downloaded [**here**](https://slcladal.github.io/content/intror.Rmd). If you want to render the R Notebook on your machine, i.e. knitting the document to html or a pdf, you need to make sure that you have R and RStudio installed and you also need to download the [**bibliography file**](https://slcladal.github.io/content/bibliography.bib) and store it in the same folder where you store the Rmd or the Rproj file.

* Use the RStudio Cheat Sheets at * Use the RStudio Help viewer by typing `?` before a function or package * Check out the keyboard shortcuts `Help` under `Tools` in RStudio for some good tips ### Finding help online{-} One great thing about R is that you can very often find an answer to your question online. * Google your error! See for excellent suggestions on how to find help for a specific question online. # Working with tables{-} We will now start working with data in R. As most of the data that we work with comes in tables, we will focus on this first before moving on to working with text data. ## Loading data from the web{-} To show, how data can be downloaded from the web, we will download a tab-separated txt-file. Translated to prose, the code below means *Create an object called *icebio *and in that object, store the result of the `read.delim` function*. `read.delim` stands for *read delimited file* and it takes the URL from which to load the data (or the path to the data on your computer) as its first argument. The `sep` stand for separator and the `\t` stands for tab-separated and represents the second argument that the `read.delim` function takes. The third argument, `header`, can take either T(RUE) or F(ALSE) and it tells R if the data has column names (headers) or not. ## Functions and Objects{-} In R, functions always have the following form: `function(argument1, argument2, ..., argumentN)`. Typically a function does something to an object (e.g. a table), so that the first argument typically specifies the data to which the function is applied. Other arguments then allow to add some information. Just as a side note, functions are also objects that do not contain data but instructions. To assign content to an object, we use `<-` or `=` so that the we provide a name for an object, and then assign some content to it. For example, `MyObject <- 1:3` means *Create an object called `MyObject`. this object should contain the numbers 1 to 3*. ```{r, message=FALSE, warning=FALSE} # load data icebio <- read.delim("https://slcladal.github.io/data/BiodataIceIreland.txt", sep = "\t", header = T) ``` ## Inspecting data{-} There are many ways to inspect data. We will briefly go over the most common ways to inspect data. The `head` function takes the data-object as its first argument and automatically shows the first 6 elements of an object (or rows if the data-object has a table format). ```{r, message=FALSE, warning=FALSE} head(icebio) ``` We can also use the `head` function to inspect more or less elements and we can specify the number of elements (or rows) that we want to inspect as a second argument. In the example below, the `4` tells R that we only want to see the first 4 rows of the data. ```{r, message=FALSE, warning=FALSE} head(icebio, 4) ``` ***

EXERCISE TIME!

` 1. Download and inspect the first 7 rows of the data set that you can find under this URL: `https://slcladal.github.io/data/lmmdata.txt\`. Can you guess what the data is about?

Answer ```{r loadtext} ex1data <- read.delim("https://slcladal.github.io/data/lmmdata.txt", sep = "\t") head(ex1data, 7) ``` The data is about texts and the different columns provide information about the texts such as when the texts were written (`Date`), the genre the texts represent (`Genre`), the name of the texts (`Text`), the relative frequencies of prepositions the texts contain (`Prepositions`), and the region where the author was from (`Region`).

` *** ## Accessing individual cells in a table{-} If you want to access specific cells in a table, you can do so by typing the name of the object and then specify the rows and columns in square brackets (i.e. **data[row, column]**). For example, `icebio[2, 4]` would show the value of the cell in the second row and fourth column of the object `icebio`. We can also use the colon to define a range (as shown below, where 1:5 means from 1 to 5 and 1:3 means from 1 to 3) The command `icebio[1:5, 1:3]` thus means: *Show me the first 5 rows and the first 3 columns of the data-object that is called icebio*. ```{r, message=FALSE, warning=FALSE} icebio[1:5, 1:3] ``` ***

EXERCISE TIME!

` 1. How would you inspect the content of the cells in 4^th^ column, rows 3 to 5 of the `icebio` data set?

Answer ```{r} icebio[3:5, 4] ```

` *** **Inspecting the structure of data** You can use the `str` function to inspect the structure of a data set. This means that this function will show the number of observations (rows) and variables (columns) and tell you what type of variables the data consists of - **int** = integer - **chr** = character string - **num** = numeric - **fct** = factor ```{r, message=FALSE, warning=FALSE} str(icebio) ``` The `summary` function summarizes the data. ```{r, message=FALSE, warning=FALSE} summary(icebio) ``` ## Tabulating data{-} We can use the `table` function to create basic tables that extract raw frequency information. The following command tells us how many instances there are of each level of the variable `date` in the `icebio`. ***

TIP

` In order to access specific columns of a data frame, you can first type the name of the data set followed by a `$` symbol and then the name of the column (or variable).

` *** ```{r, message=FALSE, warning=FALSE} table(icebio$date) ``` Alternatively, you could, of course, index the column by using its position in the data set like this: `icebio[, 6]` - the result of `table(icebio[, 6])` and `table(icebio$date)` are the same! Also note that here we leave out indexes for rows to tell R that we want all rows. When you want to cross-tabulate columns, it is often better to use the `ftable` function (`ftable` stands for *frequency table*). ```{r, message=FALSE, warning=FALSE} ftable(icebio$age, icebio$sex) ``` ***

EXERCISE TIME!

` 1. Using the `table` function, how many women are in the data collected between 2002 and 2005?

Answer ```{r} table(icebio$date, icebio$sex) ```

2. Using the `ftable` function, how many men are are from northern Ireland in the data collected between 1990 and 1994?

Answer ```{r} ftable(icebio$date, icebio$zone, icebio$sex) ```

` *** ## Saving data to your computer{-} To save tabular data on your computer, you can use the `write.table` function. This function requires the data that you want to save as its first argument, the location where you want to save the data as the second argument and the type of delimiter as the third argument. ```{r savedisc, eval = T, message=FALSE, warning=FALSE} write.table(icebio, here::here("data", "icebio.txt"), sep = "\t") ``` **A word about paths** In the code chunk above, the sequence `here::here("data", "icebio.txt")` is a handy way to define a path. A path is simply the location where a file is stored on your computer or on the internet (which typically is a server - which is just a fancy term for a computer - somewhere on the globe). The `here` function from the`here` package allows to simply state in which folder a certain file is and what file you are talking about. In this case, we want to access the file `icebio` (which is a `txt` file and thus has the appendix `.txt`) in the `data` folder. R will always start looking in the folder in which your project is stored. If you want to access a file that is stored somewhere else on your computer, you can also define the full path to the folder in which the file is. In my case, this would be `D:/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data`. However, as the `data` folder in in the folder where my Rproj file is, I only need to specify that the file is in the `data` folder within the folder in which my Rproj file is located. **A word about package naming** Another thing that is notable in the sequence `here::here("data", "icebio.txt")` is that I specified that the `here` function is part of the `here` package. This is what I meant by writing `here::here` which simply means use the `here` function from `here` package (`package::function`). This may appear to be somewhat redundant but it happens quite frequently, that different packages have functions that have the same names. In such cases, R will simply choose the function from the package that was loaded last. To prevent R from using the wrong function, it makes sense to specify the package AND the function (as I did in the sequence `here::here`). I only use functions without specify the package if the function is part of base R. ## Loading data from your computer{-} To load tabular data from within your project folder (if it is in a tab-separated txt-file) you can also use the `read.delim` function. The only difference to loading from the web is that you use a path instead of a URL. If the txt-file is in the folder called *data* in your project folder, you would load the data as shown below. ```{r, message=FALSE, warning=FALSE} icebio <- read.delim(here::here("data", "icebio.txt"), sep = "\t", header = T) ``` However, you can always just use the full path (and you must do this is the data is not in your project folder). ***

NOTE
You may have to change the path to the data!

*** ```{r, echo=T, eval=T, message=F, warning=F} icebio <- read.delim(here::here("data", "icebio.txt"), sep = "\t", header = T) ``` To if this has worked, we will use the `head` function to see first 6 rows of the data ```{r, message=F, warning=F} head(icebio) ``` ## Loading Excel data{-} To load Excel spreadsheets, you can use the `read_excel` function from the `readxl` package as shown below. However, it may be necessary to install and activate the `readxl` package first. ```{r, echo=T, eval=T, message=F, warning=F} icebio <- readxl::read_excel(here::here("data", "ICEdata.xlsx")) ``` We now briefly check column names to see if the loading of the data has worked. ```{r, message=F, warning=F} colnames(icebio) ``` ## Loading text data{-} There are many functions that we can use to load text data into R. For example, we can use the `readLines` function as shown below. ```{r, echo=T, eval=T, message=F, warning=F} text <- readLines(here::here("data", "text2.txt")) # inspect first text element text[1] ``` To load many texts, we can use a loop to read all texts in a folder as shown below. In a first step, we define the paths of the texts and then, we use the `sapply` function to loop over the paths and read them into R. ```{r load_texts, eval=T, echo=T, message=F, warning=F} # define paths paths <- list.files(here::here("data/testcorpus"), full.names = T) # load texts texts <- sapply(paths, function(x){ readLines(x) }) # inspect first text element texts[1] ``` A method achieving the same result which uses piping (more on what that is below) and tidyverse R code is shown below. ```{r load_texts_tidy, eval=T, echo=T, message=F, warning=F} # define paths texts <- list.files(here::here("data/testcorpus"), full.names = T, pattern = ".*txt") %>% purrr::map_chr(~ readr::read_file(.)) # inspect first text element texts[1] ``` ## Renaming, Piping, and Filtering {-} To rename existing columns in a table, you can use the `rename` command which takes the table as the first argument, the new name as the second argument, the an equal sign (=), and finally, the old name es the third argument. For example, renaming a column *OldName* as *NewName* in a table called *MyTable* would look like this: `rename(MyTable, NewName = OldName)`. Piping is done using the `%>%` sequence and it can be translated as **and then**. In the example below, we create a new object (icebio_edit) from the existing object (icebio) *and then* we rename the columns in the new object. When we use piping, we do not need to name the data we are using as this is provided by the previous step. ```{r, message=FALSE, warning=FALSE} icebio_edit <- icebio %>% dplyr::rename(Id = id, FileSpeakerId = file.speaker.id, File = colnames(icebio)[3], Speaker = colnames(icebio)[4]) # inspect data icebio_edit[1:5, 1:6] ``` A very handy way to rename many columns simultaneously, you can use the `str_to_title` function which capitalizes first letter of a word. In the example below, we capitalize all first letters of the column names of our current data. ```{r, message=FALSE, warning=FALSE} colnames(icebio_edit) <- stringr::str_to_title(colnames(icebio_edit)) # inspect data icebio_edit[1:5, 1:6] ``` To remove rows based on values in columns you can use the `filter` function. ```{r, message=FALSE, warning=FALSE} icebio_edit2 <- icebio_edit %>% dplyr::filter(Speaker != "?", Zone != is.na(Zone), Date == "2002-2005", Word.count > 5) # inspect data head(icebio_edit2) ``` To select specific columns you can use the `select` function. ```{r, message=FALSE, warning=FALSE} icebio_selection <- icebio_edit2 %>% dplyr::select(File, Speaker, Word.count) # inspect data head(icebio_selection) ``` You can also use the `select` function to remove specific columns. ```{r, message=FALSE, warning=FALSE} icebio_selection2 <- icebio_edit2 %>% dplyr::select(-Id, -File, -Speaker, -Date, -Zone, -Age) # inspect data head(icebio_selection2) ``` ## Ordering data{-} To order data, for instance, in ascending order according to a specific column you can use the `arrange` function. ```{r, message=FALSE, warning=FALSE} icebio_ordered_asc <- icebio_selection2 %>% dplyr::arrange(Word.count) # inspect data head(icebio_ordered_asc) ``` To order data in descending order you can also use the `arrange` function and simply add a - before the column according to which you want to order the data. ```{r, message=FALSE, warning=FALSE} icebio_ordered_desc <- icebio_selection2 %>% dplyr::arrange(-Word.count) # inspect data head(icebio_ordered_desc) ``` The output shows that the female speaker in file S2A-005 with the speaker identity A has the highest word count with 2,355 words. ***

EXERCISE TIME!

` 1. Using the data called `icebio`, create a new data set called `ICE_Ire_ordered` and arrange the data in descending order by the number of words that each speaker has uttered. Who is the speaker with the highest word count?

Answer ```{r} ICE_Ire_ordered <- icebio %>% dplyr::arrange(-word.count) # inspect data head(ICE_Ire_ordered) ```

` *** ## Creating and changing variables{-} New columns are created, and existing columns can be changed, by using the `mutate` function. The `mutate` function takes two arguments (if the data does not have to be specified): the first argument is the (new) name of column that you want to create and the second is what you want to store in that column. The = tells R that the new column will contain the result of the second argument. In the example below, we create a new column called *Texttype*. This new column should contain + the value *PrivateDialoge* if *Filespeakerid* contains the sequence *S1A*, + the value *PublicDialogue* if *Filespeakerid* contains the sequence *S1B*, + the value *UnscriptedMonologue* if *Filespeakerid* contains the sequence *S2A*, + the value *ScriptedMonologue* if *Filespeakerid* contains the sequence *S2B*, + the value of *Filespeakerid* if *Filespeakerid* neither contains *S1A*, *S1B*, *S2A*, nor *S2B*. ```{r, message=FALSE, warning=FALSE} icebio_texttype <- icebio_selection2 %>% dplyr::mutate(Texttype = dplyr::case_when(stringr::str_detect(Filespeakerid ,"S1A") ~ "PrivateDialoge", stringr::str_detect(Filespeakerid ,"S1B") ~ "PublicDialogue", stringr::str_detect(Filespeakerid ,"S2A") ~ "UnscriptedMonologue", stringr::str_detect(Filespeakerid ,"S2B") ~ "ScriptedMonologue", TRUE ~ Filespeakerid)) # inspect data head(icebio_texttype) ``` ## If-statements{-} We should briefly talk about if-statements (or `case_when` in the present case). The `case_when` function is both very powerful and extremely helpful as it allows you to assign values based on a test. As such, `case_when`-statements can be read as: *When/If X is the case, then do A and if X is not the case do B!* (When/If -> Then -> Else) The nice thing about `ifelse` or `case_when`-statements is that they can be used in succession as we have done above. This can then be read as: *If X is the case, then do A, if Y is the case, then do B, else do Z* ***

EXERCISE TIME!

` 1.Using the data called `icebio`, create a new data set called `ICE_Ire_AgeGroup` in which you create a column called `AgeGroup` where all speakers who are younger than 42 have the value *young* and all speakers aged 42 and over *old*. **Tip**: use if-statements to assign the *old* and *young* values.

Answer ```{r} ICE_Ire_AgeGroup <- icebio %>% dplyr::mutate(AgeGroup = dplyr::case_when(age == "42-49" ~ "old", age == "50+" ~ "old", age == "0-18" ~ "young", age == "19-25" ~ "young", age == "26-33" ~ "young", age == "34-41" ~ "young", TRUE ~age)) # inspect data head(ICE_Ire_AgeGroup); table(ICE_Ire_AgeGroup$AgeGroup) ```

` *** ## Summarizing data{-} Summarizing is really helpful and can be done using the `summarise` function. ```{r, message=FALSE, warning=FALSE} icebio_summary1 <- icebio_texttype %>% dplyr::summarise(Words = sum(Word.count)) # inspect data head(icebio_summary1) ``` To get summaries of sub-groups or by variable level, we can use the `group_by` function and then use the `summarise` function. ```{r, warning=F, message=F} icebio_summary2 <- icebio_texttype %>% dplyr::group_by(Texttype, Sex) %>% dplyr::summarise(Speakers = n(), Words = sum(Word.count)) # inspect data head(icebio_summary2) ``` ***

EXERCISE TIME!

` 1. Use the `icebio` and determine the number of words uttered by female speakers from Northern Ireland above an age of 50.

Answer ```{r} words_fni50 <- icebio %>% dplyr::select(zone, sex, age, word.count) %>% dplyr::group_by(zone, sex, age) %>% dplyr::summarize(Words = sum(word.count)) %>% dplyr::filter(sex == "female", age == "50+", zone == "northern ireland") # inspect data words_fni50 ```

2. Load the file *exercisedata.txt* and determine the mean scores of groups A and B. **Tip**: to extract the mean, combine the `summary` function with the `mean` function.

Answer ```{r} exercisedata <- read.delim(here::here("data", "exercisedata.txt"), sep = "\t", header = T) %>% dplyr::group_by(Group) %>% dplyr::summarize(Mean = mean(Score)) # inspect data exercisedata ```

` *** ## Gathering and spreading data{-} The `tidyr` package has two very useful functions for gathering and spreading data that can be sued to transform data to long and wide formats (you will see what this means below). The functions are called `gather` and `spread`. We will use the data set called `icebio_summary2`, which we created above, to demonstrate how this works. We will first check out the `spread`-function to create different columns for women and men that show how many of them are represented in the different text types. ```{r, message=FALSE, warning=FALSE} icebio_summary_wide <- icebio_summary2 %>% dplyr::select(-Words) %>% tidyr::spread(Sex, Speakers) # inspect icebio_summary_wide ``` The data is now in what is called a `wide`-format as values are distributed across columns. To reformat this back to a `long`-format where each column represents exactly one variable, we use the `gather`-function: ```{r, message=FALSE, warning=FALSE} icebio_summary_long <- icebio_summary_wide %>% tidyr::gather(Sex, Speakers, female:male) # inspect icebio_summary_long ``` # More on working with text{-} We have now worked though how to load, save, and edit tabulated data. However, R is also perfectly equipped for working with textual data which is what we going to concentrate on now. ## Loading text data{-} To load text data from the web, we can use the `read_file` function which takes the URL of the text as its first argument. In this case will will load the 2016 rally speeches Donald Trump. ```{r, message=FALSE, warning=FALSE} Trump <-base::readRDS(url("https://slcladal.github.io/data/Trump.rda", "rb")) # inspect data str(Trump) ``` It is very easy to extract frequency information and to create frequency lists. We can do this by first using the `unnest_tokens` function which splits texts into individual words, an then use the `count` function to get the raw frequencies of all word types in a text. ```{r, message=FALSE, warning=FALSE} Trump %>% tibble(text = SPEECH) %>% unnest_tokens(word, text) %>% dplyr::count(word, sort=T) ``` Extracting N-grams is also very easy as the `unnest_tokens` function can an argument called `token` in which we can specify that we want to extract n-grams, If we do this, then we need to specify the `n` as a separate argument. Below we specify that we want the frequencies of all 4-grams. ```{r, message=FALSE, warning=FALSE} Trump %>% tibble(text = SPEECH) %>% unnest_tokens(word, text, token="ngrams", n=4) %>% dplyr::count(word, sort=T) %>% head(10) ``` ## Splitting-up texts{-} We can use the `str_split` function to split texts. However, there are two issues when using this (very useful) function: + the pattern that we want to split on disappears + the output is a list (a special type of data format) To remedy these issues, we + combine the `str_split` function with the `unlist` function + add something right at the beginning of the pattern that we use to split the text. To add something to the beginning of the pattern that we want to split the text by, we use the `str_replace_all` function. The `str_replace_all` function takes three arguments, 1. the **text**, 2. the **pattern** that should be replaced, 3. the **replacement**. In the example below, we add `~~~` to the sequence `SPEECH` and then split on the `~~~` rather than on the sequence "SPEECH" (in other words, we replace `SPEECH` with `~~~SPEECH` and then split on `~~~`). ```{r, message=FALSE, warning=FALSE} Trump_split <- unlist(str_split( stringr::str_replace_all(Trump, "SPEECH", "~~~SPEECH"), pattern = "~~~")) # inspect data nchar(Trump_split)#; str(Trump_split) ``` ## Cleaning texts{-} When working with texts, we usually need to clean the data. Below, we do some very basic cleaning using a pipeline. ```{r, message=FALSE, warning=FALSE} Trump_split_clean <- Trump_split %>% # replace elements stringr::str_replace_all(fixed("\n"), " ") %>% # remove strange symbols stringr::str_replace_all("[^[:alnum:][:punct:]]+", " ") %>% # combine contractions stringr::str_replace_all(" re ", "'re ") %>% stringr::str_replace_all(" ll ", "'ll ") %>% stringr::str_replace_all(" d ", "'d ") %>% stringr::str_replace_all(" m ", "'m ") %>% stringr::str_replace_all(" s ", "'s ") %>% stringr::str_replace_all("n t ", "n't ") %>% # remove \" stringr::str_remove_all("\"") %>% # remove superfluous white spaces stringr::str_squish() # remove very short elements Trump_split_clean <- Trump_split_clean[nchar(Trump_split_clean) > 5] # inspect data nchar(Trump_split_clean) ``` Inspect text ```{r, eval = F, echo=T, message=FALSE, warning=FALSE} Trump_split_clean[5] ``` ## Concordancing and KWICs{-} Creating concordances or key-word-in-context displays is one of the most common practices when dealing with text data. Fortunately, there exist ready-made functions that make this a very easy task in R. We will use the `kwic` function from the `quanteda` package to create kwics here. ```{r, message=FALSE, warning=FALSE} kwic_multiple <- quanteda::kwic(Trump_split_clean, pattern = phrase("great again"), window = 3, valuetype = "regex") %>% as.data.frame() # inspect data head(kwic_multiple) ``` We can now also select concordances based on specific features. For example, we only want those instances of "great again" if the preceding word was "america". ```{r, message=FALSE, warning=FALSE} kwic_multiple_select <- kwic_multiple %>% # last element before search term is "america" dplyr::filter(str_detect(pre, "america$")) # inspect data head(kwic_multiple_select) ``` Again, we can use the `write.table` function to save our kwics to disc. ```{r, echo = T, eval = F, message=FALSE, warning=FALSE} write.table(kwic_multiple_select, here::here("data", "kwic_multiple_select.txt"), sep = "\t") ``` As most of the data that we use is on out computers (rather than being somewhere on the web), we now load files with text from your computer. It is important to note that you need to use `\\` when you want to load data from a Windows PC (rather than single `\`). To load many files, we first create a list of all files in a the directory that we want to load data from and then use the `sapply` function (which works just like a loop). The `sapply` function takes a a vector of elements and then performs a sequence of steps on each of these elements. In the example below, we feed the file locations to the `sapply` function and then we scan each text (i.e. we read it into R), then we paste all the content of one file together. ***

NOTE
You may have to change the path to the data!

*** ```{r, message=FALSE, warning=FALSE} files <- list.files(here::here("data", "ICEIrelandSample"), pattern = ".txt", full.names = T) ICE_Ire_sample <- sapply(files, function(x) { x <- scan(x, what = "char") x <- paste(x, sep = " ", collapse = " ") }) # inspect data str(ICE_Ire_sample) ``` As the texts do not have column names (but simply names), we can clean these by removing everything before a `/` and by removing the .txt. ```{r, message=FALSE, warning=FALSE} names(ICE_Ire_sample) <- names(ICE_Ire_sample) %>% stringr::str_remove_all(".*/") %>% stringr::str_remove_all(".txt") # inspect names(ICE_Ire_sample) ``` ## Further splitting of texts{-} To split the texts into speech units where each speech unit begins with the speaker that has uttered it, we again use the `sapply` function. ```{r, message=FALSE, warning=FALSE} ICE_Ire_split <- as.vector(unlist(sapply(ICE_Ire_sample, function(x){ x <- as.vector(str_split(str_replace_all(x, "(<s1a-)", 0="" 1="" "~~~\\1"),="" "~~~"))="" })))="" #="" inspect="" head(ice_ire_split)="" ```="" ##="" basics="" of="" regular="" expressions{-}="" next,="" we="" extract="" the="" file="" and="" speaker="" combine="" text,="" file,="" in="" a="" table.="" use="" this="" to="" show="" power="" **regular="" expressions**="" (to="" learn="" more="" about="" expression,="" have="" look="" at="" very="" recommendable="" [tutorial](https:="" stringr.tidyverse.org="" articles="" regular-expressions.html)).="" expressions="" are="" symbols="" or="" sequences="" that="" stand="" for="" +="" patterns="" (e.g.="" `[a-z]`="" stands="" any="" lowercase="" character)="" frequency="" `{1,3}`="" between="" 3)="" classes="" `[:punct:]`="" punctuation="" symbol)="" structural="" properties="" `[^[:blank:]]`="" non-space="" character,="" `\t`="" tab-stop="" `\n`="" line="" break)="" can="" not="" go="" into="" detail="" here="" only="" touch="" upon="" expressions.="" symbol="" `.`="" is="" one="" most="" powerful="" universal="" as="" it="" represents="" (literally)="" character="" thus="" pattern.="" `*`="" expression="" refers="" pattern="" an="" infinite="" number="" instances.="" thus,="" `.*`="" character.="" you="" find="" overview="" r="" [here](https:="" slcladal.github.io="" regex.html).="" also,="" if="" put="" round="" brackets,="" will="" remember="" sequence="" within="" brackets="" paste="" back="" string="" from="" memory="" when="" replace="" something.="" referring="" used="" such="" `\`="" `\$`,="" need="" inform="" actually="" mean="" real="" do="" by="" typing="" two="" `\\`="" before="" question.="" example="" below="" try="" see="" what="" (`.*(s1a-[0-9]{3,3}).*`,="" `\n`,="" `.*\\$([a-z]{1,2}\\?{0,1})="">.*`) stand for. ```{r, message=FALSE, warning=FALSE} ICE_Ire_split_tb <- ICE_Ire_split %>% as.data.frame() # add column names colnames(ICE_Ire_split_tb)[1] <- "Text" # add file and speaker ICE_Ire_split_tb <- ICE_Ire_split_tb %>% dplyr::filter(!str_detect(Text, ""), Text != "") %>% dplyr::mutate(File = str_replace_all(Text, ".*(S1A-[0-9]{3,3}).*", "\\1"), File = str_remove_all(File, "\\\n"), Speaker = str_replace_all(Text, ".*\\$([A-Z]{1,2}\\?{0,1})>.*", "\\1"), Speaker = str_remove_all(Speaker, "\\\n")) ``` ```{r tc12b, echo = F, message=FALSE, warning=FALSE} # inspect data ICE_Ire_split_tb %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "") %>% flextable::border_outer() ``` ## Combining tables{-} We often want to combine different tables. This is very easy in R and we will show how it can be done by combining our bio data about speakers that are represented in the ICE Ireland corpus with the texts themselves so that we get a table which holds both the text as well as the speaker information. Thus, we now join the text data with the bio data by using the `left_join` function. We join the text with the bio data based on the contents of the File and the Speaker columns. In contract to `right_join`, and `full_join`, `left_join` will drop all rows from the *right* table that are not present in *left* table (and vice verse for `right_join`. In contrast, `full_join` will retain all rows from both the left and the right table. ```{r, message=FALSE, warning=FALSE} ICE_Ire <- dplyr::left_join(ICE_Ire_split_tb, icebio_edit, by = c("File", "Speaker")) ``` ```{r cb12b, echo = F, message=FALSE, warning=FALSE} # inspect data ICE_Ire %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "") %>% flextable::border_outer() ``` You can then perform concordancing on the Text column in the table. ```{r, message=FALSE, warning=FALSE} kwic_iceire <- quanteda::kwic(ICE_Ire$Text, pattern = phrase("Irish"), window = 5, valuetype = "regex") %>% as.data.frame() ``` ```{r cb12c, echo = F, message=FALSE, warning=FALSE} # inspect data kwic_iceire %>% as.data.frame() %>% head(10) %>% flextable() %>% flextable::set_table_properties(width = .95, layout = "autofit") %>% flextable::theme_zebra() %>% flextable::fontsize(size = 12) %>% flextable::fontsize(size = 12, part = "header") %>% flextable::align_text_col(align = "center") %>% flextable::set_caption(caption = "") %>% flextable::border_outer() ``` ## Tokenization and counting words{-} We will now use the `tokenize_words` function from the tokenizer package to find out how many words are in each file. Before we count the words, however, we will clean the data by removing everything between pointy brackets (e.g. <#>) as well as all punctuation. ```{r, message=FALSE, warning=FALSE} words <- as.vector(sapply(Trump_split_clean, function(x){ x <- tm::removeNumbers(x) x <- tm::removePunctuation(x) x <- unlist(tokenize_words(x)) x <- length(x)})) words ``` The nice thing about the tokenizer package is that it also allows to split texts into sentences. To show this, we return to the rally speeches by Donald Trump and split the first of his rally speeches into sentences. ```{r, message=FALSE, warning=FALSE} Sentences <- unlist(tokenize_sentences(Trump_split_clean[6])) # inspect head(Sentences) ``` We now want to find associations between words. To do this, we convert all characters to lower case, remove (some) non lexical words (also called stop words), remove punctuation, and superfluous white spaces and then create a document-term-matrix (DTM) which shows how often any word occurs in any of the sentences (in this case, the sentences are treated as documents). Once we have a DTM, we can then use the `findAssocs` function to see which words associate most strongly with target words that we want to investigate. We can use the argument "corlimit" to show the terms that are most strongly associated with our target words. ```{r, message=FALSE, warning=FALSE} # clean sentences Sentences <- Sentences %>% # convert to lowercase tolower() %>% # remove stop words tm::removeWords(stopwords("english")) %>% # remove punctuation tm::removePunctuation() %>% # remove numbers tm::removeNumbers() %>% # remove superfluous white spaces stringr::str_squish() # create DTM DTM <- DocumentTermMatrix(VCorpus(VectorSource(Sentences))) findAssocs(DTM, c("problems", "america"), corlimit = c(.5, .5)) ``` We now turn to data visualization basics. # Working with figures{-} There are numerous function in R that we can use to visualize data. We will use the `ggplot` function from the `ggplot2` package here to visualize the data. The `ggplot2` package was developed by Hadley Wickham in 2005 and it implements the graphics scheme described in the book *The Grammar of Graphics* by Leland Wilkinson. The idea behind the *Grammar of Graphics* can be boiled down to 5 bullet points (see Wickham 2016: 4): - a statistical graphic is a mapping from data to **aes**thetic attributes (location, color, shape, size) of **geom**etric objects (points, lines, bars). - the geometric objects are drawn in a specific **coord**inate system. - **scale**s control the mapping from data to aesthetics and provide tools to read the plot (i.e., axes and legends). - the plot may also contain **stat**istical transformations of the data (means, medians, bins of data, trend lines). - **facet**ing can be used to generate the same plot for different subsets of the data. ## Basics of ggplot2 syntax{-} **Specify data, aesthetics and geometric shapes** `ggplot(data, aes(x=, y=, color=, shape=, size=)) +` `geom_point()`, or `geom_histogram()`, or `geom_boxplot()`, etc. - This combination is very effective for exploratory graphs. - The data must be a data frame. - The `aes()` function maps columns of the data frame to aesthetic properties of geometric shapes to be plotted. - `ggplot()` defines the plot; the `geoms` show the data; each component is added with `+` - Some examples should make this clear ## Practical examples{-} We will now create some basic visualizations or plots. Before we start plotting, we will create data that we want to plot. In this case, we will extract the mean word counts by gender and age. ```{r, message=FALSE, warning=FALSE} plotdata <- ICE_Ire %>% # only private dialogue dplyr::filter(stringr::str_detect(File, "S1A"), # without speaker younger than 19 Age != "0-18", Age != "NA") %>% dplyr::group_by(Sex, Age) %>% dplyr::summarise(Words = mean(Word.count)) # inspect head(plotdata) ``` In the example below, we specify that we want to visualize the `plotdata` and that the x-axis should represent `Age` and the y-axis `Words`(the mean frequency of words). We also tell R that we want to group the data by `Sex` (i.e. that we want to distinguish between men and women). Then, we add `geom_line` which tells R that we want a line graph. The result of this is shown below. ```{r, message=FALSE, warning=FALSE} ggplot(plotdata, aes(x = Age, y = Words, color = Sex, group = Sex)) + geom_line() ``` Once you have a basic plot like the one above, you can prettify the plot. For example, you can + change the width of the lines (`size = 1.25`) + change the y-axis limits (`coord_cartesian(ylim = c(0, 1000)) `) + use a different theme (`theme_bw()` means black and white theme) + move the legend to the top + change the default colors to colors you like (*scale_color_manual ...`) + change the linetype (`scale_linetype_manual ...`) ```{r, message=FALSE, warning=FALSE} ggplot(plotdata, aes(x = Age, y = Words, color = Sex, group = Sex, linetype = Sex)) + geom_line(size = 1.25) + coord_cartesian(ylim = c(0, 1500)) + theme_bw() + theme(legend.position = "top") + scale_color_manual(breaks = c("female", "male"), values = c("gray20", "gray50")) + scale_linetype_manual(breaks = c("female", "male"), values = c("solid", "dotted")) ``` An additional and very handy feature of this way of producing graphs is that you + can integrate them into pipes + can easily combine plots. ```{r, message=FALSE, warning=FALSE} ICE_Ire %>% dplyr::filter(Sex != "NA", Age != "NA", is.na(Sex) == F, is.na(Age) == F) %>% dplyr::mutate(Age = factor(Age), Sex = factor(Sex)) %>% ggplot(aes(x = Age, y = Word.count, color = Sex, linetype = Sex)) + geom_point() + stat_summary(fun=mean, geom="line", aes(group=Sex)) + coord_cartesian(ylim = c(0, 2000)) + theme_bw() + theme(legend.position = "top") + scale_color_manual(breaks = c("female", "male"), values = c("indianred", "darkblue")) + scale_linetype_manual(breaks = c("female", "male"), values = c("solid", "dotted")) ``` You can also create different types of graphs very easily and split them into different facets. ```{r, message=FALSE, warning=FALSE} ICE_Ire %>% drop_na() %>% dplyr::filter(Age != "NA") %>% dplyr::mutate(Date = factor(Date)) %>% ggplot(aes(x = Age, y = Word.count, fill = Sex)) + facet_grid(vars(Date)) + geom_boxplot() + coord_cartesian(ylim = c(0, 2000)) + theme_bw() + theme(legend.position = "top") + scale_fill_manual(breaks = c("female", "male"), values = c("#E69F00", "#56B4E9")) ``` ***

EXERCISE TIME!

` 1. Create a box plot showing the `Date` on the x-axis and the words uttered by speakers on the y-axis and group by `Sex`.

Answer ```{r} ICE_Ire %>% drop_na() %>% dplyr::filter(Sex != "NA") %>% dplyr::mutate(Date = factor(Date)) %>% ggplot(aes(x = Date, y = Word.count, fill = Sex)) + geom_boxplot() + coord_cartesian(ylim = c(0, 2000)) + theme_bw() + theme(legend.position = "top") + scale_fill_manual(breaks = c("female", "male"), values = c("#E69F00", "#56B4E9")) ```

2. Create a scatter plot showing the `Date` on the x-axis and the words uttered by speakers on the y-axis and create different facets for `Sex`.

Answer ```{r} ICE_Ire %>% drop_na() %>% dplyr::filter(Sex != "NA", Date != "NA") %>% dplyr::mutate(Date = factor(Date), Sex = factor(Sex)) %>% ggplot(aes(Date, Word.count, color = Date)) + facet_wrap(vars(Sex), ncol = 2) + geom_point() + coord_cartesian(ylim = c(0, 2000)) + theme_bw() + scale_color_manual(breaks = c("1990-1994", "2002-2005"), values = c("#E69F00", "#56B4E9")) ```

` *** **Advanced** Create a bar plot showing the number of men and women by `Date`. Solution ```{r, echo = F, eval = F, message=FALSE, warning=FALSE} ICE_Ire %>% drop_na() %>% dplyr::select(Date, Sex, File) %>% unique() %>% dplyr::filter(Sex != "NA", Date != "NA") %>% dplyr::group_by(Date, Sex) %>% dplyr::summarize(NumberOfSpeakers = n()) %>% ggplot(aes(Date, NumberOfSpeakers, fill = Date)) + facet_wrap(vars(Sex), ncol = 2) + geom_bar(stat = "identity") + coord_cartesian(ylim = c(0, 20)) + theme_bw() + scale_fill_manual(breaks = c("1990-1994", "2002-2005"), values = c("#E69F00", "#56B4E9")) ```*** # Ending R sessions{-} At the end of each session, you can extract information about the session itself (e.g. which R version you used and which versions of packages). This can help others (or even your future self) to reproduce the analysis that you have done. ## Extracting session information{-} You can extract the session information by running the `sessionInfo` function (without any arguments) ```{r} sessionInfo() ``` # Going further{-} If you want to know more, would like to get some more practice, or would like to have another approach to R, please check out the workshops and resources on R provided by the [UQ library](https://web.library.uq.edu.au/library-services/training). In addition, there are various online resources available to learn R (you can check out a very recommendable introduction [here](https://uvastatlab.github.io/phdplus/intror.html)). Here are also some additional resources that you may find helpful: * Grolemund. G., and Wickham, H., [*R 4 Data Science*](http://r4ds.had.co.nz/), 2017. + Highly recommended! (especially chapters 1, 2, 4, 6, and 8) * Stat545 - Data wrangling, exploration, and analysis with R. University of British Columbia. * Swirlstats, a package that teaches you R and statistics within R: * DataCamp's (free) *Intro to R* interactive tutorial: + DataCamp's advanced R tutorials require a subscription. *Twitter: + Explore RStudio Tips https://twitter.com/rstudiotips + Explore #rstats, #rstudioconf # Citation & Session Info {-} Schweinberger, Martin. 2022. *Getting started with R*. Brisbane: The University of Queensland. url: https://ladal.edu.au/intror.html (Version 2022.11.15). ``` @manual{schweinberger2022intror, author = {Schweinberger, Martin}, title = {Getting started with R}, note = {https://ladal.edu.au/intror.html}, year = {2022}, organization = "The University of Queensland, School of Languages and Cultures}, address = {Brisbane}, edition = {2022.11.15} } ``` ```{r fin} sessionInfo() ``` *** [Back to top](#introduction) [Back to LADAL home](https://ladal.edu.au) *** # References {-}</s1a-)",>