Take Home Exercise #1: Participants in Engagement Ohio

We study a sample of 1011 residents from Engagement, Ohio USA. The information collected is as follows:

participantId: unique ID assigned to each participant.

householdSize: the number of people in the participant’s household

haveKids: whether there are children living in the participant’s household

age: participant’s age in years at the start of the study

educationLevel: the participant’s educ2ation level, namely Low, HighSchoolorCollege, Bachelors or Graduate

interestGroup: a char representing the participant’s stated primary interest-A, B, C, D, E, F, G, H, I, J.

joviality: a value ranging from 0 to 1 indicating the participant’s overall happiness level at the start of the study.

Author

Affiliation

David Kwok

Published

April 23, 2022

DOI

Challenges

There is not an even distribution of demographics by interest group, age or education. This could be representative of the larger population, however it makes identifying specific demographic indicators on happiness and coorelation of demographic traits a little more challenging

Installing and launching R packages

In order to run generate the data, we need to first install and run the below packages with the below script:

packages <- c('tidyverse','rvest','reshape2','ggtern','viridis','ggrepel','CGPfunctions','ggpubr')

for (p in packages){
  if (!require(p,character.only=T)){
    install.packages(p)
  }
  library(p, character.only=T)
}

Data Preparation and tidying

The data is imported from a local copy of the file Participants.csv using the function read_csv from the readr package, then inspected using the str() function using the below code:

Participant_data <- read_csv("Data/Participants.csv")
str(Participant_data)

spec_tbl_df [1,011 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ participantId : num [1:1011] 0 1 2 3 4 5 6 7 8 9 ...
 $ householdSize : num [1:1011] 3 3 3 3 3 3 3 3 3 3 ...
 $ haveKids      : logi [1:1011] TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ age           : num [1:1011] 36 25 35 21 43 32 26 27 20 35 ...
 $ educationLevel: chr [1:1011] "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" ...
 $ interestGroup : chr [1:1011] "H" "B" "A" "I" ...
 $ joviality     : num [1:1011] 0.00163 0.32809 0.39347 0.13806 0.8574 ...
 - attr(*, "spec")=
  .. cols(
  ..   participantId = col_double(),
  ..   householdSize = col_double(),
  ..   haveKids = col_logical(),
  ..   age = col_double(),
  ..   educationLevel = col_character(),
  ..   interestGroup = col_character(),
  ..   joviality = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

To facilitate clarity and good house keeping, we renamed the columns to “Participant_ID, Household_Size,”Kids”, “Age”, Education”, “Interest” and “Joviality” using the code below:

 Participant_data <- Participant_data %>% rename(`Participant_ID` = `participantId`,`Household_Size` = `householdSize`, `Kids` = `haveKids`,`Age`=`age`,`Education` =`educationLevel` , `Interest`= `interestGroup`, `Joviality`=`joviality`)
colnames(Participant_data)

[1] "Participant_ID" "Household_Size" "Kids"          
[4] "Age"            "Education"      "Interest"      
[7] "Joviality"

Visualizations

We proceed to do preliminary analysis on the data provided by trying to establish demographic details and some things that impact the happiness (or joviality) of the participants ## Demographic Details

We endevor to identify key demographics of the population, we plot a set of charts illustrating the number of people with a given age with a given level of education. We also try to identify the interest in the respective activities broken down by education level. Results are as follows:

ggplot(data=Participant_data, 
       aes(x=Age)) +
    geom_histogram(bins=20) +
facet_wrap(~ Education)

The above charts shows a higher incidence of high school or college grads and a significant blip in the population around age 30. Also , there is a relatively small number of lowly educated people.

ggplot(data=Participant_data, aes(x=Education)) + 
     geom_bar()+facet_wrap(~ Interest) +
coord_flip()

The above charts show a even distribution of high school or college grads across the activities. There is a notable uptick in interest by those with a bachelors in activity D and J. There is also a low interest in activity D and F by the lowly educated. Graduates seem more interested in activity G. Noteworthy is there is no one activity that holds a more than 20% share of interest from any single education level, so interests seems very balanced.

Happiness by Interest Group and Education

We proceed to see the happiness level of the sample. We are doing this through box plots of Joviality against activity, diagrammed by education level to try comparing averages and extremes (100th and 1st percentiles).

ggplot(data=Participant_data, aes(y = Joviality, x= Interest))+ 
 geom_boxplot() +
 facet_grid(~Education)

The Bachelors educated have the largest inter-quartile distribution for most of the activities. They also have a lower happiness for the typical user for activity J. Graduates are quite unhappy in activity A. High school or college tend to have a very even inter-quartile distribution independent of activity. We find that the happiness of the low education group is very heavily dependent on the activity. F is by far the worst performing group, with G having a notably low rate of happiness for the typical user. For interest A we notice a distinctly poor 2nd quartile performance.

Happiness by education

We explore the impact or possible coorelation of education with happiness through charting a box plot of Joviality against education

ggplot(data=Participant_data, aes(y = Joviality, x= Education))+
     geom_boxplot()

We find that the typical low education is less than 0.05 unhappier than the best scoring graduates. There appears to be a slight coorelation between higher education and happiness.