Feedback and redone LIU Zhenglin’s a takehome exercise.
There is the addition of a preliminary analysis done on the raw data to understand the details and possible constraints of data. This also scopes out possible errors in formatting, clarity and potential data cleaning needed. Many different plots are used to provide different prospectives to the end users. There is good use of colours in the plots which ensure clear segregation of different components of data.
The overview of the study is abit brief and lacks an introduction to the data captured such as origin or background behind the data. With 6 different plots provided, there is a significant chance of ‘paralysis by analysis’. We will endevor to streamline the focus to try and draw clearer observations. Also,constraints of the data could have been called out.
We have collected data from the city of Engagement, Ohio USA. The information collected is as follows:
participantId: unique ID assigned to each participant.
householdSize: the number of people in the participant’s household
haveKids: whether there are children living in the participant’s household
age: participant’s age in years at the start of the study
educationLevel: the participant’s educ2ation level, namely Low, HighSchoolorCollege, Bachelors or Graduate
interestGroup: a char representing the participant’s stated primary interest-A, B, C, D, E, F, G, H, I, J.
joviality: a value ranging from 0 to 1 indicating the participant’s overall happiness level at the start of the study.
We will endeavor to find the coorelation between some of these demographic characteristics and joviality. Specificially for this, we will be looking at age, presence of kids, household size and possible impacts on happiness.
The packages required are ggplot2, readr and dplyr. These are all captured in the package tidyverse and the devtool introdataviz function, which we will load using the below script:
devtools::install_github("psyteachr/introdataviz")
library(introdataviz)
We place the source file Participants.csv in a local data folder and use the read_csv function from the readr package and save it as participants using the following code chunk:
participants <- read_csv("Data/Participants.csv")
glimpse(participants)
Rows: 1,011
Columns: 7
$ participantId <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~
We find that the data consists of 1011 entries of 7 fields.
To have a general understand of the data we have and what might have coorelations, we develop plots of: 1) Household size 2) Age distribution 3) Presence of kids 4) Education level 5) Interest 6) Distribution of happiness
This is done with the geom_bar() function in the ggplot2 package
p_householdsize <- ggplot(data = participants,
aes(x = householdSize))+
geom_bar(color="grey25",
fill="grey90") +
ggtitle("Household Sizes")
p_havekids <- ggplot(data=participants,
aes(x = haveKids)) +
geom_bar(color="grey25",
fill="grey90") +
ggtitle("Distribution of Kids")
p_age <- ggplot(data=participants,
aes(x = age)) +
geom_histogram(boundary = 100,
color="grey25",
fill="grey90") +
coord_cartesian(xlim=c(16, 61)) +
ggtitle("Age Distribution")
p_edu <- ggplot(data=participants,
aes(x = educationLevel)) +
geom_bar(boundary = 100,
color="grey25",
fill="grey90") +
ggtitle("Distribution of Education Level")
p_interest <- ggplot(data=participants,
aes(x = interestGroup)) +
geom_bar(boundary = 100,
color="grey25",
fill="grey90") +
ggtitle("Distribution of Interests")
p_joviality<- ggplot(data=participants,
aes(x = joviality)) +
geom_density() +
ggtitle("Distribution of Happiness")
(p_householdsize/p_havekids/p_interest)|(p_age/p_edu/p_joviality)
This preliminary look shows that most do not have kids, with about half the population being high school or college educated. There is no clearly dominant interest or level of happiness. Age seems to have alternating bands, which could suggest a seasonality to the birthrates of individuals in the area.
To facilitate clarity and good house keeping, we renamed the columns to “Participant_ID”, “Household_Size”,“Kids”, “Age”, “Education”, “Interest” and “Joviality” using the code below:
participants <- participants %>%rename('Participant_ID' = 'participantId','Household_Size' = 'householdSize', 'Kids' = 'haveKids','Age'='age','Education' ='educationLevel' , 'Interest'= 'interestGroup', 'Joviality'='joviality')
colnames(participants)
[1] "Participant_ID" "Household_Size" "Kids"
[4] "Age" "Education" "Interest"
[7] "Joviality"
Also, in order to create clean breaks in the ages, we binned the ages by significant legal milestones (Age 21) and subsequently evenly by 5 year intervals from age 25. We did this with the mutate function with the following code chunk:
We run the below code chunk to provide insight into the status of kids in the various households:
ggplot(data = participants_Age,
aes(x = Household_Size, fill = Kids))+
geom_bar()+
ggtitle("Household sizes with (or without) kids")
Interestingly enough, there are no households with less than 3 people with kids. This could suggest a low presence of single parenthood in Engagement. A potential leading indicator of lower rates of delinquency (with an assumed lower happiness) compared to places with a higher presence of single parent households.
We also endeavor to understand happiness against household sizes by plotting the happiness against household size with the following code chunk:
ggplot(data=participants,
aes(y = Joviality))+
geom_boxplot()+
facet_wrap(~ Household_Size)+
ggtitle("Happiness by household size")
Based on this plot, we can see that it seems that living alone tends to correlate with a lower average happiness, having more than 2 does not seem to significantly improve happiness of a typical person.This, when combined with the earlier plot, we can observe that children do not significantly impact happiness.
It stands to reason that the observations in the previous section are could be a result of larger households tending to have older people who are at a more stable (and hence happier)stage of life. We investigate this with the overlaying of age against household sizes.
ggplot(data = participants_Age,
aes(x = Age_group))+
geom_bar(width=0.5) +
facet_wrap(~Household_Size)+
theme(axis.text.x = element_text(angle = 90))
ggtitle("Household size by Age")
$title
[1] "Household size by Age"
attr(,"class")
[1] "labels"
We find that our earlier hypothesis is not accurate. We also notice that there is a significant frequency of young parents due to the comparable number of 17 to 25 year olds with kids. We then proceed to see if happiness potentially has any influence by age and presence of kids with a box and violin plot from the below code chunk:
ggplot(participants_Age, aes(x = Age_group, y = Joviality, fill = Kids)) +
introdataviz::geom_split_violin(alpha = .4, trim = FALSE) +
geom_boxplot(width = .2, alpha = .6, fatten = NULL, show.legend = FALSE) +
stat_summary(fun.data = "mean_se", geom = "pointrange", show.legend = F,
position = position_dodge(.175)) +
scale_y_continuous(breaks = seq(0, 1, 0.2),
limits = c(0, 1)) +
scale_fill_brewer(palette = "Dark2", name = "Have Kids")
Given the population sizes, it is very difficult to conclude any concrete trends on this analysis, apart from the observation that typical happiness tends to hover around around 0.5 and does not appear to be strong coorelated with age and race or presence of kids.
Based on our study, the community seems to be of average happiness, a larger sample would be needed to draw more concrete observation with relation to demographic factors and happiness.
More study can also be done on the education status and other factors, which was not covered in the scope of this study.
Demographically, it does seem that all children are from households with more than 1 other person. Which seems to be a good indicator of societal health, due to lower potential for single parent households.
Also, there is a noticeable proportion of young parents (ages 17-25) of almost a third of the population of that age group.
There is also a even balance of adults, with a slight drop in young adults, which could be an indicator of future problems with an aging population.