This is Part II in Scout from University of California’s threepart series, Gradebook analysis with R. If you missed Part I, read it here. You can also read Part III here.
In Part I, we worked through loading, cleaning, tidying and summarizing gradebooks exported from an LMS. While it’s true that approximately 80% of data analysis is spent wrangling and cleaning data, it’s only 20% of the fun. In our humble opinion, the real excitement and payoff come with plotting, inspecting and sharing visual patterns in your data. That’s what this post is all about.
Toward that end, there are numerous ways to construct plots in R. There is, however, a loose consensus within the R community that the ggplot2 package offers both the greatest flexibility and produces the most publicationready graphics. True, the syntax takes some getting used to, but the results are rewarding. Here I cosign with David Robinson, data scientist and staunch advocate of ggplot: “Don’t teach builtin plotting; teach ggplot2.”
Installation is easy:
install.packages(“ggplot2”)
library(ggplot2)
There are dozens of quality ggplot2 tutorials around the web, and we won’t add to the surplus here. Rather, we hope to share some ideas for extracting insight from your gradebooks through histograms, scatter plots and box plots. The R code will be provided in each instance, but with somewhat limited commentary. This post will also demonstrate the creation of custom functions which can save a great deal of unnecessary copying and pasting.
In Part II we’re going to sift through a new gradebook from a recent AP Calculus BC course. We’ve already cleaned, subsetted, and tidied this gradebook, as well as joined additional data from our SIS using the same steps from Part I:
library(readr)
library(dplyr)
library(stringr)
library(tidyr)
library(magrittr)
apcalcbc < read_csv(“./gradebooks/apcalcbc111006.csv”, skip = 7) %>%
select(`Student Name`,
`Calculated Grade`,
contains(“Final”),
contains(“Homework”),
contains(“Quiz”),
contains(“Comment”))
names(apcalcbc)[3:ncol(apcalcbc)] %<>%
str_sub(start = 1, end = 8) %>%
str_trim()
apcalcbc %<>%
rename(Final = `Semester 1 Final Exam`,
Student = `Student Name`,
Grade = `Calculated Grade`) %>%
gather(Item, Score, 3:ncol(.)) %>%
mutate(Type =
ifelse(grepl(“Homework”, Item), “Homework”,
ifelse(grepl(“Open Response”, Item), “Open Response Quiz”, “Quiz”)),
Result =
ifelse(Grade > 60, “Passed”, “Failed”),
Unit =
ifelse(grepl(“Unit 1”, Item), “Unit 1”,
ifelse(grepl(“Unit 2”, Item), “Unit 2”,
ifelse(grepl(“Unit 3”, Item), “Unit 3”, “Unit 4”)))) %>%
left_join(sis_data, by = “Student”)
Some explanation of the new variables within apcalcbc:
Status  Task Description 

Guide students to the Student Learning Portal to sign into their course. They should have received an email with their login information. If needed, they can click on the “Forgot your Login and/or Password” link. 

Send parents a letter or email explaining their student’s online course and expectations. Remind parents that they also have a login for the Student Learning Portal. 

Show or email students Blackboard presentation from the Google Site (Blackboard presentation will help orient students to Blackboard). 

Encourage students to do the following in Blackboard: Read announcements, read the welcome letter from their instructor and any other information their instructor suggests. 

Have students print the Pacing Guide from Blackboard under the “Course Info” tab. 

Review Syllabus and Expectations from Blackboard with students including netiquette, plagiarism, how to get help, and capitalization and punctuation. Discuss etiquette when sending messages and participating in discussion board activities. 

Verify all enrollments at the beginning of the term to make sure students are in the correct courses.  
Work on Unit 0 (as a group if possible). Unit 0 is the first unit (Orientation) for students to complete in Blackboard. 

Establish and communicate clear expectations and guidelines with your students. Building relationships is key! 

Read Welcome Letters that instructors email you. It’s best to print and store or make a folder in your email to refer back to for instructor contact info and/or passwords. 

Write on board or post start and end dates. 

Write on board or post Progress Check Dates. There are 4 progress checks dates and they are 20%, 40%, 60%, and 80% of the semester/trimester. 

If in a lab setting, seat students together who are taking the same course. This is great for collaboration. 

Communicate with the online teacher on a regular basis regarding student progress. 

Log into the Student Learning Portal daily to check on student progress, login information and to read messages. If you are an Affiliation and a Mentor, you must be in your Mentor Student Learning Portal to view messages. 

Prepare for Count Day! Students need to submit an assignment on count day. Check the Google Site for resources and ideas! 

Check to see which students have an IEP or a 504. Upload the necessary documents in the Student Learning Portal. Michigan Virtual will let you know what accommodations they can provide. 

If you don’t see your students every day, consider using Remind or another online communication tool. 

Always feel free to contact Debbie Lynch by email or phone at 5178169532. You can also contact the Michigan Virtual Customer Care Center. 
Viewing Distributions with Histograms
Histograms are useful for observing the class’s overall performance on a particular exam, quiz or homework assignment. The shape of the distribution is important: for most assignments, we would hope to see a unimodal, bell curve shape with scores clustering around the mean and median. Using the geom_histogram() “geom” (plots in ggplot2 are created with various geoms), here we create a histogram of the final exam scores:
library(ggplot2)
plot_data < apcalcbc %>%
distinct(Student) %>%
filter(Final > 0) #Removes students who did not take the final
ggplot(plot_data, #The first argument of ggplot() is the name of the dataset
aes(Final)) + #The second argument is the aesthetic mapping where you specify the x and y axes
#Here the “Final” variable will be plotted on the x axis
#We add “geoms” and elements to the plot with “+”
geom_histogram(binwidth = 5, #Optional argument to adjust histogram bin width
fill = “lightgreen”, #Optional argument to adjust fill color
color = “black”) + #Optional argument to adjust border color
ggtitle(“Histogram of Final Exam Scores”) + #Optional plot title
theme_minimal() #Optional theme that removes extra grid lines and background color
There is nothing overly troubling about the shape of this distribution. It is a loosely normal and a tad leftskewed, and we might question the distribution’s center, but the deviation in scores is about what we would expect. Furthermore, if you consider the course’s online, AP context, the concentration of high scores is less surprising. Online AP courses probably attract higher achieving students.
A standalone histogram is of limited use when studying gradebooks, and retyping or copying the code for each exam, quiz or assignment would take forever. Suppose, for example, we want to compare the distributions from multiple assignments, such as all quizzes or homework. To do so, we can create a facet grid with + facet_wrap(~Item).
Now let’s plot the histogram facet grid using only the quizzes from Unit 3:
plot_data < apcalcbc %>%
filter(Type == “Quiz”,
Unit == “Unit 3”,
Score > 0)
ggplot(plot_data,
aes(Score)) +
geom_histogram(binwidth = 2, fill = “lightgreen”, color = “black”) +
ggtitle(“Unit 3 Quiz Scores”) +
facet_wrap(~Item) +
theme_minimal()
Here it might be worth nothing that the center of Unit 3: Quiz 06 is different from the other quizzes, but that could be due to a simple change in maximum score. We could have plotted a histogram for every quiz by removing the Unit == “Unit 3” argument, but that would have flooded our plot viewer with dozens of tiny, illegible plots. Filtering by Unit is preferable in this respect.
So now we’re interested in the quiz and homework distributions from other units, but there’s a problem: the thought of copying and pasting the same plot code and changing some filters and plot titles along the way is repellant. The process is not only cumbersome and timeconsuming, but it lacks panache — an unspoken, essential quality of all data analysis. Fortunately, we can save a great deal of time and typing by creating our own custom function that plots our desired histograms with minimal effort.
Userdefined functions are one of R’s signature advantages. The rule of thumb is that if you find yourself copying and pasting your code more than once, you should create your own function.
Our new function, gradebook_histograms(), will have three arguments: the tidy dataset, the gradebook type and a specified unit. The function will then use those three inputs to create a facet grid with the desired histograms that includes a red line marking the distribution’s median:
gradebook_histograms < function(tidy_gradebook, type, unit) {
plot_data < tidy_gradebook %>%
filter(Score > 0,
Type == type, #Filter by type input
Unit == unit) %>% #Filter by unit input
group_by(Item) %>%
summarize(Median = median(Score)) %>% #Calculate median score
left_join(tidy_gradebook, by = “Item”) #Join the summarized, grouped medians back onto the dataset
print(
ggplot(plot_data, aes(Score)) +
geom_histogram(binwidth = 2, fill = “lightgreen”/span>, color = “black”/span>) +
geom_vline(data = plot_data, aes(xintercept = Median), #Plot the vertical red line
color = “red”, linetype = “dashed”) + #Optional aesthetic arguments
facet_wrap(~Item) +
labs(x = paste(type, “Score”), y = “Count”) + #Axes labels determined by function input
ggtitle(paste(unit, type, “Scores”)) +
theme_minimal()
)
}
Now we tell R: “Show us the AP Calculus BC score distributions for all quizzes in Unit 1:”
gradebook_histograms(apcalcbc, “Quiz”, “Unit 1”)
Voila! Thanks to our new function, the line between man and machine has been further blurred. If we want to view the distributions of homework scores or quiz scores from another unit or even another gradebook, we just have to change a few arguments. And if we know in advance that we want to see all the quiz plots for each unit anyway, we can call gradebook_histograms() iteratively within a forloop. We won’t run the following code, but it creates the histograms for every quiz from each Unit:
for (unit in unique(apcalcbc$Unit)) { #This specifies four iterations, one for each unit
gradebook_histograms(apcalcbc, “Quiz”, unit)
}
Scatter Plots with Two Continuous Variables
A curricular review is a Gordian knot of students and scores, teachers and technology, but another promising entry point is to consider how individual gradebook items relate to summative assessments such as a final exam. Correlation is not causation, but a series of scatter plots can flag potential issues and act as a springboard into other inquiries. An activity with little to no relationship to final exam performance is perhaps cause for concern.
Let’s create another function called gradebook_scatter_plots() that plots each student’s Item score on the xaxis, and their Final score on the yaxis. We’ll also add a smoother line with geom_smooth(method = “lm”) that captures the line of best fit for each plot. For additional customization, we’ll pass an elipses (…) into the function that provides room for additional, optional arguments. For example, if I want the color of the scatter plot points to identify each student’s gender, I can add color = “Gender” to my call of gradebook_scatter_plots(). And if I want the shape of the scatter plot points to identify each student’s ethnicity, I can add shape = “Ethnicity”. Here we create the function using the geom geom_point():
gradebook_scatter_plots < function(tidy_gradebook, type, unit, …) {
plot_data < tidy_gradebook %>%
filter(Type == type,
Unit == unit,
Score > 0,
Final > 0)
print(
ggplot(plot_data, aes_string(“Score”, “Final”, …)) + #The ellipses go where additional arguments are provided
#Because an additional mapping can be provided, we use aes_string()
geom_point() +
facet_wrap(~Item) +
geom_smooth(method = “lm”) + #Specifies line of best fit with shaded area representing 95% confidence interval
labs(y = “Final Exam Score”, x = paste(type, “Score”)) +
ggtitle(paste(unit, type, “Scores ~ Final Exam”)) +
theme_minimal()
)
}
With our new function, let’s view the scatter plots for all quizzes in Unit 1:
gradebook_scatter_plots(apcalcbc, “Quiz”, “Unit 1”)
We observe a consistent positive relationship between quiz scores and final exam scores. The assumption here is that there is at least some overlap in the skills and knowledge required to succeed on the quizzes, and the skills and knowledge required to succeed on the final exam. An assignment that showed little to no positive correlation would raise some eyebrows, as would a concentration of points located outside the shaded confidence intervals. Here the relatively small sample size and the small variation in quiz score probably explains the phenomenon, but these are oddities worth investigating, and the purpose of the plots (along with the rest of our exploratory data analysis) is to flag potential issues within the curriculum. The scatter plots accomplish precisely this with remarkable speed and ease.
Let’s pass an additional argument to our function (color = “Gender”) and view the plots with the color of each point identifying each student’s Gender:
gradebook_scatter_plots(apcalcbc, “Quiz”, “Unit 1”, color = “Gender”)
Again, the sample size is too small to make any generalization about gender performance. What is important to see here is the process and method of viewing your data across multiple variables, some continuous, some categorical.
Comparing Distributions across Categorical Variables with Box Plots
A box plot is another useful visual aid when comparing distributions of a continuous variable across levels (groups) of a categorical variable. We can, for example, compare each ethnic group’s overall performance at a glance with geom_boxplot(). Here we’ll also calculate the class median and overlay the plot with a dashed red line indicating the value.
plot_data < apcalcbc %>%
filter(Final > 0) %>%
distinct(Student)
class_median < median(plot_data$Grade)
ggplot(plot_data,
aes(Ethnicity, Grade, fill = Ethnicity)) +
geom_boxplot(outlier.colour = “red”, outlier.shape = 1) + #Specifies color and shape of outliers
stat_summary(fun.y = mean, geom = “point”, shape = 23, size = 3) + #Plots the mean, indicated by diamonds on the plots
geom_hline(yintercept = class_median, color = “red”, linetype = “dashed”)
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) #Optional adjustment to labels
Of note: the median score for each ethnic group is comfortably close to the overall class median (indicated by the dashed red line), although some are higher than others. And if you’re curious, outliers in geom_boxplot() are points beyond 1.5 * IQR (interquartile range) in either direction.
Other Plotting Options
Far be it from me to drone through this entire post without mentioning some of R’s more dynamic graphing capabilities. Packages like googleVis, rCharts, and plotly allow you to create interactive visualizations, but many users still get usage out of base R and lattice. Perhaps the most exciting recent development in R plotting is the gganimate package which can “gifafy” your plots into slick animations. In short, in the world of open source data visualization, R is probably king.
Have a question about R or this tutorial? Feel free to email me at [email protected].
About the Author
David wrangles, slices and dices data for Scout, but he also likes to pal around with the English department reciting Julius Caesar, extolling the innumerable merits of East of Eden, and arguing about the hegemonic canon.
About Scout
Learning is synonymous with empowerment at Scout. We are teachers, instructional designers and technologists working to deliver University of Californiaquality interactive online classes, curriculum and supplemental education materials to middle school and high school students and teachers across the U.S. and beyond. Our course materials are designed to inspire lifelong curiosity and prepare pupils of all backgrounds and education levels for an increasingly technological world where training and job skills are mobile, asynchronous and selfdirected. Explicitly created to bridge achievement gaps, we believe that using technology effectively can remove traditional obstacles to education.