/ Research / Gradebook analysis with R – Part II

Gradebook analysis with R – Part II

Assorted Electronics on table
Share on facebook
Share on linkedin
Share on twitter
Share on pinterest

This is Part II in Scout from University of California’s three-part series, Gradebook analysis with R. If you missed Part I, read it here. You can also read Part III here.

OR logo

In Part I, we worked through loading, cleaning, tidying and summarizing gradebooks exported from an LMS. While it’s true that approximately 80% of data analysis is spent wrangling and cleaning data, it’s only 20% of the fun. In our humble opinion, the real excitement and payoff come with plotting, inspecting and sharing visual patterns in your data. That’s what this post is all about.

Toward that end, there are numerous ways to construct plots in R. There is, however, a loose consensus within the R community that the ggplot2 package offers both the greatest flexibility and produces the most publication-ready graphics. True, the syntax takes some getting used to, but the results are rewarding. Here I cosign with David Robinson, data scientist and staunch advocate of ggplot: “Don’t teach built-in plotting; teach ggplot2.”

Installation is easy:


There are dozens of quality ggplot2 tutorials around the web, and we won’t add to the surplus here. Rather, we hope to share some ideas for extracting insight from your gradebooks through histograms, scatter plots and box plots. The R code will be provided in each instance, but with somewhat limited commentary. This post will also demonstrate the creation of custom functions which can save a great deal of unnecessary copying and pasting.

In Part II we’re going to sift through a new gradebook from a recent AP Calculus BC course. We’ve already cleaned, subsetted, and tidied this gradebook, as well as joined additional data from our SIS using the same steps from Part I:


apcalcbc <- read_csv(“./gradebooks/apcalcbc111006.csv”, skip = 7) %>%
select(`Student Name`,
`Calculated Grade`,

names(apcalcbc)[3:ncol(apcalcbc)] %<>%
str_sub(start = 1, end = -8) %>%

apcalcbc %<>%
rename(Final = `Semester 1 Final Exam`,
Student = `Student Name`,
Grade = `Calculated Grade`) %>%
gather(Item, Score, 3:ncol(.)) %>%
mutate(Type =
ifelse(grepl(“Homework”, Item), “Homework”,
ifelse(grepl(“Open Response”, Item), “Open Response Quiz”, “Quiz”)),
Result =
ifelse(Grade > 60, “Passed”, “Failed”),
Unit =
ifelse(grepl(“Unit 1”, Item), “Unit 1”,
ifelse(grepl(“Unit 2”, Item), “Unit 2”,
ifelse(grepl(“Unit 3”, Item), “Unit 3”, “Unit 4”)))) %>%
left_join(sis_data, by = “Student”)

Some explanation of the new variables within apcalcbc:

StatusTask Description
Guide students to the Student Learning Portal to sign into their course. They should have received an email with their login information. If needed, they can click on the “Forgot your Login and/or Password” link.
Send parents a letter or email explaining their student’s online course and expectations. Remind parents that they also have a login for the Student Learning Portal.
Show or email students Blackboard presentation from the Google Site (Blackboard presentation will help orient students to Blackboard).
Encourage students to do the following in Blackboard: Read announcements, read the welcome letter from their instructor and any other information their instructor suggests.
Have students print the Pacing Guide from Blackboard under the “Course Info” tab.
Review Syllabus and Expectations from Blackboard with students including netiquette, plagiarism, how to get help, and capitalization and punctuation. Discuss etiquette when sending messages and participating in discussion board activities.
Verify all enrollments at the beginning of the term to make sure students are in the correct courses.
Work on Unit 0 (as a group if possible). Unit 0 is the first unit (Orientation) for students to complete in Blackboard.
Establish and communicate clear expectations and guidelines with your students. Building relationships is key!
Read Welcome Letters that instructors email you. It’s best to print and store or make a folder in your email to refer back to for instructor contact info and/or passwords.
Write on board or post start and end dates.
Write on board or post Progress Check Dates. There are 4 progress checks dates and they are 20%, 40%, 60%, and 80% of the semester/trimester.
If in a lab setting, seat students together who are taking the same course. This is great for collaboration.
Communicate with the online teacher on a regular basis regarding student progress.
Log into the Student Learning Portal daily to check on student progress, login information and to read messages. If you are an Affiliation and a Mentor, you must be in your Mentor Student Learning Portal to view messages.
Prepare for Count Day! Students need to submit an assignment on count day. Check the Google Site for resources and ideas!
Check to see which students have an IEP or a 504. Upload the necessary documents in the Student Learning Portal. Michigan Virtual will let you know what accommodations they can provide.
If you don’t see your students every day, consider using Remind or another online communication tool.
Always feel free to contact Debbie Lynch by email or phone at 517-816-9532. You can also contact the Michigan Virtual Customer Care Center.
Viewing Distributions with Histograms

Histograms are useful for observing the class’s overall performance on a particular exam, quiz or homework assignment. The shape of the distribution is important: for most assignments, we would hope to see a unimodal, bell curve shape with scores clustering around the mean and median. Using the geom_histogram() “geom” (plots in ggplot2 are created with various geoms), here we create a histogram of the final exam scores:


plot_data <- apcalcbc %>%
distinct(Student) %>%
filter(Final > 0) #Removes students who did not take the final

ggplot(plot_data, #The first argument of ggplot() is the name of the dataset
aes(Final)) + #The second argument is the aesthetic mapping where you specify the x and y axes
#Here the “Final” variable will be plotted on the x axis
#We add “geoms” and elements to the plot with “+”
geom_histogram(binwidth = 5, #Optional argument to adjust histogram bin width
fill = “lightgreen”, #Optional argument to adjust fill color
color = “black”) + #Optional argument to adjust border color
ggtitle(“Histogram of Final Exam Scores”) + #Optional plot title
theme_minimal() #Optional theme that removes extra grid lines and background color

histogram of final exam scores

There is nothing overly troubling about the shape of this distribution. It is a loosely normal and a tad left-skewed, and we might question the distribution’s center, but the deviation in scores is about what we would expect. Furthermore, if you consider the course’s online, AP context, the concentration of high scores is less surprising. Online AP courses probably attract higher achieving students.

A standalone histogram is of limited use when studying gradebooks, and retyping or copying the code for each exam, quiz or assignment would take forever. Suppose, for example, we want to compare the distributions from multiple assignments, such as all quizzes or homework. To do so, we can create a facet grid with + facet_wrap(~Item).

Now let’s plot the histogram facet grid using only the quizzes from Unit 3:

plot_data <- apcalcbc %>%
filter(Type == “Quiz”,
Unit == “Unit 3”,
Score > 0)

aes(Score)) +
geom_histogram(binwidth = 2, fill = “lightgreen”, color = “black”) +
ggtitle(“Unit 3 Quiz Scores”) +
facet_wrap(~Item) +

unit 3 quiz scores chart

Here it might be worth nothing that the center of Unit 3: Quiz 06 is different from the other quizzes, but that could be due to a simple change in maximum score. We could have plotted a histogram for every quiz by removing the Unit == “Unit 3” argument, but that would have flooded our plot viewer with dozens of tiny, illegible plots. Filtering by Unit is preferable in this respect.

So now we’re interested in the quiz and homework distributions from other units, but there’s a problem: the thought of copying and pasting the same plot code and changing some filters and plot titles along the way is repellant. The process is not only cumbersome and time-consuming, but it lacks panache — an unspoken, essential quality of all data analysis. Fortunately, we can save a great deal of time and typing by creating our own custom function that plots our desired histograms with minimal effort.

User-defined functions are one of R’s signature advantages. The rule of thumb is that if you find yourself copying and pasting your code more than once, you should create your own function.

Our new function, gradebook_histograms(), will have three arguments: the tidy dataset, the gradebook type and a specified unit. The function will then use those three inputs to create a facet grid with the desired histograms that includes a red line marking the distribution’s median:

gradebook_histograms <- function(tidy_gradebook, type, unit) {

plot_data <- tidy_gradebook %>%
filter(Score > 0,
Type == type, #Filter by type input
Unit == unit) %>% #Filter by unit input
group_by(Item) %>%
summarize(Median = median(Score)) %>% #Calculate median score
left_join(tidy_gradebook, by = “Item”) #Join the summarized, grouped medians back onto the dataset

ggplot(plot_data, aes(Score)) +
geom_histogram(binwidth = 2, fill = “lightgreen”/span>, color = “black”/span>) +
geom_vline(data = plot_data, aes(xintercept = Median), #Plot the vertical red line
color = “red”, linetype = “dashed”) + #Optional aesthetic arguments
facet_wrap(~Item) +
labs(x = paste(type, “Score”), y = “Count”) + #Axes labels determined by function input
ggtitle(paste(unit, type, “Scores”)) +

Now we tell R: “Show us the AP Calculus BC score distributions for all quizzes in Unit 1:

gradebook_histograms(apcalcbc, “Quiz”, “Unit 1”)

unit 1 quiz scores chart

Voila! Thanks to our new function, the line between man and machine has been further blurred. If we want to view the distributions of homework scores or quiz scores from another unit or even another gradebook, we just have to change a few arguments. And if we know in advance that we want to see all the quiz plots for each unit anyway, we can call gradebook_histograms() iteratively within a for-loop. We won’t run the following code, but it creates the histograms for every quiz from each Unit:

for (unit in unique(apcalcbc$Unit)) { #This specifies four iterations, one for each unit
gradebook_histograms(apcalcbc, “Quiz”, unit)

Scatter Plots with Two Continuous Variables

A curricular review is a Gordian knot of students and scores, teachers and technology, but another promising entry point is to consider how individual gradebook items relate to summative assessments such as a final exam. Correlation is not causation, but a series of scatter plots can flag potential issues and act as a springboard into other inquiries. An activity with little to no relationship to final exam performance is perhaps cause for concern.

Let’s create another function called gradebook_scatter_plots() that plots each student’s Item score on the x-axis, and their Final score on the y-axis. We’ll also add a smoother line with geom_smooth(method = “lm”) that captures the line of best fit for each plot. For additional customization, we’ll pass an elipses (…) into the function that provides room for additional, optional arguments. For example, if I want the color of the scatter plot points to identify each student’s gender, I can add color = “Gender” to my call of gradebook_scatter_plots(). And if I want the shape of the scatter plot points to identify each student’s ethnicity, I can add shape = “Ethnicity”. Here we create the function using the geom geom_point():

gradebook_scatter_plots <- function(tidy_gradebook, type, unit, …) {

plot_data <- tidy_gradebook %>%
filter(Type == type,
Unit == unit,
Score > 0,
Final > 0)
ggplot(plot_data, aes_string(“Score”, “Final”, …)) + #The ellipses go where additional arguments are provided
#Because an additional mapping can be provided, we use aes_string()
geom_point() +
facet_wrap(~Item) +
geom_smooth(method = “lm”) + #Specifies line of best fit with shaded area representing 95% confidence interval
labs(y = “Final Exam Score”, x = paste(type, “Score”)) +
ggtitle(paste(unit, type, “Scores ~ Final Exam”)) +

With our new function, let’s view the scatter plots for all quizzes in Unit 1:

gradebook_scatter_plots(apcalcbc, “Quiz”, “Unit 1”)

quiz scores - final 1

We observe a consistent positive relationship between quiz scores and final exam scores. The assumption here is that there is at least some overlap in the skills and knowledge required to succeed on the quizzes, and the skills and knowledge required to succeed on the final exam. An assignment that showed little to no positive correlation would raise some eyebrows, as would a concentration of points located outside the shaded confidence intervals. Here the relatively small sample size and the small variation in quiz score probably explains the phenomenon, but these are oddities worth investigating, and the purpose of the plots (along with the rest of our exploratory data analysis) is to flag potential issues within the curriculum. The scatter plots accomplish precisely this with remarkable speed and ease.

Let’s pass an additional argument to our function (color = “Gender”) and view the plots with the color of each point identifying each student’s Gender:

gradebook_scatter_plots(apcalcbc, “Quiz”, “Unit 1”, color = “Gender”)

quiz scores - final 2

Again, the sample size is too small to make any generalization about gender performance. What is important to see here is the process and method of viewing your data across multiple variables, some continuous, some categorical.

Comparing Distributions across Categorical Variables with Box Plots

A box plot is another useful visual aid when comparing distributions of a continuous variable across levels (groups) of a categorical variable. We can, for example, compare each ethnic group’s overall performance at a glance with geom_boxplot(). Here we’ll also calculate the class median and overlay the plot with a dashed red line indicating the value.

plot_data <- apcalcbc %>%
filter(Final > 0) %>%

class_median <- median(plot_data$Grade)

aes(Ethnicity, Grade, fill = Ethnicity)) +
geom_boxplot(outlier.colour = “red”, outlier.shape = 1) + #Specifies color and shape of outliers
stat_summary(fun.y = mean, geom = “point”, shape = 23, size = 3) + #Plots the mean, indicated by diamonds on the plots
geom_hline(yintercept = class_median, color = “red”, linetype = “dashed”)
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) #Optional adjustment to labels

Ethnicity graph

Of note: the median score for each ethnic group is comfortably close to the overall class median (indicated by the dashed red line), although some are higher than others. And if you’re curious, outliers in geom_boxplot() are points beyond 1.5 * IQR (inter-quartile range) in either direction.

Other Plotting Options

Far be it from me to drone through this entire post without mentioning some of R’s more dynamic graphing capabilities. Packages like googleVis, rCharts, and plotly allow you to create interactive visualizations, but many users still get usage out of base R and lattice. Perhaps the most exciting recent development in R plotting is the gganimate package which can “gif-afy” your plots into slick animations. In short, in the world of open source data visualization, R is probably king.

Have a question about R or this tutorial? Feel free to email me at [email protected].

University of California-Scout logo

About the Author

David wrangles, slices and dices data for Scout, but he also likes to pal around with the English department reciting Julius Caesar, extolling the innumerable merits of East of Eden, and arguing about the hegemonic canon.

About Scout

Learning is synonymous with empowerment at Scout. We are teachers, instructional designers and technologists working to deliver University of California-quality interactive online classes, curriculum and supplemental education materials to middle school and high school students and teachers across the U.S. and beyond. Our course materials are designed to inspire life-long curiosity and prepare pupils of all backgrounds and education levels for an increasingly technological world where training and job skills are mobile, asynchronous and self-directed. Explicitly created to bridge achievement gaps, we believe that using technology effectively can remove traditional obstacles to education.

Share on facebook
Share on linkedin
Share on twitter
Share on pinterest
Michigan Virtual Learning Research Institute

Michigan Virtual Learning Research Institute

The Michigan Virtual Learning Research Institute (MVLRI) is a non-biased organization that exists to expand Michigan’s ability to support new learning models, engage in active research to inform new policies in online and blended learning, and strengthen the state’s infrastructures for sharing best practices. MVLRI works with all online learning environments to develop the best practices for the industry as a whole.

Keep up with the latest Michigan Virtual has to offer

The Digital Backpack

Get our latest articles sent straight to your inbox every Thursday!

  • This field is for validation purposes and should be left unchanged.

Limited Course Capacity

We’re sorry to inform you that we have reached capacity for several of our Semester 1 and Trimester 1 courses. You’ll notice when attempting to enroll students in our Student Learning Portal that some courses are unavailable. While we are no longer accepting new enrollments for these courses at this time, many courses continue to remain open for enrollment.

With many students across the state 100% remote, demand for our online courses is greater than ever before. Because every course we offer is taught by a Michigan-certified teacher, this high volume of enrollments has created capacity issues for our teachers who provide each and every student with individual feedback.

While the Michigan Virtual team anticipated and planned for significant increases in student enrollments this Fall, the increased demand we’ve experienced has been unprecedented. As a result, we are taking steps to hire even more part-and full-time teachers to support larger numbers of student enrollments for Semester 2 as well as for Trimester 2 and 3. 

For schools that still need online learning options this year, please fill out the form at the bottom of our virtual pathways page to meet with someone to discuss other solutions. While some of our teacher-led courses are full, we may still have the capacity to help you in upcoming terms or can discuss timing to implement a whole-school or collaborative program in which local teachers from your school/district use our online course content to teach students. We also have free course content and resources available for you to use.

We know this is an incredibly stressful time for all, and we’re sorry if the courses you’re looking for are unavailable. We never want to turn away a student who wants to learn from us. Our top concern, however, is student success, and we have a policy to not take on additional enrollments if we cannot guarantee that all students will have a quality online learning experience. 

We appreciate your patience and understanding as we navigate the unusually high volume of enrollments we are receiving.