Gradebook analysis with R – Part I

Assorted Electronics on table
Share on facebook
Share on linkedin
Share on twitter
Share on pinterest

This is Part I in Scout from University of California’s three-part series, Gradebook Analysis with R. You can read Part II here, and Part III here.

OR logo

Gradebooks are rich sources of information. Hidden among the test, quiz and homework scores often lie important insights into the curriculum and student performance. While assessing a curriculum’s overall effectiveness and integrity is an enormous task that requires much more data, a gradebook’s ability to flag potential issues makes it a useful entry point.

The analytics tools and interfaces provided by learning management systems (LMS), however, are sometimes inadequate. This is not meant as criticism. Data analysis is an imaginative exercise whose questions no LMS provider can fully anticipate. Furthermore, the best questions often require data from additional sources, such as your student information system (SIS). As we hope to demonstrate in this series, joining datasets from a variety of sources, issuing complex queries, and producing compelling data visualizations are tasks for which the R programming language and the environment are well-suited. R is a popular tool among statisticians and data scientists, but recent improvements have made it more accessible to the general public. This tutorial does not cover installing R, but an easy guide can be found here. R also has a fantastic integrative development environment (IDE) called RStudio, but it’s not required for any of the data analysis performed here.

Loading and Viewing Your Gradebook

First, we need our data. Every LMS allows you to export your gradebooks to a comma-separated values (CSV) file, so once you’ve downloaded the desired gradebook CSV onto your computer, you’re ready to open R. A screen shot of our raw data is below:

Raw data screenshot

Open R, and at the command line use the setwd() function to move your working directory wherever the gradebook is located. For example, we exported our Geometry gradebook to a directory called “Gradebooks” on our Desktop, so we enter:

setwd(“Desktop/Gradebooks”)#The path to the directory must be in quotation marks.

We can confirm the file is in the directory with the list.files() function. To read the file into R, we’ll use the read_csv() function from the readr package, which, like all R packages, must be installed prior to use with install_packages(“package_name”). In the example below, we’re creating an R object and naming it geometry. The <- operator tells R to assign to geometry whatever is returned by the function on right, which in this case is a table from the CSV file.

library (readr) #We need to load the readr package where read_csv() resides
geometry <- read_csv(“geometry_gradebook.csv”, skip = 7)

Note that in the image above, the data we need begins on the eighth row. Thus, the second argument (skip = 7) tells R to skip seven lines before reading the data. Part of the hidden beauty of read_csv() is its speed and default settings. We could have read the gradebook into R with the read.csv() function from base R, but we would then have had to supply additional arguments, such as header = TRUE and stringsAsFactors = FALSE. Without going into detail, the dataset object created by read_csv is superior to what is created by read.csv().

geometry is now an R object, and we can begin to examine it. Towards that end, R has several useful functions:

  • dim() prints the dimensions of an R object (e.g., number of rows and columns).
  • str() prints the structure of an R object, as well as the column names and the type of each column (e.g., character, numeric, integer, etc.).
  • names() prints the names of an R object.
  • head() prints the first few parts of an R object.
  • tail() prints the last few parts of an R object.

Here are some examples of these functions called on our geometry dataset:

dim(geometry); kable(head(geometry[,11:16])) #kable() is an optional function to format the display

[1] 32 96

Elements of the SCED Coding Structure

The table below shows the SCED coding structure elements along with a description and an example of each.
Coding ElementDescriptionExample
Course DescriptionThe first two digits of a code represent a subject area, corresponding to one of twenty-two general categories. For example, Subject Area Mathematics is coded as (02). The next three digits represent a course identifier, essentially what distinguishes one course from another within the same subject area.For example, Algebra I is coded as (052). Taken together, the SCED code for Algebra I is 02052. The documents referred to above contain a full list of all recognized SCED courses, coding and description.
Course LevelThe second element of a course's SCED code refers to its course level. One of four possible letters is assigned to a course to convey its level of rigor:
· B - Basic or Remedial
· E - Enriched or Advanced
· G - General or Regular
· H - Honors
The majority of courses that schools offer are general.
Available CreditFor secondary courses this refers to the amount of Carnegie unit credits available to a student who successfully completes the course. Credit is assigned as either 1.0 or .50, varying between courses and from district to district.
Note: Prior-to-secondary courses, this element is coded differently as it indicates the grade span for which the course is appropriate, represented as a four-character code with no decimals. A course may be appropriate across multiple grade levels or a singular grade.
For example, a course appropriate for grades 7 and 8 would receive the coding 0708, while a course appropriate for only grade 3 would be coded as 0303.
SequenceThis identifier refers to whether courses are" broken up" into multiple terms but consist of the same course description.For example, a district may offer Algebra 1A and Algebra IB, in the first and second semesters respectively, but the course will maintain the same coding and description for each term. The catalog requires that the position of the specific course in the sequence and the total number of courses in the sequence be identified. This can be thought of in terms of a course being "number x out of number y." For example, Algebra 1A would have a position in the sequence of, and the total number of courses in the sequence would be 2.
Subsetting and Cleaning Your Gradebook

This dataset has 32 rows and 96 columns (i.e., 32 students and 96 variables). For our purposes, we want to focus on the gradebook scores, so we can drop the general information about the students and the columns containing teacher comments. To do so, we’ll use the select() and contains() functions from the handy dplyr package to specify which columns we want to keep. The contains (“Homework”) clause, for example, indicates that we want all columns whose header contains “Homework.”

geometry <- select(geometry, #The first argument of select() is the dataset `Student Name`, #Column names with spaces must be closed by backticks
`Calculated Grade`,
-contains(“Comment”)) #”-” tells R to drop columns with “Comment”

Checking the dimensions and column names of our new, smaller dataset, we see that the number of columns has dropped to 42:

dim(geometry); head(names(geometry), 20)

[1] 32 42

[1] “Student Name” “Calculated Grade”
[3] “Final Exam, Form A (R) (0)” “Lesson 1 Homework (R) (0)”
[5] “Lesson 2 Homework (R) (0)” “Lesson 3 Homework (R) (0)”
[7] “Lesson 4 Homework (R) (0)” “Lesson 5 Homework (R) (0)”
[9] “Lesson 6 Homework (R) (0)” “Lesson 8 Homework (R) (0)”
[11] “Lesson 9 Homework (R) (0)” “Lesson 10 Homework (R) (0)”
[13] “Lesson 11 Homework (R) (0)” “Lesson 12 Homework (R) (0)”
[15] “Lesson 13 Homework (R) (0)” “Lesson 14 Homework (R) (0)”
[17] “Lesson 15 Homework (R) (0)” “Lesson 16 Homework (R) (0)”
[19] “Lesson 01 Quiz (R) (0)” “Lesson 02 Quiz (R) (0)”

If you’re picky about column names like us, the extraneous “(R) (o)” at the end of each column name is annoying and should be removed. Furthermore, when we start creating plots across facets and grids in Part II, we’ll want cleaner plot titles. Here we’ll use the str_sub() and str_trim() functions from the stringr package to remove the final eight characters from each column name and trim the remaining white space:

names(geometry)[3:42] <- str_sub(names(geometry)[3:42], start = 1, end = -8)
#”(R) (0)” only occurs in columns 3-42
#The third argument (end = -8) tells R to remove the final eight characters
names(geometry) <- str_trim(names(geometry))
head(names(geometry), 20)

[1] “Student Name” “Calculated Grade” “Final Exam, Form A”
[4] “Lesson 1 Homework” “Lesson 2 Homework” “Lesson 3 Homework”
[7] “Lesson 4 Homework” “Lesson 5 Homework” “Lesson 6 Homework”
[10] “Lesson 8 Homework” “Lesson 9 Homework” “Lesson 10 Homework”
[13] “Lesson 11 Homework” “Lesson 12 Homework” “Lesson 13 Homework”
[16] “Lesson 14 Homework” “Lesson 15 Homework” “Lesson 16 Homework”
[19] “Lesson 01 Quiz” “Lesson 02 Quiz”

Much better! Our column names are clearer and free of extraneous white space. Here we should note that it’s generally preferable to remove spaces from your column names, and we could do so by calling make.names(names(geometry)). We intend, however, to reshape our data into new columns below, so we can leave the spaces for the time being.

Tidying Your Gradebook

Our column names are clean, but the dataset is not tidy. Tidy data has two key characteristics:

  • Each column is a variable.
  • Each row is an observation.

While meeting the second condition, geometry violates the first. We have 42 columns, but really only six variables: Student Name, Final Exam Grade, Calculated Grade, Item and Score. You could argue that the Calculated Grade and Final Exam Grade scores could be rolled into the Score variable and that there are only three variables (Student Name, Item, Score), but for our purposes, we want to treat the Calculated Grade and Final Exam Scores as separate variables.

Tidying data is easy with the tidyr package. gather() takes multiple columns and collapses them into what are called key-value pairs. In the example below, we’re taking the 39 assignment columns from geometry (columns 4-42) and collapsing them into two new columns, Item (the key) and Score (the value). The column names will fall into rows under Item, and the scores will fall into rows under Score.

tidy_geometry <- gather(geometry, Item, Score, 4:42)

Checking the dimensions of tidy_geometry, we can see that our dataset is now 1248 rows tall and only five columns wide:

dim(tidy_geometry); kable(head(tidy_geometry))

[1] 1248 5

[table “8” not found /]

We’re almost ready to begin querying our data. Creating some additional variables will help us issue more incisive queries and can be accomplished with a combination of the mutate(), ifelse() and grepl() functions. mutate() adds a new variable to a dataset, ifelse() tests a logical condition and returns TRUE or FALSE, and grepl() searches for matches within a character vector. In the example below, we’re creating two new variables, Type and Result. If “Homework” appears within the Item name of a particular row, its Type becomes “Homework.” If “Quiz” appears within the Item name of a particular row, its Type becomes “Quiz,” and so on and so forth. Result is the second variable created, and is dependent on the student’s Calculated Grade:

tidy_geometry <- mutate(tidy_geometry,
Type =
ifelse(grepl(“Homework”, Item), “Homework”,
ifelse(grepl(“Quiz”, Item), “Quiz”,
ifelse(grepl(“Project”, Item), “Project”,
ifelse(grepl(“Exam”, Item), “Exam”, “Other”)))),
Result =
ifelse(`Calculated Grade` > 60, “Passed”, “Failed”))

Stepping back and reviewing our code, we could have rewritten the script in a more succinct fashion with the %>% operator from the magrittr package. %>% “pipes” the R object to its left into the function on its right, allowing you to chain multiple functions together. The technique not only reduces the amount of code you need to write but also improves its readability. It may help to think of %>% as a synonym for “then,” as in, first take geometry, then do this, then do this, etc.

The %<>% operator is a handy combination of <- and %>%: it both assigns and pipes.


geometry <- read_csv(“geom111004.csv”, skip = 7) %>%
select(`Student Name`, `Calculated Grade`, contains(“Final”), contains(“Homework”),
contains(“Quiz”), contains(“Project”), contains(“Exam”),

names(geometry)[3:42] %<>%
str_sub(1, -8) %>%

tidy_geometry <- geometry %>%
gather(Item, Score, 4:ncol(.)) %>%
mutate(Type =
ifelse(grepl(“Homework”, Item), “Homework”,
ifelse(grepl(“Quiz”, Item), “Quiz”,
ifelse(grepl(“Project”, Item), “Project”,
ifelse(grepl(“Exam”, Item), “Exam”, “Other”)))),
Result =
ifelse(`Calculated Grade` > 60, “Passed”, “Failed”)) %>%
rename(Final = `Final Exam, Form A`, #Retaining spaces in column names is unadvisable
Grade = `Calculated Grade`,
Student = `Student Name`)

Joining Other Datasets to Your Gradebook

We could start analyzing our gradebook now, but joining data from our SIS will enrich the dataset and further sharpen our analysis. At Scout we obtain information about a student’s gender, their parents’ education levels, whether they have an IEP or 504 plan, and whether they attend a public or private school, among other variables. This mix of categorical and quantitative variables makes for a rich and interesting dataset.

In the example below, we’ve downloaded another CSV file from our SIS onto our desktop. Using the setwd() and read_csv() functions again, we’ll move our working directory back to the desktop, read the file into R, and join the new dataset, sis_data, with geometry:

setwd(“..”) #Shorthand meaning “move back one directory”
sis_data <- read_csv(“sis_data.csv”)
rich_geometry <- left_join(tidy_geometry, sis_data, by = “Student”) #sis_data must also have a matching Student column

There are many different ways to join tables. By calling left_join(), we’re matching all the rows from geometry with corresponding rows from sis_data. The third argument (by = “Student”), indicates which column to join by. For more about joins and relational data in R, Garrett Grolemond and Hadley Wickham created a terrific overview here.

Querying and Summarizing Your Gradebook

To recap: we’ve (1) loaded, subsetted and cleaned our gradebook, (2) reshaped our gradebook from long to tall, (3) created two additional variables, (4) renamed some variables to remove the spaces, and (5) added variables from our SIS. A brief summary of our variables in the cheesily named rich_geometry can be obtained with str():


Classes ‘tbl_df’, ‘tbl’ and ‘data.frame’: 1248 obs. of 9 variables:
$ Student : chr “Student 1” “Student 2” “Student 3” “Student 4” …
$ Grade : num 1.4 60.8 68.6 60.2 66.5 …
$ Final : int 0 90 69 63 71 122 84 135 101 60 …
$ Item : Factor w/ 39 levels “Lesson 1 Homework”,..: 1 1 1 1 1 1 1 1 1 1 …
$ Score : num 0 16.2 19.4 15 14.9 20 14.8 18.6 20 13.3 …
$ Type : chr “Homework” “Homework” “Homework” “Homework” …
$ Gender : chr “Female” “Male” “Female” “Male” …
$ Parent_Education: chr “Non-College Grad” “Non-College Grad” “Non-College Grad” “Non-College Grad” …
$ Result : chr “Failed” “Passed” “Passed” “Passed” …

We’re now ready to begin querying and summarizing our dataset. Here there is no limit to the questions we can ask, and some additional functions from the dplyr package help us create quick and easy queries:

  • filter() returns rows with matching conditions (e.g. Score > 70).
  • group_by() groups data into rows with the same values.
  • summarize() summarizes multiple values to a single value
  • arrange() arranges rows by variable in either ascending or descending order
  • distinct() retains only unique rows of a variable
Sample Questions and Answers

Question: which students performed best on quizzes?

rich_geometry %>%
filter(Score > 0, Type == “Quiz”) %>% #We want to filter 0s, which indicate missing assignments
group_by(Student) %>%
summarize(Quiz_Mean = round(mean(Score), 2)) %>% #round() rounds the values to two decimal places
arrange(desc(Quiz_Mean)) %>% #Display Quiz_Mean in descending order
head() %>% #Display only the first six rows
kable() #Format the table display

PO InvoiceSemester TypeDrop Date(s)Anticipated Invoice Date
1Semester 1 (20 weeks)10/6/202010/23/2020
1Trimester 1 (13 Weeks)10/13/202010/23/2020
2Trimester 2 (13 weeks)1/26/20211/29/2021
3Semester 2 (20 weeks)2/23/20213/26/2021
4Trimester 3 (13 weeks)4/13/20214/30/2021

Answer: Students 27, 8, 30, 11, 6, and 26

Question: What are the summary statistics of each exam?

rich_geometry %>%
filter(Score > 0, Type == “Exam”) %>%
group_by(Item) %>%
summarize(Mean = mean(Score),
Median = median(Score),
Std_Dev = sd(Score)) %>% #sd() computes the standard deviation

CapabilityAffiliation UserMentor
Associate a Mentor to a StudentX
Create a Mentor AccountX
Create a Affiliation User AccountX
Create Student AccountX
Drop EnrollmentX
Enroll Multiple Students in a Single CourseX
Enroll Student in CourseX
Export List of Student AccountsX
Go to LMSX
Find Student Account InformationXX
Keep Record of Student InteractionsX
Locate Instructor PICX
Locate Student Guardian InformationXX
Message a StudentXX
Monitor Student ProgressXX
Record Student/Mentor InteractionsX
Reset Student Account PasswordXX
Review Activity LogsXX
Review Student's Course InformationXX
Search Course CatalogX
Send Message to Multiple Students at OnceXX
Send Messages to User AccountsX
View Enrollment LedgerX
View Student EnrollmentsXX
View Student Login HistoryXX
View Student's Start and End DatesXX

Answer: (See above)

Question: Did either gender outperform the other on quizzes or exams?

rich_geometry %>%
filter(Score > 0, Type == “Exam” | Type == “Quiz”) %>% #”|” indicates all rows with exams OR quizzes
group_by(Gender, Type) %>%
summarize(Mean_Score = round(mean(Score), 2)) %>%

Reference TypeFields
Book ChapterAbstract
Book Title
Chapter Title
End Page
Publication Date
Publication Place
Start Page
Edited BookAbstract
Book Title
Publication Date
Publication Place
End Page
Publication Date
Start Page
URL for Full Text
Publication Date
URL for Full Text
Book WholeAbstract
Book Title
Publication Date
Publication Place
End Page
Proceedings Title
Publication Date
Publication Place
Start Page
URL for Full Text
Publication Date
Publication Place
Series Title
URL for Full Text
StatuteAct Title
Publication Date
URL for Full Text

Answer: Yes, males outperformed females on both quizzes and exams.

Question: Which gradebook items are least correlated to final exam performance?

rich_geometry %>%
group_by(Item) %>%
summarize(Correlation_to_Final = round(cor(Score, Final), 2)) %>% #cor() computes the correlation
arrange(Correlation_to_Final) %>%
head() %>%

[table “12” not found /]

Answer: The projects, as well as the homework from lessons 5, 15, and 12.

Question: Did students whose parents graduated from college pass the course at a higher rate?

rich_geometry %>%
distinct(Student) %>%
group_by(Parent_Education) %>%
summarize(Pass_Percentage = round(mean(Result == “Passed”), 2)) %>%

StatusTask Description
Create an online course selection sheet (requiring Parent signature).
Assist with enrollment. Ensure the course meets graduation requirements and NCAA, if applicable.
Get classroom/lab setup (if applicable).
Create Syllabus to hand out to students so they know your expectations of them.
Create a learning environment that is welcoming, supportive and flexible.
Make sure students have necessary books and/or resources for their online course. School is responsible for purchasing resources.
Explore the following resources available to mentors:

Answer: Yes, students whose parents graduated from college passed the course at a higher rate.

Again, it would be impossible for any LMS to anticipate all of these questions in advance. With just a few lines of code, we can explore our data with imaginative freedom. In Part II, we’ll continue our exploratory data analysis by building scatter plots, box plots and histograms. R has fantastic graphing capabilities that help us create compelling visualizations and dashboards.

Have a question about R or this tutorial? Feel free to email me at [email protected].

University of California-Scout logo



About the Author

David wrangles, slices and dices data for Scout, but he also likes to pal around with the English department reciting Julius Caesar, extolling the innumerable merits of East of Eden and arguing about the hegemonic canon.

About Scout

Learning is synonymous with empowerment at Scout. We are teachers, instructional designers and technologists working to deliver University of California-quality interactive online classes, curriculum and supplemental education materials to middle school and high school students and teachers across the U.S. and beyond. Our course materials are designed to inspire lifelong curiosity and prepare pupils of all backgrounds and education levels for an increasingly technological world where training and job skills are mobile, asynchronous and self-directed. Explicitly created to bridge achievement gaps, we believe that using technology effectively can remove traditional obstacles to education.

Share on facebook
Share on linkedin
Share on twitter
Share on pinterest
Michigan Virtual Learning Research Institute

Michigan Virtual Learning Research Institute

The Michigan Virtual Learning Research Institute (MVLRI) is a non-biased organization that exists to expand Michigan’s ability to support new learning models, engage in active research to inform new policies in online and blended learning, and strengthen the state’s infrastructures for sharing best practices. MVLRI works with all online learning environments to develop the best practices for the industry as a whole.

Keep up with the latest Michigan Virtual has to offer

The Digital Backpack

Get our latest articles sent straight to your inbox every Thursday!

  • This field is for validation purposes and should be left unchanged.

Limited Course Capacity

We’re sorry to inform you that we have reached capacity for several of our Semester 1 and Trimester 1 courses. You’ll notice when attempting to enroll students in our Student Learning Portal that some courses are unavailable. While we are no longer accepting new enrollments for these courses at this time, many courses continue to remain open for enrollment.

With many students across the state 100% remote, demand for our online courses is greater than ever before. Because every course we offer is taught by a Michigan-certified teacher, this high volume of enrollments has created capacity issues for our teachers who provide each and every student with individual feedback.

While the Michigan Virtual team anticipated and planned for significant increases in student enrollments this Fall, the increased demand we’ve experienced has been unprecedented. As a result, we are taking steps to hire even more part-and full-time teachers to support larger numbers of student enrollments for Semester 2 as well as for Trimester 2 and 3. 

For schools that still need online learning options this year, please fill out the form at the bottom of our virtual pathways page to meet with someone to discuss other solutions. While some of our teacher-led courses are full, we may still have the capacity to help you in upcoming terms or can discuss timing to implement a whole-school or collaborative program in which local teachers from your school/district use our online course content to teach students. We also have free course content and resources available for you to use.

We know this is an incredibly stressful time for all, and we’re sorry if the courses you’re looking for are unavailable. We never want to turn away a student who wants to learn from us. Our top concern, however, is student success, and we have a policy to not take on additional enrollments if we cannot guarantee that all students will have a quality online learning experience. 

We appreciate your patience and understanding as we navigate the unusually high volume of enrollments we are receiving.