<p dir="ltr">Even though comments serve as crucial artifacts for understanding computer programs, relatively few studies examine their form, frequency, or authorship. Code comments are human-readable text that a compiler or interpreter ignores when executing the program. Comments serve multiple purposes, including describing a program’s functionality, explaining bugs or pending updates, and communicating with other developers. Although writing good comments is considered a best practice in software engineering, few studies examine the style and practice of code comment writing, especially non-English comments. The Russian Comment Corpus (RCC) was born out of a desire to understand how Russian-speaking programmers write comments in programming code. This project proposes a new methodology for code comment corpus construction implemented using a Python program to process, filter, and store files containing Russian comments. The RCC contains 95,538 code comments from programs written in C#, Java, JavaScript, Kotlin, PHP, Python, Ruby, and SQL. This project introduces an original comment corpus construction methodology and implements it to create the Russian Comment Corpus. The RCC methodology serves as a blueprint for developing future comment corpora to support studies in code comments, developer cognition, and natural language usage in programming. As a dataset, the Russian Comment Corpus is a foundational work for studying Russian language used in the context of computer programming. </p>
History
Publisher
ProQuest
Language
English
Committee chair
Mark Nelson
Committee member(s)
Kathleen Riley; Luis Cerezo Ceballoa
Degree discipline
Computer Science
Degree grantor
American University. College of Arts and Sciences
Degree level
Masters
Degree name
M.S. in Computer Science, American University, May 2025
Local identifier
Taylor_american_0008N_12335
Media type
application/pdf
Pagination
54 pages
Access statement
Electronic thesis is restricted to authorized American University users only, per author's request.