American University
Browse

The Russian Comment Corpus

thesis
posted on 2025-07-15, 17:08 authored by Emma Taylor
<p dir="ltr">Even though comments serve as crucial artifacts for understanding computer programs, relatively few studies examine their form, frequency, or authorship. Code comments are human-readable text that a compiler or interpreter ignores when executing the program. Comments serve multiple purposes, including describing a program’s functionality, explaining bugs or pending updates, and communicating with other developers. Although writing good comments is considered a best practice in software engineering, few studies examine the style and practice of code comment writing, especially non-English comments. The Russian Comment Corpus (RCC) was born out of a desire to understand how Russian-speaking programmers write comments in programming code. This project proposes a new methodology for code comment corpus construction implemented using a Python program to process, filter, and store files containing Russian comments. The RCC contains 95,538 code comments from programs written in C#, Java, JavaScript, Kotlin, PHP, Python, Ruby, and SQL. This project introduces an original comment corpus construction methodology and implements it to create the Russian Comment Corpus. The RCC methodology serves as a blueprint for developing future comment corpora to support studies in code comments, developer cognition, and natural language usage in programming. As a dataset, the Russian Comment Corpus is a foundational work for studying Russian language used in the context of computer programming. </p>

History

Publisher

ProQuest

Language

English

Committee chair

Mark Nelson

Committee member(s)

Kathleen Riley; Luis Cerezo Ceballoa

Degree discipline

Computer Science

Degree grantor

American University. College of Arts and Sciences

Degree level

  • Masters

Degree name

M.S. in Computer Science, American University, May 2025

Local identifier

Taylor_american_0008N_12335

Media type

application/pdf

Pagination

54 pages

Access statement

Electronic thesis is restricted to authorized American University users only, per author's request.

Call number

Thesis 11674

MMS ID

99187054692204102

Submission ID

12335

Usage metrics

    Theses and Dissertations

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC