American University
Browse

COCITATION CLUSTERING: DEVELOPMENT OF A HYBRID COMPUTER TECHNIQUE FOR SEARCHING FOR RELATED CHEMICAL SUBSTANCES

Download (2.47 MB)
thesis
posted on 2023-09-06, 02:55 authored by George Franklin Hazard, Jr.

A set of methods generically known as 'Substructure Search' has been developed over the years to search for related structural classes in the chemical literature. These methods require two types of expertise. First, chemical subject matter expertise is needed to determine the basic skeleton, or 'substructure', characteristic of a class of substances. Second, a search intermediate is needed to code this substructure into the conventions of a given search system. This thesis describes a hybrid method of searching for related chemicals using citation frequency data and similarity measures. Given the CAS Registry Number of a substance of interest, this method will discover a variety of structurally related substances with very little input on the part of the searcher. All articles citing a parent substance of interest in the TOXLINE file of the National Library of Medicine are retrieved and substances cited in these articles are ordered by the strength of their 'cocitation' with the parent substance. Cocitation strength is defined as the number of citations that cite two substances of interest together. These substances are then further grouped by the strength of the similarity of their systematic names to form cocitation clusters. Similarity is calculated by the number of two character segments (digrams) generated from the systematic names of the parent substance and those substances cited with it. Digrams are stored in a bit string using the method of superimposed coding. Compared to substructure searches run as controls, cocitation clustering always retrieved relevant substances not found in the controls, including related chemicals without fully defined structures that are difficult to retrieve in standard substructure search systems. The method did not retrieve all substances found in the control searches, and reasons are discussed for this. When used together, cocitation and clustering leads to the discovery of chemicals related both structurally and biologically, and these substances are useful in defining the scope of further substructure searches or as data to be used immediately in a search in a citation file such as TOXLINE.

History

Publisher

ProQuest

Language

English

Notes

Ph.D. American University 1982.

Handle

http://hdl.handle.net/1961/thesesdissertations:1998

Media type

application/pdf

Access statement

Part of thesis digitization project, awaiting processing.

Usage metrics

    Theses and Dissertations

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC