The following nomenclature may be used for the description of both Regulatory Fusions and Chimeric Transcript Fusions in the context of Categorical Gene Fusions or Assayed Gene Fusions as applicable. The nomenclature components are organized into three categories: Gene Components, Transcript Sequence Components, and Regulatory Nomenclature. These may be used interchangeably, in accordance with the below General Rules.
All components are joined together by the double-colon (
A hyphen (
-) operator may be used instead of a double-colon when describing a read-through transcript at the gene level (see Gene Components).
When describing Chimeric Transcript Fusions, structural components are ordered in 5’ to 3’ orientation with respect to the transcribed gene product.
When describing Regulatory Fusions, the regulatory element is indicated first (e.g. reg_e@GATA2::EVI1).
When describing Chimeric Transcript Fusions by Junction Components (in lieu of full Transcript Segment Components), the 5’ fusion partner junction must be the first component, and the 3’ fusion partner junction must be the last component.
Throughout the nomenclature components, some information may be provided optionally. In these cases, the optional text is colored orange and may be omitted.
Some fusions are inferred from an assayed genomic rearrangement, typically in the context of a phenotypic presentation that is associated with the inferred gene fusion event. In these cases, the nomenclature may indicate that the fusion was inferred through the use of parentheticals surrounding the double-colon operator (shown in red):
<Gene Symbol>(::)<Gene Symbol>
An example of this is provided in the Unknown Gene Component section.
The use of inferred fusions is expected to ALWAYS be contextualized by supporting evidence in a clinical report or research study. Inferred fusions should not be recorded or evaluated in isolation without additional context, as this can lead to potential misinterpretation and/or affect clinical management.
Gene components are used in coarse representation of gene fusions by constituent gene partners, and are generally aligned with previous recommendations on gene-gene fusion descriptions as provided by HGNC [Bruford2021], with attention paid to additional considerations needing attention [Wagner2021].
The most commonly used component is the Named Gene Component, which is complemented by the Unknown Gene Component (for Assayed Gene Fusions) and the Multiple Possible Gene Component (for Categorical Gene Fusions).
In addition, description of read-through fusion transcripts at the gene level may be described with a hyphen instead
of a double-colon, also in alignment with HGNC recommendations [Bruford2021]. For example, a read-through of the INS
gene to the IGF2 gene may be described as
INS-IGF2 in lieu of
INS::IGF2, indicating it as a read-through.
Rearranged genes can have newly adjacent partner genes with which they produce read-through transcripts. Gene-level description of these read-through transcripts must use the standard double-colon syntax. See the special case of read-through fusions for more.
Named Gene Component
Named Gene Components are most often described by an assigned gene symbol from a gene naming authority such as HGNC. An example fusion described as two Named Gene Components may look like:
BCR::ABL1. This is a convenient shorthand syntax for describing fusions at the gene level, but should be accompanied by references to stable gene IDs associated with each used symbol.
Gene symbols (e.g. KMT2A, previously known as MLL) are less stable than their associated gene identifiers (e.g. hgnc:7132). Named Gene Components SHOULD ALWAYS be accompanied by a persistent gene identifier elsewhere within the document or resource where the fusion is described, aligned with prior recommendations from the HGNC [Bruford2021].
Alternatively, Named Gene Components may use the optional Identified Symbol Syntax to identify gene symbols directly within the fusion description if an application would benefit from doing so, though use of this optional syntax will not be compliant with the HGNC recommendations.
Identified Symbol Syntax
In some circumstances it may be preferable to identify the gene symbol used to describe a named gene component directly in the description of the gene fusion. In those cases, the following optional syntax may be used for Named Gene Components:
<Gene Symbol>(<Gene ID>)
An example fusion described with this syntax may look like:
Unknown Gene Component
The syntax for an unknown (typically inferred) gene component (used for Assayed Gene Fusions) is a
An example fusion using an unknown gene component may be inferred from an ALK break-apart assay:
Multiple Possible Gene Component
The syntax for a multiple possible gene component (used for Categorical Gene Fusions) is a
An example fusion using a multiple possible gene component is the “ALK Fusions” concept as seen in biomedical knowledgebases (e.g. CIViC ALK Fusion, OncoKB ALK Fusions):
Transcript Sequence Components
Transcript sequence components are used in precise representation of gene fusions by sequence representations, and are designed for compatibility with the HUGO Gene Variation Society (HGVS) variant nomenclature. Primary among these components is the Transcript Segment Component, and the closely-related 5’ and 3’ Junction Components. Additional components are used to represent intervening sequences, provided as a stand-alone literal sequence (Linker Sequence Component) or as a sequence derived from a Genomic Location (Templated Linker Sequence Component).
Transcript Segment Component
The Transcript Segment Component explicitly describes a transcript sequence segment by start and end exons, and is represented using the following syntax:
<Transcript ID>(<Gene Symbol>):e.<start exon><+/- offset>_<end exon><+/- offset>
Offsets, if omitted, indicate that there is no offset from the segment boundary (which is often the case in gene fusions). For a full description on the use of exon coordinates and offsets, see Structural Elements.
Transcript segment components would be used, for example, to represent COSMIC Fusion 165 (COSF165) under the gene fusion nomenclature as follows:
The 5’ and 3’ Junction Components represent only 5’ and 3’ junction locations, respectively, for Chimeric Transcript Fusions. These components contrast with the Transcript Segment Component which represents a full segment. As noted in the General Rules, these components must be used only as the beginning or ending components, respectively, for a fusion.
The syntax for these components follows:
5’ Junction Component: <Transcript ID>(<Gene Symbol>):e.<end exon><+/- offset>
3’ Junction Component: <Transcript ID>(<Gene Symbol>):e.<start exon><+/- offset>
Optional use of offsets have the same meaning as in the Transcript Segment Component.
Linker Sequence Component
The Linker Sequence Component is represented literally by DNA characters (
Linker Sequence Components would be used, for example, to represent COSMIC Fusion 1780 (COSF1780) under the gene fusion nomenclature as follows:
Using Transcript Segment Component:
Using Junction Components:
Templated Linker Sequence Component
The Templated Linker Sequence Component is represented by a genomic location and strand using the following syntax:
<Chromosome ID>(chr <1-22, X, Y>):g.<start coordinate>_<end coordinate>(+/-)
In the description of gene fusions, at most one regulatory element component may be used to describe the fusion, and it must be designated first (see General Rules). However, regulatory components are complex data objects themselves, and may be comprised of multiple subcomponents which collectively describe the regulatory element of interest. This section specifies the nomenclature for defining regulatory elements, which may be used as a component in the broader description of Regulatory Fusions.
Every regulatory element component begins with a description of the regulatory element class, which is typically an
enhancer or promoter. This is designated as
reg_p, respectively. In rare cases, it may be
necessary to represent other classes of regulatory elements within the INSDC regulatory class vocabulary, which
may be specified using this syntax by appending the regulatory class name to
reg_ as applicable (e.g.
Feature ID subcomponent
A regulatory element may be described by reference to a registered identifier, such as the registered cis-regulatory elements from ENCODE. These are represented using the syntax:
An example registered enhancer element is reg_e_EH38E1516972.
Only one of a Feature ID OR a Feature location subcomponent may be specified.
Feature location subcomponent
A regulatory element may be described by reference to a Genomic Location. These are represented using the syntax:
<Chromosome ID>(chr <1-22, X, Y>):g.<start coordinate>_<end coordinate>
Only one of a Feature Location OR a Feature ID subcomponent may be specified.
Associated gene subcomponent
A regulatory element may also be described by reference to an associated gene. An associated gene is represented using the syntax:
First use of a gene in a document: @<associated gene symbol>(<associated gene ID>)
Subsequent use in a document: @<associated gene symbol>(<associated gene ID>)
An associated gene may be indicated in addition to, or in lieu of, a Feature ID subcomponent or Feature location subcomponent. If representing a regulatory element without an associated feature ID or feature location subcomponent, an associated gene subcomponent MUST be used. The associated gene subcomponent is always placed at the end of the regulatory element description.
Bruford EA, et al., HUGO Gene Nomenclature Committee (HGNC) recommendations for the designation of gene fusions. Leukemia (October 2021). doi:10.1038/s41375-021-01436-6
Wagner AH, et al., Recommendations for future extensions to the HGNC gene fusion nomenclature. Leukemia (December 2021). doi.org/10.1038/s41375-021-01493-x