Gene Fusion Guidelines
Warning
These guidelines are in a draft state, assembled by consensus through a cross-consortia initiative with representatives from multiple professional societies. However, this draft has not yet been evaluated for formal endorsement by any professional society. Community alignment status is organized on GitHub and summarized at Community Feedback and Endorsements.
The Gene Fusion Guidelines are a collection of recommendations for the precise representation of gene fusions, assembled by a cross-consortia initiative between members of the Clinical Genome (ClinGen) Somatic Cancer Clinical Domain Working Group, the Cancer Genomics Consortium, the Cytogenetics Committee of the College of American Pathologists (CAP) and the American College of Medical Genetics and Genomics (ACMG), and the Variant Interpretation for Cancer Consortium.
The guidelines provide a precise definition, minimal information model, and nomenclature for both assayed and categorical gene fusions. These guidelines incorporate, extend, and refine related standards, including the HUGO Gene Nomenclature Committee (HGNC) recommendations for the designation of gene fusions.
Introduction
Maximizing the personal, public, research, and clinical value of genomic information will require that clinicians, researchers, and testing laboratories capture and report genetic variation data reliably. The Gene Fusion Guidelines — written by a partnership among experts from clinical laboratory testing and informatics societies — is an open set of guidelines to standardize the representation of gene fusion data and knowledge.
Here we document the primary contributions of this specification for variation representation:
Terminology. We provide definitions for gene fusions and disambiguate several classes, including chimeric product and regulatory fusions, from related but distinct concepts such as genomic rearrangements. We also elaborate on the distinction between gene fusions represented as aggregated concepts in cohort studies and biomedical knowledgebases from individual fusions observed in a sample.
Minimum information model. We provide recommendations on the salient data elements for the representation of assayed and categorical gene fusions. These data elements capture key information used in the evaluation of gene fusions in biomedical research and clinical applications.
Human-readable nomenclature. We provide recommendations for a human-readable nomenclature for the designation of gene fusions and associated regulatory elements, at the sequence or gene level.
Gene fusion information capture workflows. We provide recommendations for gathering information about gene fusions in bioinformatics pipelines and knowledge curation efforts.
Supporting tools. We provide a Python library (fusor) that enforces data objects containing the salient elements of gene fusions, for use in informatics pipelines. We also provide an educational web tool (fusion-curation) that implements our recommendations to train gene fusion curators.
Terminology
Gene Fusions
Gene fusions are a complex class of variation that may be characterized by a broad range of relevant attributes with varying specificity. The defining characteristic of gene fusions is the interaction of two or more genes to drive aberrant activity of a gene product, through formation of a chimeric transcript or interaction of rearranged gene regulatory elements. Similar genetic variations involving Rearrangements within the same gene (e.g. internal tandem duplications), and transcript alterations due to splice site variants are biologically meaningful but distinct from gene fusions. Importantly, gene fusions are also distinct from the underlying genomic rearrangements that drive them, though these concepts have been conflated due to the historical use of genomic assays for inferring the presence of specific gene fusions.
The two primary classes of gene fusions–Chimeric Transcript Fusions and Regulatory Fusions–are not mutually exclusive classes, as some fusions (such as promoter-swap fusions) may be defined either in the context of their regulatory elements or by their chimeric gene product.
Gene products that are considered loss-of-function or are not expressed should not be described as gene fusions, even when they result from a genomic rearrangement.

(TOP) Gene fusions may be regulatory in nature, where a rearranged promoter or nearby enhancer element drives overexpression of the partner gene. (BOTTOM) Gene fusions typically result in chimeric transcripts between two genes, which (for coding transcripts) often result in novel protein sequences.
Chimeric Transcript Fusions
Chimeric transcript gene fusions are often driven by genomic rearrangements involving two gene loci, resulting in the concatenation of exons from each into a single chimeric transcript. This class of fusions is exemplified by well-known clinically-relevant gene fusions such as BCR(hgnc:1014)::ABL1(hgnc:76). Other clinically-relevant gene fusions of this type may be driven by RNA processing events in lieu of genomic rearrangements, including read-through derived fusions such as CTSD(hgnc:2529)::IFITM10(hgnc:40022) and trans-splicing derived fusions such as JAZF1(hgnc:28917)::JJAZ1(hgnc:17101). These alternative mechanisms for creating fusions are specified in these guidelines, but it should be noted that most read-through and trans-splicing events are artifactual and have little to no known clinical relevance.
Regulatory Fusions
In contrast to chimeric transcript fusions, deregulated gene fusions are primarily characterized by the rearrangement of regulatory elements from one gene near a second gene, resulting in the increased gene product expression of the second gene. This class of gene fusions include promoter-swapping gene fusions such as TMPRSS2(hgnc:11876)::ERG(hgnc:3446), as well as enhancer-driven gene fusions such as Reg@IGH(hgnc:5477)::MYC(hgnc:7553). Gene products rendered unexpressed or non-functional should not be described as gene fusions, even when they result from a genomic rearrangement.
Gene Fusion Contexts
Determining the salient elements for a gene fusion is dependent upon the context in which the gene fusion is being described, whether it describes an assayed fusion event from a sample (Assayed Gene Fusions) or an aggregate context described in biomedical literature or knowledgebases (Categorical Gene Fusions). These guidelines provide recommendations for characterizing gene fusions in each context.
Assayed Gene Fusions
Assayed gene fusions from biological specimens are directly detected using RNA-based gene fusion assays, or alternatively may be inferred from genomic rearrangements detected by whole genome sequencing or by coarser-scale cytogenomic assays in the context of informative phenotypic biomarkers. For example, an EWSR1 fusion is often inferred by breakapart FISH assay when a neoplasm is diagnosed or suspected to be Ewing sarcoma/primitive neuroectodermal tumor by immunohistochemical and/or morphological analysis.
Categorical Gene Fusions
In contrast, categorical gene fusions are generalized concepts representing a class of fusions by their shared attributes, such as retained or lost regulatory elements and/or functional domains, and are typically curated from the biomedical literature for use in genomic knowledgebases. An example categorical gene fusion is EWSR1 as a known 5’ gene fusion partner that binds 3’ partner genes with DNA-binding domains.
Minimum Information Model
To accurately characterize gene fusions, a set of data elements comprising a minimum information model has been defined. These elements are selectively used in accordance with the type of gene fusion (Chimeric Transcript Fusions and/or Regulatory Fusions) and the gene fusion context (Assayed Gene Fusions or Categorical Gene Fusions).
Common Elements
Some data elements (e.g. genes) are complex entities with their own information model that are reused across multiple sections of the gene fusion information model. We call these common data elements, which we describe here.
Gene
A gene is defined by a gene symbol and stable gene identifier. For describing gene fusions in humans, we recommend using HUGO Gene Nomenclature Committee (HGNC) genes.
Field |
Limits |
Description |
---|---|---|
Gene symbol |
1..1 |
A registered symbol for a gene, e.g. |
Gene identifier |
1..1 |
A registered identifier for a gene, e.g. |
Genomic Location
A genomic location is a specialized case of a Sequence Location, with the reference sequence identifier constrained to those representing chromosomal reference sequences associated with a genome assembly. In gene fusions, genomic locations are often used to represent the inter-residue location at which a fusion junction occurs. They may also be used to specify the location of regulatory elements or templated linker sequence.
Sequence Location
A sequence location is defined by a reference sequence, a start coordinate, and an end coordinate. Reference sequences should be versioned.
Note
The coordinates indicated here are not described inherently as residue or inter-residue, 0-based or 1-based. Omission on this point is intentional, see the associated Discussion at GitHub.
Field |
Limits |
Description |
---|---|---|
Reference sequence identifier |
1..1 |
A registered identifier for the reference sequence, e.g. |
Start coordinate |
1..1 |
A coordinate representing the start of a genomic location. |
End coordinate |
1..1 |
A coordinate representing the end of a genomic location. |
Structural Elements
The structural elements of a gene fusion represent the expressed gene product, and are typically characterized at the gene level or the transcript level. Chimeric Transcript Fusions must be represented by at least two structural elements, and Regulatory Fusions must be represented by at least one structural element and one Regulatory Element.
The order of structural elements is important, and by convention representations of structural components for gene fusions follow a 5’ -> 3’ ordering. If describing a regulatory fusion, the regulatory element is listed first.
The minimal information for characterizing gene fusions is context-dependent, with components necessary for representing assayed fusions (blue-green boxes), categorical fusions (yellow boxes), or both (white boxes). (A) Structural Elements represent the expressed gene product, and are typically characterized at the gene level or the transcript level. Segments of transcripts should be represented by a transcript ID and associated 5’ and/or 3’ Segment Boundary. (B) Segment Boundaries are characterized by the exon number and offset from the corresponding 5’ or 3’ end. Segment Boundaries also include an aligned Genomic Coordinate with a versioned reference sequence identifier (e.g. a RefSeq NC_ chromosome sequence accession) and position for data fidelity. Importantly, segment boundary Genomic Coordinates represent the aligned positions of fusion junctions, and NOT breakpoints for an associated rearrangement.
Gene (as Structural Element)
A gene (see the Gene common element above for information model) may be used as a structural element, in which case it refers to an unspecified transcript of that gene. For Categorical Gene Fusions, this means any transcript meeting other parameters of the specified fusion. For Assayed Gene Fusions, this means that the exact transcript is not known.
Transcript Segment
A transcript segment is a representation of a transcribed sequence denoted by a 5-prime and 3-prime segment boundary. Typically, transcript segments are used when the gene fusion junction boundary is known or when representing full-length Chimeric Transcript Fusions. In the case where only the fusion junction is reported, only one boundary of a given transcript segment will be represented.
We recommend that representative transcript sequences, when needed, are preferentially selected using the following criteria: 1. A compatible transcript from MANE Select 2. A compatible transcript from MANE Plus Clinical 3. The longest compatible transcript cDNA sequence 4. The first-published transcript among those transcripts meeting criterion #3
Transcript compatibility should be determined from what is known about the gene fusion structure. If the gene fusion junction sequence is known, compatible transcripts are those that most accurately reflect the junction, with selection among those transcripts prioritized by the above criteria. If the breakends for an underlying rearrangement are known, those data may also help identify the most compatible transcript selection.
Field |
Limits |
Description |
---|---|---|
Transcript sequence identifier |
1..1 |
A registered identifier for the reference transcript sequence, e.g. |
5’ segment boundary |
0..1 |
A Segment Boundary representing the 5-prime end of the transcript segment |
3’ segment boundary |
0..1 |
A Segment Boundary representing the 3-prime end of the transcript segment |
Segment Boundary
A segment boundary describes the exon-anchored coordinate (and corresponding genomic coordinate) defining a boundary of a transcript segment.
Field |
Limits |
Description |
---|---|---|
Exon number |
1..1 |
The exon number counted from the 5-prime end of the transcript. |
Exon offset |
1..1 |
A value representing the offset from the segment boundary, with positive values offset towards the 5-prime end of the transcript and negative values offset towards the 3-prime end of the transcript. Offsets can reference sequence in the intronic space. |
Genomic location |
1..1 |
A Genomic Location aligned to the transcript segment boundary. |
Linker Sequence
A linker sequence is an observed sequence in the gene fusion that typically occurs between transcript segments, and where the sequence origin is unknown or ambiguous. In cases where the linker sequence is a known intronic or intergenic region, it should be represented as a Templated Linker Sequence instead.
Field |
Limits |
Description |
---|---|---|
Sequence |
1..1 |
A literal sequence expressed as cDNA. |
Templated Linker Sequence
A templated linker sequence is an observed sequence in the gene fusion that typically occurs between transcript segments, and where the sequence origin is a known intronic or intergenic region.
Field |
Limits |
Description |
---|---|---|
Genomic location |
1..1 |
A Genomic Location from which the linker sequence is derived. |
Genomic strand |
1..1 |
MUST be one of |
Sequence |
0..1 |
An optional literal sequence derived from the genomic location. |
Regulatory Elements
Regulatory elements include a Regulatory Feature used to describe an enhancer, promoter, or other regulatory elements that constitute Regulatory Fusions. Regulatory features may also be defined by a gene with which the feature is associated (e.g. an IGH-associated enhancer element).
Regulatory Feature
Our definitions of regulatory features follows the definitions provided by the
INSDC regulatory class vocabulary. In gene fusions, these are typically either enhancer
or promoter
features. These features may be represented as stand-alone entities with their own conceptual identifier
(e.g. ENCODE cis-Regulatory Elements) or by a Genomic Location. Regulatory features may also be represented by
their association with a nearby gene (e.g. regulatory fusion between MYC and IGH-associated enhancer elements).
It is expected that a regulatory feature will be described by at least (and often exactly) one of a Feature ID
,
Genomic location
, or associated gene
.
Field |
Limits |
Description |
---|---|---|
Regulatory class |
1..1 |
MUST be |
Feature ID |
0..1 |
An optional identifier for the regulatory feature, e.g. registered cis-regulatory elements from ENCODE. |
Feature location |
0..1 |
An optional Genomic Location for the regulatory feature. |
Associated gene |
0..1 |
A Gene associated with the regulatory feature. |
Categorical elements
Categorical data elements are specifically used for the representation of Categorical Gene Fusions. These data elements define the key criteria for matching Assayed Gene Fusions.
Functional Domains
Categorical Gene Fusions are often characterized by the presence or absence of critical functional domains within a gene fusion.
Field |
Limits |
Description |
---|---|---|
Label |
0..1 |
An optional name for the functional domain, e.g. |
ID |
0..1 |
An optional namespaced identifier for the domain, e.g. interpro:IPR000719. |
Sequence location |
0..1 |
An optional Sequence Location for the domain. |
Status |
1..1 |
MUST be one of [ |
Associated gene |
1..1 |
The Gene associated with the domain. |
Reading Frame
A common attribute of a categorical gene fusion is whether the reading frame is preserved in the expressed gene product. This is typical of protein-coding gene fusions.
Field |
Limits |
Description |
---|---|---|
Reading frame preserved |
0..1 |
Boolean indicating whether the reading frame must be preserved or not. |
Assayed Elements
Assayed data elements are specifically used for the representation of Assayed Gene Fusions. These data elements provide important context for downstream evaluation of Chimeric Transcript Fusions and Regulatory Fusions detected by biomedical assays.
Causative Event
The evaluation of a fusion may be influenced by the underlying mechanism that generated the fusion. Often this will be a DNA rearrangement, but it could also be a read-through or trans-splicing event.
Field |
Limits |
Description |
---|---|---|
Type |
1..1 |
The type of event that generated the fusion. May be |
Description |
0..1 |
For rearrangements, this field is useful for characterizing the rearrangement. This could be a string describing the rearrangement with an appropriate nomenclature (e.g. ISCN or HGVS), or an equivalent data structure. |
Assay
Metadata about the assay that detected the fusion–and whether that fusion was directly detected by the assay or inferred–is useful to preserve for downstream evaluation.
Field |
Limits |
Description |
---|---|---|
Name |
1..1 |
A human-readable name for the assay. Should match the label for the assay ID, e.g.
|
ID |
1..1 |
An ID for the assay concept, e.g. obi:OBI_0003094 from the Ontology for Biomedical Investigations. |
Fusion detection |
1..1 |
MUST be one of [direct, inferred]. Direct detection methods (e.g. RNA-seq, RT-PCR) directly interrogate chimeric transcript junctions. Inferred detection methods (e.g. WGS, FISH) infer the existence of a fusion in the presence of compatible biomarkers (e.g. ALK rearrangements in non-small cell lung cancers). |
Method URI |
1..1 |
A URI pointing to the methodological details of the assay. |
Nomenclature
The following nomenclature may be used for the description of both Regulatory Fusions and Chimeric Transcript Fusions in the context of Categorical Gene Fusions or Assayed Gene Fusions as applicable. The nomenclature components are organized into three categories: Gene Components, Transcript Sequence Components, and Regulatory Nomenclature. These may be used interchangeably, in accordance with the below General Rules.
General Rules
All components are joined together by the double-colon (
::
) operator. Additional rules apply for sub-components of Regulatory Nomenclature.When describing Chimeric Transcript Fusions, structural components are ordered in 5’ to 3’ orientation with respect to the transcribed gene product.
When describing Regulatory Fusions, the regulatory element is indicated first (e.g. reg_e@IGH::MYC).
When describing Chimeric Transcript Fusions by Junction Components (in lieu of full Transcript Segment Components), the 5’ fusion partner junction must be the first component, and the 3’ fusion partner junction must be the last component.
Throughout the nomenclature components, some information may be provided optionally. In these cases, the optional text is colored orange and may be omitted.
Gene Components
Gene components are used in coarse representation of gene fusions by constituent gene partners, and are generally aligned with previous recommendations on gene-gene fusion nomenclature as provided by HGNC [Bruford2021]. The most common of these is the Specific Gene Component, which is complemented by the Multiple Possible Gene Component (for Categorical Gene Fusions) and the Unknown Gene Component (for Assayed Gene Fusions).
Specific Gene Component
The syntax for a specific gene is as follows:
First use of a gene in a document: <Gene Symbol>(<Gene ID>)
Subsequent use in a document: <Gene Symbol>(<Gene ID>)
An example fusion using two Specific Gene Components:
BCR(hgnc:1014)::ABL1(hgnc:76)
Unknown Gene Component
The syntax for an unknown (typically inferred) gene component (used for Assayed Gene Fusions) is a ?
.
An example fusion using an unknown gene component may be inferred from an ALK break-apart assay:
?::ALK(hgnc:427)
Multiple Possible Gene Component
The syntax for a multiple possible gene component (used for Categorical Gene Fusions) is a v
.
An example fusion using a multiple possible gene component is the “ALK Fusions” concept as seen in biomedical knowledgebases (e.g. CIViC ALK Fusion, OncoKB ALK Fusions):
v::ALK(hgnc:427)
Transcript Sequence Components
Transcript sequence components are used in precise representation of gene fusions by sequence representations, and are designed for compatibility with the HUGO Gene Variation Society (HGVS) variant nomenclature. Primary among these components is the Transcript Segment Component, and the closely-related 5-prime and 3-prime Junction Components. Additional components are used to represent intervening sequences, provided as a stand-alone literal sequence (Linker Sequence Component) or as a sequence derived from a Genomic Location (Templated Linker Sequence Component).
Transcript Segment Component
The Transcript Segment Component explicitly describes a segment transcript sequence by start and end exons, and is represented using the following syntax:
<Transcript ID>(<Gene Symbol>):e.<start exon><+/- offset>_<end exon><+/- offset>
Offsets, if omitted, indicate that there is no offset from the segment boundary (which is often the case in gene fusions). For a full description on the use of exon coordinates and offsets, see Structural Elements.
Transcript segment components would be used, for example, to represent COSMIC Fusion 165 (COSF165) under the gene fusion nomenclature as follows:
ENST00000397938.6(EWSR1):e.1_7::ENST00000527786.6(FLI1):e.6_9
Junction Components
The 5-prime and 3-prime Junction Components represent only 5-prime and 3-prime junction locations, respectively, for Chimeric Transcript Fusions. These components contrast with the Transcript Segment Component which represents a full segment. As noted in the General Rules, these components must be used only as the beginning or ending components, respectively, for a fusion.
The syntax for these components follows:
5-prime Junction Component: <Transcript ID>(<Gene Symbol>):e.<end exon><+/- offset>
3-prime Junction Component: <Transcript ID>(<Gene Symbol>):e.<start exon><+/- offset>
Optional use of offsets have the same meaning as in the Transcript Segment Component.
Linker Sequence Component
The Linker Sequence Component is represented literally by DNA characters (A
, C
, G
, T
).
Linker Sequence Components would be used, for example, to represent COSMIC Fusion 1780 (COSF1780) under the gene fusion nomenclature as follows:
Using Transcript Segment Component:
ENST00000305877.12(BCR):e.1_2::ACTAAAGCG::ENST00000318560.5(ABL1):e.2_11
Using Junction Components:
ENST00000305877.12(BCR):e.2::ACTAAAGCG::ENST00000318560.5(ABL1):e.2
Templated Linker Sequence Component
The Templated Linker Sequence Component is represented by a genomic location and strand using the following syntax:
<Chromosome ID>(chr <1-22, X, Y>):g.<start coordinate>_<end coordinate>(+/-)
Regulatory Nomenclature
In the description of gene fusions, at most one regulatory element component may be used to describe the fusion, and it must be designated first (see General Rules). However, regulatory components are complex data objects themselves, and may be comprised of multiple subcomponents which collectively describe the regulatory element of interest. This section specifies the nomenclature for defining regulatory elements, which may be used as a component in the broader description of Regulatory Fusions.
Class Subcomponent
Every regulatory element component begins with a description of the regulatory element class, which is typically an
enhancer or promoter. This is designated as reg_e
or reg_p
, respectively. In rare cases, it may be
necessary to represent other classes of regulatory elements within the INSDC regulatory class vocabulary, which
may be specified using this syntax by appending the regulatory class name to reg_
as applicable (e.g.
reg_response_element
).
Feature ID subcomponent
A regulatory element may be described by reference to a registered identifier, such as the registered cis-regulatory elements from ENCODE. These are represented using the syntax:
_<reference id>
An example registered enhancer element is reg_e_EH38E1516972.
Only one of a Feature ID OR a Feature location subcomponent may be specified.
Feature location subcomponent
A regulatory element may be described by reference to a Genomic Location. These are represented using the syntax:
<Chromosome ID>(chr <1-22, X, Y>):g.<start coordinate>_<end coordinate>
Only one of a Feature Location OR a Feature ID subcomponent may be specified.
Associated gene subcomponent
A regulatory element may also be described by reference to an associated gene. An associated gene is represented using the syntax:
First use of a gene in a document: @<associated gene symbol>(<associated gene ID>)
Subsequent use in a document: @<associated gene symbol>(<associated gene ID>)
An associated gene may be indicated in addition to, or in lieu of, a Feature ID subcomponent or Feature location subcomponent. If representing a regulatory element without an associated feature ID or feature location subcomponent, an associated gene subcomponent MUST be used. The associated gene subcomponent is always placed at the end of the regulatory element description.
References
- Bruford2021
Bruford EA, et al., HUGO Gene Nomenclature Committee (HGNC) recommendations for the designation of gene fusions. Leukemia (October 2021). doi:10.1038/s41375-021-01436-6
Curation Workflow

A workflow for the curation of gene fusions
This is a recommended workflow for the curation of gene fusions from the biomedical literature or for use in biomedical knowledgebases. Dashed lines indicate notes specific to certain decisions or actions taken in the workflow. Solid lines ending in open circles represent automatable tasks using software, such as demonstrated in the Gene Fusion Curation Tool.
Supporting Tools
Note
These tools assist in the curation and representation of gene fusion data. To do this, they must choose conventions that are not defined in these guidelines, specifically around data exchange. For example, these implementations choose to use SequenceLocation from the Global Alliance for Genomics and Health (GA4GH) Variation Representation Specification (VRS), due to its use of inter-residue coordinates and extensible design. Other implementations may choose different conventions for representation of gene fusion data in system exchange.
FUSOR
The FUSOR data validation / translation Python package provides data classes and constructor tools to create valid gene fusion messages for use in downstream applications. The package is publicly available on the Python Package Index (PyPI).
Gene Fusion Curation Tool
Gene fusion curation educational web tool provides a user interface supporting gene fusion curation. This web tool is primarily an educational resource to demonstrate the computable structure and associated nomenclature for gene fusions constructed in the application.
Community Feedback and Endorsements
These guidelines were initially developed through consensus building among key stakeholders with expertise in clinical variant diagnostics and informatics.
However, for widespread adoption, it is important that broader community input is considered and included in the continued development of these guidelines and alignment across the many stakeholders interested in the representation of gene fusions. If you have feedback to provide on the specification, please leave feedback on our on our GitHub issue tracker or Google form.
Below is a summary the alignment efforts for this version of the specification (draft-1):
Open for community review