Minimum Information Model
To accurately characterize gene fusions, a set of data elements comprising a minimum information model has been defined. These elements are selectively used in accordance with the type of gene fusion (Chimeric Transcript Fusions and/or Regulatory Fusions) and the gene fusion context (Assayed Gene Fusions or Categorical Gene Fusions).
Common Elements
Some data elements (e.g. genes) are complex entities with their own information model that are reused across multiple sections of the gene fusion information model. We call these common data elements, which we describe here.
Gene
A gene is defined by a gene symbol and stable gene identifier. For describing gene fusions in humans, we recommend using HUGO Gene Nomenclature Committee (HGNC) genes.
Field |
Limits |
Description |
---|---|---|
Gene symbol |
1..1 |
A registered symbol for a gene, e.g. |
Gene identifier |
1..1 |
A registered identifier for a gene, e.g. |
Genomic Location
Formally, a genomic location is a specialized case of a Sequence Location, with the reference sequence identifier constrained to those representing chromosomal reference sequences associated with a genome assembly. A Genomic Location may be informally described as a position on a chromosome sequence. In gene fusions, genomic locations are often used to represent the inter-residue location at which a fusion junction occurs. They may also be used to specify the location of regulatory elements or templated linker sequence.
Sequence Location
A sequence location is a position on a sequence, defined by a reference sequence, a start coordinate, and an end coordinate. Reference sequences used to describe Sequence Locations should be versioned.
Note
The coordinates indicated here are not described inherently as residue or inter-residue, 0-based or 1-based. Omission on this point is intentional, see the associated Discussion at GitHub.
Field |
Limits |
Description |
---|---|---|
Reference sequence identifier |
1..1 |
A registered identifier for the reference sequence, e.g. |
Start coordinate |
1..1 |
A coordinate representing the start of a genomic location. |
End coordinate |
1..1 |
A coordinate representing the end of a genomic location. |
Structural Elements
The structural elements of a gene fusion represent the expressed gene product, and are typically characterized at the gene level or the transcript level. Chimeric Transcript Fusions must be represented by at least two structural elements, and Regulatory Fusions must be represented by at least one structural element and one Regulatory Element.
The order of structural elements is important, and by convention representations of structural components for gene fusions follow a 5’ -> 3’ ordering. If describing a regulatory fusion, the regulatory element is listed first.
The minimal information for characterizing gene fusions is context-dependent, with components necessary for representing assayed fusions (blue-green boxes), categorical fusions (yellow boxes), or both (white boxes). (A) Structural Elements represent the expressed gene product, and are typically characterized at the gene level or the transcript level. Segments of transcripts should be represented by a transcript ID and associated 5’ and/or 3’ Segment Boundary. (B) Segment Boundaries are characterized by the exon number and offset from the corresponding 5’ or 3’ end. Segment Boundaries also include an aligned Genomic Coordinate with a versioned reference sequence identifier (e.g. a RefSeq NC_ chromosome sequence accession) and position for data fidelity. Importantly, segment boundary Genomic Coordinates represent the aligned positions of fusion junctions, and NOT breakpoints for an associated rearrangement.
Gene (as Structural Element)
A gene (see the Gene common element above for information model) may be used as a structural element, in which case it refers to an unspecified transcript of that gene. For Categorical Gene Fusions, this means any transcript meeting other parameters of the specified fusion. For Assayed Gene Fusions, this means that the exact transcript is not known.
Transcript Segment
A transcript segment is a segment of transcribed sequence denoted by a 5’ and 3’ segment boundary. Typically, transcript segments are used when the gene fusion junction boundary is known or when representing full-length Chimeric Transcript Fusions. In the case where only the fusion junction is reported, only one boundary of a given transcript segment will be represented.
We recommend that representative transcript sequences, when needed, are preferentially selected using the following criteria:
A compatible transcript from MANE Select
A compatible transcript from MANE Plus Clinical
The longest compatible transcript cDNA sequence
The first-published transcript among those transcripts meeting criterion #3
Transcript compatibility should be determined from what is known about the gene fusion structure. If the gene fusion junction sequence is known, compatible transcripts are those that most accurately reflect the junction, with selection among those transcripts prioritized by the above criteria. If the breakends for an underlying rearrangement are known, those data may also help identify the most compatible transcript selection.
Field |
Limits |
Description |
---|---|---|
Transcript sequence identifier |
1..1 |
A registered identifier for the reference transcript sequence, e.g. |
5’ segment boundary |
0..1 |
A Segment Boundary representing the 5’ end of the transcript segment |
3’ segment boundary |
0..1 |
A Segment Boundary representing the 3’ end of the transcript segment |
Segment Boundary
A segment boundary describes the exon-anchored coordinate (and corresponding genomic coordinate) defining a boundary of a transcript segment.
Field |
Limits |
Description |
---|---|---|
Exon number |
1..1 |
The exon number counted from the 5’ end of the transcript. |
Exon offset |
1..1 |
A value representing the offset from the segment boundary, with negative values offset towards the 5’ end of the transcript and positive values offset towards the 3’ end of the transcript. Offsets can reference sequence in the intronic space. |
Genomic location |
1..1 |
A Genomic Location aligned to the transcript segment boundary. |
Linker Sequence
A linker sequence is an observed sequence in the gene fusion that typically occurs between transcript segments, and where the sequence origin is unknown or ambiguous. In cases where the linker sequence is a known intronic or intergenic region, it should be represented as a Templated Linker Sequence instead.
Field |
Limits |
Description |
---|---|---|
Sequence |
1..1 |
A literal sequence expressed as cDNA. |
Templated Linker Sequence
A templated linker sequence is an observed sequence in the gene fusion that typically occurs between transcript segments, and where the sequence origin is a known intronic or intergenic region.
Field |
Limits |
Description |
---|---|---|
Genomic location |
1..1 |
A Genomic Location from which the linker sequence is derived. |
Genomic strand |
1..1 |
MUST be one of |
Sequence |
0..1 |
An optional literal sequence derived from the genomic location. |
Regulatory Elements
Regulatory elements include a Regulatory Feature used to describe an enhancer, promoter, or other regulatory elements that constitute Regulatory Fusions. Regulatory features may also be defined by a gene with which the feature is associated (e.g. an IGH-associated enhancer element).
Regulatory Feature
Our definitions of regulatory features follows the definitions provided by the
INSDC regulatory class vocabulary. In gene fusions, these are typically either enhancer
or promoter
features. These features may be represented as stand-alone entities with their own conceptual identifier
(e.g. ENCODE cis-Regulatory Elements) or by a Genomic Location. Regulatory features may also be represented by
their association with a nearby gene (e.g. regulatory fusion between MYC and IGH-associated enhancer elements).
It is expected that a regulatory feature will be described by at least (and often exactly) one of a Feature ID
,
Genomic location
, or associated gene
.
Field |
Limits |
Description |
---|---|---|
Regulatory class |
1..1 |
MUST be |
Feature ID |
0..1 |
An optional identifier for the regulatory feature, e.g. registered cis-regulatory elements from ENCODE. |
Feature location |
0..1 |
An optional Genomic Location for the regulatory feature. |
Associated gene |
0..1 |
A Gene associated with the regulatory feature. |
Categorical elements
Categorical data elements are specifically used for the representation of Categorical Gene Fusions. These data elements define the key criteria for matching Assayed Gene Fusions.
Functional Domains
Categorical Gene Fusions are often characterized by the presence or absence of critical functional domains within a gene fusion.
Field |
Limits |
Description |
---|---|---|
Label |
0..1 |
An optional name for the functional domain, e.g. |
ID |
0..1 |
An optional namespaced identifier for the domain, e.g. interpro:IPR000719. |
Sequence location |
0..1 |
An optional Sequence Location for the domain. |
Status |
1..1 |
MUST be one of [ |
Associated gene |
1..1 |
The Gene associated with the domain. |
Reading Frame
A common attribute of a categorical gene fusion is whether the reading frame is preserved in the expressed gene product. This is typical of protein-coding gene fusions.
Field |
Limits |
Description |
---|---|---|
Reading frame preserved |
0..1 |
Boolean indicating whether the reading frame must be preserved or not. |
Assayed Elements
Assayed data elements are specifically used for the representation of Assayed Gene Fusions. These data elements provide important context for downstream evaluation of Chimeric Transcript Fusions and Regulatory Fusions detected by biomedical assays.
Causative Event
The evaluation of a fusion may be influenced by the underlying mechanism that generated the fusion. Often this will be a DNA rearrangement, but it could also be a read-through or trans-splicing event.
Field |
Limits |
Description |
---|---|---|
Type |
1..1 |
The type of event that generated the fusion. May be |
Description |
0..1 |
For rearrangements, this field is useful for characterizing the rearrangement. This could be a string describing the rearrangement with an appropriate nomenclature (e.g. ISCN or HGVS), or an equivalent data structure. |
Assay
Metadata about the assay that detected the fusion–and whether that fusion was directly detected by the assay or inferred–is useful to preserve for downstream evaluation.
Field |
Limits |
Description |
---|---|---|
Name |
1..1 |
A human-readable name for the assay. Should match the label for the assay ID, e.g.
|
ID |
1..1 |
An ID for the assay concept, e.g. obi:OBI_0003094 from the Ontology for Biomedical Investigations. |
Fusion detection |
1..1 |
MUST be one of [direct, inferred]. Direct detection methods (e.g. RNA-seq, RT-PCR) directly interrogate chimeric transcript junctions. Inferred detection methods (e.g. WGS, FISH) infer the existence of a fusion in the presence of compatible biomarkers (e.g. ALK rearrangements in non-small cell lung cancers). |
Method URI |
1..1 |
A URI pointing to the methodological details of the assay. |