Minimum Information Model

To accurately characterize gene fusions, a set of data elements comprising a minimum information model has been defined. These elements are selectively used in accordance with the type of gene fusion (Chimeric Transcript Fusions and/or Regulatory Fusions) and the gene fusion context (Assayed Gene Fusions or Categorical Gene Fusions).

Common Elements

Some data elements (e.g. genes) are complex entities with their own information model that are reused across multiple sections of the gene fusion information model. We call these common data elements, which we describe here.

Gene

A gene is defined by a gene symbol and stable gene identifier. For describing gene fusions in humans, we recommend using HUGO Gene Nomenclature Committee (HGNC) genes.

Field	Limits	Description
Gene symbol	1..1	A registered symbol for a gene, e.g. `ABL1`.
Gene identifier	1..1	A registered identifier for a gene, e.g. `hgnc:76`.

Genomic Location

A genomic location is a specialized case of a Sequence Location, with the reference sequence identifier constrained to those representing chromosomal reference sequences associated with a genome assembly. In gene fusions, genomic locations are often used to represent the inter-residue location at which a fusion junction occurs. They may also be used to specify the location of regulatory elements or templated linker sequence.

Sequence Location

A sequence location is defined by a reference sequence, a start coordinate, and an end coordinate. Reference sequences should be versioned.

Note

The coordinates indicated here are not described inherently as residue or inter-residue, 0-based or 1-based. Omission on this point is intentional, see the associated Discussion at GitHub.

Field	Limits	Description
Reference sequence identifier	1..1	A registered identifier for the reference sequence, e.g. `NC_000001.11` for chr1 on GRCh38.p14.
Start coordinate	1..1	A coordinate representing the start of a genomic location.
End coordinate	1..1	A coordinate representing the end of a genomic location.

Structural Elements

The structural elements of a gene fusion represent the expressed gene product, and are typically characterized at the gene level or the transcript level. Chimeric Transcript Fusions must be represented by at least two structural elements, and Regulatory Fusions must be represented by at least one structural element and one Regulatory Element.

The order of structural elements is important, and by convention representations of structural components for gene fusions follow a 5’ -> 3’ ordering. If describing a regulatory fusion, the regulatory element is listed first.

_images/structural-elements.svg — The minimal information for characterizing gene fusions is context-dependent, with components necessary for representing assayed fusions (blue-green boxes), categorical fusions (yellow boxes), or both (white boxes). **(A)** Structural Elements represent the expressed gene product, and are typically characterized at the gene level or the transcript level. Segments of transcripts should be represented by a transcript ID and associated 5’ and/or 3’ Segment Boundary. **(B)** Segment Boundaries are characterized by the exon number and offset from the corresponding 5’ or 3’ end. Segment Boundaries also include an aligned Genomic Coordinate with a versioned reference sequence identifier (e.g. a RefSeq NC_ chromosome sequence accession) and position for data fidelity. Importantly, segment boundary Genomic Coordinates represent the aligned positions of fusion junctions, and NOT breakpoints for an associated rearrangement.

Gene (as Structural Element)

A gene (see the Gene common element above for information model) may be used as a structural element, in which case it refers to an unspecified transcript of that gene. For Categorical Gene Fusions, this means any transcript meeting other parameters of the specified fusion. For Assayed Gene Fusions, this means that the exact transcript is not known.

Transcript Segment

A transcript segment is a representation of a transcribed sequence denoted by a 5-prime and 3-prime segment boundary. Typically, transcript segments are used when the gene fusion junction boundary is known or when representing full-length Chimeric Transcript Fusions. In the case where only the fusion junction is reported, only one boundary of a given transcript segment will be represented.

We recommend that representative transcript sequences, when needed, are preferentially selected using the following criteria: 1. A compatible transcript from MANE Select 2. A compatible transcript from MANE Plus Clinical 3. The longest compatible transcript cDNA sequence 4. The first-published transcript among those transcripts meeting criterion #3

Transcript compatibility should be determined from what is known about the gene fusion structure. If the gene fusion junction sequence is known, compatible transcripts are those that most accurately reflect the junction, with selection among those transcripts prioritized by the above criteria. If the breakends for an underlying rearrangement are known, those data may also help identify the most compatible transcript selection.

Field	Limits	Description
Transcript sequence identifier	1..1	A registered identifier for the reference transcript sequence, e.g. `NM_005157.6` as a MANE Select transcript identifier for the ABL1 gene.
5’ segment boundary	0..1	A Segment Boundary representing the 5-prime end of the transcript segment
3’ segment boundary	0..1	A Segment Boundary representing the 3-prime end of the transcript segment

Segment Boundary

A segment boundary describes the exon-anchored coordinate (and corresponding genomic coordinate) defining a boundary of a transcript segment.

Field	Limits	Description
Exon number	1..1	The exon number counted from the 5-prime end of the transcript.
Exon offset	1..1	A value representing the offset from the segment boundary, with positive values offset towards the 5-prime end of the transcript and negative values offset towards the 3-prime end of the transcript. Offsets can reference sequence in the intronic space.
Genomic location	1..1	A Genomic Location aligned to the transcript segment boundary.

Linker Sequence

A linker sequence is an observed sequence in the gene fusion that typically occurs between transcript segments, and where the sequence origin is unknown or ambiguous. In cases where the linker sequence is a known intronic or intergenic region, it should be represented as a Templated Linker Sequence instead.

Field	Limits	Description
Sequence	1..1	A literal sequence expressed as cDNA.

Templated Linker Sequence

A templated linker sequence is an observed sequence in the gene fusion that typically occurs between transcript segments, and where the sequence origin is a known intronic or intergenic region.

Field	Limits	Description
Genomic location	1..1	A Genomic Location from which the linker sequence is derived.
Genomic strand	1..1	MUST be one of `+` or `-`. Used to indicate the coding strand at the genomic location from which the linker sequence is derived,
Sequence	0..1	An optional literal sequence derived from the genomic location.

Regulatory Elements

Regulatory elements include a Regulatory Feature used to describe an enhancer, promoter, or other regulatory elements that constitute Regulatory Fusions. Regulatory features may also be defined by a gene with which the feature is associated (e.g. an IGH-associated enhancer element).

Regulatory Feature

Our definitions of regulatory features follows the definitions provided by the INSDC regulatory class vocabulary. In gene fusions, these are typically either enhancer or promoter features. These features may be represented as stand-alone entities with their own conceptual identifier (e.g. ENCODE cis-Regulatory Elements) or by a Genomic Location. Regulatory features may also be represented by their association with a nearby gene (e.g. regulatory fusion between MYC and IGH-associated enhancer elements).

It is expected that a regulatory feature will be described by at least (and often exactly) one of a Feature ID, Genomic location, or associated gene.

Field	Limits	Description
Regulatory class	1..1	MUST be `enhancer`, `promoter`, or another term from the INSDC regulatory class vocabulary.
Feature ID	0..1	An optional identifier for the regulatory feature, e.g. registered cis-regulatory elements from ENCODE.
Feature location	0..1	An optional Genomic Location for the regulatory feature.
Associated gene	0..1	A Gene associated with the regulatory feature.

Categorical elements

Categorical data elements are specifically used for the representation of Categorical Gene Fusions. These data elements define the key criteria for matching Assayed Gene Fusions.

Functional Domains

Categorical Gene Fusions are often characterized by the presence or absence of critical functional domains within a gene fusion.

Field	Limits	Description
Label	0..1	An optional name for the functional domain, e.g. `Protein kinase domain`.
ID	0..1	An optional namespaced identifier for the domain, e.g. interpro:IPR000719.
Sequence location	0..1	An optional Sequence Location for the domain.
Status	1..1	MUST be one of [`preserved`, `lost`]
Associated gene	1..1	The Gene associated with the domain.

Reading Frame

A common attribute of a categorical gene fusion is whether the reading frame is preserved in the expressed gene product. This is typical of protein-coding gene fusions.

Field	Limits	Description
Reading frame preserved	0..1	Boolean indicating whether the reading frame must be preserved or not.

Assayed Elements

Assayed data elements are specifically used for the representation of Assayed Gene Fusions. These data elements provide important context for downstream evaluation of Chimeric Transcript Fusions and Regulatory Fusions detected by biomedical assays.

Causative Event

The evaluation of a fusion may be influenced by the underlying mechanism that generated the fusion. Often this will be a DNA rearrangement, but it could also be a read-through or trans-splicing event.

Field	Limits	Description
Type	1..1	The type of event that generated the fusion. May be `rearrangement`, `read-through`, or `trans-splicing`.
Description	0..1	For rearrangements, this field is useful for characterizing the rearrangement. This could be a string describing the rearrangement with an appropriate nomenclature (e.g. ISCN or HGVS), or an equivalent data structure.

Assay

Metadata about the assay that detected the fusion–and whether that fusion was directly detected by the assay or inferred–is useful to preserve for downstream evaluation.

Field	Limits	Description
Name	1..1	A human-readable name for the assay. Should match the label for the assay ID, e.g. `fluorescence in-situ hybridization assay` for obi:OBI_0003094.
ID	1..1	An ID for the assay concept, e.g. obi:OBI_0003094 from the Ontology for Biomedical Investigations.
Fusion detection	1..1	MUST be one of [direct, inferred]. Direct detection methods (e.g. RNA-seq, RT-PCR) directly interrogate chimeric transcript junctions. Inferred detection methods (e.g. WGS, FISH) infer the existence of a fusion in the presence of compatible biomarkers (e.g. ALK rearrangements in non-small cell lung cancers).
Method URI	1..1	A URI pointing to the methodological details of the assay.