Skip to content

Clustering

The Clustering module groups samples and features by similarity in feature space. Results can be coloured by any metadata variable to reveal structure in the data.


Dimensionality reduction

High-dimensional omics data is first reduced to 2–3 dimensions for visualisation.

PCA — Principal Component Analysis

Linear method. Projects samples onto axes of maximum variance. Fast and interpretable — the loadings show which features drive each component.

Parameter Default Description
Components 2 Number of principal components to compute and plot

The score plot shows samples in PC space. The loading plot shows which features contribute most to each component. The scree plot shows variance explained per component.

UMAP — Uniform Manifold Approximation and Projection

Non-linear method. Preserves local neighbourhood structure. Better at revealing clusters than PCA, especially in complex datasets. Slower and stochastic — results vary slightly between runs.

Parameter Default Description
n_neighbors 15 Controls local vs. global structure. Lower = more local detail
min_dist 0.1 Minimum distance between points in the embedding. Lower = tighter clusters
metric euclidean Distance metric in the original feature space

UMAP results are not directly interpretable in terms of individual features — use PCA loadings for feature-level interpretation.


Sample clustering

Groups samples into discrete clusters based on their feature profiles.

Hierarchical clustering

Builds a tree (dendrogram) of samples by iteratively merging the most similar pairs.

Parameter Options Description
Linkage ward, complete, average, single How distance between clusters is computed. Ward minimises within-cluster variance and is usually preferred
Distance metric euclidean, correlation, cosine How similarity between samples is measured

Results are shown as a dendrogram with an optional heatmap of feature values.

k-means

Partitions samples into k clusters by minimising within-cluster sum of squares. Requires specifying k in advance.

Parameter Default Description
k 3 Number of clusters
Init k-means++ Initialisation method. k-means++ reduces sensitivity to starting conditions
Iterations 10 Number of random restarts; best result is kept

Use the elbow plot (inertia vs. k) to help choose k.


Feature clustering

Groups features by co-expression or co-abundance patterns across samples. Useful for identifying metabolite modules or gene programmes.

Run from the Clustering module by selecting Features instead of Samples. The same hierarchical and k-means methods are available.

Clustered features can be exported as a list for downstream enrichment analysis.


Heatmap

The heatmap shows samples (columns) and features (rows) with colour representing scaled abundance. Both axes can be clustered independently.

Option Description
Row clustering Cluster features by similarity across samples
Column clustering Cluster samples by similarity across features
Colour scale Diverging (for centred data) or sequential
Annotation bar Colour samples by a metadata variable

By default, only the top 50 most variable features are shown. Increase this in the settings or filter to a feature set of interest.


Colouring by metadata

All dimensionality reduction plots and heatmap annotation bars can be coloured by any column in your sample metadata. Use the Colour by dropdown in the plot toolbar. Categorical variables produce discrete colour palettes; continuous variables produce a gradient.

Patterns in the embedding that align with metadata variables (e.g. disease group, batch, age) indicate that those variables explain variance in the data.