Clustering¶

The Clustering module groups samples and features by similarity in feature space. Results can be coloured by any metadata variable to reveal structure in the data.

Dimensionality reduction¶

High-dimensional omics data is first reduced to 2–3 dimensions for visualisation.

PCA — Principal Component Analysis¶

Linear method. Projects samples onto axes of maximum variance. Fast and interpretable — the loadings show which features drive each component.

Parameter	Default	Description
Components	2	Number of principal components to compute and plot

The score plot shows samples in PC space. The loading plot shows which features contribute most to each component. The scree plot shows variance explained per component.

UMAP — Uniform Manifold Approximation and Projection¶

Non-linear method. Preserves local neighbourhood structure. Better at revealing clusters than PCA, especially in complex datasets. Slower and stochastic — results vary slightly between runs.

Parameter	Default	Description
n_neighbors	15	Controls local vs. global structure. Lower = more local detail
min_dist	0.1	Minimum distance between points in the embedding. Lower = tighter clusters
metric	euclidean	Distance metric in the original feature space

UMAP results are not directly interpretable in terms of individual features — use PCA loadings for feature-level interpretation.

Sample clustering¶

Groups samples into discrete clusters based on their feature profiles.

Hierarchical clustering¶

Builds a tree (dendrogram) of samples by iteratively merging the most similar pairs.

Parameter	Options	Description
Linkage	ward, complete, average, single	How distance between clusters is computed. Ward minimises within-cluster variance and is usually preferred
Distance metric	euclidean, correlation, cosine	How similarity between samples is measured

Results are shown as a dendrogram with an optional heatmap of feature values.

k-means¶

Partitions samples into k clusters by minimising within-cluster sum of squares. Requires specifying k in advance.

Parameter	Default	Description
k	3	Number of clusters
Init	k-means++	Initialisation method. k-means++ reduces sensitivity to starting conditions
Iterations	10	Number of random restarts; best result is kept

Use the elbow plot (inertia vs. k) to help choose k.

Feature clustering¶

Groups features by co-expression or co-abundance patterns across samples. Useful for identifying metabolite modules or gene programmes.

Run from the Clustering module by selecting Features instead of Samples. The same hierarchical and k-means methods are available.

Clustered features can be exported as a list for downstream enrichment analysis.

Heatmap¶

The heatmap shows samples (columns) and features (rows) with colour representing scaled abundance. Both axes can be clustered independently.

Option	Description
Row clustering	Cluster features by similarity across samples
Column clustering	Cluster samples by similarity across features
Colour scale	Diverging (for centred data) or sequential
Annotation bar	Colour samples by a metadata variable

By default, only the top 50 most variable features are shown. Increase this in the settings or filter to a feature set of interest.

Colouring by metadata¶

All dimensionality reduction plots and heatmap annotation bars can be coloured by any column in your sample metadata. Use the Colour by dropdown in the plot toolbar. Categorical variables produce discrete colour palettes; continuous variables produce a gradient.

Patterns in the embedding that align with metadata variables (e.g. disease group, batch, age) indicate that those variables explain variance in the data.