Preprocessing¶

Preprocessing prepares your raw data for analysis by handling missing values, normalising across samples, and optionally log-transforming and scaling. GrAndMA runs these steps in a fixed order via a configurable pipeline.

Pipeline order¶

Missingness filtering → Imputation → Normalisation → Log transform → Scaling → Batch correction

Each step is optional. The default for a new dataset is: no filtering, no imputation, log2 transform, no scaling, no batch correction.

Missingness filtering¶

Removes samples or features that have too many missing values before imputation.

Parameter	Default	Description
Sample threshold	0.5	Remove samples missing more than this fraction of features
Feature threshold	0.5	Remove features missing from more than this fraction of samples

Set both to 1.0 to skip filtering entirely.

Imputation¶

Fills remaining missing values after filtering.

Method	When to use
None	Data has no missing values, or you want to handle them downstream
Minimum	Replace each missing value with the half-minimum of that feature. Appropriate for metabolomics data where missing = below detection limit
KNN	Replace using k nearest neighbour imputation (default k=5). Works well for metabolomics with <20% missingness and structured covariation between features
Mean	Replace with the column mean. Fast but ignores feature covariance

For microbiome abundance data, use none — zeros are structural (truly absent) not missing.

Normalisation¶

Adjusts for differences in total signal between samples (e.g. dilution, extraction efficiency). Applied before log transformation.

Method	When to use
None	Data is already normalised, or normalisation is inappropriate (e.g. scRNA-seq CPM already normalised)
Probabilistic Quotient (PQN)	Recommended for metabolomics. Divides each sample by its median ratio to a reference sample, robustly handling dilution differences
Total Ion Count (TIC)	Divides each sample by its total signal. Simpler than PQN; appropriate when total abundance is expected to be equal

Log transformation¶

Reduces the influence of extreme values and makes the data more normally distributed.

Option	When to use
None	Data is already log-transformed (check "already log-transformed" during configuration)
log2	Standard for metabolomics and transcriptomics. Fold changes become additive
log10	Less common; useful when reporting in log10 units is preferred
ln	Natural log

If you ticked "already log-transformed" during dataset configuration, this step is skipped automatically.

Scaling¶

Centers and/or scales each feature so that all features contribute equally regardless of their absolute magnitude.

Method	When to use
None	Features are already on comparable scales, or you want to preserve magnitude differences
Autoscaling (z-score)	Subtract mean and divide by standard deviation. All features have mean 0, variance 1. Use when you want equal weight for all features regardless of variance
Pareto	Divide by the square root of the standard deviation. Less aggressive than autoscaling; preserves some magnitude information

Batch correction¶

Removes systematic technical variation between batches of samples measured at different times or by different operators. Requires a batch column in your sample metadata.

Method	When to use
None	No batch variable, or batches are balanced across groups
ComBat	Empirical Bayes method. Robust for small batch sizes. Can preserve biological covariates (specify in the covariate field)

Batch correction is applied last, after all other preprocessing steps.

Re-running preprocessing¶

You can re-run preprocessing with different settings at any time. Go to the preprocessing page for your dataset and adjust the parameters. Previous runs are preserved — you can switch between them using the run selector. Only one run is active at a time and all downstream analyses use the active run.

Viewing results¶

After preprocessing, the View Data page shows:

Sample count and feature count after filtering
Missing value heatmap before and after imputation
Sample distribution (box plot) before and after normalisation
PCA score plot coloured by any metadata column
Feature variance distribution