Skip to content

Preprocessing

Preprocessing prepares your raw data for analysis by handling missing values, normalising across samples, and optionally log-transforming and scaling. GrAndMA runs these steps in a fixed order via a configurable pipeline.

Pipeline order

Missingness filtering → Imputation → Normalisation → Log transform → Scaling → Batch correction

Each step is optional. The default for a new dataset is: no filtering, no imputation, log2 transform, no scaling, no batch correction.


Missingness filtering

Removes samples or features that have too many missing values before imputation.

Parameter Default Description
Sample threshold 0.5 Remove samples missing more than this fraction of features
Feature threshold 0.5 Remove features missing from more than this fraction of samples

Set both to 1.0 to skip filtering entirely.


Imputation

Fills remaining missing values after filtering.

Method When to use
None Data has no missing values, or you want to handle them downstream
Minimum Replace each missing value with the half-minimum of that feature. Appropriate for metabolomics data where missing = below detection limit
KNN Replace using k nearest neighbour imputation (default k=5). Works well for metabolomics with <20% missingness and structured covariation between features
Mean Replace with the column mean. Fast but ignores feature covariance

For microbiome abundance data, use none — zeros are structural (truly absent) not missing.


Normalisation

Adjusts for differences in total signal between samples (e.g. dilution, extraction efficiency). Applied before log transformation.

Method When to use
None Data is already normalised, or normalisation is inappropriate (e.g. scRNA-seq CPM already normalised)
Probabilistic Quotient (PQN) Recommended for metabolomics. Divides each sample by its median ratio to a reference sample, robustly handling dilution differences
Total Ion Count (TIC) Divides each sample by its total signal. Simpler than PQN; appropriate when total abundance is expected to be equal

Log transformation

Reduces the influence of extreme values and makes the data more normally distributed.

Option When to use
None Data is already log-transformed (check "already log-transformed" during configuration)
log2 Standard for metabolomics and transcriptomics. Fold changes become additive
log10 Less common; useful when reporting in log10 units is preferred
ln Natural log

If you ticked "already log-transformed" during dataset configuration, this step is skipped automatically.


Scaling

Centers and/or scales each feature so that all features contribute equally regardless of their absolute magnitude.

Method When to use
None Features are already on comparable scales, or you want to preserve magnitude differences
Autoscaling (z-score) Subtract mean and divide by standard deviation. All features have mean 0, variance 1. Use when you want equal weight for all features regardless of variance
Pareto Divide by the square root of the standard deviation. Less aggressive than autoscaling; preserves some magnitude information

Batch correction

Removes systematic technical variation between batches of samples measured at different times or by different operators. Requires a batch column in your sample metadata.

Method When to use
None No batch variable, or batches are balanced across groups
ComBat Empirical Bayes method. Robust for small batch sizes. Can preserve biological covariates (specify in the covariate field)

Batch correction is applied last, after all other preprocessing steps.


Re-running preprocessing

You can re-run preprocessing with different settings at any time. Go to the preprocessing page for your dataset and adjust the parameters. Previous runs are preserved — you can switch between them using the run selector. Only one run is active at a time and all downstream analyses use the active run.


Viewing results

After preprocessing, the View Data page shows:

  • Sample count and feature count after filtering
  • Missing value heatmap before and after imputation
  • Sample distribution (box plot) before and after normalisation
  • PCA score plot coloured by any metadata column
  • Feature variance distribution