blank

Forge: A Multiomics Analytical Platform

2026-03-23T12:00:00+00:00

Biology’s really hard. It’s hard to measure, it’s impossible to predict, and those long-held skeletal rules we believe about it – the central dogma, Mendelian genetics, what goes in must come out – are limited abstractions at best. It’s also incredibly necessary to understand the full impact of that heterogeneity: internal regulation, genetic differences, the accumulation of substrates, and many other outside-the-blueprint biological phenomena are what actually inform why diseases might occur, where people might react to different medication, and even how we might engineer bacteria to suitably fit into artificial scenarios to benefit the planet.

Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Copy of Untitled.doc.

I have been building computational platforms for multiomics since graduate school, integrating linear optimization, transport phenomena, and canonical knowledge to interrogate heterogeneous and emergent behavior in cellular populations. This is convenient and slightly bibliographic: given what we know, what can we understand about what we cannot easily measure? It’s an interrogation of the indicative present vs. subjunctive; what could we know based on what we do know. George E.P. Box’s famous statistics quote gets thrown around here a lot: “All models are wrong, but some are useful.”

But how did we even get to the point where we could model?

At the core of all canonical models is a series of experiments that survive and are codified into understanding. With the advent of biobanks, big data, and AI, we’re running a lot of experiments. But those experiments don’t passively result in biological understanding. Data arrives as a bundled mess, in formats that do not agree with each other, where preprocessing decisions are made invisibly and inconsistently, represented as statistical results reported without the context needed to interpret them, and with an analytical archaeology held together by a combination of R scripts and institutional memory. Having more data, more AI, more hardware hasn’t solved this; it’s amplified it: every operation is more immediate and larger scale, but every choice is larger and more legible in the final product. In that paradigm, every organization is making its own Library of Babel at higher scale and much more rapidly; each siloed in its own way, each segmented and operationalized separately, and now equipped with a very willing robot partner to accelerate both the good and the bad.

My work is in that funnel: a process that takes new, multimodal data and refines it into something that fits within canonical systems or extends them reliably. The struggle is using the benefits of the signal while being conscious of the amplified noise, especially in multiomic contexts where multiple measurements stuck together from the same system can elucidate those complex biological features we need to understand.

Forge is the current iteration of that work. It is a production-deployed multiomics analytical platform, built and maintained under Insilijo Science, my independent consulting and advisory practice. This post is an introduction to what it does and why certain design decisions were made the way they were.

The Problem

Omics studies generate data at a scale and complexity that strain conventional analytical workflows. For example, a typical untargeted metabolomics experiment produces tens of thousands of features – many uncharacterized – distributed across samples with varying missingness, batch effects, and normalization requirements. The most common metabolomics data, which is widely targeted, still results in over a thousand features. On the other hand, transcriptomic studies can easily have tens of thousands of features with several thousand of them being differential. Affinity-based proteomics experiments are typically in the multiple thousands range. Getting from these raw feature matrices to a biologically interpretable result requires a sequence of decisions, each of which shapes the downstream analysis: which imputation strategy fits the missing data mechanism, whether normalization should be sample-level or feature-level or both, how aggressively to correct for run-order drift, which statistical framework fits the study design, and how should scaling be applied for non-normal distributions.

These decisions are not interchangeable; they interact. But in most workflows, they are made once, in an R or Python session, with no systematic record of the configuration, no interface for a collaborator to inspect or modify them, and no mechanism for reproducing the result on a different dataset without manual reconstruction. The upstream and downstream processes are decoupled and uncollaborative meaning that analyses inherit decisions that might not be appropriate for their interpretation. Transparency is not only nice, it is necessary: you can’t build an effective scientific argument based on decisions you don’t know about or don’t understand.

Forge is built around the premise that the analytical pipeline itself is a scientific object, something that should be configurable, documented, persistent, comprehensible, and shareable in the same way the data is. Science is a collaborative ecosystem, and the decisions embedded in data analysis must be as portable and composable as the data itself. Lastly, it should be low barrier-to-entry; a first year undergraduate student studying biology should be able to run a PCA for the first time and understand why, while a seasoned clinician shouldn’t have to learn Python to see expected group separation in A/B testing.

Forge homepage. Projects enable users to build and aggregate datasets from public data, demos, or from personal data.

What Forge Does

Data Ingestion

Public download configuration. Forge maintains integrations with public datasets for easy access to published literature.

Forge accepts all tabular-style data outputs with integrations for four major data repositories: Metabolights and Metabolomics Workbench for metabolomics, PRIDE for proteomics, and GEO for gene-based experiments. It allows uploading of tabular data with a specification of the data type and currently contains 9 synthetic and 14 public datasets for exploration. Users, once registered, can create projects with multiple datasets to stay organized.

QC and Preprocessing Pipeline

The preprocessing pipeline is a configurable sequence of steps exposed as a live control panel, specific for the data type. For example, for metabolomics, it enables missingness filtering, imputation (none, mean, median, half-minimum, KNN), normalization (total-area, median, quantile, internal standard, total protein), log transformation, scaling (standard, Pareto, min-max, robust), and batch correction. Similarly, specific pipelines exist for metagenomics, proteomics, and transcriptomics. Each parameter is a labeled control with a documented effect. Configurations are saved as named profiles and can be reloaded across datasets, which means a core facility or team can enforce consistent preprocessing across a study without re-specifying the pipeline by hand each time. Finally, QC plots – run order, TIC/MIC, Q/Q plots, correlation – are directly visible, enabling evaluation of analysis performance based on metadata, run order, and other configurable parameters.

Preprocessing control panel. The full pipeline configuration is captured as a named profile, reproducible across datasets and shareable with collaborators. Visualizations match data type.

Statistical Analysis

Principal component analysis. Variance-maximizing axis demonstrating contributions of variables to group separation.

Volcano plot for differential abundance. Fold-change and significance thresholds are interactively adjustable. Hover text includes effect size and globally corrected q-values.

Differential abundance testing covers t-tests (Welch’s, Student’s, Mann-Whitney U), one-way ANOVA, and ANCOVA with covariate support. Effect sizes are reported alongside p-values for every test: Cohen’s d for pairwise comparisons, η² for ANOVA, R² for ANCOVA. Multiple testing correction is applied globally across all comparisons in an analysis using BH FDR, Bonferroni, or Holm-Šídák. PCA and PLS-DA handle dimensionality reduction and group separation; clicking any point on the scatter opens a feature contribution panel decomposing the score by loading per axis.

Longitudinal Analysis

Longitudinal analysis module. Feature trajectories are tracked across time points; the module identifies features with significant within-subject change patterns across the study timeline.

Support for longitudinal study designs is underrepresented in most analytical platforms, which are built around cross-sectional comparisons. Forge includes a dedicated longitudinal module that tracks feature trajectories across time points, models within-subject change, and identifies features with statistically significant time-course patterns. This matters for intervention studies, clinical trials, and any design where the relevant signal is the shape of change rather than a single-timepoint difference. The module also scans the full feature set and selects the best-fit trajectory model (time, space, or other continuous parameter) and reports the best model (linear, quadratic, logistic, or other) for that feature, enabling investigation of alternative longitudinal responses.

Correlation Network Visualization

Correlation network visualization. Edge significance is tested with Fisher z-transform + BH correction before any threshold is applied. The interactive graph is explorable by feature, cluster, and pathway annotation.

Standard metabolomics statistics are feature-centric: they ask which individual metabolites are different between conditions. Network analysis asks a different question — which features move together, and what does the structure of those co-variation relationships reveal about the underlying biology? Forge builds correlation networks from the preprocessed feature matrix, applies Fisher z-transform significance testing with BH correction on all candidate edges before threshold filtering, and renders the result as an interactive graph. This is a systems-level interpretive layer that sits on top of the standard statistical output, not a replacement for it.

Knowledge Graph and Pathway Context

Biological knowledge graph computation. Enrichment of pathways, causal chain identification, and other graph computations by overlaid data help investigate mode of action and mechanisms.

Forge integrates pathway context through a knowledge graph layer that maps features to biological pathways and provides enrichment-level interpretation alongside the feature-level statistics. This is where multiomics becomes interpretable to a biologist: not “these 47 features are significant” but “these features converge on these pathways, with this evidence, and with these other supporting features.” This also solves the ontology issue; teams collaborate at the biological, not the analytical, level. By integrating multiple open-source, public databases organizing different omics layers, phenotypes, and clinical data, Forge maintains a computation-ready knowledgegraph that can be directly applied to different contexts.

Insilijo Science

Forge is the analytical infrastructure I use in my consulting practice. Insilijo Science provides multiomics analysis, pipeline development, and data strategy advisory to life sciences teams: biotech, pharma, and academic groups working on metabolomics, proteomics, and integrated omics studies. If you are building a multiomics analytical capability and want to talk about what that looks like in practice, reach out.

What’s Next

The immediate roadmap includes pathway-centric visualization – enrichment drilldown and canonical pathway overlays rather than just feature-level output – and multi-omics integration proper: cross-assay sample alignment, late-fusion approaches, and coordinated visualization across omics layers. The knowledge graph integration will deepen as the companion tools in the Insilijo stack mature. Several software papers are in preparation for Forge and several other tools I’ve been working on.

The longer arc is the one I have been working on since graduate school: making the full path from measurement to biological meaning something that a research team can traverse reproducibly, collaboratively, and without losing the scientific accountability for every analytical decision along the way. Forge is the current best version of that.

When the levees break

2026-02-15T12:00:00+00:00

Still from Milano Cortina Winter Olympics 2026, NBC/Peacock.

The 2026 Winter Olympics in Milan opened with AI-assisted imagery. The Superbowl featured 15 AI-centered commercials: nearly \$120 million in ad spend. This has led some analysts like Carl Brown and the folks he cites to note the eerie similarities to the 2000 dot-com bubble where 14 of 61 spots were dot-com related. At \$2.2 million per spot, that’s \$30.8 million, or nearly \$60 million in today’s dollars.

The word now circulating is “bubble.” The comparison is obvious: it looks increasingly like a high-risk game of hot potato, played between 7 companies. Bloomberg’s Odd Lots recently suggested this cycle combines dot-com excess, real estate leverage, and aggressive financing models. We’re seeing massive investments in a technology with real applications but without the promised productivity boon.

AI will almost certainly endure. It compresses cognition the way the internet compressed distribution. The capability expansion is real: the question is not whether it persists, but how institutions metabolize it.

Projected money flow between major AI companies by user jcceagle on [reddit.com/r/dataisbeautiful](https://www.reddit.com/r/dataisbeautiful/comments/1ppla7o/oc_mapping_the_flow_of_revenue_and_investment/) using PlotSet

Dissolution of self

The public reaction hasn’t been awe. It’s been ridicule. Instead of being wowed by the technology, the discourse is tearing it apart: on the world’s largest stages, we’ve seen unconscionable brand mistakes like incorrect logos, gibberish words, and eerie I, Robot style images.

What I think the public is responding to is not the innate benefit or limitation of the technology; that story is yet to be written. It’s the misalignment of the technology with reality. AI is a boundary-eroding force that redistributes cost, responsibility, and meaning faster than institutions can re-contain them. We’re operating at the gap between promise and realization.

In CS, it can be hard to explain the difference between the easy and the virtually impossible. xkcd.com/1425, *Tasks*, by Randall Munroe

From my previous post, I argue that a common institutional conceptualization of boundaries is core to mission focus and execution. Our existence in AI limbo indicates we’re in an unbounded, unrealized territory. Institutions rely on boundaries to assign cost, responsibility, and deferral. When those boundaries fail, valuation detaches from structural reliability. Perceived value detaches when execution lags expectation.

In the dot-com bubble, we saw a new technology (that has since become fundamental!) that excluded revenue models and profitability. What we gained was a completely novel industrial sector and productivity gains, but only after a devastating market crash that compromised the whole technological sector. Valuation outran integration and then integration became its own viable industry.

In the dot-com era, integration became its own industry — DevOps, cloud infrastructure, cybersecurity. AI will follow a similar path. The winners will build internal AI governance and review systems as products, not policies.

As before, so now: gibberish, the hallucinations, the energy expenditure, these are potentially just engineering defects. Engineering defects get solved. But the defects are not the real signal: what is harder to solve is boundary dissolution.

AI collapses distinctions that institutions depend on:

Brand vs. Designer — Who is responsible when the logo is wrong?
Author vs. Model — Who owns the sentence and its meaning?
Expert vs. Prompt Engineer — Who is accountable for the analysis and its communication?
Junior vs. Senior — Who absorbs review cost and ensures architectural compatibility?
Tool vs. Decision-Maker — Who signs off and under what conditions?
Labor vs. Capital — Who captures productivity gains?

These distinctions are not philosophical. They are how we assign cost, accountability, and compensation. When AI compresses them, output accelerates but responsibility diffuses. And diffusion feels like efficiency until the bill arrives. Or until you have to do something with all that output.

There is no singular AI economy

AI commands a broad, infrastructural change, but what both proponents and detractors identify is not what it does on the macroscopic level, but what happens on the individual level. Whether people argue that it extends an individual’s faculty or replaces them entirely, the decisions that AI enables are primarily scaleable microscopic decisions. A backend engineer can rapidly prototype a front-end that they can pass on to their peer. A hobbyist trader can pull data and apply machine learning algorithms to predict the right time to be stocks consistent with non-proprietary quantitative traders.

Individually, none of this destabilizes. At scale, it compounds. Each AI-assisted decision externalizes cost to infrastructure, oversight, energy, liability. Individually trivial. Systemically compounding.

There is no singular “AI economy.” There are millions of boundary crossings embedded in marketing teams, codebases, research labs, legal drafts, and art studios. Each reduces local friction while exporting integration cost elsewhere.

Systems can absorb small deferrals. At scale, those deferrals synchronize. Infrastructure costs rise. Trust erodes. Accountability diffuses. Capital reallocates before integration catches up. Eventually, no one fully understands the system they depend on. Output accumulates faster than integration capacity.

The durable institutions will not eliminate boundaries; they will redesign them. Instead of “AI replaces junior engineers,” they will define new review layers, new authorship standards, new liability checkpoints. Boundary collapse is destabilizing. Boundary redesign is competitive advantage.

AI is boundary arbitrage at scale. It allows individuals to capture upside while exporting integration cost into shared infrastructure.

The paradox of governance

The sanitized executive narrative is empowerment: higher productivity per employee. The unsanitized version is compression: fewer employees per function.

AI lowers the cost of crossing boundaries. But institutions were built around those boundaries. When crossing becomes cheap, containment becomes expensive. Oversight, review, and coordination were not inefficiencies. They were containment infrastructure.

AI does not integrate itself. It requires deliberate boundary redesign, cross-functional communication, and explicit ownership of review, infrastructure, and liability cost. Without that, what looks like productivity is simply deferred integration expense.

AI without explicit integration design is not transformation. It is cost redistribution.

While AI promises access to upskilling, it’s worth contending with Chesterton’s Fence here. Responsibilities are encoded institutional memory; they are not just skill, they are context. In some cases it’s useful to break and revisit these systems, but in others it can be catastrophic. We come by these containment structures honestly; they were direct responses to gaps between technical capability.

When disrupted, context bleeds out first. Authorship diffuses, labor is conducted elsewhere, and liability is distributed beyond the person best equipped to handle it. Previously created pathways effecting formal oversight, audit layers, internal controls, and legal containment are now broken because the technical function is now elsewhere.

AI creates the appearance of productivity gains by compressing labor boundaries. In exchange, it increases governance volatility. This is not a tooling problem. It is an accounting problem, a liability problem, and a governance problem. Institutions that do not redesign their containment structures will discover that efficiency gains were cost deferrals.

Why bubbles form

A bubble forms when expectation outpaces containment.

The internet ultimately transformed the economy, but only after valuation collapsed to match integration capacity. Infrastructure, tooling, governance, and discipline lagged promise.

The internet endured. Valuation did not.

The mistake was temporal. Valuation preceded reintegration. The technology and infrastructure were disruptive but immature, leading to a boundary erosion of existing systems but a lack of reintegration into the new world.

Stages in a Bubble. Dr. Jean-Paul Rodrigue, Dept. of Global Studies & Geography, Hofstra University.

AI is facing a similar climate. We’re in a race to support the technology, but it’s unclear how we do so at scale. Institutional absorption is immature and the displacements and deferrals are not yet priced-in or addressed. Cost reallocation is opaque. Governance and regulation are underdeveloped. Labor displacement is unpriced. Brand and liability containment are unclear.

The technology has disrupted core economic and technical infrastructure. Crucial variables are now unclaimed. The membrane of the system is broken: responsibility is unassigned, liability is unpriced, and labor is displaced faster than governance can reconstitute it. For an economy that prioritizes capital flow and certainty, those deferrals must be addressed.

Who wins when the bubble pops?

Losses follow when integration lags valuation. Winners invest in reintegration before markets force them to. They price governance as infrastructure and don’t overindex the early hyped applications and spectacles. They treat workforce integration as asset development. They use AI for system augmentation, not output acceleration.

Still from Milano Cortina Winter Olympics 2026, NBC/Peacock.

Reintegration requires explicit cost attribution. If AI reduces labor in one function but increases review, infrastructure, and liability exposure elsewhere, those costs must be surfaced rather than absorbed invisibly. Organizations that treat AI as net labor compression without integrated cost accounting will misprice their gains.

The winners will view the potentially displaced workers as ones that the technology will invest in and improve, not displace. The winners will not be those who maximize output. They will be those who intentionally manage and internalize displacement before the market forces them to.

A bubble is not excessive investment. It is enthusiasm expanding faster than containment.

That’s where we are: AI appears mature because it is visible, but visibility is not integration. It’s spectacle. Until then, we’ll see JECK OUTIbIyMDES D’AIIVER at the Milan Winter Olympics.

Markets price acceleration. Reality prices containment.

Dunkies, Data, and Defensive Equilibrium

2026-02-15T12:00:00+00:00

Google Maps screenshot of Dunkin' locations in downtown Boston. This in thumbnail image from https://www.boston.com/news/wickedpedia/2023/02/06/closest-two-dunkins-massachusetts/.

Humans like to believe in their own rationality. Markets, we are told, allocate efficiently. Firms optimize for productivity. Capital flows, inevitably, toward utility.

Yet we repeatedly observe clustering that appears inefficient: Dunkin’ stores across the street from one another. Mattress retailers on the same corner. Biotech companies converging on nearly identical AI positioning.

These are not market failures. They do not mean that Bostonians are buying two mattresses a day, or grabbing a cruller then an extra large and then three Parliaments. These configurations are equilibria.

Hotelling’s Law describes why. Imagine two vendors on a 100-yard beach. The socially optimal configuration is one at 25 yards and one at 75; that way, consumers never walk more than 25 yards. Utility is maximized for both producers and consumers.

But that configuration – while efficient for utility – is unstable. One vendor can move toward the center and capture more territory. The other must follow. Eventually both stand, back-to-back, at exactly 50 yards: inefficient for the beach, but stable for the vendors.

Nash equilibrium is often inefficient. It is stable because unilateral deviation from the competitive incentive structure is punished.

We mistake inefficiency for irrationality because we perceive a loss of utility. Competitive systems, however, frequently optimize for defensive stability rather than socially optimal scenarios.

In biotech, the center of the beach is not geographic. It is narrative.

Narrative as Boundary Selection

In Boundary Illusions, I argued that framing determines what a system can solve. In When the Levees Break, I argued that AI erodes institutional containment faster than governance can redesign it.

Our boundaries don’t operate by themselves without interaction. They’re all fighting for scarce resources under heavy uncertainty, especially now. Therefore, there’s another force shaping boundaries: competition under capital constraint.

Drug development remains long-cycle and capital intensive, averaging roughly $950 million over 12-15 years with single-digit-to-low-double-digit success probabilities. In restrictive funding environments, firms compete on two currencies: data and signal.

When robust data are available, the frame is grounded in measurable outcomes. But in early-stage discovery – where proof is sparse, continuing means staking tens of millions more dollars, and ROI might still be a decade away – signal density becomes a strategic asset.

In 2026, AI has become the highest-density signal available.

A proprietary reasoning engine implies scale, speed, defensibility, platform leverage, optionality. Adding in public or large-scale private repositories of data in biobanks, AI appears risk-mitigating and proactive. However, whether those optimistic implications are realized is secondary in the short term. What matters is that absence of AI carries immediate cost.

The incentives are straightforward:

The reputational cost of appearing technologically obsolete is immediate.
The partnership cost of lacking AI positioning is immediate.
The capital-raising cost of narrative deviation is immediate.
The integration, governance, and opportunity costs of adoption are deferred.

Under these conditions, clustering around a hyped technology is rational. In other words, no firm wants to be the vendor at 25 yards when capital markets are crowding the center.

We Have Seen This Before

This is not the first time biotech has clustered around a narrative center.

CRISPR created a similar gravitational pull. The underlying science was – and remains – transformative. But once Jennifer Doudna and others demonstrated the platform potential of the technology, firms rapidly expanded scope into multi-indication pipelines, rapid programmability, universal editability.

CRISPR has endured. However, we’re not freely editing organisms everywhere to cure disease and advance synthetic biology. The platform narratives did not all scale.

Capital eventually differentiated between:

Firms that translated editing capability into validated therapeutic assets, and
Firms whose surface area expanded faster than their clinical maturation.

The mid-2010s “platform biotech” wave followed a similar arc. Horizontal applicability across therapeutic areas, automation to facilitate scale, end-to-end discovery engines, integrated omics stacks, and synthetic biology toolkits like CRISPR together promised compounding advantage.

Top AI Innovations in 2025, Gartner Hype Cycle. From https://bankingfrontiers.com/top-ai-innovations-in-2025-gartner/

Some platforms survived by narrowing scope and integrating deliberately. Others discovered that declared platform breadth required managerial coherence, compute infrastructure, hardware, and capital reserves that outpaced their root systems. They found that their main selling point – end-to-end generalizability – was what ultimately became untenable to maintain.

In each cycle, the technology was real. What failed was not the capability of the technology; it was surface-area discipline, governance, and boundary maintenance under capital constraint.

AI is following a similar trajectory.

Surface Area and Selection

The American chestnut tree (Castanea dentata) was once a dominant species in the Appalachian forests, making up an estimated 25% of the hardwood canopy. These majestic trees, which could grow over 100 feet tall and live for centuries. (Appelo Archives Museum)

Narrative convergence has a structural consequence: surface area expansion. The major hype cycles in each case promise abstraction and generalizability, but fail to live up to their most idealized potentials.

In our current system, AI-first organizations often widen their declared scope, and these feel quite familiar: multi-indication platforms, horizontal data ingestion, end-to-end automation, cross-therapeutic applicability. Each additional claim presents both déjà vu and an expansion of the boundary of responsibility.

Surface area is not free.

A clinical-stage biotech with a $40–80M annual burn and 18–24 months of runway does not have indefinite integration capacity. Every additional therapeutic area, data vertical, or automation layer compounds managerial complexity, review cost, compute infrastructure, and regulatory exposure. That’s as true now as it was then.

Surface area expands geometrically. Revenue does not.

In capital-abundant environments, surface area can be subsidized by signal. In constrained markets, maintenance outpaces replenishment. Biotech is structurally capital dependent, and current conditions amplify that dependency.

This means that the failure mode will not be technical incapacity. It will be capital starvation induced by surface-area inflation.

High-surface-area firms without validated assets will discover that their narrative canopy expanded faster than their root systems. Capital markets tolerate delayed yield because of collective signal mechanisms. Those same markets do not tolerate indefinite canopy without harvest.

When signal density no longer compensates for burn, selection pressure intensifies.

The result will not be a broad rejection of AI. It will be a Darwinian shakeout.

Firms that optimized for height over yield will struggle to maintain themselves. Firms that converted narrative into durable assets – validated molecules, disciplined pipelines, explicit governance structures, scientific findings, resonant commercial narratives – will survive.

The correction won’t obliterate the surface, it will compress it.

After the Center

The durable organizations will not avoid the center of the beach. They will understand why they are standing there.

They will:

Treat AI as boundary redesign, not boundary dissolution.
Price governance as infrastructure, not overhead.
Narrow declared scope until integration capacity matches ambition.
Convert signal into fruit before expanding canopy.

In practice, this means narrowing before expanding. Converting signal into one validated program before declaring five. Using AI to kill programs earlier rather than to justify new ones. Surface area that grows without proportional asset maturation is not platform strength. It is balance sheet brittleness.

Markets reward acceleration. Selection rewards containment. The system is not irrational. It is optimizing defensively, until capital reintroduces discipline.

And when it does, the tallest trees with the fewest fruit will fall first, leaving the firms that stage-gate surface area expansion against validated asset conversion ratios.

Boundary Illusions

2026-02-07T12:00:00+00:00

One of only two photographs Cartier-Bresson was known to have cropped. Behind the Gare Saint-Lazare (Uncropped) (1932) by Henri Cartier-Bresson.

The photographer Henri Cartier-Bresson was famous for arguing that composition must take place in the viewfinder, rather than the darkroom. “If you start cutting or cropping a good photograph, it means death to the geometrically correct interplay of proportions,” he claims.

This was a courageous take, especially at the time, when 35mm was rare and expensive. But it hinted at a universal truth: the photograph doesn’t exist within the boundaries of the frame, but includes the frame itself. The frame operates and organizes the entities within, not functioning as a cliff but as a guide.

The phenomena of boundary selection isn’t relegated to art. One of the core responsibilities of an engineer is not to think in global laws but apply boundaries to a system. We are confined to certain physical realities – Newton’s laws, $E=mc^2$, thermodynamics, conservation – but we’re not trying to describe the world, we’re trying to make a decision on a system. A reactor failing makes physical sense because entropy is created, heat is lost, material is converted to energy predictably. But it does not make sense to us, the designers, who are responsible for engineering a system that operates productively by our definition but also within the boundaries of physics.

The goal, then, is not to manipulate physical reality, but to create a manageable one. The goal is to frame.

How Do We Manage?

The challenge is the frame selection: what do we include to selectively manage and manipulate. This can look like a lot of different things: which products we select for market, which market segments we format for, what design choices we make.

In commercial terms, we optimize for technical resources, timelines, capital efficiency, and measurable outcomes. In engineering terms, we optimize for stability, throughput, and control. In policy terms, we optimize for legibility: what can be counted, governed, and reported.

In each case, optimization requires omission. We select objectives and formalize them, and everything else becomes externalities. What we include consumes capital, attention, and legitimacy. What we exclude becomes opportunity cost: not erased, but displaced beyond the system’s accounting.

But optimization is never free. Every boundary excludes variables that still act and still accumulate pressure.

We operate in a Pareto Front, where there is no 'perfect' choice, but a selected set of optima with tradeoffs. We can move the curve and move along it, but it doesn't make the choice for us. Kesireddy, A., & Medrano, F. A. (2024). Elite Multi-Criteria Decision Making—Pareto Front Optimization in Multi-Objective Optimization. Algorithms, 17(5), 206. https://doi.org/10.3390/a17050206.

This is the quiet illusion of framing: that exclusion is equivalent to irrelevance. In reality, it is merely displacement. Risk, uncertainty, opportunity, labor, and failure are not eliminated by being excluded from the system. They are reassigned to another team, another market, another community, or another moment in time.

What do frames look like?

In chemical engineering, system boundaries are drawn to make equations solvable. Heat loss becomes a term. Side reactions are ignored. The model converges. In production, those ignored reactions foul catalysts, warp vessels, and shorten lifetimes. The physics was never wrong, the frame was incomplete.

In biological research, we select model organisms and controlled environments because they are tractable. The intervention works in vitro. The signal is clear in mice. What remains excluded are environmental variability, long-term adaptation, and human context. Translation becomes the rediscovery of what the frame left out.

In organizations, we define success through metrics that can be counted: quarterly growth, customer acquisition, cost per unit. Culture, long-term resilience, and latent risk remain harder to quantify. The system optimizes for what is legible. Over time, the measurable becomes the mission.

The complexity and sensitivity of biology requires we assemble knowledge from discrete, controlled frames. Urbanski AH, Araujo JD, Creighton R, et al. Integrative Biology Approaches Applied to Human Diseases. In: Husi H, editor. Computational Biology. Brisbane (AU): Codon Publications; 2019 Nov 21. Figure 1, [A framework for integrative biology...]. Available from: https://www.ncbi.nlm.nih.gov/books/NBK550336/figure/Ch2-f0001/ doi: 10.15586/computationalbiology.2019.ch2.

Framing determines not only what a system can solve, but what it is allowed to know. Early choices harden into reporting structures, metrics, and incentives. Metrics become dashboards. Dashboards become strategy. Strategy becomes goals. Over time, a provisional frame calcifies into institutional knowledge, and what began as a modeling convenience becomes “how the system works.”

The system becomes increasingly efficient at solving the problem it defined and increasingly blind to the problems it excluded. Systems rarely fail because reality violates the model. They fail because the model trained the organization to ignore what did not fit.

Boundary selection is therefore not a technical prelude to “real work.” It is the work. How we formulate a system determines not just how we solve it, but who benefits, who absorbs the cost, and who remains invisible. The illusion is that the frame is neutral. The reality is that it is decisive.

Cartier-Bresson insisted that composition must happen in the viewfinder because once the frame is set, the photograph is already determined. The same is true of systems. By the time we are debating outcomes, budgets, or reforms, the geometry has already been chosen. The proportions were fixed when we drew the boundary, and finalized when we defended it.

Anti-Governance, Original Sin, and Conway’s Law

2026-01-27T12:00:00+00:00

Loss through human divergence is a recurring motif across mythologies and modern narratives. Whether it’s the Tower of Babel, Anna Karenina, or Apple TV’s Pluribus, coherence is portrayed as fragile: something briefly held within a community before fragmenting under competing perspectives, incentives, or truths.

Organizations face the same problem. We are increasingly aware of our fractured knowledge (as explored here) and we even possess idealized models for addressing it (outlined here). And yet, more often than not, we find ourselves in a disequilibrium between data in and meaning out.

Functional organizations are alike: clear ownership, shared purpose, and mechanisms for resolving conflicts or operational differences as they evolve. Dysfunctional organizations, by contrast, fail in many different ways. Those failures are not random. They tend to fracture along the same structural seams, where decisions were deferred, authority was ambiguous, or tradeoffs were never made explicit.

In scientific organizations, these fractures are especially visible because messy, expensive, large, complex, and regulated data magnifies them. When coherence breaks down, it does so predictably by eroding structure, dissolving focus, or overwhelming orchestration. What follows are not edge cases or cultural quirks, but recurring governance failures that turn abundance into noise.

In the context of scientific data systems, I’ll outline five such failure modes.

Original Sin

Every data system inherits its earliest assumptions. Those assumptions are almost never wrong, but they are always incomplete.

Early success hardens into structure before its implications are understood or felt. In scientific organizations, this usually happens at moments of genuine discovery that translates to early commercial success or funding. A postdoc identifies a compound with higher efficacy and lower toxicity. A mid-career scientist devises a scalable manufacturing process for a difficult protein. The work is real, urgent, and valuable.

At that moment, no one asks how these results should be schematized, governed, or made durable. They shouldn’t. The goal is progress, not architecture.

But early decisions – from file formats, to identifiers, naming conventions, database structures, knowledge limits, ownership boundaries – quietly fix the future shape of the system. What begins as a pragmatic shortcut becomes a moral commitment. Downstream teams inherit constraints they did not choose, and over time those constraints, now inscrutable, are mistaken for deified intent.

Original Sin is not bad design. It is unexamined design, preserved long past the conditions that justified it.

Underinvestment in Stewardship

As organizations evolve, they must become increasingly selective with resources. Early success often encourages investment in technology: more scientists, more direct support, more output. During periods of strategic repositioning, those same organizations may assume that existing systems are robust enough to persist with less attention. Even in uneventful drift, as stewards move on to new roles or opportunities, institutional protections quietly decay.

The underlying assumption is that something built correctly (or even fit for purpose) will last indefinitely. It won’t.

Stewardship is an unassuming function. Its greatest yields are negative space: fewer disruptions, smoother access, less friction. These outcomes rarely excite investors or justify headcount. Failure is also slow. Platforms corrode gradually; they almost never collapse outright.

Stewardship is not a technical role. It is an organizational function and an institutional posture. When it is underinvested, the closest contributors to the pain points often compensate through yeomanlike labor, building informal workarounds or cottage industries of reconciliation. To leadership, this can appear efficient. In reality, it externalizes cost.

The burden of maintaining coherence shifts downstream. New hires have to play games of telephone and archaeology, adjacent teams must support domain knowledge they don’t have, and end users who must reconstruct context that was never preserved. Over time, the organization becomes dependent on memory rather than structure.

Stewardship is the work of maintaining meaning across time. When it is absent, systems do not fail loudly; they rot quietly.

Garfield Minus Garfield: Existentialism. Jim Davis and Dan Walsh

Organic Growth

Organizations are never born just once. They pass through a series of transitions: promising platforms give way to new candidates, technologies are adopted, programs are cut, teams re-form, and key personnel move on. Throughout these changes, data continues to accumulate. What changes is not volume, but direction.

It is natural to assume that methods which worked previously will continue to work. During transitions, however, pressure mounts. Deadlines tighten, scope shifts, and small deviations become necessary. A workaround is introduced to accommodate an edge case or unblock progress. The change functions, delivers value, and – crucially – appears harmless.

At this point, a choice exists: reconcile the exception back into the system by extending the schema, revising constraints, or explicitly redefining scope. Or leave it alone.

Organic overgrowth occurs when reconciliation is deferred. The workaround takes root outside the original structure, not as an intentional fork, but as a practical accommodation. Over time, these local optimizations proliferate. They remain functional, even useful, but increasingly distinct. What began as a single exception becomes a parallel record of truth.

This is how competing ontologies emerge in good faith. A table is created without key relationships. Clinical trial metadata lives in a PowerPoint because it is visible and fast. A secondary identifier system is introduced to satisfy an urgent need. Each choice is rational in isolation. Collectively, they fracture coherence.

The danger is not that these systems fail. It is that they succeed. Parallel frameworks demand duplicate effort, require constant translation, and resist consolidation because they are already embedded in workflows. Overgrowth is difficult to reverse precisely because nothing is broken.

Organic overgrowth is entropy mistaken for flexibility. Without deliberate reconciliation, systems expand outward instead of upward, accumulating exceptions until the cost of unification outweighs the will to attempt it.

Myst library from https://www.trueachievements.com/game/Myst/walkthrough/3. Rand and Robyn Miller, Cyan, 1993. I will not be taking further questions.

Conway’s Law

“Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.” — Melvin Conway, 1968

Melvin Conway’s observation is often treated as a technical curiosity: a two-person team builds a two-pass compiler, a four-person team builds a four-pass one. In practice, it describes far more than software architecture. It applies equally to organizations, data systems, and the knowledge structures that emerge from them.

At its core, Conway’s Law is neutral. Structure follows communication. Decisions follow authority. Boundaries propagate. X begets Y.

It becomes a failure mode when X is already compromised.

Knowledge systems are not abstract representations of truth; they are accumulations of decisions, incentives, and justifications made by an organization at a specific moment in time. They encode not only what is known, but who was allowed to decide, who needed to be consulted, and which tradeoffs were acceptable. When communication paths fragment, so does meaning.

Because decisions are cumulative, small disruptions matter. A temporary reporting line becomes permanent. A workaround becomes precedent. A silenced stakeholder becomes an absent domain. Over time, the structure of the organization becomes its intent, and that intent determines the structure of the system.

This is why Conway’s Law is not merely descriptive but predictive. Systems do not fail because they are poorly designed. They fail because they faithfully mirror organizations that are themselves in flux, misaligned, or fragmented.

Conway’s Law explains why coherence cannot be fixed downstream. When communication breaks, architecture follows. When authority diffuses, meaning fractures. The system does not resist this outcome: it records it.

Unwillingness to Revise

All of these failure modes carry an element of inevitability. The future cannot be predicted, so the present cannot be perfectly constructed around it. Our flexibility can be a strength but can limit focus and our best attempts at orchestration can become outdated or misdirected.

Left unmitigated, these dynamics can degrade systems. But none of them represent terminal failure. What is terminal is the unwillingness to confront them.

Revision is a fact of scientific life. Until 1944, proteins were thought to carry genetic information. Until the late nineteenth century, disease was attributed to miasma rather than microbes. Science isn’t conducted by predicting, preserving, or even finding correct answers, but by revising frameworks when evidence demands it.

As scientists, we accept that hypotheses must change. The same principle must apply to how we record, structure, and govern our knowledge. When existing systems obstruct interpretation or decision-making, they must be revised, regardless of how successful or familiar they once were.

This is rarely easy. Revision is both technical and political. It requires deep institutional understanding and legitimate authority, a combination that is seldom concentrated in a single role. As a result, revision often occurs only when failure becomes visible enough to bridge the gap between technical insight and executive action. That can be too late in a competitive landscape.

But without revision, failure modes compound. Original assumptions harden. Stewardship erodes. Exceptions proliferate. Organizational fractures imprint themselves onto systems. What begins as manageable drift becomes structural decay.

Revision is the only counterforce. It must be deliberate, authorized, and decisive. Without it, systems do not merely age; they accumulate incoherence and fracture into individual, incompatible truths.

Why it’s so hard to feed people

2026-01-27T12:00:00+00:00

The Grapes of Wrath by John Steinbeck. Cover art by Elmer Hader. 1939.

In 1939, John Steinbeck wrote a haunting passage about the real source of hunger in the Dust Bowl and Great Depression:

The works of the roots of the vines, of the trees, must be destroyed to keep up the price, and this is the saddest, bitterest thing of all. Carloads of oranges dumped on the ground. The people came for miles to take the fruit, but this could not be. How would they buy oranges at twenty cents a dozen if they could drive out and pick them up? And men with hoses squirt kerosene on the oranges, and they are angry at the crime, angry at the people who have come to take the fruit. A million people hungry, needing the fruit - and kerosene sprayed over the golden mountains. And the smell of rot fills the country. Burn coffee for fuel in the ships. Burn corn to keep warm, it makes a hot fire. Dump potatoes in the rivers and place guards along the banks to keep the hungry people from fishing them out. Slaughter the pigs and bury them, and let the putrescence drip down into the earth.

There is a crime here that goes beyond denunciation.

There is a sorrow here that weeping cannot symbolize.

There is a failure here that topples all our success.

The fertile earth, the straight tree rows, the sturdy trunks, and the ripe fruit. And children dying of pellagra must die because a profit cannot be taken from an orange. And coroners must fill in the certificate - died of malnutrition - because the food must rot, must be forced to rot. The people come with nets to fish for potatoes in the river, and the guards hold them back; they come in rattling cars to get the dumped oranges, but the kerosene is sprayed. And they stand still and watch the potatoes float by, listen to the screaming pigs being killed in a ditch and covered with quick-lime, watch the mountains of oranges slop down to a putrefying ooze; and in the eyes of the people there is the failure; and in the eyes of the hungry there is a growing wrath. In the souls of the people the grapes of wrath are filling and growing heavy, growing heavy for the vintage.

John Steinbeck, The Grapes of Wrath

What Steinbeck identifies as most disturbing is not the hunger itself or even the diagnosis of a clear villain that leads to it. There is no cartoonish president directly promoting genocide, no subversive group of wealthy individuals advocating for hunger, not even a Phytophthora infestans blighting the crops as it did during the Irish Potato Famine. Instead, there’s the quiet and bleak assumption that – sometimes – many have to be sacrificed to keep the machine of industry running, even if it results in obvious inefficiency.

What Steinbeck describes is not a historical aberration but a recurring coordination failure: abundance without access. We live in a world of logistical myopia where we record and feed the visibly needy but lose many we could help.

Even in Boston — a city with high transit access, strong social programs, and deep institutional wealth — food insecurity remains widespread. 37% of Massachusetts families report facing food insecurity and nearly 20% rely directly on SNAP assistance to make ends meet (see the report).

So how can we be so wealthy and so poor at the same time? Because hunger is not primarily a problem of supply, but of context, coordination, and representation.

What Do We Know?

SNAP Access by Neighborhood. City of Boston Food Access: Food Access and Insecurity.

General statistics tell a big part of the story. Research suggests that demographics are the biggest predictor with 62% of Hispanic households, 46% of Black households, and 56% of LGBTQIA+ households facing insecurity from the same report as above. That kind of segregation persists even in a progressive, history-rich city like Boston.

Indeed, neighborhood-level maps support this theory, with communities of color like Roxbury, Dorchester, and Mattapan disproportionately affected.

While the city of Boston, predominantly through the Greater Boston Food Bank, distributes nearly 90 million healthy meals to 190 cities and towns in Eastern Massachusetts while operating over 600 community-based pantries. Massachusetts, as a whole, distributes over $2.6 billion dollars to food assistance in the state.

Despite this, Boston’s Office of Food Justice (OFJ) along with the National Resources Development Council (NRDC) estimates that 21% of all of Boston’s waste – 130,000 tons – is food, with an additional 1,100 that could be recovered through further interventions.

The demand is there, and the supply is there. Markets optimize price and throughput, not for local constraint. This is the myopia: we cannot use centralized information and solutions to solve the local failures.

Therefore, the problem isn’t quantity, it’s logistics. Logistics determines who can access help without sacrificing time, income, or dignity.

Our typical response to hunger isn’t to meet it where it is, it’s to meet it where it visibly is. Boston homelessness is at 2.4% and mostly concentrated in a multi-block radius around a few sites downtown near MGH at “Mass and Cass” or “Methadone Mile”, North Roxbury/the South End, and in Jamaica Plain near Southeast Franklin Park at the Shattuck Shelter Pine Street Inn.

Boston cost per room based on 2020 census data. Work my own.

With the exception of Jamaica Plain, there is significant concentration of service for food pantries in the denser, visibly homeless areas. This is great; these people need food – especially fresh food – and direct access.

However, we’re seeing a negligence of a vast amount of people who would meet that “food insecure” label and claim SNAP. Indeed, the locations with the highest median cost per room are very near the locations of food pantries in the downtown, Back Bay, Fenway, South End, and South Boston regions, right where the food pantries are.

What Exactly Are We Doing Here?

Food Pantries by Neighborhood. City of Boston Food Access: Food Access and Insecurity.

We have a functional food delivery apparatus, a lot of food, and a lot of people willing to resolve hunger. But there remains a lot of hunger. We’re still in Steinbeck’s America. We’re still serving a singular, visible context.

The practical solution is to centralize: use tax dollars and volunteer labor to aggregate resources in a utilitarian function. In a smaller city like this, if we serve downtown access hubs that are near the supply sites, the needy, and the transportation hubs, we’re doing the most with what we have, right?

Unfortunately, the people most consistently missed by the emergency food system are not the unhoused, but the “working poor”. These people – parents, blue-collar workers (people with jobs, schedules, and constraints, in other words, folks doing it “the right way”) – are too far from food pantries and too busy to get there. They’re stuck in a twilight zone of doing everything right but losing access as a result.

Our current methods of resolving hunger are well-meaning and well-directed but miss a majority of people we can serve. This is the disconnect, and we can do something about it.

First of all, the context we have isn’t sufficient. We fall prey to the same modes of decision-making that I’ve discussed before: many people who need to work together, speak different professional languages, and repeatedly fail to translate between them. Even records like the USDA’s food desert map are predicated on indirect measures of access by distance, vehicle access, and income, but miss local resolution like modes of transport, cost of living, and other clues.

Randall Munroe. https://xkcd.com/1599/. When I was a kid, I asked my parents why our houses didn't have toothpaste pipes in addition to water ones. I'm strangely pleased to see Amazon thinking the same way.

Hunger persists not because food is scarce, but because context is.

In this case, we have people with money trying to find the right way to leverage it by buying food and labor from the people who have it so they can give it to the communities with the greatest impact. But if you ask a politician, they’re looking at the figures above or listening to disgruntled community members driving by Mass and Cass. If you’re the Greater Boston Food Bank or the YMCA or one of the many other wonderful organizations, you’re trying to balance volunteering, minimal monetary resources, and centralized logistics to maximize output. If you’re a grocery store, farmers’ market, or food importer, you’re trying to maximize revenue but also minimize waste. If you’re a community advocate, you see your neighbors and can say – directly – to whom the food should go. If you’re a scientist, you consume these data indirectly, trying to assemble questions from fragments rather than a unified record.

The only reason I can see it is because I’ve been embedded with a good number of these experts for over three years and have been working in food justice and hunger prevention for over 10. There are a lot of experts with one goal, but it’s hard to compile this much context into a coherent, actionable framework without losing part of the story. To the outside observer, it appears more like a typical supply-constrained environment rather than an unrecorded and unactioned narrative. This is not a moral failure. It is a representational one, and representation determines who is seen, served, and remembered.

This narrative is forgotten. These people are forgotten. We’ve found complacency and lost revision, the ability to update systems when reality shifts or is incompletely satisfied. We collect data, but acting on it is either decentralized or divorced from local needs. Local groups do really good work and cities are actively pursuing zero waste and food distribution initiatives. But the connective tissue – acting concurrently at a county and neighborhood level – means that the working poor are forgotten. Hunger persists not because food and goodwill are scarce, but because distribution requires local context that centralized systems are structurally bad at representing.

(I look forward to discussing solutions in future posts.)

Maps taken from https://storymaps.arcgis.com/stories/956debdf80c0492bbceeedff9f6a4bac.

The Catalog of Catalogs

2026-01-20T12:00:00+00:00

“This much is known: for every rational line or forthright statement there are leagues of senseless cacophony, verbal nonsense, and incoherency.” — Jorge Luis Borges, The Library of Babel

In 1941, Borges imagined a universe in which the limitation of knowledge was not scarcity but excess. Every possible book existed, yet nothing was meaningfully accessible. In this world, the librarian’s hope rests on a mythical object: a Catalog of Catalogs, a structure that might impose order on overwhelming noise, and the Book-Man, who has studied it.

Erik Desmazieres, 1997, La Bibliothèque de Babel (above and thumbnail) for Le Livre Contermporain and Les Bibliophiles Franco-Suisses.

In 2026, modern scientific organizations increasingly resemble Borges’ library. Data accumulates faster than meaning can be maintained, and organization becomes the difference between insight and noise. We’re not faced with a lack of quantitative or qualitative metrics, we’re faced with a lack of coherence.

Last week, I wrote about how the crucial endeavor in science is getting people with fundamentally different expertises to talk to each other. I focused a lot on ontologies as a mechanism of that record, and how both subject matter and translational governance are underestimated infrastructure. Borges’ allegory of the Catalog of Catalogs and the Book-Man are particularly prescient here; it’s not knowledge being sought but the demand for structured knowledge. We’re all looking for a template by which to organize the data and someone who can make sense of it and therefore lead us into a kind of distillation where the infinite becomes comprehensible to the finite.

This structured knowledge doesn’t exist by accident. Instead, it’s a very active process of structure, focus, and then orchestration: a “what”, a “why”, and a “how”.

Structure

The first concern is that data be reasonably segmented and relational. From a practical standpoint, data must be Findable, Accessible, Interoperable, and Reusable (FAIR). If not, data tend to be labyrinthine, bottlenecked, or plainly incoherent.

We’ve all lived in organizations where we’ve had to wade through years of previous colleagues’ Excel files, sandbox analyses, hopes, and dreams. Correctly structured data maintain relevance and a paper trail for a broad swath of the organization without unnecessary bureaucracy.

A lot of focus is put on the first two elements (findable and accessible), but interoperable and reusable data are equally important. It’s crucial that data maintain relation to other data while functioning broadly across contexts. Structure presents the “where” and the “how” of coherence.

Focus

Structure frames and orients knowledge, but it does not direct it. Most organizations do not lack data; they lack agreement about which questions matter. Focus provides the bridge between what is technically possible and what is operationally relevant.

In the absence of focus, data systems become encyclopedic rather than instrumental: impressive in scope, but disconnected from decision-making. Everything is recorded, nothing is prioritized, and insight is deferred indefinitely.

Focus requires constraint. It means explicitly defining:

which problems are in scope,
which audiences the data serve,
and which uses are not supported.

Focus is an acknowledgment that not all pursuits are equally valuable and that excluding irrelevance is a prerequisite for action. Focus is where knowledge ceases being merely reflective and starts becoming useful.

Orchestration

Where structured, focused data enable valuable prototypes, orchestration is what allows those systems to scale. It is no accident that modern computational frameworks (whether data architecture, DevOps, or agentic systems) borrow musical terminology: orchestration is the work of getting multiple competent functions to operate coherently through change. The way we use our data must mirror what we expect from it.

Orchestration requires stewardship and governance. It is the ongoing maintenance of alignment between structure, focus, and reality as systems grow. Initial models of data flow are necessary, but insufficient; organizations must adapt as methods change, incentives shift, and markets evolve. That adaptability does not emerge spontaneously. It requires deliberate ownership, revision authority, and sustained care.

Calvin and Hobbes: Man of Action by Bill Watterson. September 21, 1993

When Borges wrote about his library, he did so appealing to the philosophical arrangement of knowledge and how we make sense of the world. In this perfect and infinite arrangement of all possible things that could be described, his best case still was seeking a digested, comprehensible version of it.

In scientific systems, it’s our reality to take pragmatic approaches to our expanding means of measuring and describing the world. Whether we’re collecting and making decisions in the current AI revolution, the “Big Data” boom of the last decade, or Borges’ 1941, we’re still restricted by the same conditions.

No amount of technological advancement can replace our core need for coherence, not completeness, in data. The limiting factor has never been measurement, storage, or computation. It has always been our willingness to decide what matters, who decides it, and who is accountable when meaning breaks.

Scientific Babylon

2026-01-10T12:00:00+00:00

In accounts from The Bible and The Torah, the Tower of Babel is built to prevent a second flood, only to result in the fragmentation of human language. Variations on the same theme appear across Greek, Estonian, Sumerian, and Aztec traditions: a once-unified humanity loses a shared understanding of the world through linguistic division. In these stories, collective power gives way to confusion: not through catastrophe, but through meaning itself.

The Tower of Babel By Pieter Brueghel the Elder - Levels adjusted from File:Pieter_Bruegel_the_Elder_-_The_Tower_of_Babel_(Vienna)_-_Google_Art_Project.jpg, originally from Google Art Project., Public Domain, https://commons.wikimedia.org/w/index.php?curid=22179117

Modern linguistics offers a less mythic explanation. Languages diverge naturally, shaped by ecology, geography, isolation, and social structure. Language is not merely a labeling system; it is a lens for interpreting a complex world. As perspectives diverge, so too must the structures used to describe them.

Science faces a related, but sharper, problem. We attempt to describe systems that are not only complex, but mostly unobservable, probabilistic, and dynamic. The challenge extends past measurement and into representation: data are meaningless without a structured, shared, and preserved system. That system has to subscribe to the same orders as all data: Findable, Accessible, Interoperable, and Reproducible (FAIR) because its value lies in how it interfaces with colleagues and other data.

Ontologies attempt to resolve this fragmentation by enforcing shared meaning. Here, ontology is used in its applied sense: not as a claim about what exists, but as a practical expression of how knowledge is organized and exchanged. In practice, they often expose the cost of assuming meaning can be fixed at all. Applying an alternative standard to these processes – instead of resulting in meaning – adds another standard on top of the dozens already there. We confront our own Tower of Babel, then: we find ourselves playing a massive, expensive game of telephone where meaning is exchanged, lost, and mutated between experts.

In metabolomics alone, biological, chemical, and analytical vocabularies coexist; each developed in different ecologies, each optimized for a different audience, and each only partially compatible with the others.

These ontologies operate long before interpretation is even possible. Moreover, each individual discipline involved is nuanced, requiring years of experience or education to get to foundational understanding. These ontologies are the connective tissue that flattens these disciplinary nuances. For example, metabolomics data flows from Liquid Chromatography/Mass Spectrometry (LC/MS) through reference spectra (ideally developed on the same machine and method) and finally to a format that’s accessible to a biochemist (typically a table). Even in this simplified process, it’s clear that there’s a substantial amount of effort involved in developing the method, creating the infrastructure around managing/storing the data, selecting/synthesizing reference compounds, analyzing data, and engineering it for interpretation.

Even getting to the point where we’re comfortable analyzing the data requires a tremendous amount of effort and coordination. Moreover, no scientist has the capacity or time to ensure quality of the product, and relies on the data originator to deliver precise, accurate, and relevant results. It’s crucial, for both internal and external purposes, to maintain a paper trail that makes it clear how each step integrates into the useful data. Otherwise, we’re left with a mess of isolated numbers.

xkcd 927 "Standards" by Randall Munroe, CC BY-NC 2.5, https://xkcd.com/927/

This isn’t just a factor in metabolomics. In fact, this specific analytical example can be applied directly to proteomics with a few small tweaks. Genomics, too, can claim to have done an excellent job of exploiting integrated analytics and ontologies to create reliable, consistent pipelines that are broadly interpretable and trustworthy. What we’re trying to do here – use extremely expensive, sensitive instruments on extremely expensive, sensitive biological samples – is difficult and important. The first step is being able to adequately describe what’s going on to other people.

Ontologies are, at their most useful, a representation of shared purpose among diverse methods, all integrating into a cohesive representation of an intractable phenomenon. To get to that point, it requires often silent, detailed labor from a large group of people to map out and maintain a reliable pipeline. However, they can often fall prone to overly-calcified standards, resulting in a labyrinthine branched set of systems. These processes must reflect their application while remaining integrated with their partners. To do this, ontologies are most effective when they constrain interpretation without attempting to freeze meaning. In domains that evolve as quickly as medicine, chemistry, tech, and biology, representation must remain thoughtfully provisional without becoming unstable.

What looks like a problem of standards is often a problem of governance. The failure mode here is rarely technical. It is organizational. Different groups optimize for different incentives, audiences, or realities – speed, precision, publication, novelty, regulatory defensibility, commercial relevance – and ontologies become the battleground where those incentives collide. Standardization does not remove ambiguity; it decides who bears the cost of resolving it.

In practice, governance does not mean tighter standards or different hiring practices. It means deciding who can revise definitions, how translation is handled, and where ambiguity is tolerated. Meaning will change regardless; the real question is who is accountable for managing that change over time.