Cross-Referencing Supplement Identifiers: PubChem, UNII, and Why It Matters for Interoperability

Supplements do not have a universal identifier. There is no ISBN for creatine. No ISIN for vitamin D3. No single authority that assigns a globally unique code to every dietary supplement compound and guarantees that code is used consistently across databases, regulatory filings, research publications, and commercial product catalogs.

This creates a concrete problem for any health-tech platform that integrates data from multiple sources. Your clinical database calls it "Coenzyme Q10." Your supplier catalog calls it "CoQ10." PubChem identifies it as CID 5281915. The FDA's UNII registry assigns it the code "EJ27X76M46." They are all the same compound. Your system needs to know that.

Identifier mapping is the infrastructure that solves this problem. This article covers the major supplement identifier systems, why cross-referencing between them matters, and how to implement identifier resolution in a production system.

The identifier landscape

PubChem

PubChem is the National Library of Medicine's open chemistry database. It assigns a Compound ID (CID) to each unique chemical structure. PubChem covers most supplement compounds because they are defined chemical entities, but it does not cover proprietary blends, branded formulations, or multi-compound botanical extracts where the "active ingredient" is not a single molecule.

Strengths: Broad chemical coverage, stable identifiers, rich structural data, free and open.

Limitations: No concept of "supplement" as a category. A PubChem CID identifies a molecule, not a health product. Botanical extracts with multiple active compounds (like ashwagandha) may have multiple CIDs or none that captures the full extract profile.

UNII (FDA)

The Unique Ingredient Identifier system is maintained by the FDA's Substance Registration System. Every substance that appears in a drug or supplement filing receives a UNII code. Unlike PubChem, UNII explicitly covers botanicals, allergens, and complex mixtures because the FDA needs to identify them in regulatory submissions.

Strengths: Covers botanicals and complex mixtures. Used in FDA regulatory filings. Stable and authoritative for U.S. regulatory contexts.

Limitations: Narrower international adoption. Some UNII codes map to categories ("Ashwagandha root extract") rather than specific standardized preparations, which can create ambiguity when comparing products with different extraction methods.

CAS Registry

Chemical Abstracts Service assigns CAS Registry Numbers to chemical substances. CAS numbers are widely used in manufacturing and material safety contexts. Most supplement compounds have CAS numbers, but the registry is not freely accessible, which limits its utility for consumer-facing applications.

Proprietary identifiers

Every supplement database, e-commerce platform, and clinical system has its own internal identifier scheme. Amazon has ASINs. Your EHR vendor has internal codes. Your recommendation engine has entity IDs. These identifiers are necessary for internal operations but meaningless outside their originating system.

Why cross-referencing matters

Data integration

The most immediate use case is joining data from multiple sources. Your clinical reference database provides interaction data keyed by UNII. Your product catalog uses internal SKU codes. Your evidence graph uses its own entity identifiers. Without a cross-reference table, these datasets cannot be joined.

A cross-reference table maps between identifier systems:

Entity	Internal ID	PubChem CID	UNII	CAS
Creatine Monohydrate	creatine	586	MU72812GK0	6020-87-7
Coenzyme Q10	coq10	5281915	EJ27X76M46	303-98-0
Ashwagandha	ashwagandha	—	5R26Y48U3J	—

With this table, a query to PubChem for CID 586 can be resolved to your internal entity "creatine," which can then be used to fetch evidence data, interaction flags, and product recommendations from your own system.

Research citation linkage

Published studies reference compounds by name, by CAS number, or by PubChem CID. When your system indexes a new study, it needs to resolve the compound reference to your internal entity. Without identifier mapping, this resolution requires fuzzy string matching on compound names, which is error-prone. "Magnesium L-threonate," "MgT," "Magtein," and "magnesium threonate" are all the same compound. Identifier mapping handles this by resolving all four to the same canonical entity.

Regulatory alignment

If your product operates in a regulated context (clinical decision support, supplement manufacturing, FDA submissions), you need to reference supplements using the identifier systems that regulators expect. FDA submissions use UNII codes. European filings may use different identifiers. Identifier mapping lets your internal system use whatever identifiers are convenient while producing regulatory-compliant outputs.

Deduplication

When ingesting data from multiple sources, identifier mapping is the most reliable deduplication mechanism. Two databases might both contain entries for "vitamin D" but with different names, different categorizations, and different metadata. If both entries map to the same UNII code, they are the same entity and should be merged rather than stored as duplicates.

Implementation patterns

Pattern 1: Canonical entity with identifier map

The cleanest architecture designates one identifier as canonical (your internal entity ID) and maintains a mapping table to all external systems. All internal references use the canonical ID. External identifiers are resolved to canonical IDs at the system boundary.

External query (PubChem CID 586)
  → Resolve to canonical ID ("creatine")
  → Fetch from internal data store using canonical ID
  → Return response

The identifier map is a simple table: (canonicalid, system, externalcode). Lookups in both directions (canonical → external, external → canonical) should be indexed for performance.

Pattern 2: Bidirectional lookup service

For systems that need to resolve identifiers dynamically — for example, a content ingestion pipeline that encounters a PubChem CID in a study abstract and needs to find the corresponding internal entity — a lookup service provides real-time resolution.

The lookup service accepts a system name and an external code, and returns the canonical entity (or a "not found" response if the mapping does not exist). This is essential for automated data ingestion pipelines where new external references appear continuously.

Pattern 3: Federated resolution

For platforms that integrate deeply with multiple external databases, federated resolution queries the external systems themselves to find mappings. PubChem's API can resolve a compound name to a CID. The FDA's API can resolve a compound to a UNII. Your system chains these resolutions to build and maintain its identifier map automatically.

This pattern is powerful but requires careful rate limiting and caching, since external APIs have their own quotas and latency characteristics.

Edge cases and challenges

Botanicals and extracts

Single-molecule supplements (creatine, caffeine, vitamin D3) map cleanly to PubChem CIDs. Botanical extracts (ashwagandha, turmeric, ginkgo) are more complex because the "supplement" is a mixture of compounds. PubChem may have CIDs for individual constituents (withanolides for ashwagandha, curcuminoids for turmeric) but not for the whole extract.

The practical solution is to maintain identifier mappings at the extract level using UNII (which handles complex mixtures) and supplement the mapping with constituent CIDs where available.

Name ambiguity

"Vitamin E" can refer to alpha-tocopherol, a mixed tocopherol blend, tocotrienols, or a combination. Each has different PubChem CIDs and different evidence profiles. Your identifier mapping needs to handle these distinctions, which means your internal entity model needs to be specific enough to distinguish between them.

Evolving registries

PubChem CIDs are stable, but new compounds are added and occasionally deprecated compounds are merged. UNII codes are generally stable but the FDA does update them. Your identifier mapping should include a last-verified timestamp and a periodic validation job that confirms external identifiers still resolve correctly.

The interoperability payoff

Identifier mapping is not a glamorous feature. It is plumbing. But it is the plumbing that determines whether your system can integrate with clinical databases, ingest published research automatically, produce regulatory-compliant outputs, and deduplicate data across sources.

Health-tech platforms that invest in identifier infrastructure early find that every subsequent integration — a new data source, a new regulatory requirement, a new partner API, or an evidence graph — is dramatically easier because the resolution layer already exists. Platforms that skip it find themselves building ad-hoc string-matching hacks for every new integration, each one slightly different and slightly broken.

The Unfair Library API provides bidirectional identifier mapping across PubChem, UNII, and other systems for all 271 supplements in the dataset. The `/identifiers` endpoint resolves internal entities to external codes, and the `/lookup` endpoint resolves external codes back to internal entities. Explore the API docs or contact us to discuss your interoperability requirements.