ClinVar Variant Pathogenicity Lookup in Python — Programmatic Access for Hereditary Disease Screening (2026)
Step-by-step Python tutorial: query NCBI ClinVar via E-utilities API to get pathogenicity classification (Pathogenic, Likely Benign, VUS) for any SNP or variant. Useful for batch-screening 23andMe / WGS data against BRCA, Lynch syndrome, hereditary cancer panels, and any clinically annotated variant set.
The Workflow This Tutorial Replaces
You have 600,000 SNPs from 23andMe (or 5 million variants from WGS). You want to know which of them have known clinical significance — Pathogenic, Likely Benign, Variant of Uncertain Significance (VUS), etc. Manually entering each rsid into the ClinVar web search is not the answer.
ClinVar (NCBI's free database of clinical variant interpretations) has a programmatic API via the standard NCBI E-utilities. This tutorial shows the Python code to look up any variant — by rsid, by gene+HGVS, by genomic coordinates — and get back ACMG classification, condition associations, supporting evidence, and submitter information.
It's the practical complement to the gene-specific guides in this site (BRCA hereditary cancer, PGx guide, CYP2D6 from 23andMe, CYP2C19 for Plavix) — they each handle one gene. This one is the general engine for any variant.
ClinVar Quick Background
- What it is: NCBI's centralized, curated database of clinical interpretations of human genetic variants
- Who submits: clinical labs (Invitae, Ambry, ARUP, etc.), academic institutions, single submitters
- Size: 2 million+ variants with classifications, 30,000+ genes covered
- Updated: monthly bulk releases; daily web updates
- Free: no API key required (rate limits apply — see below)
Classification tiers (ACMG/AMP standards)
| Classification | Meaning |
|---|---|
| Pathogenic | Strong evidence variant causes disease |
| Likely Pathogenic | Strong but not definitive evidence |
| Uncertain Significance (VUS) | Insufficient evidence in either direction |
| Likely Benign | Suggests no clinical impact |
| Benign | Strong evidence no clinical impact |
| Conflicting interpretations | Different submitters disagree |
| Drug response | Variant affects medication response (PGx) |
| Risk factor | Modifies disease risk without directly causing it |
For screening, you usually care about Pathogenic + Likely Pathogenic (P/LP) and any conflicting interpretations involving them.
NCBI E-utilities API — The Basics
The E-utilities are NCBI's REST API. ClinVar is one of many databases. Key endpoints:
- esearch.fcgi — search ClinVar, get a list of matching IDs
- esummary.fcgi — get summary metadata (faster, lighter)
- efetch.fcgi — get full XML/text record
Rate limit (without API key): 3 requests/second. With an API key (free, register on NCBI): 10 requests/second.
For Python, use the official Biopython Bio.Entrez module:
pip install biopython
from Bio import Entrez
Entrez.email = "your@email.com" # NCBI requires this for usage tracking
Entrez.api_key = "your_ncbi_api_key" # Optional but recommended for >3 req/sec
Lookup #1 — By rsid
Most useful for 23andMe-style data where you have a list of rsids.
from Bio import Entrez
import xml.etree.ElementTree as ET
Entrez.email = "you@example.com"
def clinvar_by_rsid(rsid):
"""Look up ClinVar entries for a dbSNP rsid (e.g., 'rs28897696')."""
# Step 1 — search for ClinVar IDs matching this rsid
handle = Entrez.esearch(db='clinvar', term=f'{rsid}[Variant ID]')
res = Entrez.read(handle)
handle.close()
ids = res['IdList']
if not ids:
return []
# Step 2 — fetch summaries
handle = Entrez.esummary(db='clinvar', id=','.join(ids))
summaries = Entrez.read(handle)['DocumentSummarySet']['DocumentSummary']
handle.close()
results = []
for s in summaries:
results.append({
'clinvar_id': s.attributes['uid'],
'title': str(s.get('title', '')),
'classification': str(s.get('clinical_significance', {}).get('description', '')),
'review_status': str(s.get('clinical_significance', {}).get('review_status', '')),
'genes': [g.get('symbol') for g in s.get('genes', [])],
'conditions': [t.get('trait_name') for t in s.get('trait_set', [])],
'last_evaluated': str(s.get('clinical_significance', {}).get('last_evaluated', '')),
})
return results
# Example
hits = clinvar_by_rsid('rs28897696')
for h in hits:
print(f" {h['classification']:30} {h['genes']} {h['conditions']}")
print(f" Review status: {h['review_status']}")
Output for rs28897696 (a known BRCA1 variant):
Pathogenic ['BRCA1'] ['Familial cancer of breast', 'Hereditary breast and ovarian cancer syndrome']
Review status: criteria provided, multiple submitters, no conflicts
Lookup #2 — By Gene + HGVS Nomenclature
When you have a specific variant in HGVS format (e.g., BRCA1:c.5266dupC).
def clinvar_by_hgvs(gene, hgvs):
"""Look up ClinVar by gene symbol + HGVS coding notation."""
term = f'{gene}[gene] AND "{hgvs}"[variant name]'
handle = Entrez.esearch(db='clinvar', term=term)
res = Entrez.read(handle)
handle.close()
ids = res['IdList']
if not ids:
return []
handle = Entrez.esummary(db='clinvar', id=','.join(ids[:10]))
summaries = Entrez.read(handle)['DocumentSummarySet']['DocumentSummary']
handle.close()
return [
{
'clinvar_id': s.attributes['uid'],
'classification': str(s.get('clinical_significance', {}).get('description', '')),
'title': str(s.get('title', '')),
}
for s in summaries
]
hits = clinvar_by_hgvs('BRCA1', 'c.5266dupC')
for h in hits:
print(f" {h['classification']:30} {h['title'][:80]}")
Lookup #3 — By Genomic Coordinates
For WGS or VCF data where you have chromosome + position.
def clinvar_by_position(chromosome, start, end, build='GRCh37'):
"""Look up ClinVar variants overlapping a position range."""
term = f'{chromosome}[chr] AND {start}:{end}[chrpos]'
if build == 'GRCh38':
term = f'{chromosome}[chr] AND {start}:{end}[chrpos38]'
handle = Entrez.esearch(db='clinvar', term=term, retmax=100)
res = Entrez.read(handle)
handle.close()
return res['IdList']
# Example: BRCA1 region
ids = clinvar_by_position(17, 41196311, 41277500)
print(f"Variants in BRCA1 region: {len(ids)}")
Batch Screening — Apply to a 23andMe File
The real workflow: 600K rsids → which have ClinVar entries?
Two strategies:
Strategy A — Loop with rate limiting
import time
import pandas as pd
snps = pd.read_csv('genome.txt', sep='\t', comment='#',
names=['rsid','chromosome','position','genotype'])
# Filter to obvious candidates (e.g., have a non-reference allele)
# Actually most rsids in a healthy human's data are normal alleles, so we
# pre-filter against known clinically relevant rsid lists
# Get the list of clinically-relevant ClinVar rsids (precomputed, see below)
clinvar_rsids = pd.read_csv('clinvar_rsids.txt', header=None, names=['rsid'])['rsid'].tolist()
# Intersect
relevant = snps[snps['rsid'].isin(clinvar_rsids)]
print(f"23andMe SNPs with ClinVar entries: {len(relevant)}")
# Look up only those, batched to respect rate limits
results = []
for _, row in relevant.iterrows():
hits = clinvar_by_rsid(row['rsid'])
for h in hits:
h['user_genotype'] = row['genotype']
results.append(h)
time.sleep(0.4) # 2.5 req/sec to stay under 3 req/sec limit
df = pd.DataFrame(results)
df.to_csv('clinvar_matches.csv', index=False)
This typically returns 200-500 matches for a healthy individual's 23andMe — most will be Benign or Likely Benign. The interesting filter:
clinically_relevant = df[df['classification'].str.contains('Pathogenic', na=False)]
print(f"P/LP variants in user data: {len(clinically_relevant)}")
print(clinically_relevant[['classification', 'genes', 'conditions', 'user_genotype']])
Strategy B — Use the bulk download
NCBI publishes the full ClinVar database monthly. For batch screening of many variants, download the bulk file (~200 MB) and query locally — no API calls needed:
# Download VCF format (~500 MB compressed) — has rsid + classification per row
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar.vcf.gz
# Or weekly bulk XML for full record detail
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/ClinVarVariationRelease_00-latest_weekly.xml.gz
import pysam
vcf = pysam.VariantFile('clinvar.vcf.gz')
# Stream through all ClinVar variants, store rsid → classification mapping
clinvar_db = {}
for rec in vcf:
rsid = rec.info.get('RS')
if not rsid:
continue
rsid_str = f"rs{rsid[0]}"
clnsig = rec.info.get('CLNSIG', ('Unknown',))[0]
clndn = rec.info.get('CLNDN', ('',))[0]
clinvar_db[rsid_str] = {'significance': clnsig, 'condition': clndn}
print(f"ClinVar variants loaded: {len(clinvar_db):,}")
# Match user data
matched = []
for _, row in snps.iterrows():
if row['rsid'] in clinvar_db:
m = clinvar_db[row['rsid']].copy()
m['rsid'] = row['rsid']
m['user_genotype'] = row['genotype']
matched.append(m)
import pandas as pd
matched_df = pd.DataFrame(matched)
print(f"Total matches: {len(matched_df)}")
print(f"P/LP matches: {matched_df['significance'].str.contains('Pathogenic').sum()}")
This is faster (single download, no rate limits) and what most clinical pipelines use. The API approach is better for ad-hoc lookups of specific variants.
Important — "Pathogenic in ClinVar" Doesn't Equal "You Have the Disease"
Three critical considerations:
1. Zygosity matters
Many ClinVar Pathogenic variants are recessive — having one copy means you're a carrier, not affected. The variant being labeled "Pathogenic for X disease" means it causes X in homozygotes (or compound heterozygotes), not in carriers.
Always check the inheritance pattern (autosomal dominant, recessive, X-linked) of the condition.
2. Penetrance varies
Even for dominant Pathogenic variants, penetrance (the % of carriers who develop the disease) is often less than 100%. BRCA1 pathogenic variants confer ~65% lifetime breast cancer risk, not 100%. A "Pathogenic" classification doesn't mean certain disease.
3. Review status
ClinVar entries have a review status indicating how robust the classification is:
| Review status | Meaning |
|---|---|
practice guideline | Highest confidence — incorporated into clinical guidelines |
reviewed by expert panel | Expert panel curated |
criteria provided, multiple submitters, no conflicts | Strong |
criteria provided, multiple submitters, conflicting | Caution — submitters disagree |
criteria provided, single submitter | Moderate confidence |
no assertion criteria provided | Lowest — be skeptical |
For any clinical decision-making, filter for review status ≥ "criteria provided, multiple submitters" at minimum. Treat single-submitter and conflicting interpretations as preliminary.
Practical Screening — Hereditary Cancer Panel Example
A common use case: scan a 23andMe file for variants in major hereditary cancer genes.
HEREDITARY_CANCER_GENES = [
'BRCA1', 'BRCA2', # Hereditary Breast and Ovarian Cancer
'MLH1', 'MSH2', 'MSH6', 'PMS2', 'EPCAM', # Lynch syndrome
'TP53', # Li-Fraumeni
'PTEN', # Cowden syndrome
'STK11', # Peutz-Jeghers
'APC', # Familial Adenomatous Polyposis
'CDH1', # Hereditary Diffuse Gastric Cancer
'VHL', # Von Hippel-Lindau
'RB1', # Retinoblastoma
'NF1', 'NF2', # Neurofibromatosis
'PALB2', 'CHEK2', 'ATM', # Moderate-penetrance breast cancer
]
# Restrict screening to known ClinVar Pathogenic variants in these genes
def get_pathogenic_variants_for_genes(genes):
"""Query ClinVar for Pathogenic variants in specified genes."""
results = {}
for gene in genes:
term = f'{gene}[gene] AND (Pathogenic[clinsig] OR "Likely pathogenic"[clinsig])'
handle = Entrez.esearch(db='clinvar', term=term, retmax=500)
res = Entrez.read(handle)
handle.close()
results[gene] = res['IdList']
time.sleep(0.4)
return results
# Run once, cache the results
gene_to_clinvar_ids = get_pathogenic_variants_for_genes(HEREDITARY_CANCER_GENES)
total = sum(len(v) for v in gene_to_clinvar_ids.values())
print(f"Total Pathogenic/LP variants in cancer panel: {total:,}")
# Then map ClinVar IDs to rsids and intersect with user data
# (Full implementation uses the bulk clinvar.vcf.gz for speed)
This gives you the universe of variants that, if found in your 23andMe data, deserve attention.
Limitations You Should Know
- 23andMe doesn't probe many rare clinically-actionable variants. Most hereditary cancer Pathogenic variants are rare (<1% frequency) and not on consumer SNP arrays. Negative DTC ≠ negative clinical screen.
- Structural variants (deletions, duplications, CNVs) are missed. Most ClinVar Pathogenic BRCA variants are SNVs and indels — both detectable. But some (BRCA1 BRCT large rearrangements) are CNVs and invisible to 23andMe.
- Imputation can fill some gaps — services like Sequencing.com use linkage to predict variants not directly typed. Quality varies.
- For clinical decision, work with a genetic counselor and order a clinical hereditary cancer panel (Invitae, Ambry, etc.). DTC + ClinVar lookup is screening, not diagnosis.
Rate Limit and Etiquette
NCBI's rate limits (without API key):
- 3 requests/second
- Recommended: throttle to 2-2.5/sec to avoid 429 errors
- Batch via
id=id1,id2,id3in single esummary call (up to ~200 IDs)
With NCBI API key (free, get from https://www.ncbi.nlm.nih.gov/account/):
- 10 requests/second
- Should be enough for most scripts
For production-scale lookups (>10,000 variants in tight loops), use the bulk VCF download. Hammering the API for batch jobs will get you rate-limited at minimum.
FAQ
Q: ClinVar vs ClinGen vs OMIM vs HGMD — what's the difference?
- ClinVar: free, NCBI-hosted aggregator of submitted clinical interpretations
- ClinGen: NIH-funded curation effort that creates expert-panel classifications; many appear in ClinVar
- OMIM: catalog of human genes and genetic disorders (older, narrative format)
- HGMD: commercial database of mutation-disease associations; broader but paid subscription
For most workflows, ClinVar is the right default. ClinGen for expert-curated subsets. HGMD if you have institutional access.
Q: Can I get ACMG criteria details (PS1, PM2, BS3, etc.)?
Some ClinVar submissions include them. Look in the full record (efetch XML) under the criteria provided. Most use a subset; reconstructing complete ACMG framework requires VarSome or InterVar (separate tools).
Q: Is there a Python ClinVar API client (not via E-utilities)?
Several wrappers exist: clinvarweb, pyClinVar. They typically wrap the same E-utilities under the hood. Biopython's Bio.Entrez is the most maintained.
Q: How fresh is ClinVar?
Web search is updated daily. Bulk download (VCF) refreshes weekly. API queries are real-time. The variant interpretations themselves are submitted by labs at various cadences — well-known genes like BRCA see updates monthly, rare-disease variants more sporadically.
Q: What about ClinVar VCF GRCh38 vs GRCh37?
23andMe v5 uses GRCh37. WGS files often use GRCh38. Download the matching VCF (clinvar.vcf.gz is for GRCh37; clinvar_GRCh38.vcf.gz for GRCh38). Or use rsid-based matching which is build-agnostic.
Q: Can I submit to ClinVar?
Yes — clinical labs, research labs, and individuals can submit. There's a web submission portal. Useful if you find a novel variant and have evidence; helps the community.
Q: Why does the same rsid sometimes have multiple ClinVar entries?
Because there can be multiple variant interpretations (e.g., the same SNP affecting two different genes via overlap; or same nucleotide change interpreted differently for different conditions). Each ClinVar ID is a unique variant-condition pair.
Closing — The Workflow
For batch screening of 23andMe / WGS data against clinical relevance:
- Download ClinVar bulk VCF (clinvar.vcf.gz, monthly) → local rsid → significance map
- Match user variants to this map (fast, no API calls)
- Filter to Pathogenic / Likely Pathogenic with strong review status
- Check zygosity + inheritance pattern for each hit
- Discuss with a genetic counselor before acting on any finding
The code in this guide is a starting point. For meaningful clinical screening, integrate with a hereditary disease panel approach (see BRCA + Hereditary Cancer Guide) and confirm any hits via clinical-grade testing.
Related posts:
- BRCA1/2 + Hereditary Cancer Genetic Testing Guide 2026
- Pharmacogenomics (PGx) Complete Guide 2026
- Reading 23andMe Raw Data for CYP2D6 Star Alleles in Python
- Extracting CYP2C19 Star Alleles from 23andMe — Plavix Response Prediction
- DTC Genetic Testing 2026 Complete Buyer's Guide
References:
- ClinVar database: https://www.ncbi.nlm.nih.gov/clinvar/
- NCBI E-utilities documentation: https://www.ncbi.nlm.nih.gov/books/NBK25497/
- ACMG/AMP Standards (Richards et al., 2015): Genetics in Medicine, 17, 405-424.
- ClinGen expert panels: https://clinicalgenome.org/
- Landrum, M. J. et al. (2018). ClinVar: improving access to variant interpretations. Nucleic Acids Research, 46, D1062-D1067.
관련 글
Extracting CYP2C19 Star Alleles from 23andMe — Plavix (Clopidogrel) Response Prediction in Python
5월 27일 · 10 min read
약물유전체학Reading 23andMe Raw Data for CYP2D6 Star Alleles in Python — Why DTC Often Misses *5 Deletion
5월 23일 · 11 min read
유전성암BRCA1/2와 유전성 암 — 유전자 검사·예방 전략 완전 가이드 2026
5월 19일 · 19 min read
유전체유전자 검사 결과 해석법 — 23andMe부터 임상 검사까지
2월 15일 · 9 min read