유전체분석

ClinVar Variant Pathogenicity Lookup in Python — Programmatic Access for Hereditary Disease Screening (2026)

Step-by-step Python tutorial: query NCBI ClinVar via E-utilities API to get pathogenicity classification (Pathogenic, Likely Benign, VUS) for any SNP or variant. Useful for batch-screening 23andMe / WGS data against BRCA, Lynch syndrome, hereditary cancer panels, and any clinically annotated variant set.

·12 min read
#ClinVar#Python API#NCBI E-utilities#variant pathogenicity#BRCA#Lynch syndrome#23andMe#hereditary disease screening#VUS#ACMG classification#Biopython

ClinVar Python API tutorial

The Workflow This Tutorial Replaces

You have 600,000 SNPs from 23andMe (or 5 million variants from WGS). You want to know which of them have known clinical significance — Pathogenic, Likely Benign, Variant of Uncertain Significance (VUS), etc. Manually entering each rsid into the ClinVar web search is not the answer.

ClinVar (NCBI's free database of clinical variant interpretations) has a programmatic API via the standard NCBI E-utilities. This tutorial shows the Python code to look up any variant — by rsid, by gene+HGVS, by genomic coordinates — and get back ACMG classification, condition associations, supporting evidence, and submitter information.

It's the practical complement to the gene-specific guides in this site (BRCA hereditary cancer, PGx guide, CYP2D6 from 23andMe, CYP2C19 for Plavix) — they each handle one gene. This one is the general engine for any variant.

ClinVar Quick Background

  • What it is: NCBI's centralized, curated database of clinical interpretations of human genetic variants
  • Who submits: clinical labs (Invitae, Ambry, ARUP, etc.), academic institutions, single submitters
  • Size: 2 million+ variants with classifications, 30,000+ genes covered
  • Updated: monthly bulk releases; daily web updates
  • Free: no API key required (rate limits apply — see below)

Classification tiers (ACMG/AMP standards)

ClassificationMeaning
PathogenicStrong evidence variant causes disease
Likely PathogenicStrong but not definitive evidence
Uncertain Significance (VUS)Insufficient evidence in either direction
Likely BenignSuggests no clinical impact
BenignStrong evidence no clinical impact
Conflicting interpretationsDifferent submitters disagree
Drug responseVariant affects medication response (PGx)
Risk factorModifies disease risk without directly causing it

For screening, you usually care about Pathogenic + Likely Pathogenic (P/LP) and any conflicting interpretations involving them.

NCBI E-utilities API — The Basics

The E-utilities are NCBI's REST API. ClinVar is one of many databases. Key endpoints:

  • esearch.fcgi — search ClinVar, get a list of matching IDs
  • esummary.fcgi — get summary metadata (faster, lighter)
  • efetch.fcgi — get full XML/text record

Rate limit (without API key): 3 requests/second. With an API key (free, register on NCBI): 10 requests/second.

For Python, use the official Biopython Bio.Entrez module:

pip install biopython
from Bio import Entrez
Entrez.email = "your@email.com"   # NCBI requires this for usage tracking
Entrez.api_key = "your_ncbi_api_key"   # Optional but recommended for >3 req/sec

Lookup #1 — By rsid

Most useful for 23andMe-style data where you have a list of rsids.

from Bio import Entrez
import xml.etree.ElementTree as ET

Entrez.email = "you@example.com"

def clinvar_by_rsid(rsid):
    """Look up ClinVar entries for a dbSNP rsid (e.g., 'rs28897696')."""
    # Step 1 — search for ClinVar IDs matching this rsid
    handle = Entrez.esearch(db='clinvar', term=f'{rsid}[Variant ID]')
    res = Entrez.read(handle)
    handle.close()
    ids = res['IdList']
    if not ids:
        return []
    # Step 2 — fetch summaries
    handle = Entrez.esummary(db='clinvar', id=','.join(ids))
    summaries = Entrez.read(handle)['DocumentSummarySet']['DocumentSummary']
    handle.close()
    results = []
    for s in summaries:
        results.append({
            'clinvar_id': s.attributes['uid'],
            'title': str(s.get('title', '')),
            'classification': str(s.get('clinical_significance', {}).get('description', '')),
            'review_status': str(s.get('clinical_significance', {}).get('review_status', '')),
            'genes': [g.get('symbol') for g in s.get('genes', [])],
            'conditions': [t.get('trait_name') for t in s.get('trait_set', [])],
            'last_evaluated': str(s.get('clinical_significance', {}).get('last_evaluated', '')),
        })
    return results

# Example
hits = clinvar_by_rsid('rs28897696')
for h in hits:
    print(f"  {h['classification']:30} {h['genes']} {h['conditions']}")
    print(f"      Review status: {h['review_status']}")

Output for rs28897696 (a known BRCA1 variant):

  Pathogenic                     ['BRCA1'] ['Familial cancer of breast', 'Hereditary breast and ovarian cancer syndrome']
      Review status: criteria provided, multiple submitters, no conflicts

Lookup #2 — By Gene + HGVS Nomenclature

When you have a specific variant in HGVS format (e.g., BRCA1:c.5266dupC).

def clinvar_by_hgvs(gene, hgvs):
    """Look up ClinVar by gene symbol + HGVS coding notation."""
    term = f'{gene}[gene] AND "{hgvs}"[variant name]'
    handle = Entrez.esearch(db='clinvar', term=term)
    res = Entrez.read(handle)
    handle.close()
    ids = res['IdList']
    if not ids:
        return []
    handle = Entrez.esummary(db='clinvar', id=','.join(ids[:10]))
    summaries = Entrez.read(handle)['DocumentSummarySet']['DocumentSummary']
    handle.close()
    return [
        {
            'clinvar_id': s.attributes['uid'],
            'classification': str(s.get('clinical_significance', {}).get('description', '')),
            'title': str(s.get('title', '')),
        }
        for s in summaries
    ]

hits = clinvar_by_hgvs('BRCA1', 'c.5266dupC')
for h in hits:
    print(f"  {h['classification']:30} {h['title'][:80]}")

Lookup #3 — By Genomic Coordinates

For WGS or VCF data where you have chromosome + position.

def clinvar_by_position(chromosome, start, end, build='GRCh37'):
    """Look up ClinVar variants overlapping a position range."""
    term = f'{chromosome}[chr] AND {start}:{end}[chrpos]'
    if build == 'GRCh38':
        term = f'{chromosome}[chr] AND {start}:{end}[chrpos38]'
    handle = Entrez.esearch(db='clinvar', term=term, retmax=100)
    res = Entrez.read(handle)
    handle.close()
    return res['IdList']

# Example: BRCA1 region
ids = clinvar_by_position(17, 41196311, 41277500)
print(f"Variants in BRCA1 region: {len(ids)}")

Batch Screening — Apply to a 23andMe File

The real workflow: 600K rsids → which have ClinVar entries?

Two strategies:

Strategy A — Loop with rate limiting

import time
import pandas as pd

snps = pd.read_csv('genome.txt', sep='\t', comment='#',
                   names=['rsid','chromosome','position','genotype'])

# Filter to obvious candidates (e.g., have a non-reference allele)
# Actually most rsids in a healthy human's data are normal alleles, so we
# pre-filter against known clinically relevant rsid lists

# Get the list of clinically-relevant ClinVar rsids (precomputed, see below)
clinvar_rsids = pd.read_csv('clinvar_rsids.txt', header=None, names=['rsid'])['rsid'].tolist()

# Intersect
relevant = snps[snps['rsid'].isin(clinvar_rsids)]
print(f"23andMe SNPs with ClinVar entries: {len(relevant)}")

# Look up only those, batched to respect rate limits
results = []
for _, row in relevant.iterrows():
    hits = clinvar_by_rsid(row['rsid'])
    for h in hits:
        h['user_genotype'] = row['genotype']
        results.append(h)
    time.sleep(0.4)   # 2.5 req/sec to stay under 3 req/sec limit
    
df = pd.DataFrame(results)
df.to_csv('clinvar_matches.csv', index=False)

This typically returns 200-500 matches for a healthy individual's 23andMe — most will be Benign or Likely Benign. The interesting filter:

clinically_relevant = df[df['classification'].str.contains('Pathogenic', na=False)]
print(f"P/LP variants in user data: {len(clinically_relevant)}")
print(clinically_relevant[['classification', 'genes', 'conditions', 'user_genotype']])

Strategy B — Use the bulk download

NCBI publishes the full ClinVar database monthly. For batch screening of many variants, download the bulk file (~200 MB) and query locally — no API calls needed:

# Download VCF format (~500 MB compressed) — has rsid + classification per row
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar.vcf.gz

# Or weekly bulk XML for full record detail
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/ClinVarVariationRelease_00-latest_weekly.xml.gz
import pysam

vcf = pysam.VariantFile('clinvar.vcf.gz')

# Stream through all ClinVar variants, store rsid → classification mapping
clinvar_db = {}
for rec in vcf:
    rsid = rec.info.get('RS')
    if not rsid:
        continue
    rsid_str = f"rs{rsid[0]}"
    clnsig = rec.info.get('CLNSIG', ('Unknown',))[0]
    clndn = rec.info.get('CLNDN', ('',))[0]
    clinvar_db[rsid_str] = {'significance': clnsig, 'condition': clndn}

print(f"ClinVar variants loaded: {len(clinvar_db):,}")

# Match user data
matched = []
for _, row in snps.iterrows():
    if row['rsid'] in clinvar_db:
        m = clinvar_db[row['rsid']].copy()
        m['rsid'] = row['rsid']
        m['user_genotype'] = row['genotype']
        matched.append(m)

import pandas as pd
matched_df = pd.DataFrame(matched)
print(f"Total matches: {len(matched_df)}")
print(f"P/LP matches: {matched_df['significance'].str.contains('Pathogenic').sum()}")

This is faster (single download, no rate limits) and what most clinical pipelines use. The API approach is better for ad-hoc lookups of specific variants.

Important — "Pathogenic in ClinVar" Doesn't Equal "You Have the Disease"

Three critical considerations:

1. Zygosity matters

Many ClinVar Pathogenic variants are recessive — having one copy means you're a carrier, not affected. The variant being labeled "Pathogenic for X disease" means it causes X in homozygotes (or compound heterozygotes), not in carriers.

Always check the inheritance pattern (autosomal dominant, recessive, X-linked) of the condition.

2. Penetrance varies

Even for dominant Pathogenic variants, penetrance (the % of carriers who develop the disease) is often less than 100%. BRCA1 pathogenic variants confer ~65% lifetime breast cancer risk, not 100%. A "Pathogenic" classification doesn't mean certain disease.

3. Review status

ClinVar entries have a review status indicating how robust the classification is:

Review statusMeaning
practice guidelineHighest confidence — incorporated into clinical guidelines
reviewed by expert panelExpert panel curated
criteria provided, multiple submitters, no conflictsStrong
criteria provided, multiple submitters, conflictingCaution — submitters disagree
criteria provided, single submitterModerate confidence
no assertion criteria providedLowest — be skeptical

For any clinical decision-making, filter for review status ≥ "criteria provided, multiple submitters" at minimum. Treat single-submitter and conflicting interpretations as preliminary.

Practical Screening — Hereditary Cancer Panel Example

A common use case: scan a 23andMe file for variants in major hereditary cancer genes.

HEREDITARY_CANCER_GENES = [
    'BRCA1', 'BRCA2',         # Hereditary Breast and Ovarian Cancer
    'MLH1', 'MSH2', 'MSH6', 'PMS2', 'EPCAM',   # Lynch syndrome
    'TP53',                    # Li-Fraumeni
    'PTEN',                    # Cowden syndrome
    'STK11',                   # Peutz-Jeghers
    'APC',                     # Familial Adenomatous Polyposis
    'CDH1',                    # Hereditary Diffuse Gastric Cancer
    'VHL',                     # Von Hippel-Lindau
    'RB1',                     # Retinoblastoma
    'NF1', 'NF2',              # Neurofibromatosis
    'PALB2', 'CHEK2', 'ATM',   # Moderate-penetrance breast cancer
]

# Restrict screening to known ClinVar Pathogenic variants in these genes
def get_pathogenic_variants_for_genes(genes):
    """Query ClinVar for Pathogenic variants in specified genes."""
    results = {}
    for gene in genes:
        term = f'{gene}[gene] AND (Pathogenic[clinsig] OR "Likely pathogenic"[clinsig])'
        handle = Entrez.esearch(db='clinvar', term=term, retmax=500)
        res = Entrez.read(handle)
        handle.close()
        results[gene] = res['IdList']
        time.sleep(0.4)
    return results

# Run once, cache the results
gene_to_clinvar_ids = get_pathogenic_variants_for_genes(HEREDITARY_CANCER_GENES)
total = sum(len(v) for v in gene_to_clinvar_ids.values())
print(f"Total Pathogenic/LP variants in cancer panel: {total:,}")

# Then map ClinVar IDs to rsids and intersect with user data
# (Full implementation uses the bulk clinvar.vcf.gz for speed)

This gives you the universe of variants that, if found in your 23andMe data, deserve attention.

Limitations You Should Know

  • 23andMe doesn't probe many rare clinically-actionable variants. Most hereditary cancer Pathogenic variants are rare (<1% frequency) and not on consumer SNP arrays. Negative DTC ≠ negative clinical screen.
  • Structural variants (deletions, duplications, CNVs) are missed. Most ClinVar Pathogenic BRCA variants are SNVs and indels — both detectable. But some (BRCA1 BRCT large rearrangements) are CNVs and invisible to 23andMe.
  • Imputation can fill some gaps — services like Sequencing.com use linkage to predict variants not directly typed. Quality varies.
  • For clinical decision, work with a genetic counselor and order a clinical hereditary cancer panel (Invitae, Ambry, etc.). DTC + ClinVar lookup is screening, not diagnosis.

Rate Limit and Etiquette

NCBI's rate limits (without API key):

  • 3 requests/second
  • Recommended: throttle to 2-2.5/sec to avoid 429 errors
  • Batch via id=id1,id2,id3 in single esummary call (up to ~200 IDs)

With NCBI API key (free, get from https://www.ncbi.nlm.nih.gov/account/):

  • 10 requests/second
  • Should be enough for most scripts

For production-scale lookups (>10,000 variants in tight loops), use the bulk VCF download. Hammering the API for batch jobs will get you rate-limited at minimum.

FAQ

Q: ClinVar vs ClinGen vs OMIM vs HGMD — what's the difference?

  • ClinVar: free, NCBI-hosted aggregator of submitted clinical interpretations
  • ClinGen: NIH-funded curation effort that creates expert-panel classifications; many appear in ClinVar
  • OMIM: catalog of human genes and genetic disorders (older, narrative format)
  • HGMD: commercial database of mutation-disease associations; broader but paid subscription

For most workflows, ClinVar is the right default. ClinGen for expert-curated subsets. HGMD if you have institutional access.

Q: Can I get ACMG criteria details (PS1, PM2, BS3, etc.)?

Some ClinVar submissions include them. Look in the full record (efetch XML) under the criteria provided. Most use a subset; reconstructing complete ACMG framework requires VarSome or InterVar (separate tools).

Q: Is there a Python ClinVar API client (not via E-utilities)?

Several wrappers exist: clinvarweb, pyClinVar. They typically wrap the same E-utilities under the hood. Biopython's Bio.Entrez is the most maintained.

Q: How fresh is ClinVar?

Web search is updated daily. Bulk download (VCF) refreshes weekly. API queries are real-time. The variant interpretations themselves are submitted by labs at various cadences — well-known genes like BRCA see updates monthly, rare-disease variants more sporadically.

Q: What about ClinVar VCF GRCh38 vs GRCh37?

23andMe v5 uses GRCh37. WGS files often use GRCh38. Download the matching VCF (clinvar.vcf.gz is for GRCh37; clinvar_GRCh38.vcf.gz for GRCh38). Or use rsid-based matching which is build-agnostic.

Q: Can I submit to ClinVar?

Yes — clinical labs, research labs, and individuals can submit. There's a web submission portal. Useful if you find a novel variant and have evidence; helps the community.

Q: Why does the same rsid sometimes have multiple ClinVar entries?

Because there can be multiple variant interpretations (e.g., the same SNP affecting two different genes via overlap; or same nucleotide change interpreted differently for different conditions). Each ClinVar ID is a unique variant-condition pair.

Closing — The Workflow

For batch screening of 23andMe / WGS data against clinical relevance:

  1. Download ClinVar bulk VCF (clinvar.vcf.gz, monthly) → local rsid → significance map
  2. Match user variants to this map (fast, no API calls)
  3. Filter to Pathogenic / Likely Pathogenic with strong review status
  4. Check zygosity + inheritance pattern for each hit
  5. Discuss with a genetic counselor before acting on any finding

The code in this guide is a starting point. For meaningful clinical screening, integrate with a hereditary disease panel approach (see BRCA + Hereditary Cancer Guide) and confirm any hits via clinical-grade testing.


Related posts:

References:

관련 글