Reading 23andMe Raw Data for CYP2D6 Star Alleles in Python — Why DTC Often Misses *5 Deletion

Q: Can I run this in R instead of Python?

Yes — `readr::read_tsv()` + `dplyr::filter()` does the same thing. Look at packages like `pgxgene` if you want pre-built PGx scoring.

Q: Where do I find the latest star allele definitions?

PharmVar (https://www.pharmvar.org) — the authoritative human pharmacogene variation database, free to access. Updated regularly with new alleles.

23andMe CYP2D6 Python tutorial

TL;DR (Quick Answer)

Can you use 23andMe data for CYP2D6 pharmacogenomics? For common SNP-defined alleles (*2, *3, *4, *6, *10, *17) — yes. For the clinically critical *CYP2D6 5 whole-gene deletion and gene duplications — no, because SNP arrays cannot detect copy-number variation.

Why CYP2D6 matters — it metabolizes roughly a quarter of all prescribed drugs, including codeine and tramadol (prodrug activation), many antidepressants, and tamoxifen.
What you can call — SNP-based star alleles *2/*3/*4/*6/*10/*17 from 23andMe raw data, mapping to a Poor / Intermediate / Normal / Ultrarapid phenotype.
What you cannot call — *5 (whole-gene deletion) and *xN (duplications) are CNVs invisible to SNP arrays. This is the single most common DTC-PGx interpretation error.
Clinical bottom line — treat a DTC CYP2D6 result as a screen, not a diagnosis; confirm any actionable finding with a clinical PGx panel.

What CYP2D6 Star Alleles Mean

CYP2D6 is a highly polymorphic liver cytochrome-P450 enzyme that metabolizes roughly 20–25% of clinically used drugs. Each person inherits two alleles, named with "star" nomenclature (*1, *2, *4, *5 …), which combine into an activity score and a metabolizer phenotype. Star-allele definitions are maintained by PharmVar (CYP2D6), drug–gene evidence by PharmGKB, prescribing guidance by the CPIC guidelines, and label biomarkers by the FDA Table of Pharmacogenomic Biomarkers in Drug Labeling.

"Can I Use My 23andMe Data for Pharmacogenomics?"

Short answer: *for some CYP2D6 alleles, yes; for the most clinically important variant — CYP2D6 5 (whole gene deletion) — no. Understanding which is which prevents the most common DTC-PGx interpretation mistake.

This tutorial walks through:

How to extract CYP2D6-related SNPs from a 23andMe raw data file in Python
How those SNPs map to common CYP2D6 star alleles (*2, *3, *4, *6, *10, *17)
*Why 5 (gene deletion) is invisible to SNP arrays and what that means for your interpretation
Tools (PROMETHEASE, FoundMyFitness, Sequencing.com) that automate this, and their limits

It's a worked example of the broader principle covered in Pharmacogenomics (PGx) Complete Guide 2026: SNP-based DTC tests catch the common variants and miss structural variants.

What a 23andMe Raw Data File Looks Like

After downloading your raw data from 23andMe (Settings → Browse Raw Data → Download), you get a tab-separated text file. The first lines:

# rsid	chromosome	position	genotype
rs4477212	1	82154	AA
rs3094315	1	752566	AG
rs3131972	1	752721	AG
...

About 600,000-700,000 SNPs depending on chip version. Each row: SNP ID (rsid), chromosome, position, genotype (two alleles).

Loading the File in Python

import pandas as pd

snps = pd.read_csv(
    'genome_Firstname_Lastname_Full_v5_Full_20260101230000.txt',
    sep='\t',
    comment='#',
    names=['rsid', 'chromosome', 'position', 'genotype'],
    dtype={'rsid': str, 'chromosome': str, 'position': int, 'genotype': str},
)

print(f"Total SNPs: {len(snps):,}")
print(f"Sample row:")
print(snps.head())

Expected output:

Total SNPs: 638,524
   rsid       chromosome  position    genotype
0  rs4477212  1           82154       AA
1  rs3094315  1           752566      AG
...

CYP2D6 — Where It Lives in the Genome

CYP2D6 is on chromosome 22q13.2, position approximately 42,126,000-42,131,000 (GRCh37/hg19 coordinates — 23andMe v5 uses hg19 unless you've requested the converted format).

To extract CYP2D6-region SNPs:

cyp2d6_snps = snps[
    (snps['chromosome'] == '22') &
    (snps['position'].between(42_125_000, 42_135_000))
].copy()

print(f"CYP2D6 region SNPs detected: {len(cyp2d6_snps)}")
print(cyp2d6_snps)

Typical result: 8-15 SNPs in the CYP2D6 region. That sounds like a lot, but most are intronic. The clinically meaningful ones are fewer.

Star Allele Lookup Table

CYP2D6 star alleles are named haplotypes (combinations of variants) that determine enzyme function. The most common defining SNPs:

Star allele	Function	Defining SNPs (rsid → genotype)	Population
*2	Normal function (variant)	rs16947 (G), rs1135840 (C)	Common in all
*3	No function	rs35742686 (-A deletion)	European 1-2%
*4	No function	rs3892097 (A), rs1065852 (T)	European 12-20%
*5	No function (whole gene deletion)	NOT DETECTABLE by SNP array	Asian 5-7%, Eur 2%
*6	No function	rs5030655 (-T deletion)	European 1%
*10	Decreased function	rs1065852 (T)	Asian 40-50%
*17	Decreased function	rs28371706 (T)	African 20%
*41	Decreased function	rs28371725 (T)	European 8-9%

The relevant numbers: rs1065852 (defines *10), rs3892097 (defines *4), rs28371706 (*17), rs28371725 (*41) — these are usually present on 23andMe arrays. **rs5030655 (6) and rs35742686 (3) may or may not be, depending on chip version.

Python — Detect Common Star Alleles

# Build a lookup of clinically important CYP2D6 SNPs
cyp2d6_lookup = {
    'rs3892097':  {'name': '*4',  'variant': 'A',     'effect': 'no function'},
    'rs1065852':  {'name': '*10', 'variant': 'T',     'effect': 'decreased function'},
    'rs5030655':  {'name': '*6',  'variant': '-',     'effect': 'no function'},
    'rs35742686': {'name': '*3',  'variant': '-',     'effect': 'no function'},
    'rs28371706': {'name': '*17', 'variant': 'T',     'effect': 'decreased function'},
    'rs28371725': {'name': '*41', 'variant': 'T',     'effect': 'decreased function'},
    'rs16947':    {'name': '*2',  'variant': 'G',     'effect': 'normal'},
}

def check_star_alleles(snps_df, lookup):
    results = []
    for rsid, info in lookup.items():
        row = snps_df[snps_df['rsid'] == rsid]
        if len(row) == 0:
            results.append({
                'rsid': rsid, 'allele': info['name'], 'effect': info['effect'],
                'genotype': 'NOT_TESTED', 'carries_variant': '?',
            })
            continue
        gt = row.iloc[0]['genotype']
        variant = info['variant']
        # Heterozygous = 1 copy, homozygous = 2 copies
        if variant == '-':
            # Deletion — usually marked differently in different files
            carries = '?'
        else:
            count = gt.count(variant)
            carries = {0: 'no', 1: 'heterozygous (1 copy)', 2: 'homozygous (2 copies)'}.get(count, '?')
        results.append({
            'rsid': rsid, 'allele': info['name'], 'effect': info['effect'],
            'genotype': gt, 'carries_variant': carries,
        })
    return pd.DataFrame(results)

stars = check_star_alleles(cyp2d6_snps, cyp2d6_lookup)
print(stars)

Sample output for a hypothetical Korean user:

   rsid          allele  effect             genotype  carries_variant
0  rs3892097     *4      no function        GG        no
1  rs1065852     *10     decreased function CT        heterozygous (1 copy)
2  rs5030655     *6      no function        NOT_TESTED ?
3  rs35742686    *3      no function        NOT_TESTED ?
4  rs28371706    *17     decreased function CC        no
5  rs28371725    *41     decreased function CC        no
6  rs16947       *2      normal             AG        heterozygous (1 copy)

Interpretation: this person carries one copy of *10 (decreased function) and one copy of *2 (normal). Likely phenotype: intermediate metabolizer for CYP2D6 substrates. Codeine would be less effectively activated to morphine; SSRI dosing may need adjustment.

Why *5 Is Invisible to 23andMe (And Every Other SNP Array)

CYP2D6 *5 is a whole-gene deletion — the entire CYP2D6 gene is missing on that chromosome. No SNP exists to detect "absence." A SNP array probes specific positions; if the gene is deleted, those probes simply return missing or homozygous reference (depending on the other allele).

Frequency:

East Asian populations: 5-7% carry at least one *5 allele
European populations: ~2%
African populations: ~3-7%

A Korean person with *5/*10 genotype is a poor metabolizer — but 23andMe will report them as "*10 heterozygous" because *5 is invisible to the array. Clinically, this is the difference between "reduced dose" and "avoid altogether" for many medications.

The DTC interpretation trap: if you see "no *4, heterozygous *10" in your DTC report, you might assume intermediate metabolizer. But if your other chromosome carries *5 (undetected), you're actually a poor metabolizer.

What about CYP2D6 duplications (ultra-rapid metabolizers)?

Same problem in the opposite direction. CYP2D6 can have 2-13 functional copies on one chromosome (gene duplication). Total functional copies > 2 → ultra-rapid metabolizer → codeine converts to morphine too fast (respiratory depression risk).

SNP arrays detect alleles, not copy number. 23andMe does not detect duplications.

How Clinical PGx Labs Detect *5 and Duplications

The clinical-grade test uses long-range PCR or targeted sequencing to physically check whether the CYP2D6 gene is present and how many copies exist:

AmpliChip CYP450 (older but still used)
TaqMan copy number assays
Long-range PCR with primers spanning the deletion breakpoints
Whole genome / long-read sequencing (PacBio HiFi or Oxford Nanopore) — gold standard

These are not what DTC services offer. For clinical decisions involving CYP2D6, a proper PGx test through a clinical lab is needed.

Comparison — DTC Tools That Automate This Workflow

You don't have to write Python — multiple tools parse 23andMe raw data for PGx:

Tool	What it does	Cost	Catches *5?
PROMETHEASE	Cross-references all SNPs with SNPedia	$12 one-time	❌ (SNP-based)
FoundMyFitness	Curated reports including PGx	$20/month	❌
Sequencing.com	PGx + ancestry + others	Free + paid	❌
Genetic Genie	Free PGx report	Free	❌
CodonPro PGx (formerly Codon)	Clinical-style PGx panel	$50-100	Some via imputation
Clinical PGx panel (university hospital)	Targeted sequencing	$200-500	✅

All DTC-based tools share the SNP-array limitation. If you're using DTC data for actual prescription decisions, work with a doctor who orders a proper clinical PGx panel for the questions DTC can't answer.

Related: DTC Genetic Testing 2026 Complete Buyer's Guide compares 10 DTC services on accuracy and coverage.

Beyond CYP2D6 — Other Important PGx Genes on 23andMe

Same approach applies to other CYP genes:

Gene	Star alleles on 23andMe?	Detect deletion?
CYP2C19	2, 3, *17 — common variants detected	No CNV detection
CYP2C9	2, 3 detected; ethnic-specific harder	No CNV detection
CYP3A5	*3 detected	No CNV detection
DPYD	Common variants (2A, 13) sometimes	No CNV detection
TPMT	Common variants detected	No CNV detection
VKORC1	rs9923231 detected (relevant for warfarin)	N/A
*HLA-B57:01** (abacavir)	Partial via tagging SNPs	Imputation only
*HLA-B15:02** (carbamazepine, Asian risk)	NOT reliably detectable by DTC	Needs clinical typing

HLA typing in particular is poorly served by DTC arrays. For HLA-B15:02 (severe SJS/TEN risk with carbamazepine in Asians) or HLA-B57:01 (abacavir hypersensitivity), clinical sequencing is required.

Practical Workflow — From 23andMe to Useful PGx Info

Download your 23andMe raw data
Run PROMETHEASE or this Python code for an initial overview
Identify which medications in your future or current regimen might involve CYP2D6, CYP2C19, CYP2C9 metabolism (your doctor can confirm)
For those medications, check your detected variants — but treat them as a starting hypothesis, not a clinical conclusion
If a critical medication is involved (especially codeine/tramadol, clopidogrel, warfarin, abacavir, carbamazepine, certain SSRIs/antipsychotics), request a clinical PGx panel from your physician
Never stop or change a prescription based on DTC results alone

FAQ

Q: Will the python code above work on AncestryDNA raw data too? Different chip, slightly different SNP set. The lookup logic is the same, but check whether AncestryDNA includes rs3892097 and rs1065852 (usually yes for these well-known ones; check your specific file). Other DTC providers (MyHeritage, FamilyTreeDNA) similarly cover the common rsid set.

Q: My 23andMe file uses hg19 vs hg38 coordinates — does that matter? For star allele detection via rsid, no — rsid is the same across builds. Coordinates differ between hg19 and hg38, so if you're querying by position rather than rsid, use the correct reference.

Q: How accurate is 23andMe's own "Pharmacogenetic Reports" feature? 23andMe's FDA-cleared PGx reports cover a limited set (CYP2C19, CYP2D6, CYP3A5, others) with documented variants. They explicitly note the *5 and duplication limitations. For the variants they report, accuracy is high. For what they don't report, you're back to clinical testing.

Q: Can I use 23andMe data to choose between antidepressants? Hypothesis-generating only. Many SSRIs/SNRIs are CYP2D6 substrates (paroxetine, fluoxetine, venlafaxine). If you carry *4 or are *10/*10 homozygous, you may be at higher risk for side effects — discuss with your psychiatrist. They may order a clinical PGx panel before prescribing.

Q: Why does my DTC report disagree with another DTC tool on the same raw data? Different tools use different variant→star-allele mapping rules, especially for combinations. PharmGKB allele definitions are the standard; tools that deviate may be using older or proprietary mappings.

*Q: Are there tools that try to detect 5 by imputation? Some research-grade tools use linkage disequilibrium and surrounding SNP patterns to probabilistically infer CNVs. They're not validated for clinical use. The honest answer for clinical decisions: get a real PGx panel.

Q: Can I run this in R instead of Python? Yes — readr::read_tsv() + dplyr::filter() does the same thing. Look at packages like pgxgene if you want pre-built PGx scoring.

Q: Where do I find the latest star allele definitions? PharmVar (https://www.pharmvar.org) — the authoritative human pharmacogene variation database, free to access. Updated regularly with new alleles.

Closing — Key Takeaways

SNP-based DTC tests catch common CYP2D6 alleles like *4, *10, *17 — well enough to flag intermediate or poor metabolizer status for many medications
They miss structural variants — most importantly *CYP2D6 5 (whole gene deletion) and gene duplications — which means the most clinically extreme phenotypes can be misclassified
For prescription decisions, especially for codeine/tramadol, clopidogrel, warfarin, abacavir, carbamazepine — get a clinical-grade PGx panel through a hospital or specialty lab
DTC PGx data is great for ancestry-style "interesting facts about my drug metabolism" exploration; it's not great for sole-basis prescription decisions

The Python code above gives you a complete starting view of your CYP2D6 SNP-detectable alleles. Combined with the PGx Complete Guide 2026, it should give you the framework to know when DTC data is enough and when it isn't.

Related posts:

References:

PharmVar — authoritative star allele definitions: https://www.pharmvar.org
PharmGKB — clinical pharmacogenomics knowledge base: https://www.pharmgkb.org
CPIC guidelines for CYP2D6: https://cpicpgx.org
Gaedigk, A. et al. (2008). The CYP2D6 activity score. Clinical Pharmacology & Therapeutics, 83, 234-242.
Caudle, K. E. et al. (2020). Standardizing CYP2D6 genotype to phenotype translation. Clinical Pharmacology & Therapeutics, 107, 1390-1397.

⚠️ Medical disclaimer: This article is for educational purposes. DTC genetic data and the analyses shown are not a substitute for clinical PGx testing or physician advice. Do not change or stop medications based on DTC test results alone.