Reading 23andMe Raw Data for CYP2D6 Star Alleles in Python — Why DTC Often Misses *5 Deletion
Step-by-step Python tutorial: parse a 23andMe raw data file, look up CYP2D6 star alleles from SNPs, and understand why some clinically important variants (like CYP2D6 *5 whole-gene deletion) are invisible to SNP-based DTC testing. Includes code for *2, *3, *4, *6, *10, *17 detection and a guide to which alleles you can and can't catch.
"Can I Use My 23andMe Data for Pharmacogenomics?"
Short answer: *for some CYP2D6 alleles, yes; for the most clinically important variant — CYP2D6 5 (whole gene deletion) — no. Understanding which is which prevents the most common DTC-PGx interpretation mistake.
This tutorial walks through:
- How to extract CYP2D6-related SNPs from a 23andMe raw data file in Python
- How those SNPs map to common CYP2D6 star alleles (*2, *3, *4, *6, *10, *17)
- *Why 5 (gene deletion) is invisible to SNP arrays and what that means for your interpretation
- Tools (PROMETHEASE, FoundMyFitness, Sequencing.com) that automate this, and their limits
It's a worked example of the broader principle covered in Pharmacogenomics (PGx) Complete Guide 2026: SNP-based DTC tests catch the common variants and miss structural variants.
What a 23andMe Raw Data File Looks Like
After downloading your raw data from 23andMe (Settings → Browse Raw Data → Download), you get a tab-separated text file. The first lines:
# rsid chromosome position genotype
rs4477212 1 82154 AA
rs3094315 1 752566 AG
rs3131972 1 752721 AG
...
About 600,000-700,000 SNPs depending on chip version. Each row: SNP ID (rsid), chromosome, position, genotype (two alleles).
Loading the File in Python
import pandas as pd
snps = pd.read_csv(
'genome_Firstname_Lastname_Full_v5_Full_20260101230000.txt',
sep='\t',
comment='#',
names=['rsid', 'chromosome', 'position', 'genotype'],
dtype={'rsid': str, 'chromosome': str, 'position': int, 'genotype': str},
)
print(f"Total SNPs: {len(snps):,}")
print(f"Sample row:")
print(snps.head())
Expected output:
Total SNPs: 638,524
rsid chromosome position genotype
0 rs4477212 1 82154 AA
1 rs3094315 1 752566 AG
...
CYP2D6 — Where It Lives in the Genome
CYP2D6 is on chromosome 22q13.2, position approximately 42,126,000-42,131,000 (GRCh37/hg19 coordinates — 23andMe v5 uses hg19 unless you've requested the converted format).
To extract CYP2D6-region SNPs:
cyp2d6_snps = snps[
(snps['chromosome'] == '22') &
(snps['position'].between(42_125_000, 42_135_000))
].copy()
print(f"CYP2D6 region SNPs detected: {len(cyp2d6_snps)}")
print(cyp2d6_snps)
Typical result: 8-15 SNPs in the CYP2D6 region. That sounds like a lot, but most are intronic. The clinically meaningful ones are fewer.
Star Allele Lookup Table
CYP2D6 star alleles are named haplotypes (combinations of variants) that determine enzyme function. The most common defining SNPs:
| Star allele | Function | Defining SNPs (rsid → genotype) | Population |
|---|---|---|---|
| *2 | Normal function (variant) | rs16947 (G), rs1135840 (C) | Common in all |
| *3 | No function | rs35742686 (-A deletion) | European 1-2% |
| *4 | No function | rs3892097 (A), rs1065852 (T) | European 12-20% |
| *5 | No function (whole gene deletion) | NOT DETECTABLE by SNP array | Asian 5-7%, Eur 2% |
| *6 | No function | rs5030655 (-T deletion) | European 1% |
| *10 | Decreased function | rs1065852 (T) | Asian 40-50% |
| *17 | Decreased function | rs28371706 (T) | African 20% |
| *41 | Decreased function | rs28371725 (T) | European 8-9% |
The relevant numbers: rs1065852 (defines *10), rs3892097 (defines *4), rs28371706 (*17), rs28371725 (*41) — these are usually present on 23andMe arrays. **rs5030655 (6) and rs35742686 (3) may or may not be, depending on chip version.
Python — Detect Common Star Alleles
# Build a lookup of clinically important CYP2D6 SNPs
cyp2d6_lookup = {
'rs3892097': {'name': '*4', 'variant': 'A', 'effect': 'no function'},
'rs1065852': {'name': '*10', 'variant': 'T', 'effect': 'decreased function'},
'rs5030655': {'name': '*6', 'variant': '-', 'effect': 'no function'},
'rs35742686': {'name': '*3', 'variant': '-', 'effect': 'no function'},
'rs28371706': {'name': '*17', 'variant': 'T', 'effect': 'decreased function'},
'rs28371725': {'name': '*41', 'variant': 'T', 'effect': 'decreased function'},
'rs16947': {'name': '*2', 'variant': 'G', 'effect': 'normal'},
}
def check_star_alleles(snps_df, lookup):
results = []
for rsid, info in lookup.items():
row = snps_df[snps_df['rsid'] == rsid]
if len(row) == 0:
results.append({
'rsid': rsid, 'allele': info['name'], 'effect': info['effect'],
'genotype': 'NOT_TESTED', 'carries_variant': '?',
})
continue
gt = row.iloc[0]['genotype']
variant = info['variant']
# Heterozygous = 1 copy, homozygous = 2 copies
if variant == '-':
# Deletion — usually marked differently in different files
carries = '?'
else:
count = gt.count(variant)
carries = {0: 'no', 1: 'heterozygous (1 copy)', 2: 'homozygous (2 copies)'}.get(count, '?')
results.append({
'rsid': rsid, 'allele': info['name'], 'effect': info['effect'],
'genotype': gt, 'carries_variant': carries,
})
return pd.DataFrame(results)
stars = check_star_alleles(cyp2d6_snps, cyp2d6_lookup)
print(stars)
Sample output for a hypothetical Korean user:
rsid allele effect genotype carries_variant
0 rs3892097 *4 no function GG no
1 rs1065852 *10 decreased function CT heterozygous (1 copy)
2 rs5030655 *6 no function NOT_TESTED ?
3 rs35742686 *3 no function NOT_TESTED ?
4 rs28371706 *17 decreased function CC no
5 rs28371725 *41 decreased function CC no
6 rs16947 *2 normal AG heterozygous (1 copy)
Interpretation: this person carries one copy of *10 (decreased function) and one copy of *2 (normal). Likely phenotype: intermediate metabolizer for CYP2D6 substrates. Codeine would be less effectively activated to morphine; SSRI dosing may need adjustment.
Why *5 Is Invisible to 23andMe (And Every Other SNP Array)
CYP2D6 *5 is a whole-gene deletion — the entire CYP2D6 gene is missing on that chromosome. No SNP exists to detect "absence." A SNP array probes specific positions; if the gene is deleted, those probes simply return missing or homozygous reference (depending on the other allele).
Frequency:
- East Asian populations: 5-7% carry at least one *5 allele
- European populations: ~2%
- African populations: ~3-7%
A Korean person with *5/*10 genotype is a poor metabolizer — but 23andMe will report them as "*10 heterozygous" because *5 is invisible to the array. Clinically, this is the difference between "reduced dose" and "avoid altogether" for many medications.
The DTC interpretation trap: if you see "no *4, heterozygous *10" in your DTC report, you might assume intermediate metabolizer. But if your other chromosome carries *5 (undetected), you're actually a poor metabolizer.
What about CYP2D6 duplications (ultra-rapid metabolizers)?
Same problem in the opposite direction. CYP2D6 can have 2-13 functional copies on one chromosome (gene duplication). Total functional copies > 2 → ultra-rapid metabolizer → codeine converts to morphine too fast (respiratory depression risk).
SNP arrays detect alleles, not copy number. 23andMe does not detect duplications.
How Clinical PGx Labs Detect *5 and Duplications
The clinical-grade test uses long-range PCR or targeted sequencing to physically check whether the CYP2D6 gene is present and how many copies exist:
- AmpliChip CYP450 (older but still used)
- TaqMan copy number assays
- Long-range PCR with primers spanning the deletion breakpoints
- Whole genome / long-read sequencing (PacBio HiFi or Oxford Nanopore) — gold standard
These are not what DTC services offer. For clinical decisions involving CYP2D6, a proper PGx test through a clinical lab is needed.
Comparison — DTC Tools That Automate This Workflow
You don't have to write Python — multiple tools parse 23andMe raw data for PGx:
| Tool | What it does | Cost | Catches *5? |
|---|---|---|---|
| PROMETHEASE | Cross-references all SNPs with SNPedia | $12 one-time | ❌ (SNP-based) |
| FoundMyFitness | Curated reports including PGx | $20/month | ❌ |
| Sequencing.com | PGx + ancestry + others | Free + paid | ❌ |
| Genetic Genie | Free PGx report | Free | ❌ |
| CodonPro PGx (formerly Codon) | Clinical-style PGx panel | $50-100 | Some via imputation |
| Clinical PGx panel (university hospital) | Targeted sequencing | $200-500 | ✅ |
All DTC-based tools share the SNP-array limitation. If you're using DTC data for actual prescription decisions, work with a doctor who orders a proper clinical PGx panel for the questions DTC can't answer.
Related: DTC Genetic Testing 2026 Complete Buyer's Guide compares 10 DTC services on accuracy and coverage.
Beyond CYP2D6 — Other Important PGx Genes on 23andMe
Same approach applies to other CYP genes:
| Gene | Star alleles on 23andMe? | Detect deletion? |
|---|---|---|
| CYP2C19 | *2, *3, *17 — common variants detected | No CNV detection |
| CYP2C9 | *2, *3 detected; ethnic-specific harder | No CNV detection |
| CYP3A5 | *3 detected | No CNV detection |
| DPYD | Common variants (*2A, *13) sometimes | No CNV detection |
| TPMT | Common variants detected | No CNV detection |
| VKORC1 | rs9923231 detected (relevant for warfarin) | N/A |
| HLA-B*57:01 (abacavir) | Partial via tagging SNPs | Imputation only |
| HLA-B*15:02 (carbamazepine, Asian risk) | NOT reliably detectable by DTC | Needs clinical typing |
HLA typing in particular is poorly served by DTC arrays. For HLA-B15:02 (severe SJS/TEN risk with carbamazepine in Asians) or HLA-B57:01 (abacavir hypersensitivity), clinical sequencing is required.
Practical Workflow — From 23andMe to Useful PGx Info
- Download your 23andMe raw data
- Run PROMETHEASE or this Python code for an initial overview
- Identify which medications in your future or current regimen might involve CYP2D6, CYP2C19, CYP2C9 metabolism (your doctor can confirm)
- For those medications, check your detected variants — but treat them as a starting hypothesis, not a clinical conclusion
- If a critical medication is involved (especially codeine/tramadol, clopidogrel, warfarin, abacavir, carbamazepine, certain SSRIs/antipsychotics), request a clinical PGx panel from your physician
- Never stop or change a prescription based on DTC results alone
FAQ
Q: Will the python code above work on AncestryDNA raw data too? Different chip, slightly different SNP set. The lookup logic is the same, but check whether AncestryDNA includes rs3892097 and rs1065852 (usually yes for these well-known ones; check your specific file). Other DTC providers (MyHeritage, FamilyTreeDNA) similarly cover the common rsid set.
Q: My 23andMe file uses hg19 vs hg38 coordinates — does that matter? For star allele detection via rsid, no — rsid is the same across builds. Coordinates differ between hg19 and hg38, so if you're querying by position rather than rsid, use the correct reference.
Q: How accurate is 23andMe's own "Pharmacogenetic Reports" feature? 23andMe's FDA-cleared PGx reports cover a limited set (CYP2C19, CYP2D6, CYP3A5, others) with documented variants. They explicitly note the *5 and duplication limitations. For the variants they report, accuracy is high. For what they don't report, you're back to clinical testing.
Q: Can I use 23andMe data to choose between antidepressants? Hypothesis-generating only. Many SSRIs/SNRIs are CYP2D6 substrates (paroxetine, fluoxetine, venlafaxine). If you carry *4 or are *10/*10 homozygous, you may be at higher risk for side effects — discuss with your psychiatrist. They may order a clinical PGx panel before prescribing.
Q: Why does my DTC report disagree with another DTC tool on the same raw data? Different tools use different variant→star-allele mapping rules, especially for combinations. PharmGKB allele definitions are the standard; tools that deviate may be using older or proprietary mappings.
*Q: Are there tools that try to detect 5 by imputation? Some research-grade tools use linkage disequilibrium and surrounding SNP patterns to probabilistically infer CNVs. They're not validated for clinical use. The honest answer for clinical decisions: get a real PGx panel.
Q: Can I run this in R instead of Python?
Yes — readr::read_tsv() + dplyr::filter() does the same thing. Look at packages like pgxgene if you want pre-built PGx scoring.
Q: Where do I find the latest star allele definitions? PharmVar (https://www.pharmvar.org) — the authoritative human pharmacogene variation database, free to access. Updated regularly with new alleles.
Closing — Key Takeaways
- SNP-based DTC tests catch common CYP2D6 alleles like *4, *10, *17 — well enough to flag intermediate or poor metabolizer status for many medications
- They miss structural variants — most importantly *CYP2D6 5 (whole gene deletion) and gene duplications — which means the most clinically extreme phenotypes can be misclassified
- For prescription decisions, especially for codeine/tramadol, clopidogrel, warfarin, abacavir, carbamazepine — get a clinical-grade PGx panel through a hospital or specialty lab
- DTC PGx data is great for ancestry-style "interesting facts about my drug metabolism" exploration; it's not great for sole-basis prescription decisions
The Python code above gives you a complete starting view of your CYP2D6 SNP-detectable alleles. Combined with the PGx Complete Guide 2026, it should give you the framework to know when DTC data is enough and when it isn't.
Related posts:
- 약물유전체학 (PGx) 완전 가이드 2026
- DTC Genetic Testing 2026 Complete Buyer's Guide
- DTC Genetic Test 결과지 해석 완전 가이드
- BRCA1/2 + 유전성 암 가이드
References:
- PharmVar — authoritative star allele definitions: https://www.pharmvar.org
- PharmGKB — clinical pharmacogenomics knowledge base: https://www.pharmgkb.org
- CPIC guidelines for CYP2D6: https://cpicpgx.org
- Gaedigk, A. et al. (2008). The CYP2D6 activity score. Clinical Pharmacology & Therapeutics, 83, 234-242.
- Caudle, K. E. et al. (2020). Standardizing CYP2D6 genotype to phenotype translation. Clinical Pharmacology & Therapeutics, 107, 1390-1397.
⚠️ Medical disclaimer: This article is for educational purposes. DTC genetic data and the analyses shown are not a substitute for clinical PGx testing or physician advice. Do not change or stop medications based on DTC test results alone.