Understanding STR Analysis for Human Identification
Short Tandem Repeat (STR) analysis is a method used to identify individuals by examining specific regions in their DNA that vary greatly between people. While over 99.9% of human DNA is identical between individuals, STRs fall in the remaining fraction that does not code for proteins and can tolerate variation. These hypervariable STR regions consist of short DNA sequences (4 base-pair units) repeated back-to-back.
Different individuals may exhibit varying numbers of repeats at a specific locus (location on a chromosome). By analyzing a set of STR loci known for their high variability, forensic scientists using the ANDE Rapid DNA Instrument can identify a DNA profile unique to an individual.
For human identification, standard kits examine a core set of loci. The ANDE Rapid DNA system in question looks at 24 specific autosomal STR loci (plus three male-only Y loci) chosen because they reveal identity, but do not disclose personal traits or medical information. In other words, these markers reside in non-coding DNA, so they do not affect or indicate any physical characteristics, diseases, or traits; they simply provide a unique identifier.
Each person has two copies of each autosomal STR locus, one inherited from their mother and one from their father. The STR genotype at each locus is represented by two numbers (alleles), which correspond to the number of repeat units on each chromosome copy.
For example, if one chromosome has 10 repeats of a particular sequence at that locus and the other has 12 repeats, the genotype is “10,12.” Because STR alleles are passed down following standard Mendelian inheritance, a child’s alleles must come from each parent at each locus. A parent with genotype 10,12 could pass on either a 10 or a 12 to their child.
Combining results across many STR loci yields an extremely specific DNA profile for each person. The chance of two unrelated people sharing the same STR profile becomes vanishingly small as more loci are compared. (For context, using 13 STR loci yields random match probabilities on the order of 1 in 10 billion to 1 in 1 trillion. Using 20+ loci, as modern systems do, the likelihood of a coincidental match can be as low as around 1 in 10^24, i.e., one in a trillion trillion, making each profile virtually unique.
Counting STR Alleles: The D3S1358 Example (AGAT repeats)
An allele at an STR locus is defined by how many repeat units it contains. Let’s use the STR called D3S1358 as an example. This locus is located on chromosome 3 and has a core repeat sequence “AGAT.” If the DNA sequence at D3S1358 is AGATAGATAGAT... repeated over and over, the allele is named by the count of AGAT repeats. Some people might have as few as 8 repeats of AGAT at D3S1358, while others have 20 or more.
Each allele is typically just a number indicating the repeat count. For instance, allele 14 at D3S1358 means the sequence AGAT is repeated 14 times in a row at that location. When a Rapid DNA instrument or forensic lab analyzes D3S1358, it reports the number of repeats observed on each copy of chromosome 3 for that person. So if the output says an individual is “11,12” at D3S1358, it means one chromosome carried 11 repeats of AGAT and the other carried 12 repeats.
To visualize this counting, imagine writing out the DNA sequence:
...AGAT-AGAT-AGAT-AGAT-AGAT-AGAT-AGAT-AGAT-AGAT-AGAT-AGAT...(11)
...AGAT-AGAT-AGAT-AGAT-AGAT-AGAT-AGAT-AGAT-AGAT-AGAT-AGAT-AGAT...(12)
If “AGAT” appears 11 times before the sequence stops, that chromosome has allele 11 at D3S1358. If the other chromosome has the motif 12 times, that’s allele 12. STR alleles are essentially measured lengths of the PCR-amplified DNA fragment – each repeat adds a predictable number of base pairs. Alleles at D3S1358 range in size from about 100 base pairs (8 repeats) up to about 145 base pairs (20 repeats). The ANDE Rapid DNA instrument can distinguish these lengths by using fluorescent tags and capillary electrophoresis, thereby determining the repeat count.
One important note: sometimes an allele includes a partial repeat, which is denoted with a decimal in the allele name. For example, an allele labeled 11.3 at D2S441 means the DNA has 11 full repeats of AGAT plus some extra bases (not a complete 12th repeat). These are called microvariant alleles. They are rarer, but the decimal nomenclature helps scientists represent alleles that aren’t an exact integer count of the motif. In our example, a genotype of 11.3,11.3 at D2S441 would indicate the person inherited this same microvariant allele from both parents (we will discuss the implications of that in a moment).
Using an Allelic Ladder to Identify STR Alleles
How do we match the number of repeats to the observed DNA fragment? Laboratories utilize an allelic ladder as a reference. This ladder is a mixture of DNA fragments that represent all common alleles at a specific STR locus, each with known repeat counts. During STR profiling, the allelic ladder runs through the electrophoresis instrument alongside the sample. It generates peaks corresponding to the known allele sizes. The analysis software then compares the DNA peaks from the sample to those of the ladder to assign a repeat number to each.
Think of the allelic ladder as a molecular ruler or reference chart. Each peak in the ladder is labeled with its allele number (e.g., 8, 9, 10… up to 20 for D3S1358). When your sample’s DNA is processed, suppose it shows a peak that migrates to the same position as the ladder’s “14” allele – then the sample is called allele 14 at that locus. If another peak lines up with the ladder’s “16” allele, then allele 16 is assigned.
This way, the ladder ensures that allele calling is accurate and standardized across runs and laboratories. It accounts for any slight run-to-run differences in how DNA fragments travel, because the ladder and sample DNA will be affected similarly. The result is that each STR locus in the sample gets reported as a number (or two numbers for the genotype) that corresponds to known repeat counts.
For example, if analyzing D3S1358, the allelic ladder will have peaks for allele 8, 9, 10, 11, 12, … up to 20. If a person’s sample shows two peaks that line up under 11 and 12 in the ladder, the software will read their genotype as 11,12 at D3S1358. Without the ladder, we’d only have raw fragment sizes in base pairs, which are harder to interpret – the ladder bridges the gap by providing labeled reference points. The ANDE Rapid DNA instrument will also provide raw data files for “off-box” analysis.
Heterozygous vs. Homozygous STR Genotypes
Each STR genotype consists of two alleles – one from each parent. If both alleles differ, the genotype is called heterozygous at that locus. If both alleles are the same, it’s called homozygous. This distinction affects how the STR data look on an electropherogram (the graphical output of the analysis).
Illustration: A simplified comparison of STR electropherogram peaks for a heterozygous genotype (left, alleles 14 and 16) vs. a homozygous genotype (right, alleles 11.3 and 11.3). In a heterozygote, two separate peaks appear, one for each allele. In a homozygote, both copies are the same size, so they produce a single peak – often with roughly double the height since both alleles’ DNA overlap at the same position.
In the example above, the heterozygous genotype (14,16) produces two distinct peaks – one labeled “14” and one labeled “16.” These represent the two different allele lengths inherited from the parents. By contrast, the homozygous genotype (11.3,11.3) produces just one peak at the 11.3 allele position, because both parental contributions are identical in length.
The instrument still detects two copies, but since they are indistinguishable in size, the signal merges into a single peak. Notably, a homozygous peak is expected to be about twice as tall (in terms of fluorescence intensity) as a single allele peak in a heterozygote, because double the amount of DNA is present at that size.
Let’s tie this to inheritance: Heterozygous (14,16) at D3S1358 means the person inherited two different allele sizes – perhaps one parent gave them a “14-repeat” allele and the other gave a “16-repeat” allele. Homozygous (11.3,11.3) means both parents coincidentally passed down the same allele (in this case the 11.3 microvariant). If we imagine a mother who has alleles 11.3 and 12 at D3S1358 and a father who has 11.3 and 14, their child could end up receiving the 11.3 from each – resulting in 11.3,11.3 in the child (homozygous). More commonly, a child inherits two different alleles and is heterozygous.
The graphic below illustrates how STR alleles are inherited and how they appear on an electropherogram for one locus:
In the figure above, each panel is an electropherogram trace for the D7S820 locus. The mother shows two peaks (alleles 9 and 11) and the father shows two peaks (alleles 9 and 10). The child ended up with allele 10 from the father and allele 11 from the mother, so the child’s D7S820 result is 10,11 – clearly heterozygous, with two distinct peaks. Notice how the child has one allele in common with each parent (that’s how paternity/maternity can be confirmed via STRs). If instead the child had gotten the “9” allele from both parents, all the DNA fragments at D7S820 would be of the same length (9 repeats), and the electropherogram would show a single (but stronger) peak at allele 9, indicating a 9,9 homozygote.
To summarize this important point: two peaks = heterozygous, one peak = homozygous (for single-source DNA profiles). Analysts can quickly scan a profile and see which loci have one vs. two peaks to identify homozygotes vs. heterozygotes. This becomes one factor in statistical calculations as well, because a homozygous genotype’s frequency is p² (if allele frequency is p) whereas a heterozygous genotype’s frequency is 2pq (if allele frequencies are p and q).
Allele Frequencies and Random Match Probability
Each allele at each STR locus has a certain frequency in the population – some are common, some are rare. Allele frequency refers to how often a particular allele (say allele 14 at D3S1358) appears in a given population. By gathering population data, forensic scientists know the frequency of each allele in various ethnic groups. These frequencies are critical for calculating the statistical weight of a DNA match.
Random Match Probability (RMP) is the estimated probability that a person randomly selected from the population would coincidentally have the same DNA profile as the one in question. To calculate this, one typically multiplies together the probabilities of each individual genotype in the profile. For example, if at one locus a person has a genotype that about 1 in 20 people would have, and at another locus 1 in 50, having both by chance is 1/20 × 1/50 = 1/1000. The product becomes astronomically small when you do this across 20 or more loci. Because of the product rule, even alleles that are moderately common can yield an extremely low combined probability when many loci are used.
In practice, analysts use allele frequencies (p, q, etc.) to compute genotype frequencies (2pq for heterozygotes, p² for homozygotes), and then multiply across loci. They also apply a few standard statistical corrections, but the end result is a random match probability on the order of 1 in many billions or trillions for full profiles.
As noted earlier, the chance of a false match across 13 loci was estimated around one in ten trillion (10^13) in one educational resource. With the expanded sets of STRs (20+ loci in CODIS, 27 in Rapid DNA), the theoretical random match probability can be as low as 1 in 10^18 to 10^24 or even smaller, depending on allele frequencies in the relevant population. In plain language, this means if you profiled many trillions of individuals, you’d expect no two unrelated people to share the exact same DNA profile. This is why STR profiling is considered extremely reliable for identification – the odds of coincidental matches are effectively zero for practical purposes.
It’s important to remember that these probabilities assume the STR loci are independent and that population allele frequencies are well characterized. In forensic practice, statisticians account for subpopulation effects to ensure the numbers aren’t overstated. But even with conservative estimates, DNA profiles are incredibly discriminating. In effect, your STR profile is as unique as your fingerprint, only encoded in numbers. The extremely low random match probabilities (sometimes reported as, e.g., 1 in a quintillion) quantify just how powerful a full STR profile is in identifying a single individual out of the entire world.
STR Profiles Contain No Trait or Health Information
Because forensic STR loci originates from non-coding regions of the genome, they do not provide any information about a person’s appearance, health, or traits. The 20+ loci used for human identification were specifically chosen because they reside outside of genes. Changes in the number of repeats at these locations have no known effect on the individual, which is one reason for their considerable variation. For instance, whether you have 8 repeats or 15 repeats at D3S1358 has no impact on your physiology – it’s simply an inert difference in “junk DNA.”
This also means that an STR profile cannot tell you a person’s race, medical conditions, predispositions, or anything about their phenotype (except for Amelogenin), which just indicates XX (female) or XY (male) chromosomes. In our context, the Rapid DNA chip’s 24 loci include only autosomal STRs, and none code for traits or disease. They are essentially biological barcodes.
To address privacy concerns: the ANDE Rapid DNA system outputs a DNA ID (the set of allele numbers) and nothing more. It does not sequence genes or assess any trait-related DNA. The data generated is only useful for comparison to other DNA profiles (for example, matching a suspect to a crime scene, or identifying a missing person by matching family). It cannot be used to read a person’s genome or genetic secrets. In fact, once the run is completed, the swab and the internal reagents are typically exhausted or locked away, and the machine cannot produce any additional information from that sample. The focus is purely on identity markers.