Am I perhaps too big for my job?
Or, to put it another way: Is it really not possible to compare apples with oranges?
In this article we would like to look at the usefulness and necessity of test standards, because only those who draw fair comparisons will also receive meaningful assessments.
You find the question rather strange and wonder what this has to do with test norms? For many jobs, you are right to be skeptical. However, if you were planning to drive racing cars professionally, for example, being taller than 1.80 m would be rather unfavorable. On the one hand because of the tight space inside the vehicle, but also because of the greater weight. Consequently, the context determines which question, and which specific comparison makes sense.
As an adult, you can usually answer the question of whether you are rather tall or short quite well. Due to everyday experience with other people, one has created something like a subjective "mental norm" for body sizes and can therefore estimate whether one is rather taller, shorter or about the same size as most other people. But if you want to answer this question more precisely, e.g., in the context of medical examinations, you have to 1) measure your height and 2) compare it with measurements of a suitable group. The answer to this seemingly trivial question will change depending on which scale is used and with which group the comparison is made (e.g. adults, children, professional basketball players). It is very similar with psychological tests.
Why test norms?
A psychological test is the ruler with which we can measure human characteristics. Unlike body size, however, these characteristics are difficult or impossible to observe directly. Maybe you still have a relatively good feeling for estimating your body size, by contrast, estimating psychological characteristics such as "conscientiousness" or "logical reasoning" is far more difficult. One's own "mental norm" is completely inadequate in this case.
Test norms provide a remedy here and serve to place a person's test results in relation to a relevant comparison group. Unlike centimeters, for which we have built up a certain "feeling," a raw test score rarely says anything about a person. If a person has solved 7 out of 15 items on a test, that could be a good or bad score, depending on the test’s item difficulty. The raw test score only becomes meaningful when one knows how other people perform in the test. So-called norm samples serve as a comparison group, i.e., a group of people as large as possible who are representative of the target population of the test and who have been tested with the test. Based in this test data from the norm sample, norm scores can be calculated. They provide direct information about the position the person occupies in comparison to the norm sample with respect to a mental characteristic. Thus, a result can be interpreted as an above-, below- or average test performance or as a high, low or moderate expression of personality traits, attitudes or interests.
What types of norm samples are there?
Psychological tests can be used for many different questions. In order to provide the appropriate comparison group for each question or person, many tests offer several norm samples. Often there is a norm sample that is representative of the general population. In this context, representative means that the distribution of relevant personal characteristics such as age, gender, or level of education in the sample is comparable to that in the general population. Based on these (large) population-representative norm samples, subgroup norms (e.g., age group 50-59 years) are usually also created, separated according to age, gender, and/or educational level. Depending on the test and question, it may also be helpful to use other group-specific norms (e.g., separated by occupation, type of school, or disease). Unlike population-representative norms, which are mostly stratified or quota samples, these group-specific norms are often so-called convenience samples. The more specific and smaller the population group (e.g., German U19 soccer players, 2nd division), the more likely such convenience samples can also be representative of the target population.
What types of norm scores are there?
With regard to norm scores, mainly two groups are distinguished:
Percentile ranks (PR) can be derived based on the relative frequency of certain raw test scores in the norm sample (using area transformation). A percentile rank indicates what percentage of the norm sample achieved an equal or lower test result. For example, a PR = 87 means that 87% of the norm sample have the same or a lower test result or that 13% of the norm sample have achieved a higher one.
Standard norms, on the other hand, show how many standard deviations the test result is from the mean of the norm sample. The basis of all standard norms are z-values with a mean (M) of 0 and a standard deviation (SD) of 1. A z-value = -0.5 thus means that the test result is half a standard deviation below the mean of the norm. Since z values are not very practical due to the decimal places and the changing sign, other standard norms have been developed.
Just like the conversion from Celsius to Fahrenheit, the conversion from z values to other standard norms is merely a linear transformation. Standard norms can therefore also be changed arbitrarily. Frequently used standard norms are for example T-scores (M=50, SD=10) or IQ-scores (M=100, SD=15). If the raw test scores are approximately normally distributed, these standard norms can be interpreted similarly to PR based on the standard normal distribution. An IQ = 130 represents a score that is two standard deviations (2 x 15) above the average (IQ=100) and would mean that only ~2.5% of the norm sample have a better score.
Good to know: Unlike percentile ranks (ordinal scale), standard norms (interval scale) can also be used to interpret differences between test values. For example, the drop in performance by 20 T-scores is twice as large as that by 10 T-scores. In the case of percentile ranks, this type of interpretation is not valid, since it can only be used to assess whether test scores are larger, smaller or the same and not how much larger or smaller.
How to decide which norm to use?
Choosing the right norm sample is of central importance for the interpretation of test results. Depending on the norm used, the context in which the results can be interpreted also changes. With a height of 177 cm, one clearly belongs to the above-average tall people in Japan, while one is "only" average in the Netherlands. If, on the other hand, one wants to make a gender-specific comparison, this height would be above average for a woman in both countries. Caution is therefore required when interpreting a norm score. On the one hand, the norm score changes depending on the norm sample, and on the other hand, one and the same norm score can mean something different depending on the choice of norm sample. A norm score of IQ = 130 could be an indication of giftedness in an intelligence test. However, this is not the case if the norm sample consists exclusively of persons with diagnosed mental retardation.
Basically, the norm that is most suitable for answering a specific diagnostic question for a specific person should be selected. This sounds very simple, but it is often not. It can result in different norms being used for the same person depending on the question being asked, or in different norms being used for different people despite the same question being asked.
An 84-year-old man suffers a stroke. During his rehabilitation, it should be clarified whether there are indications of cognitive impairments. Performance in many cognitive functional areas declines with age. Thus, if one wants to know whether a cognitive performance is age-appropriate or not, one should use age-specific norms from healthy individuals. This will then allow one to determine whether the man is cognitively impaired compared to other healthy individuals of the same age. If one would also like to estimate the severity of the impairment, one can in principle also use the age-specific norm for this purpose and determine how far the test person is below the average range. However, it is often the case that norms based on the general population differentiate less well in the extreme ranges of the trait spectrum and, depending on the focus of the test, floor or ceiling effects may also occur. Therefore, to assess severity, comparison with a clinical norm sample may be helpful. For example, if a norm sample of individuals who have suffered a mild to severe stroke is available for the test, it can be used to more accurately assess how severe cognitive impairments actually are.
After a 67-year-old woman has driven her car under the influence of alcohol, the authorities order a retraining course and a traffic psychological examination to assess her driving-specific cognitive ability. The use of an age-specific norm would be insufficient, since in this case it is not a matter of comparison with other persons of the same age or with the same condition, but of comparison with all drivers. Just like speed limits, these minimum cognitive requirements apply equally to everyone, regardless of age or health status. Accordingly, to answer the question, one should use an age-unspecific overall norm of the general adult population. Thus, the woman is compared not only to her peers, but to all healthy adult persons in general. The age-unspecific overall norm in that case is presumably “stricter” than when an age-specific subgroup norm is used. Indeed, it can be assumed that the average cognitive performance level is higher due to the higher proportion of younger individuals in the overall norm, thus raising the minimum required level for fitness to drive.
During a multi-stage selection process, an airline would like to determine, among other things, which applicants have the best cognitive performance. In order to keep the assessment fair, it is important that the same standard of comparison, i.e., the same (subgroup) norm, is used across all individuals. For this purpose, the overall norm of the adult general population is usually used. Since pilots can be assumed to have a high level of performance and consequently a good differentiation in this area is desirable (= no ceiling effects), it would be even better to use a specific norm sample with pilots.
As these three examples illustrate, norms are essential for the interpretation of psychological tests. However, their application and correct interpretation is not as trivial as it may seem at first glance. If one is aware of this, knows the differences between the types of norms and carefully selects the appropriate norm depending on the specific question, one has already done a lot of things right.
Convenience, quota and stratified sampling = different methods of norm sampling. Here, the population is divided into subgroups and persons are recruited randomly (stratified sampling) or non-randomly (quota sampling). Convenience sampling, on the other hand, does not use stratification, and individuals are simply recruited according to their availability.
Standard deviations = statistical parameter for the extent of dispersion of raw test values around the mean within a sample.
Standard normal distribution = a theoretical distribution of values in which values cluster in the center and fall off symmetrically on either side.
Floor/ceiling effects = Floor effects occur when a test is so difficult that most people achieve only very low test scores. Conversely, ceiling effects occur when a test is so easy that most people achieve very high test scores. In both cases the variance of the test scores within the norm sample is limited and leads to the fact that in the lower (floor effect) or in the upper characteristic range (ceiling effect) it is not possible to differentiate well.