Reliability
Reliability is the extent to which an experiment, test, or any measuring
procedure yields the same result on repeated trials. Without the agreement of
independent observers able to replicate research procedures, or the ability to
use research tools and procedures that yield consistent measurements,
researchers would be unable to satisfactorily draw conclusions, formulate
theories, or make claims about the generalizability of their research. In
addition to its important role in research, reliability is critical for many
parts of our lives, including manufacturing, medicine, and sports.
According to the Dictionary:
"Yielding the
same or compatible results in different clinical experiments or statistical
trials"
The idea behind reliability is that any significant results
must be more than a one-off finding and be inherently repeatable.
Other researchers must be able to perform exactly the same experiment, under
the same conditions and generate the same results. This will reinforce the
findings and ensure that the wider scientific community will accept the hypothesis.
Without this replication of statistically significant results, the experiment
and research have not fulfilled all of the requirements of testability. This
prerequisite is essential to a hypothesis establishing itself as an accepted
scientific truth.
For example, if you are performing a time critical experiment, you will be
using some type of stopwatch. Generally, it is reasonable to assume that the instruments
are reliable and will keep true and accurate time. However, diligent scientists
take measurements many times, to minimize the chances of malfunction and
maintain validity and reliability.
At the other extreme, any experiment that uses human judgment is always
going to come under question. This means that such experiments are more
difficult to repeat and are inherently less reliable. Reliability is a
necessary ingredient for determining the overall validity of a scientific
experiment and enhancing the strength of the results.
Reliability and science
Reliability is something that every scientist, especially in social sciences
and biology, must be aware of. In science, the definition is the same, but
needs a much narrower and unequivocal definition.
Another way of looking at this is as maximizing the inherent repeatability
or consistency in an experiment. For maintaining reliability internally, a
researcher will use as many repeat sample groups as possible, to reduce the
chance of an abnormal sample group skewing the results. If you use three
replicate samples for each manipulation, and one generates completely different
results from the others, then there may be something wrong with the experiment.
- For many experiments, results
follow a ‘normal distribution’ and there is always a chance that your
sample group produces results lying at one of the extremes. Using multiple
sample groups will smooth out these extremes and generate a more accurate
spread of results.
- If your results continue to
be wildly different, then there is likely to be something very wrong with
your design; it is unreliable.
Reliability and cold fusion
Reliability is also extremely important externally, and
another researcher should be able to perform exactly the same experiment, with
similar equipment, under similar conditions, and achieve exactly the same
results. If they cannot, then the design is unreliable.
A good example of a failure to apply the definition of reliability correctly
is provided by the cold fusion case, of 1989. Fleischmann and Pons announced to
the world that they had managed to generate heat at normal temperatures,
instead of the huge and expensive tori used in most research into nuclear
fusion.
This announcement shook the world, but researchers in many other
institutions across the world attempted to replicate the experiment, with no
success. Whether the researchers lied, or genuinely made a mistake is unclear,
but their results were clearly unreliable.
Reliability and Statistics
Physical scientists expect to obtain exactly the same
results every single time, due to the relative predictability of the physical
realms. If you are a nuclear physicist or an inorganic chemist, repeat
experiments should give exactly the same results, time after time.
Ecologists and social scientists, on the other hand, understand fully that
achieving exactly the same results is an exercise in futility. Research in
these disciplines incorporates random factors and natural fluctuations and,
whilst any experimental design must attempt to eliminate confounding variables
and natural variations, there will always be some disparities.
The key to performing a good experiment is to make sure that your results
are as reliable as is possible; if anybody repeats the experiment, powerful statistical
tests will be able to compare the results and the scientist can make a solid
estimate of statistical reliability.
Testing reliability for
social sciences and Education
In the social sciences, testing reliability is a matter of
comparing two different versions of the instrument and ensuring that they are
similar. When we talk about instruments, it does not necessarily mean a
physical instrument, such as a mass-spectrometer or a pH-testing strip.
An educational test, questionnaire, or assigning quantitative scores to
behavior are also instruments, of a non-physical sort. Measuring the reliability
of instruments occurs in different ways.
Types of Reliability
There are four
general classes of reliability estimates, each of
which estimates reliability in a different way. They are:
- Inter-Rater or Inter-Observer
Reliability
Used to assess the degree to which different raters/observers give
consistent estimates of the same phenomenon.
- Test-Retest
Reliability
Used to assess the consistency of a measure from one time to another.
- Parallel-Forms
Reliability
Used to assess the consistency of the results of two tests constructed in
the same way from the same content domain.
- Internal Consistency
Reliability
Used to assess the consistency of results across items within a test.
Inter-Rater or Inter-Observer Reliability
Whenever you use humans as a part of your measurement procedure, you have to
worry about whether the results you get are reliable or consistent. People are
notorious for their inconsistency. We are easily distractible. We get tired of
doing repetitive tasks. We daydream. We misinterpret.
So how do we determine whether two observers are being consistent in their
observations? You probably should establish inter-rater reliability outside of
the context of the measurement in your study. After all, if you use data from
your study to establish reliability, and you find that reliability is low,
you're kind of stuck. Probably it's best to do this as a side study or pilot
study. And, if your study goes on for a long time, you may want to reestablish
inter-rater reliability from time to time to assure that your raters aren't
changing.
There are two major ways to actually estimate inter-rater reliability. If
your measurement consists of categories -- the raters are checking off which
category each observation falls in -- you can calculate the percent of
agreement between the raters. For instance, let's say you had 100 observations
that were being rated by two raters. For each observation, the rater could
check one of three categories. Imagine that on 86 of the 100 observations the
raters checked the same category. In this case, the percent of agreement would
be 86%. OK, it's a crude measure, but it does give an idea of how much
agreement exists, and it works no matter how many categories are used for each
observation.
The other major way to estimate inter-rater reliability is appropriate when
the measure is a continuous one. There, all you need to do is calculate the
correlation between the ratings of the two observers. For instance, they might
be rating the overall level of activity in a classroom on a 1-to-7 scale. You
could have them give their rating at regular time intervals (e.g., every 30
seconds). The correlation between these ratings would give you an estimate of
the reliability or consistency between the raters.
You might think of this type of reliability as "calibrating" the
observers. There are other things you could do to encourage reliability between
observers, even if you don't estimate it. For instance, I used to work in a
psychiatric unit where every morning a nurse had to do a ten-item rating of
each patient on the unit. Of course, we couldn't count on the same nurse being
present every day, so we had to find a way to assure that any of the nurses
would give comparable ratings. The way we did it was to hold weekly
"calibration" meetings where we would have all of the nurses ratings
for several patients and discuss why they chose the specific values they did.
If there were disagreements, the nurses would discuss them and attempt to come
up with rules for deciding when they would give a "3" or a
"4" for a rating on a specific item. Although this was not an
estimate of reliability, it probably went a long way toward improving the
reliability between raters.
Qualitative assessments and interrater reliability
Any qualitative assessment using two or more researchers
must establish interrater reliability to ensure that the results generated will
be useful.
One good example is Bandura’s Bobo Doll experiment, which used a scale to
rate the levels of displayed aggression in young children. Apart from extensive
pre-testing, the observers constantly compared and calibrated their ratings,
adjusting their scales to ensure that they were as similar as possible.
Instrument
reliability
Instrument reliability
is a way of ensuring that any instrument used for measuring experimental
variables gives the same results every time.
Martyn Shuttleworth
(2009)
In the physical sciences, the term is self-explanatory, and it is a matter
of making sure that every piece of hardware, from a mass spectrometer to a set
of weighing scales, is properly calibrated.
Instruments in research
As an example, a researcher will always test the instrument
reliability of weighing scales with a set of calibration weights, ensuring that
the results given are within an acceptable margin of error.
Some of the highly accurate balances can give false results if they are not
placed upon a completely level surface, so this calibration process is the best
way to avoid this.
In the non-physical sciences, the definition of an instrument is much
broader, encompassing everything from a set of survey questions to an
intelligence test. A survey to measure reading ability in children must produce
reliable and consistent results if it is to be taken seriously.
Political opinion polls, on the other hand, are notorious for producing
inaccurate results and delivering a near unworkable margin of error.
In the physical sciences, it is possible to isolate a measuring instrument
from external factors, such as environmental conditions and temporal factors.
In the social sciences, this is much more difficult, so any instrument must be
tested with a reasonable range of reliability.
Test of stability
Any test of instrument reliability must test how stable the
test is over time, ensuring that the same test performed upon the same
individual gives exactly the same results.
The test-retest method is one way of ensuring that any instrument is stable
over time.
Of course, there is no such thing as perfection and there will be always be
some disparity and potential for regression, so statistical methods are used to
determine whether the stability of the instrument is within acceptable limits.
Test of equivalence
Testing equivalence involves ensuring that a test
administered to two people, or similar tests administered at the same time give
similar results.
Split-testing is one way of ensuring this, especially in tests or
observations where the results are expected to change over time. In a school
exam, for example, the same test upon the same subjects will generally result
in better results the second time around, so testing stability is not
practical.
Checking that two researchers observe similar results also falls within the
remit of the test of equivalence.
Test-retest
reliability
The test-retest
reliability method is one of the simplest ways of testing the stability and
reliability of an instrument (of a research) over time.
Martyn Shuttleworth
(2009)
For example, in a research, if a group of students takes a test, you would
expect them to show very similar results if they take the same test a few weeks
later. This definition relies upon there being no confounding factor during the
intervening time interval.
Instruments in research such as IQ tests and surveys are prime candidates
for test-retest methodology, because there is little chance of people
experiencing a sudden jump in IQ or suddenly changing their opinions.
On the other hand, research data regarding educational tests are often not
suitable, because students will learn much more information over the
intervening period and show better results in the second test.
We estimate test-retest reliability when we administer the same test to the
same sample on two different occasions. This approach assumes that there is no
substantial change in the construct being measured between the two occasions.
The amount of time allowed between measures is critical. We know that if we
measure the same thing twice that the correlation between the two observations
will depend in part by how much time elapses between the two measurement
occasions. The shorter the time gap, the higher the correlation; the longer the
time gap, the lower the correlation. This is because the two observations are
related over time -- the closer in time we get the more similar the factors
that contribute to error. Since this correlation is the test-retest estimate of
reliability, you can obtain considerably different estimates depending on the
interval.
Parallel-Forms Reliability
In parallel forms reliability you first have to create two parallel forms.
One way to accomplish this is to create a large set of questions that address
the same construct and then randomly divide the questions into two sets. You
administer both instruments to the same sample of people. The correlation
between the two parallel forms is the estimate of reliability. One major
problem with this approach is that you have to be able to generate lots of
items that reflect the same construct. This is often no easy feat. Furthermore,
this approach makes the assumption that the randomly divided halves are
parallel or equivalent. Even by chance this will sometimes not be the case. The
parallel forms approach is very similar to the split-half reliability described
below. The major difference is that parallel forms are constructed so that the
two forms can be used independent of each other and considered equivalent
measures. For instance, we might be concerned about a testing threat to
internal validity. If we use Form A for the pretest and Form B for the
posttest, we minimize that problem. it would even be better if we randomly
assign individuals to receive Form A or B on the pretest and then switch them
on the posttest. With split-half reliability we have an instrument that we wish
to use as a single measurement instrument and only develop randomly split
halves for purposes of estimating reliability.
Internal Consistency Reliability
In internal consistency reliability estimation we use our single measurement
instrument administered to a group of people on one occasion to estimate
reliability. In effect we judge the reliability of the instrument by estimating
how well the items that reflect the same construct yield similar results. We
are looking at how consistent the results are for different items for the same
construct within the measure. There are a wide variety of internal consistency
measures that can be used.
Internal consistency
reliability defines the consistency of the results delivered in a test,
ensuring that the various items measuring the different constructs deliver
consistent scores.
Martyn Shuttleworth
(2009)
For example, an English test is divided into vocabulary, spelling,
punctuation and grammar. The internal consistency reliability test provides a
measure that each of these particular aptitudes is measured correctly and
reliably.
One way of testing this is by using a test–retest method, where the same
test is administered some after the initial test and the results compared.
However, this creates some problems and so many researchers prefer to
measure internal consistency by including two versions of the same instrument
within the same test. Our example of the English test might include two very
similar questions about comma use, two about spelling and so on.
The basic principle is that the student should give the same answer to both
– if they do not know how to use commas, they will get both questions wrong. A
few nifty statistical manipulations will give the internal consistency
reliability and allow the researcher to evaluate the reliability of the test.
There are three main techniques for measuring the internal consistency
reliability, depending upon the degree, complexity and scope of the test.
They all check that the results and constructs measured by a test are
correct, and the exact type used is dictated by subject, size of the data set
and resources.
Split-Halves test
The split halves test for internal consistency reliability
is the easiest type, and involves dividing a test into two halves.
For example, a questionnaire to measure extroversion could be divided into
odd and even questions. The results from both halves are statistically analyzed,
and if there is weak correlation between the two, then there is a reliability
problem with the test.
The split halves test gives a measurement of in between zero
and one, with one meaning a perfect correlation.
The division of the question into two sets must be random. Split halves testing was a
popular way to measure reliability, because of its simplicity and speed.
However, in an age where computers can take over the laborious number
crunching, scientists tend to use much more powerful tests.
In split-half reliability we randomly divide all items that purport to
measure the same construct into two sets. We administer the entire instrument
to a sample of people and calculate the total score for each randomly divided
half. the split-half reliability estimate, as shown in the figure, is simply
the correlation between these two total scores. In the example it is .87.
Cronbach’s Alpha test
The Cronbach's Alpha test not only averages the correlation
between every possible combination of split halves, but it allows multi-level
responses.
For example, a series of questions might ask the subjects to rate their
response between one and five. Cronbach's Alpha gives a score of between zero
and one, with 0.7 generally accepted as a sign of acceptable reliability.
The test also takes into account both the size of the sample and the number
of potential responses. A 40-question test with possible ratings of 1 – 5 is
seen as having more accuracy than a ten-question test with three possible
levels of response.
Of course, even with Cronbach's clever methodology, which makes calculation
much simpler than crunching through every possible permutation, this is still a
test best left to computers and statistics spreadsheet programmes.
Kudar-Richardson test
The Kudar Richardson test for internal consistency
reliability is a more advanced, and slightly more complex, version of the split
halves test.
In this version, the test works out the average correlation for all the
possible split half combinations in a test. The Kudar Richardson test also
generates a correlation of between zero and one, with a more accurate result
than the split halves test. The weakness of this approach, as with
split-halves, is that the answer for each question must be a simple right or wrong
answer, zero or one.
For multi-scale responses, sophisticated techniques are needed to measure
internal consistency reliability.
Validity in Research
"Any research can be affected by different kinds of
factors which, while extraneous to the concerns of the research, can invalidate
the findings"
(Seliger & Shohamy 1989, 95).
Validity refers to the degree to which a study accurately
reflects or assesses the specific concept that the researcher is attempting to
measure. Validity is concerned with the study's success at measuring what the researchers set out to measure.
Researchers should be concerned with both external
and internal validity. External validity refers to the extent to which
the results of a study are generalizable or transferable.
Validity encompasses the entire experimental concept and
establishes whether the results obtained meet all of the requirements of the
scientific research method.
Types of validity
Scholars discuss several types of internal validity.
- Face Validity
- Criterion Related Validity
- Construct Validity
- Content Validity
Face Validity
Face validity is a measure of how representative a research project is ‘at
face value,’ and whether it appears to be a good project.
Face validity is concerned with how a measure or procedure appears. Does it
seem like a reasonable way to gain the information the researchers are
attempting to obtain? Does it seem well designed? Does it seem as though it
will work reliably? Unlike content validity, face validity does not depend on
established theories for support
This is the least scientific method of validity as it is not quantified
using statistical methods. This is not validity in a technical sense of
the term. It is concerned with whether it seems like we measure what we claim.
Here we look at how valid a measure appears on the surface and make subjective
judgments based off of that. For example, if you give a survey that
appears to be valid to the respondent and the questions are selected because
they look valid to the administer. The administer asks a group of random
people, untrained observers, if the questions appear valid to them.
In research it’s never sufficient to rely on face judgments alone and more
quantifiable methods of validity are necessary in order to draw acceptable
conclusions. There are many instruments of measurement to consider so
face validity is useful in cases where you need to distinguish one approach
over another. Face validity should never be trusted on its own merits.
Criterion Related
Validity
Criterion related validity, also referred to as instrumental validity, is
used to demonstrate the accuracy of a measure or procedure by comparing it with
another measure or procedure which has been demonstrated to be valid.
For example, imagine a hands-on driving test has been shown to be an
accurate test of driving skills. By comparing the scores on the written driving
test with the scores from the hands-on driving test, the written test can be
validated by using a criterion related strategy in which the hands-on driving
test is compared to the written test.
The accuracy of a measure is demonstrated by comparing it with a measure
that has been demonstrated to be valid. In other words, correlation with other
measures that have known validity. For this to work you must know that the
criterion has been measured well. And be aware that appropriate criteria
do not always exist. What you are doing is checking the performance of
your operationalization against criteria. The criteria you use as a
standard of judgment accounts for the different approaches you would use:
- Predictive Validity -
operationalization’s ability to predict what it is theoretically able to
predict. The extent to which a measure predicts expected outcomes.
- Concurrent Validity -
operationalization’s ability to distinguish between groups it
theoretically should be able to. This is where a test correlates
well with a measure that has been previously validated.
Construct Validity
Construct validity seeks agreement between a theoretical concept and a
specific measuring device or procedure. For example, a researcher inventing a
new IQ test might spend a great deal of time attempting to "define"
intelligence in order to reach an acceptable level of construct validity.
To understand whether a piece of research has construct validity, three
steps should be followed.
·
First, the theoretical relationships must be
specified.
·
Second, the empirical relationships between the
measures of the concepts must be examined.
·
Third, the empirical evidence must be
interpreted in terms of how it clarifies the construct validity of the
particular measure being tested
Construct validity defines how well a test or experiment
measures up to its claims. A test designed to measure depression must only
measure that particular construct, not closely related ideals such as anxiety
or stress.
- Convergent validity
tests that constructs that are expected to be related are, in fact,
related.
- Discriminate validity
tests that constructs that should have no relationship do, in fact, not
have any relationship. (also referred to as divergent validity)
A construct represents a collection of behaviors that are associated in a
meaningful way to create an image or an idea invented for a research
purpose. Depression is a construct that represents a personality trait
which manifests itself in behaviors such as over sleeping, loss of appetite,
difficulty concentrating, etc. The existence of a construct is manifest
by observing the collection of related indicators. Any one sign may be
associated with several constructs. A person with difficulty
concentrating may have A.D.D. but not depression. Construct validity is
the degree to which inferences can be made from operationalizations (connecting
concepts to observations) in your study to the constructs on which those
operationalizations are based. To establish construct validity you must
first provide evidence that your data supports the theoretical structure.
You must also show that you control the operationalization of the construct, in
other words, show that your theory has some correspondence with reality.
Content Validity
Content validity is the estimate of how much a measure represents every
single element of a construct.
Content validity is illustrated using the following examples: Researchers
aim to study mathematical learning and create a survey to test for mathematical
skill. If these researchers only tested for multiplication and then drew
conclusions from that survey, their study would not show content validity
because it excludes other mathematical functions. Although the establishment of
content validity for placement-type exams seems relatively straight-forward,
the process becomes more complex as it moves into the more abstract domain of
socio-cultural studies. For example, a researcher needing to measure an
attitude like self-esteem must decide what constitutes a relevant domain of
content for that attitude. For socio-cultural studies, content validity forces
the researchers to define the very domains they are attempting to study.
This is also a subjective measure but unlike face validity we ask whether
the content of a measure covers the full domain of the content. If a researcher
wanted to measure introversion they would have to first decide what constitutes
a relevant domain of content for that trait. This is considered a
subjective form of measurement because it still relies on people’s perception
for measuring constructs that would otherwise be difficult to
measure. Where it distinguishes itself is through its use of experts
in the field or individuals belonging to a target population. This study
can be made more objective through the use of rigorous statistical tests.
For example you could have a content validity study that informs researchers
how items used in a survey represent their content domain, how clear they are,
and the extent to which they maintain the theoretical factor structure assessed
by the factor analysis.
Internal Validity
Internal validity refers to the rigor with which the study was conducted
(e.g., the study's design, the care taken to conduct measurements, and
decisions concerning what was and wasn't measured).
This refers to the extent to which the independent variable can
accurately be stated to produce the observed effect. If the effect of the
dependent variable is only due to the independent variable(s) then internal
validity is achieved. This is the degree to which a result can be manipulated.
Internal validity is a measure which ensures that a
researcher’s experiment design closely follows the principle of cause and
effect.
“Could there be an alternative cause, or causes, that
explain my observations and results?”
Internal validity dictates how an experimental design is structured and
encompasses all of the steps of the scientific research method.
Even if your results are great, sloppy and inconsistent design will
compromise your integrity in the eyes of the scientific community. Internal
validity and reliability are at the core of any experimental design.
External Validity
The extent to which the designers of a study have taken into account
alternative explanations for any causal relationships they explore (Huitt,
1998).This refers to the extent to which the results of a study can be
generalized beyond the sample. Which is to say that you can apply your findings
to other people and settings? Think of this as the degree to which
a result can be generalized.
External validity is usually split into two distinct types, population
validity and ecological validity and they are both essential elements in
judging the strength of an experimental design.
External validity is the process of examining the results and questioning
whether there are any other possible causal relationships.
In 1966, Campbell and Stanley proposed the commonly accepted
definition of external validity.
“External validity asks the question of generalizability:
To what populations, settings, treatment variables and measurement variables
can this effect be generalized?”
External validity is one the most difficult of the validity
types to achieve, and is at the foundation of every good experimental design.
By Martyn Shuttleworth (2009)
Many scientific disciplines, especially the social sciences, face a long
battle to prove that their findings represent the wider population in real world
situations.
The main criteria of external validity is the process of generalization, and
whether results obtained from a small sample group, often in laboratory
surroundings, can be extended to make predictions about the entire population.
Reliability &
Validity
Reliability and validity are often confused, but the terms
actually describe two completely different concepts, although they are often
closely inter-related. This distinct difference is best summed up with an
example:
A researcher devises a new test that measures IQ more quickly than the
standard IQ test:
- If the new test delivers
scores for a candidate of 87, 65, 143 and 102, then the test is not
reliable or valid, and it is fatally flawed.
- If the test consistently
delivers a score of 100 when checked, but the candidates real IQ is 120,
then the test is reliable, but not valid.
- If the researcher’s test
delivers a consistent score of 118, then that is pretty close, and the
test can be considered both valid and reliable.
Reliability is an essential component of validity but, on its own, is not a
sufficient measure of validity. A test can be reliable but not valid, whereas a
test cannot be valid yet unreliable.
Reliability, in simple terms, describes the repeatability and consistency of
a test. Validity defines the strength of the final results and whether they can
be regarded as accurately describing the real world.
We often think of reliability and validity as separate ideas but, in fact,
they're related to each other. Here, I want to show you two ways you can think
about their relationship.
One of my favorite metaphors for the relationship between reliability is
that of the target. Think of the center of the target as the concept that you
are trying to measure. Imagine that for each person you are measuring, you are
taking a shot at the target. If you measure the concept perfectly for a person,
you are hitting the center of the target. If you don't, you are missing the
center. The more you are off for that person, the further you are from the
center.
The figure above shows four possible situations. In the first one, you are
hitting the target consistently, but you are missing the center of the target.
That is, you are consistently and systematically measuring the wrong value for
all respondents. This measure is reliable, but no valid (that is, it's
consistent but wrong). The second shows hits that are randomly spread across
the target. You seldom hit the center of the target but, on average, you are
getting the right answer for the group (but not very well for individuals). In
this case, you get a valid group estimate, but you are inconsistent. Here, you
can clearly see that reliability is directly related to the variability of your
measure. The third scenario shows a case where your hits are spread across the
target and you are consistently missing the center. Your measure in this case
is neither reliable nor valid. Finally, we see the "Robin Hood"
scenario -- you consistently hit the center of the target. Your measure is both
reliable and valid (I bet you never thought of Robin Hood in those terms
before).
Another way we can think about the relationship between reliability and
validity is shown in the figure below. Here, we set up a 2x2 table. The columns
of the table indicate whether you are trying to measure the same or different
concepts. The rows show whether you are using the same or different methods of
measurement. Imagine that we have two concepts we would like to measure,
student verbal and math ability. Furthermore, imagine that we can measure each
of these in two ways. First, we can use a written, paper-and-pencil exam (very
much like the SAT or GRE exams). Second, we can ask the student's classroom
teacher to give us a rating of the student's ability based on their own
classroom observation.
The first cell on the upper left shows the comparison of the verbal written
test score with the verbal written test score. But how can we compare the same
measure with itself? We could do this by estimating the reliability of the
written test through a test-retest correlation, parallel forms, or an internal
consistency measure. What we are estimating in this cell is the reliability of
the measure.
The cell on the lower left shows a comparison of the verbal written measure
with the verbal teacher observation rating. Because we are trying to measure
the same concept, we are looking at convergent validity.
The cell on the upper right shows the comparison of the verbal written exam
with the math written exam. Here, we are comparing two different concepts
(verbal versus math) and so we would expect the relationship to be lower than a
comparison of the same concept with itself (e.g., verbal versus verbal or math
versus math). Thus, we are trying to discriminate between two concepts and we
would consider this discriminate validity.
Finally, we have the cell on the lower right. Here, we are comparing the
verbal written exam with the math teacher observation rating. Like the cell on
the upper right, we are also trying to compare two different concepts (verbal
versus math) and so this is a discriminate validity estimate. But here, we are
also trying to compare two different methods of measurement (written exam
versus teacher observation rating). So, we'll call this
very discriminate
to indicate that we would expect the relationship in this cell to be even lower
than in the one above it.
The four cells incorporate the different values that we examine in the multitrait-multimethod
approach to estimating construct validity.
When we look at reliability and validity in this way, we see that, rather
than being distinct, they actually form a continuum. On one end is the
situation where the concepts and methods of measurement are the same
(reliability) and on the other is the situation where concepts and methods of
measurement are different.
References
Educational Research by L. R. Gay (Fifth
edition)
Measurement and evaluation in psychology and
education by R. M. Thorndike (fifth edition)
arranged by Sadaf Naz