Reliability
Reliability is the extent to which an experiment, test, or any measuring procedure yields the same result on repeated trials. Without the agreement of independent observers able to replicate research procedures, or the ability to use research tools and procedures that yield consistent measurements, researchers would be unable to satisfactorily draw conclusions, formulate theories, or make claims about the generalizability of their research. In addition to its important role in research, reliability is critical for many parts of our lives, including manufacturing, medicine, and sports.
According to the Dictionary:
"Yielding the
same or compatible results in different clinical experiments or statistical
trials"
The idea behind reliability is that any significant results
must be more than a one-off finding and be inherently repeatable.
Other researchers must be able to perform exactly the same experiment, under
the same conditions and generate the same results. This will reinforce the
findings and ensure that the wider scientific community will accept the hypothesis.
Without this replication of statistically significant results, the experiment
and research have not fulfilled all of the requirements of testability. This
prerequisite is essential to a hypothesis establishing itself as an accepted
scientific truth.For example, if you are performing a time critical experiment, you will be using some type of stopwatch. Generally, it is reasonable to assume that the instruments are reliable and will keep true and accurate time. However, diligent scientists take measurements many times, to minimize the chances of malfunction and maintain validity and reliability.
At the other extreme, any experiment that uses human judgment is always going to come under question. This means that such experiments are more difficult to repeat and are inherently less reliable. Reliability is a necessary ingredient for determining the overall validity of a scientific experiment and enhancing the strength of the results.
Reliability and science
Reliability is something that every scientist, especially in social sciences and biology, must be aware of. In science, the definition is the same, but needs a much narrower and unequivocal definition.Another way of looking at this is as maximizing the inherent repeatability or consistency in an experiment. For maintaining reliability internally, a researcher will use as many repeat sample groups as possible, to reduce the chance of an abnormal sample group skewing the results. If you use three replicate samples for each manipulation, and one generates completely different results from the others, then there may be something wrong with the experiment.
- For many experiments, results follow a ‘normal distribution’ and there is always a chance that your sample group produces results lying at one of the extremes. Using multiple sample groups will smooth out these extremes and generate a more accurate spread of results.
- If your results continue to be wildly different, then there is likely to be something very wrong with your design; it is unreliable.
Reliability and cold fusion
Reliability is also extremely important externally, and
another researcher should be able to perform exactly the same experiment, with
similar equipment, under similar conditions, and achieve exactly the same
results. If they cannot, then the design is unreliable.
A good example of a failure to apply the definition of reliability correctly
is provided by the cold fusion case, of 1989. Fleischmann and Pons announced to
the world that they had managed to generate heat at normal temperatures,
instead of the huge and expensive tori used in most research into nuclear
fusion.This announcement shook the world, but researchers in many other institutions across the world attempted to replicate the experiment, with no success. Whether the researchers lied, or genuinely made a mistake is unclear, but their results were clearly unreliable.
Reliability and Statistics
Physical scientists expect to obtain exactly the same
results every single time, due to the relative predictability of the physical
realms. If you are a nuclear physicist or an inorganic chemist, repeat
experiments should give exactly the same results, time after time.
Ecologists and social scientists, on the other hand, understand fully that
achieving exactly the same results is an exercise in futility. Research in
these disciplines incorporates random factors and natural fluctuations and,
whilst any experimental design must attempt to eliminate confounding variables
and natural variations, there will always be some disparities. The key to performing a good experiment is to make sure that your results are as reliable as is possible; if anybody repeats the experiment, powerful statistical tests will be able to compare the results and the scientist can make a solid estimate of statistical reliability.
Testing reliability for
social sciences and Education
In the social sciences, testing reliability is a matter of
comparing two different versions of the instrument and ensuring that they are
similar. When we talk about instruments, it does not necessarily mean a
physical instrument, such as a mass-spectrometer or a pH-testing strip.
An educational test, questionnaire, or assigning quantitative scores to
behavior are also instruments, of a non-physical sort. Measuring the reliability
of instruments occurs in different ways.Types of Reliability
There are four general classes of reliability estimates, each of which estimates reliability in a different way. They are:- Inter-Rater or Inter-Observer
Reliability
Used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon. - Test-Retest
Reliability
Used to assess the consistency of a measure from one time to another. - Parallel-Forms
Reliability
Used to assess the consistency of the results of two tests constructed in the same way from the same content domain. - Internal Consistency
Reliability
Used to assess the consistency of results across items within a test.
Inter-Rater or Inter-Observer Reliability
Whenever you use humans as a part of your measurement procedure, you have to worry about whether the results you get are reliable or consistent. People are notorious for their inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We daydream. We misinterpret.So how do we determine whether two observers are being consistent in their observations? You probably should establish inter-rater reliability outside of the context of the measurement in your study. After all, if you use data from your study to establish reliability, and you find that reliability is low, you're kind of stuck. Probably it's best to do this as a side study or pilot study. And, if your study goes on for a long time, you may want to reestablish inter-rater reliability from time to time to assure that your raters aren't changing.
There are two major ways to actually estimate inter-rater reliability. If your measurement consists of categories -- the raters are checking off which category each observation falls in -- you can calculate the percent of agreement between the raters. For instance, let's say you had 100 observations that were being rated by two raters. For each observation, the rater could check one of three categories. Imagine that on 86 of the 100 observations the raters checked the same category. In this case, the percent of agreement would be 86%. OK, it's a crude measure, but it does give an idea of how much agreement exists, and it works no matter how many categories are used for each observation.
The other major way to estimate inter-rater reliability is appropriate when the measure is a continuous one. There, all you need to do is calculate the correlation between the ratings of the two observers. For instance, they might be rating the overall level of activity in a classroom on a 1-to-7 scale. You could have them give their rating at regular time intervals (e.g., every 30 seconds). The correlation between these ratings would give you an estimate of the reliability or consistency between the raters.
You might think of this type of reliability as "calibrating" the observers. There are other things you could do to encourage reliability between observers, even if you don't estimate it. For instance, I used to work in a psychiatric unit where every morning a nurse had to do a ten-item rating of each patient on the unit. Of course, we couldn't count on the same nurse being present every day, so we had to find a way to assure that any of the nurses would give comparable ratings. The way we did it was to hold weekly "calibration" meetings where we would have all of the nurses ratings for several patients and discuss why they chose the specific values they did. If there were disagreements, the nurses would discuss them and attempt to come up with rules for deciding when they would give a "3" or a "4" for a rating on a specific item. Although this was not an estimate of reliability, it probably went a long way toward improving the reliability between raters.
Qualitative assessments and interrater reliability
Any qualitative assessment using two or more researchers
must establish interrater reliability to ensure that the results generated will
be useful.
One good example is Bandura’s Bobo Doll experiment, which used a scale to
rate the levels of displayed aggression in young children. Apart from extensive
pre-testing, the observers constantly compared and calibrated their ratings,
adjusting their scales to ensure that they were as similar as possible.Instrument reliability
Instrument reliability
is a way of ensuring that any instrument used for measuring experimental
variables gives the same results every time.
Martyn Shuttleworth
(2009)
In the physical sciences, the term is self-explanatory, and it is a matter
of making sure that every piece of hardware, from a mass spectrometer to a set
of weighing scales, is properly calibrated.Instruments in research
As an example, a researcher will always test the instrument
reliability of weighing scales with a set of calibration weights, ensuring that
the results given are within an acceptable margin of error.
Some of the highly accurate balances can give false results if they are not
placed upon a completely level surface, so this calibration process is the best
way to avoid this.In the non-physical sciences, the definition of an instrument is much broader, encompassing everything from a set of survey questions to an intelligence test. A survey to measure reading ability in children must produce reliable and consistent results if it is to be taken seriously.
Political opinion polls, on the other hand, are notorious for producing inaccurate results and delivering a near unworkable margin of error.
In the physical sciences, it is possible to isolate a measuring instrument from external factors, such as environmental conditions and temporal factors. In the social sciences, this is much more difficult, so any instrument must be tested with a reasonable range of reliability.
Test of stability
Any test of instrument reliability must test how stable the
test is over time, ensuring that the same test performed upon the same
individual gives exactly the same results.
The test-retest method is one way of ensuring that any instrument is stable
over time.Of course, there is no such thing as perfection and there will be always be some disparity and potential for regression, so statistical methods are used to determine whether the stability of the instrument is within acceptable limits.
Test of equivalence
Testing equivalence involves ensuring that a test
administered to two people, or similar tests administered at the same time give
similar results.
Split-testing is one way of ensuring this, especially in tests or
observations where the results are expected to change over time. In a school
exam, for example, the same test upon the same subjects will generally result
in better results the second time around, so testing stability is not
practical.Checking that two researchers observe similar results also falls within the remit of the test of equivalence.
Test-retest reliability
The test-retest
reliability method is one of the simplest ways of testing the stability and
reliability of an instrument (of a research) over time.
Martyn Shuttleworth
(2009)
For example, in a research, if a group of students takes a test, you would
expect them to show very similar results if they take the same test a few weeks
later. This definition relies upon there being no confounding factor during the
intervening time interval.Instruments in research such as IQ tests and surveys are prime candidates for test-retest methodology, because there is little chance of people experiencing a sudden jump in IQ or suddenly changing their opinions.
On the other hand, research data regarding educational tests are often not suitable, because students will learn much more information over the intervening period and show better results in the second test.
We estimate test-retest reliability when we administer the same test to the same sample on two different occasions. This approach assumes that there is no substantial change in the construct being measured between the two occasions. The amount of time allowed between measures is critical. We know that if we measure the same thing twice that the correlation between the two observations will depend in part by how much time elapses between the two measurement occasions. The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time -- the closer in time we get the more similar the factors that contribute to error. Since this correlation is the test-retest estimate of reliability, you can obtain considerably different estimates depending on the interval.
Parallel-Forms Reliability
In parallel forms reliability you first have to create two parallel forms. One way to accomplish this is to create a large set of questions that address the same construct and then randomly divide the questions into two sets. You administer both instruments to the same sample of people. The correlation between the two parallel forms is the estimate of reliability. One major problem with this approach is that you have to be able to generate lots of items that reflect the same construct. This is often no easy feat. Furthermore, this approach makes the assumption that the randomly divided halves are parallel or equivalent. Even by chance this will sometimes not be the case. The parallel forms approach is very similar to the split-half reliability described below. The major difference is that parallel forms are constructed so that the two forms can be used independent of each other and considered equivalent measures. For instance, we might be concerned about a testing threat to internal validity. If we use Form A for the pretest and Form B for the posttest, we minimize that problem. it would even be better if we randomly assign individuals to receive Form A or B on the pretest and then switch them on the posttest. With split-half reliability we have an instrument that we wish to use as a single measurement instrument and only develop randomly split halves for purposes of estimating reliability.Internal Consistency Reliability
In internal consistency reliability estimation we use our single measurement instrument administered to a group of people on one occasion to estimate reliability. In effect we judge the reliability of the instrument by estimating how well the items that reflect the same construct yield similar results. We are looking at how consistent the results are for different items for the same construct within the measure. There are a wide variety of internal consistency measures that can be used.
Internal consistency
reliability defines the consistency of the results delivered in a test,
ensuring that the various items measuring the different constructs deliver
consistent scores.
Martyn Shuttleworth
(2009)
For example, an English test is divided into vocabulary, spelling,
punctuation and grammar. The internal consistency reliability test provides a
measure that each of these particular aptitudes is measured correctly and
reliably.One way of testing this is by using a test–retest method, where the same test is administered some after the initial test and the results compared.
However, this creates some problems and so many researchers prefer to measure internal consistency by including two versions of the same instrument within the same test. Our example of the English test might include two very similar questions about comma use, two about spelling and so on.
The basic principle is that the student should give the same answer to both – if they do not know how to use commas, they will get both questions wrong. A few nifty statistical manipulations will give the internal consistency reliability and allow the researcher to evaluate the reliability of the test.
There are three main techniques for measuring the internal consistency reliability, depending upon the degree, complexity and scope of the test.
They all check that the results and constructs measured by a test are correct, and the exact type used is dictated by subject, size of the data set and resources.
Split-Halves test
The split halves test for internal consistency reliability
is the easiest type, and involves dividing a test into two halves.
For example, a questionnaire to measure extroversion could be divided into
odd and even questions. The results from both halves are statistically analyzed,
and if there is weak correlation between the two, then there is a reliability
problem with the test.
The split halves test gives a measurement of in between zero
and one, with one meaning a perfect correlation.
The division of the question into two sets must be random. Split halves testing was a
popular way to measure reliability, because of its simplicity and speed.
However, in an age where computers can take over the laborious number
crunching, scientists tend to use much more powerful tests.In split-half reliability we randomly divide all items that purport to measure the same construct into two sets. We administer the entire instrument to a sample of people and calculate the total score for each randomly divided half. the split-half reliability estimate, as shown in the figure, is simply the correlation between these two total scores. In the example it is .87.
Cronbach’s Alpha test
The Cronbach's Alpha test not only averages the correlation
between every possible combination of split halves, but it allows multi-level
responses.
For example, a series of questions might ask the subjects to rate their
response between one and five. Cronbach's Alpha gives a score of between zero
and one, with 0.7 generally accepted as a sign of acceptable reliability.The test also takes into account both the size of the sample and the number of potential responses. A 40-question test with possible ratings of 1 – 5 is seen as having more accuracy than a ten-question test with three possible levels of response.
Of course, even with Cronbach's clever methodology, which makes calculation much simpler than crunching through every possible permutation, this is still a test best left to computers and statistics spreadsheet programmes.
Kudar-Richardson test
The Kudar Richardson test for internal consistency
reliability is a more advanced, and slightly more complex, version of the split
halves test.
In this version, the test works out the average correlation for all the
possible split half combinations in a test. The Kudar Richardson test also
generates a correlation of between zero and one, with a more accurate result
than the split halves test. The weakness of this approach, as with
split-halves, is that the answer for each question must be a simple right or wrong
answer, zero or one.For multi-scale responses, sophisticated techniques are needed to measure internal consistency reliability.
Validity in Research
"Any research can be affected by different kinds of
factors which, while extraneous to the concerns of the research, can invalidate
the findings"
(Seliger & Shohamy 1989, 95).
Validity refers to the degree to which a study accurately
reflects or assesses the specific concept that the researcher is attempting to
measure. Validity is concerned with the study's success at measuring what the researchers set out to measure.
Researchers should be concerned with both external
and internal validity. External validity refers to the extent to which
the results of a study are generalizable or transferable.
Validity encompasses the entire experimental concept and
establishes whether the results obtained meet all of the requirements of the
scientific research method.
Types of validity
Scholars discuss several types of internal validity.
- Face Validity
- Criterion Related Validity
- Construct Validity
- Content Validity
Face Validity
Face validity is a measure of how representative a research project is ‘at face value,’ and whether it appears to be a good project.Face validity is concerned with how a measure or procedure appears. Does it seem like a reasonable way to gain the information the researchers are attempting to obtain? Does it seem well designed? Does it seem as though it will work reliably? Unlike content validity, face validity does not depend on established theories for support
This is the least scientific method of validity as it is not quantified using statistical methods. This is not validity in a technical sense of the term. It is concerned with whether it seems like we measure what we claim. Here we look at how valid a measure appears on the surface and make subjective judgments based off of that. For example, if you give a survey that appears to be valid to the respondent and the questions are selected because they look valid to the administer. The administer asks a group of random people, untrained observers, if the questions appear valid to them. In research it’s never sufficient to rely on face judgments alone and more quantifiable methods of validity are necessary in order to draw acceptable conclusions. There are many instruments of measurement to consider so face validity is useful in cases where you need to distinguish one approach over another. Face validity should never be trusted on its own merits.
Criterion Related Validity
Criterion related validity, also referred to as instrumental validity, is used to demonstrate the accuracy of a measure or procedure by comparing it with another measure or procedure which has been demonstrated to be valid.For example, imagine a hands-on driving test has been shown to be an accurate test of driving skills. By comparing the scores on the written driving test with the scores from the hands-on driving test, the written test can be validated by using a criterion related strategy in which the hands-on driving test is compared to the written test.
The accuracy of a measure is demonstrated by comparing it with a measure that has been demonstrated to be valid. In other words, correlation with other measures that have known validity. For this to work you must know that the criterion has been measured well. And be aware that appropriate criteria do not always exist. What you are doing is checking the performance of your operationalization against criteria. The criteria you use as a standard of judgment accounts for the different approaches you would use:
- Predictive Validity - operationalization’s ability to predict what it is theoretically able to predict. The extent to which a measure predicts expected outcomes.
- Concurrent Validity - operationalization’s ability to distinguish between groups it theoretically should be able to. This is where a test correlates well with a measure that has been previously validated.
Construct Validity
Construct validity seeks agreement between a theoretical concept and a specific measuring device or procedure. For example, a researcher inventing a new IQ test might spend a great deal of time attempting to "define" intelligence in order to reach an acceptable level of construct validity.To understand whether a piece of research has construct validity, three steps should be followed.
·
First, the theoretical relationships must be
specified.
·
Second, the empirical relationships between the
measures of the concepts must be examined.
·
Third, the empirical evidence must be
interpreted in terms of how it clarifies the construct validity of the
particular measure being tested
Construct validity defines how well a test or experiment
measures up to its claims. A test designed to measure depression must only
measure that particular construct, not closely related ideals such as anxiety
or stress.
- Convergent validity tests that constructs that are expected to be related are, in fact, related.
- Discriminate validity tests that constructs that should have no relationship do, in fact, not have any relationship. (also referred to as divergent validity)
Content Validity
Content validity is the estimate of how much a measure represents every single element of a construct.Content validity is illustrated using the following examples: Researchers aim to study mathematical learning and create a survey to test for mathematical skill. If these researchers only tested for multiplication and then drew conclusions from that survey, their study would not show content validity because it excludes other mathematical functions. Although the establishment of content validity for placement-type exams seems relatively straight-forward, the process becomes more complex as it moves into the more abstract domain of socio-cultural studies. For example, a researcher needing to measure an attitude like self-esteem must decide what constitutes a relevant domain of content for that attitude. For socio-cultural studies, content validity forces the researchers to define the very domains they are attempting to study.
This is also a subjective measure but unlike face validity we ask whether the content of a measure covers the full domain of the content. If a researcher wanted to measure introversion they would have to first decide what constitutes a relevant domain of content for that trait. This is considered a subjective form of measurement because it still relies on people’s perception for measuring constructs that would otherwise be difficult to measure. Where it distinguishes itself is through its use of experts in the field or individuals belonging to a target population. This study can be made more objective through the use of rigorous statistical tests. For example you could have a content validity study that informs researchers how items used in a survey represent their content domain, how clear they are, and the extent to which they maintain the theoretical factor structure assessed by the factor analysis.
Internal Validity
Internal validity refers to the rigor with which the study was conducted (e.g., the study's design, the care taken to conduct measurements, and decisions concerning what was and wasn't measured). This refers to the extent to which the independent variable can accurately be stated to produce the observed effect. If the effect of the dependent variable is only due to the independent variable(s) then internal validity is achieved. This is the degree to which a result can be manipulated.
Internal validity is a measure which ensures that a
researcher’s experiment design closely follows the principle of cause and
effect.
“Could there be an alternative cause, or causes, that
explain my observations and results?”
Internal validity dictates how an experimental design is structured and
encompasses all of the steps of the scientific research method.Even if your results are great, sloppy and inconsistent design will compromise your integrity in the eyes of the scientific community. Internal validity and reliability are at the core of any experimental design.
External Validity
The extent to which the designers of a study have taken into account alternative explanations for any causal relationships they explore (Huitt, 1998).This refers to the extent to which the results of a study can be generalized beyond the sample. Which is to say that you can apply your findings to other people and settings? Think of this as the degree to which a result can be generalized.
External validity is usually split into two distinct types, population validity and ecological validity and they are both essential elements in judging the strength of an experimental design.
External validity is the process of examining the results and questioning whether there are any other possible causal relationships.
In 1966, Campbell and Stanley proposed the commonly accepted
definition of external validity.
“External validity asks the question of generalizability:
To what populations, settings, treatment variables and measurement variables
can this effect be generalized?”
External validity is one the most difficult of the validity
types to achieve, and is at the foundation of every good experimental design.
By Martyn Shuttleworth (2009)
Many scientific disciplines, especially the social sciences, face a long
battle to prove that their findings represent the wider population in real world
situations. The main criteria of external validity is the process of generalization, and whether results obtained from a small sample group, often in laboratory surroundings, can be extended to make predictions about the entire population.
Reliability & Validity
Reliability and validity are often confused, but the terms
actually describe two completely different concepts, although they are often
closely inter-related. This distinct difference is best summed up with an
example:
A researcher devises a new test that measures IQ more quickly than the
standard IQ test:- If the new test delivers scores for a candidate of 87, 65, 143 and 102, then the test is not reliable or valid, and it is fatally flawed.
- If the test consistently delivers a score of 100 when checked, but the candidates real IQ is 120, then the test is reliable, but not valid.
- If the researcher’s test delivers a consistent score of 118, then that is pretty close, and the test can be considered both valid and reliable.
Reliability, in simple terms, describes the repeatability and consistency of a test. Validity defines the strength of the final results and whether they can be regarded as accurately describing the real world.
One of my favorite metaphors for the relationship between reliability is that of the target. Think of the center of the target as the concept that you are trying to measure. Imagine that for each person you are measuring, you are taking a shot at the target. If you measure the concept perfectly for a person, you are hitting the center of the target. If you don't, you are missing the center. The more you are off for that person, the further you are from the center.
Another way we can think about the relationship between reliability and validity is shown in the figure below. Here, we set up a 2x2 table. The columns of the table indicate whether you are trying to measure the same or different concepts. The rows show whether you are using the same or different methods of measurement. Imagine that we have two concepts we would like to measure, student verbal and math ability. Furthermore, imagine that we can measure each of these in two ways. First, we can use a written, paper-and-pencil exam (very much like the SAT or GRE exams). Second, we can ask the student's classroom teacher to give us a rating of the student's ability based on their own classroom observation.
The cell on the lower left shows a comparison of the verbal written measure with the verbal teacher observation rating. Because we are trying to measure the same concept, we are looking at convergent validity.
The cell on the upper right shows the comparison of the verbal written exam with the math written exam. Here, we are comparing two different concepts (verbal versus math) and so we would expect the relationship to be lower than a comparison of the same concept with itself (e.g., verbal versus verbal or math versus math). Thus, we are trying to discriminate between two concepts and we would consider this discriminate validity.
Finally, we have the cell on the lower right. Here, we are comparing the verbal written exam with the math teacher observation rating. Like the cell on the upper right, we are also trying to compare two different concepts (verbal versus math) and so this is a discriminate validity estimate. But here, we are also trying to compare two different methods of measurement (written exam versus teacher observation rating). So, we'll call this very discriminate to indicate that we would expect the relationship in this cell to be even lower than in the one above it.
The four cells incorporate the different values that we examine in the multitrait-multimethod approach to estimating construct validity.
When we look at reliability and validity in this way, we see that, rather than being distinct, they actually form a continuum. On one end is the situation where the concepts and methods of measurement are the same (reliability) and on the other is the situation where concepts and methods of measurement are different.
Educational Research by L. R. Gay (Fifth
edition)
Measurement and evaluation in psychology and
education by R. M. Thorndike (fifth edition)
arranged by Sadaf Naz
No comments:
Post a Comment