Measuring tools exist for a vast array of tasks, including the assessment of acceleration, particle concentration, brightness, and pH levels, but the data they provide are meaningless unless there has been some sort of analysis conducted to confirm the accuracy of readings.
Another consideration–in addition to accuracy–is consistency. If a scale reads the correct weight two weeks in a row and then starts producing wrong numbers, the consistency of that tool would then be in doubt. As an example of this, the scale should read 170 pounds one month after an initial accurate reading of 170 if the person did not lose or gain any weight.
Psychometric tools like survey assessments give us information, but without research, it is impossible to articulate with any degree of certainty what that information actually means. Tilt 365 has previously published a blog that details the psychometric concepts of validity and reliability, and readers who are curious about details on those constructs can click here to read that article. For the purposes of this piece, it is enough to know that validity and reliability–in the context of survey assessments–mean accuracy and consistency, respectively.
All Tilt 365 assessments are anchored on the Tilt Model and Framework, which features twelve character strengths. In no particular order, those twelve are Likability, Openness, Inspiration, Creativity, Confidence, Boldness, Integrity, Diligence, Focus, Perspective, Trust, and Empathy. See Figure 1 below for a depiction of the Tilt Model.
Three assessments are currently in the Tilt 365 arsenal, one of which is the Team Agility Predictor™ (TAP). The TAP is a 360-degree tool that taps a potentially wide array of observers, such as team leaders and team members. Since the TAP was truncated earlier this year from 48 slide-bar items to 12 slide-bar items, it was important to conduct a validation study on the abbreviated TAP.
In an effort to keep this current article from an excessive length, we will explain only a subset of our TAP validity and reliability findings, beginning with content validity. As can be seen in the details that follow below, validity and reliability can be assessed in numerous ways.
Whereas four questions previously measured each of the 12 Tilt character strengths, we sought to accurately measure each of the 12 Tilt character strengths with a single item. The four-person Tilt 365 Science Team generated a lengthy collection of possible items that would address the 12 Tilt character strengths.
Next, the Science Team deliberated the merits and flaws of each of the items. When a consensus was reached regarding the best item to retain for each of the 12 character strengths, that best item was deployed in the TAP.
“Convergent validity” sounds imposing and complicated, but the essence of the construct is simply that responses to two assessments that both claim to cover the same concept should have a relationship and correlate. Someone who scores in a very high percentile on one widely accepted personality assessment would be expected to score in about as high a percentile on a different widely accepted personality assessment.
“Covary” and “correlate” mean that as responses to one assessment go in a given direction, the responses to the other, similar assessment should do approximately the same thing. More generally, correlation can be thought of as a relationship. For example, there is generally a correlation (relationship) between daily consumption of extra-large milkshakes and high triglycerides, and there is generally a correlation (relationship) between running five miles a day and a healthy cardiovascular system.
Factor analysis is an analytical tool that helps us understand whether or not items in a survey assessment are related. Picture the following four-item survey to which one must respond with either strongly disagree, disagree, neutral, agree, or strongly agree:
We would expect that people who strongly agree with item 1 would also strongly agree with item 2 since they are tapping a similar conceptual domain. Similarly, someone who strongly agrees with item 3 would be expected (more or less) to strongly agree with item 4 because the two items are again tapping a similar conceptual domain.
Researchers employing factor analysis are using a tool that analyzes responses to survey items and gives them an idea of how similar (or how different) two or more items in the survey are to one another. The more people agree or disagree with one, the more we expect them to agree or disagree with a similar item.
Our intent with the TAP validation study, in terms of factor analysis, was to confirm that each of the 12 items in the assessment corresponded to one of the twelve distinct character strength domains in the Tilt Model. This is exactly what we found. Technically speaking, we found that all 12 character strength items had “strong factor loadings”.
Put simply, strong factor loadings means that each item had strong correspondence to the single character strength it was written to assess. The strong factor loadings for the remainder of the items provide additional evidence for the internal structure validity of the Tilt Model and Tilt Framework (as confirmed by previous work). This just means that the structure of the Tilt Model and Framework held up from an analytical standpoint.
If we hypothetically consider one assessment that evaluates intelligence and another assessment that evaluates favorite types of meals, we would expect that high scores on the intelligence test would have no bearing on the types of responses from the assessment that evaluates favorite meals.
When someone quickly reproduces patterns while completing an intelligence test, we have no reason to believe that they would have a penchant for pizza more than pancakes or steak more than lobster. Discriminant validity would thus be demonstrated, in this example, if we found that responses on the intelligence test had nothing to do with favorite meals.
In the TAP validation study, we compared responses to the 12-item TAP with three constructs that should not have a relationship with the TAP items: team size, business sector, and geographic region.
There was found to be no relationship between the TAP responses and team size, which is intuitive since larger (or smaller) teams are not expected to have more balance in character strengths than teams of differing proportions. This also provides evidence that the responses to the TAP have no significant relationship with nor are they contaminated by the (theoretically) unrelated construct of team size.
With respect to the difference (or lack thereof) between the TAP responses and the business sector to which the various teams belonged and between the TAP responses and geographic region, we found no significant difference between the means for TAP responses based on either of these other domains.
Again, this makes sense because teams in one business sector are not expected, on average, to be more balanced in character strengths than teams in other business domains and because teams in one geographic region are not expected, on average, to be more balanced in character strengths than teams in alternate geographic regions.
Criterion-related validity is all about meaningful predictions. We can picture someone who was tasked with creating an assessment that would provide hiring managers with an accurate notion of how well a particular applicant would perform on the job.
To determine how useful the assessment is from a criterion-related validity standpoint, it could be given to a number of job applicants, and then the job performance (after some determined period of time) for the applicants who were hired would be compared to the prediction. If the test predicted strong performance and the employees in question truly had strong work performance, the assessment would demonstrate evidence for criterion-related validity.
Gauging the criterion-related validity of the shortened TAP took the form of establishing the relationship between the responses to the TAP and five constructs that should, in theory, be predicted positively by the TAP items. The five constructs, specifically, that were used in this endeavor were transparency, safety, cohesion, conflict resolution, and trust. We anticipated that higher levels of these five constructs would be predicted by better responses on the TAP.
As opposed to just being correlated with responses from the TAP, these constructs were expected to be predicted by the TAP. The difference lies in the fact that we expect increases in, say, integrity (a TAP construct), to cause a certain increase in transparency (not a TAP construct). Similarly, an increase in empathy (TAP) should cause an increase in conflict resolution (non-TAP). All five of the constructs showed a significant, positive correlation with the TAP responses.
Reliability is the degree to which an assessment is free of error, and it is often operationalized as whether the assessment measures the same construct consistently. There are different ways to evaluate the reliability of an assessment, including test-retest, parallel forms, internal consistency, and inter-rater. We exclusively focus in this section on test-retest reliability.
Test-retest reliability shows the extent to which an assessment remains consistent over time. Simply put, this means that we compare responses to a single assessment generated by a single User with some pre-determined span of time between assessment completions. Ideally, teams develop and grow more cohesive and trusting over time, so ratings on a team developmental assessment like the TAP should NOT be consistent over moderate to long time periods.
During the TAP validation study, test-retest reliability was assessed as dependability correlation, a more appropriate form of reliability for this type of instrument. The dependability analyses for this current study involved comparing responses to the TAP generated very close in time to each other for each User.
Internal consistency reliability provides evidence that items on the same scale are measuring the same construct. Internal consistency reliability is important for any type of assessment with multiple items per construct.
An example of this involves three questions that all are asking about job performance. If they are written so that best possible performance is indicated by “Strongly agree”, then we would expect that a rater chooses strongly agree (or close to strongly agree) for all three questions when rating a single employee’s stellar performance.
There are different methods available for evaluating an assessment’s internal consistency reliability, and our internal consistency findings for each of the 12 character strengths in the TAP validation study were mostly in the moderate or satisfactory ranges.
Our evaluation of the abbreviated TAP provided evidence that it has sound psychometric properties in terms of several forms of each of reliability and validity.