Reexamining the Construction of the Serious Leisure Inventory and Measure
문항반응이론을 적용한 진지한 여가척도의 재평가
- Publish: Journal of Leisure and Recreation Studies Volume 35, Issue1, p107~116, 00 March 2011
A fundamental challenge to understand the attitude of serious leisure in leisure studies lies in implementing reliable measures. Substantial efforts from the Serious Leisure Inventory and Measure (SLIM) in leisure studies have been made to demonstrate reliable psychometric properties of items by the classical test theory which could lead to inaccurate inferences about the attitude of serious leisure. To determine whether items on SLIM have measurement issues, it is necessary to use a modern psychometric technique such as item response theory (IRT), which provides more insightful and helpful information for measures in leisure studies. This study introduces Samejima's (1969) graded response model, which is one of the most frequently used IRT models in Likert-type data and suggests that some items on SLIM could be improved in a number of ways.
진지한 여가의 특성을 이해하기 위한 가장 근본적인 문제는 신뢰 있는 측정을 실행하는 것이다. 여가학 분야에서 진지한 여가 측정 척도(SLIM)를 개발하기까지 전통적인 측정 이론을 바탕으로 한 문항의 신뢰성 있는 측정적인 속성을 증명하기 위하여 많은 노력들이 이어져 왔다. 이러한 전통적 측정 이론은 진지한 여가의 특성에 대한 정확하지 못한 의미를 도출할가능성을 지니고 있다. 따라서
Gould, Moore, McGuire & Stebbins (2008)에 의하여 개발된 진지한 여가 측정 척도(72문항)에 포함된 문항들이 측정상의 문제를 안고 있는지를 규명하기 위하여 문항반응 이론(IRT)과 같은 현대적 측정 기술을 사용할 필요성이 제기된다. 문항반응 이론은 여가학에서 사용되는 다양한 척도의 정확한 측정을 위한 보다 통찰력 있고 유용한 정보를 제공할 수 있다. 본 연구는 리커트 형식으로 측정된 데이터를 이용하는데 가장 많이 사용되고 있는 문항반응이론 모델 중의 하나인 Samejima(1969)의 등급반응 모델을 소개하고, 진지한 여가 측정 척도에 포함된 몇몇 문항들은 다양한 방법에 의하여 수정 및 보완되어야 한다는 것을 주장하고 있다.
serious leisure , graded response model , item response theory , reliability
Serious leisure activities are strongly interwined with attitudinal, economic, social, and psychological constructs. Thus, the measurement of serious leisure in leisure studies can be a very complicated and difficult issue (Stebbins, 1982, 1992, 2001a). Correspondingly, this leads to a concomitant effort to develop accurate and reliable measurement of serious leisure which forms the foundation of serious leisure research.
Since the term serious leisure was coined (Stebbins, 1982, 1992, 2001a), the attitude of serious leisure in leisure studies has been increasingly studied in a variety of contexts such as masters swimming (Hastings, Kurth, Schloder, & Cyr, 1995), bass fishing (Yoder, 1997), adult amateur ice skating (McQuarrie & Jackson, 1996), dog sports (Baldwin & Norris, 1999), motorsport events (Harrington, Cuskelly, & Auld, 2000), football fandom (Jones, 2000), computer gaming (Bryce & Rutter, 2003), college football fandom (Gibson, Willming, & Holdnak, 2002), adventure tours (Kane & Zink, 2004), sport tourism (Green & Jones, 2005), and quilting (Stalp, 2006). Furthermore, the recent short and long versions of the Serious Leisure Inventory and Measure (SLIM) provide an individual's attitude toward serious leisure on 18 subdimensions (Gould, Moore, McGuire, & Stebbins, 2008). The 18 subdimensions of SLIM consist of 72 items (4 items for each of the 18 dimensions) and 54 items (3 items for each of the 18 dimensions) for the long and short versions of SLIM respectively.
Despite the rich volumes of research in leisure studies, psychometric properties of items on SLIM have never been studied thoroughly. One major limitation of previous research in leisure studies is that researchers draw inferences based on classical psychometric techniques (e.g., using total scores and mean) without considering measurement theory. The classical psychometric techniques based on total score depend on the number of items and characteristics of samples under study (Hambleton, Swaminathan, & Rogers, 1991). Furthermore, it is essential to investigate the quality of psychometric properties of items on two different currently available types of SLIM since simple modification of measures has a huge impact on measures (Keller & Dansereau, 2001). In order to understand the importance of psychometric properties of items measuring the attitude of serious leisure, this study introduces item response theory (IRT) and demonstrates some of the IRT aspects that can be usefully applied in constructing measure in serious leisure.
Item response theory (IRT) has been widely used for analyzing psychometric characteristics of items and for estimating an individual's latent ability,
θin educational and psychological measurement (Lord & Novick, 1968), performance assessment (Muraki, Hombo, Lee, 2000), attachment research (Fraley, Waller, Brenna, 2000), leadership research (Scherbaum, Finlinson, Barden, & Tamanini, 2006), and health researches (Hays, Morales, Reise, 2000; Hays & Lipscomb, 2007). IRT is a statistical model which reveals how an individual's item responses are related to an individual's θand item properties (Baker & Kim, 2004; Embretson & Reise, 2000; Hambletion & Swamination, 1985; Lord, 1980) and could allow insightful and accurate inferences about the attitude of serious leisure in serious leisure studies.
IRT provides a test information function and the standard error of measurement (SEM) to index the degree of measurement precision across the full range of latent abilities. Thus, SLIM can be evaluated in terms of the amount of information and precision it provides at specific ranges of latent abilities by using IRT. Item information for each item on a measure can be computed at any level of latent ability. Figure 1 shows that item 72 provides the most informative information across
θranging from -2 to 2. Test information can also be computed by summing each item information function to represent the relative precision of a measure across different levels of θ. Figure 3 shows that SLIM is the most informative along θranging from -2 to 1.5. The red dotted line in the test information function in Figure 3 indicates SEM across latent abilities. As SEM increases, the test information function decreases at the lower and upper end of the latent ability continuum. Additional detail is provided elsewhere (Baker & Kim, 2004; Lord, 1980)
Along with a test information function and SEM, it is helpful to reveal the relationship between item properties (e.g., item difficulty and item discrimination) and an individual's
θbecause items provide the most information when item difficulty is matched to an individual's θ. The relationship between item properties and an individual's θcan be described by an item characteristic curve for dichotomously scored items and a category characteristic curve for polytomously scored items.
It is also important to choose a proper IRT model for a measure in order to obtain accurate estimates of items and an individual's
θ. Thus, certain IRT models are preferred over others in terms of the types of responses. When responses are answered dichotomously, IRT models are frequently used, for example, the Rasch or one-parameter model (Rasch, 1960) and the two-, three-paramet(Birnbum, 1957, 1968). For Likert-type data, several IRT models, such as a graded response model or a generalized partial credit model (Muraki, 1992; Samejima 1969), can be used.
The IRT approach for examining psychometric properties of items on a measure is much more complex than the classical test theory approach (CTT). However, IRT has a number of advantages over CTT to measure an individual's attitude toward serious leisure. Dependence between items and samples emerges as a major limitation of CTT in spite of the fact that the relatively weak theoretical assumptions (Hambleton & Jones, 1993) make CTT popular in various contexts such as coefficient alpha (Cronbach, 1951) and generalizability theory (Brennan, 2001). In the CTT approach, the commonly used item statistics in test development (e.g., item difficulty and item discrimination) depend on the particular characteristics of samples (Lord & Novick, 1968) and estimates of an individual's
θare dependent on the particular test items administered. For instance, test scores and reliability of SLIM could be different across different characteristics of samples. However, psychometric properties of items (e.g., parameters of item difficulty and item discrimination) in IRT do not depend on the particular distribution of the characteristics in the sample, and an individual's θin IRT also do not depend on particular characteristics of items on a measure (Hambleton et al., 1991).
IRT also provides more rich information than CTT does in terms of item-level information and SEM. CTT provides no item-level information about how an individual performs when confronted with items on a test. CTT yields a single estimate of reliability and provides corresponding standard errors of measurement assumed the same for all individuals (Hambleton & Swaminathan, 1985). Unlike CTT, IRT provides the item-level information function which indicates how well the item is working for each level of an individual's
θ. IRT also estimates SEM for each level of an individual's θ(Embretson & Reise, 2000; Lord, 1980).
In spite of all the potential benefits, IRT must be applied with great care in serious leisure research. Unlike CTT, IRT makes a number of strong assumptions such as unidimensionality and local independence. The inferences from the application of IRT are only valid when the underlying assumptions are met. The assumption of local independence is that a response to any one item is unrelated to the responses to other items once controlling an individual's
θ(Lord & Novick, 1968). In addition, the assumption of unidimensionality is that the construct of latent ability can be measured by a dominant underlying general factor rather than only one specific underlying factor (Drasgow & Parson, 1983; Kirisci, Hsu, & Yu, 2001; Reckase, 1979). Despite considerable effort to solve unidimensionality assessment, there are no universally accepted statistical tests. However, it is recommended that dimensionality of measures be tested by factor analysis (Reckase, 1979) not by reliability and internal consistency, which cannot be used as an index for unidimensionality (Lord & Novick, 1968, Green, Lissitz, & Mulaik, 1977). Also, it is well known that any instrument that satisfies the unidimensionality assumption also can be considered as meeting the assumption of local independence but not vice versa. (Hambleton et al., 1991).
One practical limitation is that IRT models in general require relatively large sample sizes to obtain robust and meaningful results (Hulin, Lissak, & Drasgow, 1982). There is no absolute guideline about minimum sample sizes required for particular IRT models. However, regarding the issue of sample sizes in Samejima's graded response model (GRM), the simulation study (Lautenschlager, Meade, & Kim, 2006) showed that parameters of items and latent ability with 20 items and sample sizes of 200 on a measure had high correlation (.90) between generated and estimated parameters.
Construction and validation of psychological instruments involve analysis of the items by either traditional statistical methods based on CTT or IRT. Despite the advantages offered by IRT over traditional statistical methods of assessing performances, IRT has never been used in the leisure research field. This study aims at applying GRM (Samejima, 1969) to measure the recently developed SLIM. The MULTILOG (Thissen, 2003) computer program was used to estimate item properties such as item parameters and test information functions for 72 items. Next, the least desirable items identified on the basis of their item information function were removed from the scale.
Leisure activity participants who have memberships with fitness centers in Seoul, Korea, were asked to complete the questionnaire consisting of 72 items to measure serious leisure. A cluster random sampling method was utilized (50 participants from each of the randomly selected six fitness centers). Specifically, all fitness centers in Seoul were alphabetically numbered and six centers were selected using the table of random numbers. Paper-and-pencil surveys were distributed to 300 leisure activity participants and returned directly to the survey administrators. Of the 300 questionnaires that were collected, 45 of the surveys were not included in the analyses because they had data missing not at random with 50% or more on two or more scales. Finally, a total of 255 surveys were utilized for the analyses of this study. The average age of the participants was 29.06 years (SD=9.65) and all respondents were 18 years old or older. Of the 255 participants, the majority were male (61.2%). The average annual household income was $42,310 (SD=32,660), and 66.7% of the respondents were married.
SLIM, developed by Gould et al. (2008), was translated into Korean. The questionnaire for this study included SLIM, consisting of 72 items, and demographic characteristics. The short and long versions of SLIM assess 18 subdimensions. All items were measured with a Likert-type scale (1 – strongly disagree to 5- strongly agree).
Samejima's GRM was implemented in the MULTILOG (Thissen, 2003) computer program since item responses are ordered categorical responses. GRM can be considered as an extension of the two-parameter model for polytomous items and computes conditional probability for successive steps and estimates each response category’s probability of proficiency at each possible value of an individual's
θ. In GRM, it is assumed that the value of θis smaller for individuals who choose the first response option than for individuals who choose the second response option in ordered categorical responses.
GRM determines boundaries reflecting where a given individual is expected to obtain a certain category score (or higher) versus those categories below. GRM computes threshold parameters,
βi with a common slope, α, for a given item and also allows for different spacing of categories across items. Since the restriction of is the same for all categories for each item, the category order will always be the same. For example, GRM estimates a slope parameter, α, and 4 threshold parameters, βi with the five response categories. The item slope parameter indicates how well an item is able to discriminate between continuous trait levels near the inflection point and can either be fixed or free. The high value of indicates the item response categories differentiating among the ability levels of those who choose adjacent response categories (Baker & Kim, 2004). In addition, βi reflects the minimum level of the θneeded to respond above that location with a probability of .5.
θ), the probability of an individual's θchoosing response category k, is defined as the difference between successive boundary curves as the following:
For the lowest and highest options, the lowest assumes the probability is 1.0 that the response is in or above the lowest category, whereas the highest assumes the probability is zero that the response is above the highest category.
The dimensionality of SLIM was tested by principal axis factor analysis with the varimax rotation method in order to check unidimensionality. The results indicated that eigenvalues of the first factor were about eight times superior to the second factor, and the first factor also explained 43.9 % of the total variance. Comparison of the ratio of the 1st to 2nd eigenvalues to the ratio of 2nd to 3rd eigenvlaues resulted in the value of 7.272 and provided an additional evidence for unidimensionality. Thus, it is reasonable to conclude SLIM sufficiently satisfies the assumption of unidimensionality, althougth the SLIM was originally designed to measure more than one factors.
The marginal reliability of SLIM was .9828 for the 72 items, and biserial correlation ranged from .297 to .750. Marginal reliability dropped slightly less than .004, from .9828 to .9824 once deleting 12 items which were identified as poorly constructed ones in terms of item information function as shown in Table 1.
Difficulty parameters across 72 items were mostly negative values. This indicated that serious attitudes were not required to respond to items on SLIM because lower scores reflected a less serious attidude. These results might imply that SLIM seems to be more appropriate to measure individuals who lie on the middle range of serious leisure.
Once the 72 items were analyzed by GRM, item and test information functions were calculated. The amount of item information at different levels of an individual's
θshowed most items provide rich information in the moderately low range of θ.
However, some items (1, 2, 14, 16, 17, 18, 19, 48, 49, 50, 51, and 52) provided relatively little information compared to other items.
The item information functions of those 12 items were relatively flat across the whole range of an individual’s
θ. For instance, item 51 was equally imprecise over the whole range of an individual's θ, whereas item 72 provided a large amount of information at a lower range of an individual's θeven though its precision fell dramatically for a higher level of an individual's θas shown earlier in Table 1. Futhermore, there were also some items which needed close attention (item 4, 5, 6, 7, 25, 34, 37, 41, 52, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72) because the twenty-one items mentioned above functioned like dichotomoulsy scored items which had only two categories. For instance, item 68 needed only the lowest and highest categories as shown in Figure 2.
In addition, item information functions for each of the 72 items were aggregated to form the test information function which showed how precisely
SLIM measures the individual’s
θ. The test information function for SLIM did not have a distinct peak but was relatively flat between the -2 and 2 levels of the θcontinuum. The test information function was provided with the highest levels of measurement precision in the middle range of θ, whereas precision declined at the high and low ends of θ. SLIM measured individuals positioning at different levels of serious leisure traits from low to moderately high with almost equal precision. However, standard error of measurement of SLIM sharply increased as individuals with high serious leisure traits were measured.
The aim of this study is to assess and evaluate the precision and accuracy of SLIM in order to determine how useful SLIM is for research, self-evaluation, and self-improvement. Item response theory (IRT), especially Samejima's (1969) GRM, is used to achieve this end.
SLIM can provide strong measurement precision across most of the range of serious leisure latent traits. Item information functions for each of the 72 items revealed that SLIM performed well for individuals with low to moderate levels of serious leisure attitude. The test information function for the 72 items on SLIM was relatively high and flat.
However, measurement precision tended to decline somewhat at the extreme ends of the latent traits and became increasingly unreliable in assessing individuals with high levels of serious leisure proficiency, which means that SLIM is not an appropriate instrument for selecting individuals having high serious leisure proficiency. Instead, SLIM is more applicable for identifying individuals lying between the low and medium high serious leisure attitude levels on the continuum.
Furthermore, this study reveals that more accurate responses could be obtained while taking less time to collect data because SLIM could be shortened without a noticeable drop in its measurement precision and marginal reliability.
Items with low discrimination power, especially those that have low item information, do not contribute much to the test information function and those items could be safely removed from SLIM. Hence, researchers can save time and cost to collect data with reliable items and can obtain better estimate respondents' attitude given the respondents' responses to items on SLIM with less measurement error. Furthermore, the findings from this study suggest SLIM might need more difficult, highly discriminating items for individuals having serious leisure attitude.
Despite the advantages offered by the IRT over the traditional statistical methods based on the classical test theory, IRT is rarely used in the serious leisure research field due to various difficulties (Hambleton & Jones, 1993; Magno, 2009). First of all, most researchers who have used classical test theory (ANOVA and factor analysis) need to become familiar with and understand the terminology used in educational testing because IRT has been developed within the educational and psychological testing framework. In addition, it is difficult for researchers to understand the mathematical complexity used in IRT and to select appropriate IRT models among various IRT models. The most difficult obstacle to applying IRT in leisure research is IRT computer programs, which are not user-friendly. Despite these limitations, practical applications of IRT in the field of serious leisure cannot be ignored because only an accurate instrument can enhance the understanding of seriousness in leisure.
The current study examined the psychometric properties of items and individuals. The results shows that SLIM without items having low item information has a higher degree of internal consistency than SLIM using all 72 items. The most important implication of using IRT is that an optimal number of items can be obtained while increasing reliability and validity. In addition, several interesting extensions are straightforward within the previously mentioned GRM. The more important ones are models with more item parameters (e.g., adding pseudo-guessing parameter) and models with covariates.
Although these finding support the usefulness of SLIM for measuring seriousness of leisure, it is necessary to investigate measurement invariance across cultures because the concept of serious leisure might be different across different cultures.
[Figure 1] Item information function (item 72)
[Table 1] Item parameters for the deleted 12 items
[Figure 3] Total test information