Topic Classification for Suicidology

  • cc icon

    Computational techniques for topic classification can support qualitative research by automatically applying labels in preparation for qualitative analyses. This paper presents an evaluation of supervised learning techniques applied to one such use case, namely, that of labeling emotions, instructions and information in suicide notes. We train a collection of one-versus-all binary support vector machine classifiers, using cost-sensitive learning to deal with class imbalance. The features investigated range from a simple bag-of-words and n-grams over stems, to information drawn from syntactic dependency analysis and WordNet synonym sets. The experimental results are complemented by an analysis of systematic errors in both the output of our system and the gold-standard annotations.


    Affect recognition , Sentiment analysis , Skewed class distribution , Text classification


    Suicide is a major cause of death worldwide, with an annual global mortality rate of 16 per 100,000, and the problem is growing at a rate that has been increasing by 60% in the last 45 years [1]. Researchers have recently called for more qualitative research in the fields of suicidology and suicide prevention [2]. Computational methods can expedite such analyses by labeling related texts with relevant topics.

    The work described in this paper was conducted in the context of track 2 of the 2011 Medical Natural Language Processing (NLP) Challenge on sentiment analysis in suicide notes [3]. This was a multi-label non-exclusive sentence classification task, where labels were applied to the notes left by people who died from suicide. This paper presents an evaluation of the utility of various types of features for supervised training of support vector machine (SVM) classifiers to assign labels representing topics including several types of emotion and indications of information and instructions. The information sources explored range from bag-of-words features and n-grams over stems, to features based on syntactic dependency analysis and WordNet synonym sets. We also describe how costsensitive learning can be used to mitigate the effect of class imbalance.

    We begin the remainder of the paper by providing some background on relevant work in Section II. We describe the data provided by the 2011 Medical NLP Challenge task organizers in Section III. Section IV details our approach, which involves training a collection of binary one-versus-all SVM sentence classifiers. Section V presents the performance of our approach, both under cross-validation of the development data and in final evaluation on held-out data. Section VI analyzes common types of errors, both in the gold-standard and the output produced by our system, while our conclusions and thoughts for future work are outlined in Section VII.


    We are not aware of any previous work on the automatic labeling of suicide notes. However, given the emphasis on emotion labels, the most similar previous work is perhaps the emotion labeling subtask of the SemEval-2007 affective text shared task [4], which involved scoring newswire headlines according to the strength of six so-called basic emotions stipulated by Ekman [5] - ANGER, DISGUST, FEAR, JOY, SADNESS and SURPRISE. There were three participating systems in the SemEval- 2007 emotion labeling task. SWAT [6] employed an affective lexicon where the relevance of words to emotions was scored in an average emotion score for every headline in which they appear. UA [7] also used a lexicon, which was instead compiled by calculating the point-wise mutual information with headline words and an emotion using counts obtained through information retrieval queries. UPAR7 [8] employed heuristics over dependency graphs in conjunction with lexical resources such as WordNet-Affect [9]. In subsequent work, the task organizers investigated the application of latent semantic analysis (LSA) and a Naive Bayes (NB) classifier that was trained using author-labelled blog posts [10]. As can perhaps be expected given the different approaches of the various systems, each performed best for different emotions. This highlights the need for emotion-labeling systems to draw from a variety of analyses and resources.


    The task organizers provided developmental data consisting of 600 suicide notes, comprising 4,241 (pre-segmented) sentences. Note that a “sentence” here is defined by the data, and can range from a single word or phrase to multiple sentences (in the case of segmentation errors). Each sentence is annotated with 0 to 15 labels (listed with their distribution in Table 1). For held-out evaluation, the organizers provided an additional set of 300 unlabelled notes, comprising 1,883 sentences. The task organizers report an inter-annotator agreement rate of 54.6% over all sentences. Fig. 1 provides excerpts from a note in the training data, with assigned labels.


    Our approach to the task of labeling suicide notes

    involves learning a collection of binary one-versus-all classifiers. One-versus-all classifiers are a common solution for multi-class problems [11], where the problem is reduced to multiple independent binary classifiers. In a typical one-versus-all setup, an item is assigned the label with the highest score among the classifiers. However, as items in this task can have multiple labels, we simply assign labels according to the decision of each binary classifier.

    The classifiers are based on the framework of SVM [12]. SVMs have been found to be very effective for text classification and tend to outperform other approaches such as NB [13]. For each label, we train a linear sentence classifier using the SVMlight toolkit [14]. The set of all sentences annotated with the label in question form positive examples for that classifier, with all remaining sentences used as negative examples. Section IV-B describes how the problem of imbalanced numbers of positive and negative examples in the data is alleviated by using unsymmetric cost factors during learning. First, however, Section IV-A below describes the feature functions that define the vector representation given to each sentence.

      >  A. Features

    We explored a range of different feature types for our emotion classifiers. The most basic features we employ are obtained by reducing inflected and derived words to their stem or base form, e.g., happy, happiness, happily, etc., all activate the stem feature happi. Together, the stem features provide a bag-of-words type representation for a given sentence. The word stems themselves are determined using the implementation of the Porter Stemmer [15] in the Natural Language Toolkit [16].

    Another feature type records bigrams of stems (e.g., happy days activates the bigram feature happi day). We also investigated the use of longer n-grams in preliminary experiments, but found that they were counter-productive.

    Lexicalized Part-of-Speech features are formed of word stems concatenated with their part-of-speech (PoS). PoS tags are assigned using TreeTagger [17], which is based on the Penn Treebank tagset.

    Features based on syntactic dependency analysis provide us with a method for abstracting over syntactic patterns in the data set. The data is parsed with Maltparser, a language-independent system for data-driven dependency parsing [18]. We train the parser on a PoS-tagged version of the Wall Street Journal sections 2-21 of the Penn Treebank, using the parser and learner settings optimized for the Maltparser in the CoNLL-2007 Shared Task. The data was converted to dependencies using the Pennconverter software [19].

    The parser was chosen partly due to its robustness to noise in the input data; it will not break down when confronted with incomplete sentences or misspelled words, but will always provide some output. While the amount of noise in the data will clearly affect the quality of the parses, we found that, in the context of this task, having at least some output is preferable to no output at all.

    Consider the dependency representation provided for the example sentence in Fig. 2. The features we extract from the parsed data aim to generalize over the main predication of the sentence, and hence center on the root of the dependency graph (usually the finite verb) and its dependents. In the given example, the root is an auxiliary, and we traverse the chain of verbal dependents to locate the lexical main verb, leave,” which we assume is more indicative of the meaning of the sentence than the auxiliary, will.” The extracted feature types are as follows, with example instantiations based on the representation in Fig. 2:

    Sentence dependency patterns: lexical features (wordform, lemma, PoS) of the root of the dependency graph, e.g., (leave, leave, VV), and patterns of dependents from the (derived) root, expressed by their dependency label, e.g., (VC-OBJ-OPRD), partof- speech (VV-NN-VVD) or lemma (leave-doorunlock)

    Dependency triples: labelled relations between each head and dependent: will-SBJ-I, will-VCleave, leave-OPRD-unlocked, etc.

    We also include a class-based feature type recording the semantic relationships defined by WordNet synonym sets (synsets) [20]. These features are generated by mapping words and their PoS to the first synset identifier (WordNet synsets are sorted by frequency). For example, the adjectives distraught and overwrought both map to the synset id 00086555.

    WordNet-Affect [9] is an extension of WordNet with affective knowledge pertaining to information such as emotions, cognitive states, etc. We utilize this information by activating features representing emotion classes when member words are observed in sentences. For example, instances of the words wrath or irritation both activate the WordNet-Affect feature anger.

    In preliminary experiments, we investigated the difference in performance when representing feature frequency versus presence, as previous experiments in sentiment classification [21] indicated that unigram presence (i.e., a boolean value of 0 or 1) is more informative than their frequencies. For the suicide note analysis, however, we found that features encoding frequency rather than presence always performed better in our end-to-end experiments.

    The final type of feature that we will describe represents the degree to which each stem in a sentence is associated with each label, as estimated from the training data. While there is a range of standardly used lexical association measures that could potentially be used for this purpose (such as point-wise mutual information, the Dice coefficient, etc.), the particular measure we will be using here is the log odds ratio (log θ). After first computing the relevant co-occurrence probabilities for a given word w and a label l in the training data, the odds ratio is calculated as:


    If the probability of having the label l increases when w is present, then θ(w, l) > 1. If θ(w, l) = 1 then w makes no difference in the probability of l, which means that the label and the word are distributionally independent. By taking the natural logarithm of the odds ratio, log θ, the score is made symmetric, with 0 being the neutral value that indicates independence. In order to incorporate this information in the classifier, we add features of all words in a given sentence towards each label, in addition to

    boolean features indicating which label had the maximum association score.

      >  B. Cost-Sensitive Learning

    From the frequencies listed in Table 1, it is clear that the label distributions are rather different. Moreover, for each individual classifier, it is also clear that the class balance will be very skewed, with the negative examples (often vastly) outnumbering the positives. At the same time, it is the retrieval of the positive minority class that is our primary interest. A well-known approach for improving classifier performance in the face of such skewed class distributions is to incorporate the notion of cost-sensitive learning. While this is sometimes done by the use of so-called down-sampling or up-sampling techniques [22], the SVMlight toolkit comes with built-in support for estimating cost-sensitive models directly. Working within the context of intensive care patient monitoring, but facing a similar setting of very unbalanced numbers of positive and negative examples, Morik et al. [23] introduced a notion of unsymmetric cost factors in SVM learning. This means associating different cost penalties with false positives and false negatives. Using the SVMlight toolkit, it is possible to train such cost models by supplying a parameter (j) that specifies the degree to which training errors on positive examples outweigh errors on negative examples (the default being j = 1, i.e., equal cost). In practice, the unsymmetric cost factor essentially governs the balance between precision and recall. The next section includes results of tuning the SVM cost-balance parameter separately for each emotion label in the suicide data and relative to different feature configurations.


    As specified by the shared task organizers, overall system performance is evaluated using micro-averaged F1. In addition, we also compute precision, recall and F1 for each label individually. We report two rounds of evaluation. The first was conducted solely on the development data using ten-fold cross-validation (partitioning on the note-level). The second corresponds to the system submission for the shared task, i.e., training classifiers on the full development data and predicting labels for the notes in the held-out set.

      >  A. Developmental Results

    Table 2 lists the performance of each feature type in isolation (using the same feature configuration for each binary classifier and the default symmetric cost-balance). We also include the score for a simple baseline method that naively assigns the majority label (INSTRUCTIONS) to all sentences. We note that stems are the most informative feature type in isolation and perform best overall (F1 = 39.43). Dependency Triples are most effective in terms of precision, and all feature types have less recall than the majority baseline.

    In further experiments that examined the effect of using several feature types in combination, we found that combining stems, bigrams, parts-of-speech and dependency analyses achieved the best performance overall (F1 = 41.82). However, these experiments also made it clear that different combinations of features were effective for different labels. Moreover, as our one-versus-all set-up means training distinct classifiers for each label, we are not limited to using one set of features for all labels. We therefore experimented with a grid search across different permutations of feature configurations, as further described below.

    We also tuned the cost-balance parameter described in Section IV above. The reason for introducing the costbalance parameter in our setup is to alleviate the imbalance between positive and negative examples. For some labels, this imbalance is so extreme that our initial system was unable to identify any positive predictions at all, neither true nor false. An example of such a label is forgiveness, which has only six annotated examples among the 4,241 sentences in the training data. Naturally, any supervised learning strategy will have problems making reliable generalizations on the basis of so little evidence. However, even for the more frequently occurring labels, the ratio of positive to negative examples is still quite skewed.

    As we found that the optimal feature configuration was dependent on the value of the cost-balance parameter (and vice-versa), these parameters were optimized in parallel for each classifier. The results of this search are listed in Table 3, with the best feature combinations and cost-balance for each label. We note that the optimal configuration of features varies from label to label, but that stems and synonym sets are often in the optimal setup, while dependency triples and features from WordNetAffect do not occur in any configuration.

    As discussed above, the unsymmetric cost factor essen-

    tially governs the balance between precision and recall. For many classes, increasing the cost of errors on positive examples during training allowed us to achieve a pronounced increase in recall, though often at a corresponding loss in precision. Although this could often lead to greatly increased F1 at the level of individual labels, the overall micro F1 was compromised due to the low precision of the classifiers for infrequent labels in particular. Therefore, our final system only attempts to classify the six labels that can be predicted with the most reliability - GUILT, HOPELESSNESS, INFORMATION, INSTRUCTIONS, LOVE and THANKFULNESS - and makes no attempt on the remaining labels.

    Testing by ten-fold cross-validation of the development data, this has the effect of an increased overall system performance in terms of the micro-average scores (compare micro-average (total) and micro-average in Table 3). A further point of comparison is the optimal result of using identical setups for all classifiers, where F1 = 54.68, precision = 60.57, and recall = 50.17 (using bag-of-stems, bigrams over stems, parts-of-speech, and sentence dependency patterns as features, and a cost-balance of 6). It should be noted that this rather radical design choice to only attempt classification of six labels is at least partially informed by the fact that micro-averaging (rather than macro-averaging) is used for the shared task evaluation. While micro-averaging is prone to emphasize larger classes, macro-averaging emphasizes smaller classes.

      >  B. Held-out Results

    Table 4 describes the performance on the held-out evaluation data set when training classifiers on the entire development data set, with details on each label attempted by our setup. As described above, we only apply classifiers for six of the labels in the data set (due to the low precision observed in the development results for the remaining nine labels). We find that the held-out results are quite consistent with those predicted by cross-validation on the development data. The final micro-averaged F1 is 54.36, a drop of only 1.45 compared to the development result (see Section VII for a comparison of our system and its

    results to those of other participants in the shared task).


    This section offers some analysis and reflections with respect to the prediction errors made by our classifiers. Given the multi-class nature of the task, much of the discussion will center on cases where the system confuses two or more labels. Note that all example sentences given in this section are taken from the shared task evaluation data and are reproduced verbatim.

    In order to uncover instances of systematic errors, we compiled contingency tables showing discrepancies between the decisions of the classifiers and the labels in the gold standard. Firstly, we note that BLAME and FORGIVENESS are often confused by our approach, and are closely related semantically. We consider these classes to be polar in nature; while both imply misconduct by some party, they elicit opposite reactions from the offended entity. Their similarity means that their instances often share features and are thus confused by our system.

    We also note that the classes of GUILT and SORROW are hard to discern, not only for our system but also for the human annotators. For instance, Example 1 is annotated as SORROW, while Example 2 is annotated as GUILT. This makes features such as the stem of sorry prominent for both classes, hence our system often labels instances of either GUILT or SORROW with both labels. We also note some instances that are unlabelled but where the context is typically indicative of GUILT/SORROW, such as in Example 3. Furthermore, sorry appears to be a particularly ambiguous word; conceivably, it might also be associated with BLAME, (e.g., you will be sorry).

    Example 1. Am sorry but I can’t stand it ...

    Example 2. I am truly sorry to leave ...

    Example 3. ... sorry for all the trouble .

    It is worth noting here that some of the apparent inconsistencies observed in the gold annotations are likely due to the way the annotation process was conducted. While three annotators separately assigned sentence-level labels, the final gold standard was created on the basis of majority vote between the annotators. This means that, unless two or more annotators agree on a label for a given sentence, the sentence is left unlabelled (with respect to the label in question).

    Some of the labels in the data tend to co-occur. For instance, Example 1 above is actually annotated with both SORROW and HOPELESSNESS. However, these intuitively apply to two different sub-sentential units: Am sorry (SORROW) and I can’t stand it (HOPELESSNESS). A problem that faced in any supervised learning approach here is the fact that the annotations are given at the sentence level, with no distinction between different sentence constituents or subsequences, and so the presence of a token like sorry can be deemed a positive feature for both SORROW and HOPELESSNESS by the learner. One possible avenue for improving results would therefore be to apply further annotation describing sub-sentential labels and the constituents to which they apply.

    Note that the problem discussed above is also compounded due to errors in the sentence segmentation. For instance, Example 4 is provided as a single sentence in the training data, with the labels THANKFULNESS and HOPELESSNESS. However, as the labels actually apply to different sentences, this will introduce additional noise in the learning process.

    Example 4. You have been good to me. I just cannot take it anymore.

    Some of the errors made by the learner seem to indicate that having features that are sensitive to a larger context might also be useful, such as taking the preceding sentences and/or previous predictions into account. Consider the following examples from the same note, where both sentences are annotated as INSTRUCTIONS:

    Example 5. In case of accident notify Jane.

    Example 6. J. Johnson 3333 Burnet Avenue.

    While Example 6 is simply an address, it is annotated as INSTRUCTIONS. Of course, predicting the correct label for this sentence in isolation from the preceding context will be near impossible. Other cases would seem to require information that is very different from that captured by our current features, such as pragmatic knowledge, before we could hope to get them right. For example, in several cases, the system will label something as INFORMATION when the correct label is INSTRUCTIONS. This is often because a sentence has communicated information which pragmatically implied an instruction. For example, we presume that Example 7 is annotated as INSTRUCTIONS because it is taken to imply an instruction to collect the clothes.

    Example 7. Some of my clothes are at 3333 Burnet Ave. Cincinnati - just off of Olympic .


    This paper has provided experimental results for a variety of feature types for use when learning to identify various fine-grained emotions, as well as information and instructions, in suicide notes. These feature types range from simple bags-of-words to syntactic dependency analyses and information from manually-compiled lexicalsemantic resources. We explored these features using an array of binary SVM classifiers.

    A challenging property of this task is the fact the classifiers are subject to extreme imbalances between positive and negative examples in the training data; the infrequency of positive examples can make the learning task intractable for supervised approaches. In this paper, we have shown how a cost-sensitive learning approach that separately optimizes the cost-balance parameter for each of the topic labels, can be successfully applied for addressing problems with such skewed distributions of training examples. For the less-frequent labels, however, the optimal F1 tended to arise from gains in recall at the great expense of precision. Thus, we found that discarding poorly-performing classifiers resulted in improvements overall. While arguably an ad hoc solution, this is motivated by the shared task evaluation scheme of maximizing micro-averaged F1.

    Of the twenty-five submissions to the shared task, our system was placed fifth (with a micro-averaged F1 of 54.36); the highest-performer achieved an F1 of 61.39, while the lowest scored 29.67. The mean result was 48.75 (σ = 7.42), and the median was 50.27. The primary differences between the system we describe and the other top-ranked approaches include: combining machine learning with heuristics [24, 25] and keyword-spotting for infrequent labels [25]; manually labeling training data from additional sources [24]; manually re-annotating training data to remove inconsistencies [26] and extracting other features (such as character n-grams [27] and spanning n-grams that skip tokens [24]).

    An analysis of the errors made by our system has suggested possible instances of inter-annotator confusion, and has provided some indications for directions for future work. These include re-annotating data at the subsentential level, and drawing in the context and predictions of the rest of the note when labeling sentences. We also note that text in this domain tends to contain many typographical errors, and thus models might benefit from features generated using automatic spelling correction.

    In other future work, we will conduct a search of the parameter space to find optimal parameters for each label with respect to the overall F1 (rather than the label-local F1 we used in the current work). Finally, we will look to boost performance for labels with few examples by drawing information from large amounts of unlabelled text. For instance, inferring the semantic similarity of words from their distributional similarity has been effective for other emotion-labeling tasks [28].

  • 1. Suicide prevention (SUPRE) [Internet]. google
  • 2. Hjelmeland H., Knizek B. L. 2010 “Why we need qualitative research in suicidology” [Suicide and Life-Threatening Behavior] Vol.40 P.74-80 google
  • 3. Pestian J. P., Matykiewicz P., Linn-Gust M., South B., Uzuner O., Wiebe J., Cohen K. B., Hurdle J., Brew C. 2012 “Sentiment analysis of suicide notes: aa shared task” [Biomedical Informatics Insights] Vol.5 P.3-16 google
  • 4. Strapparava C., Mihalcea R. 2007 “SemEval-2007 task 14: affective text” [Proceedings of the 4th International Workshop on Semantic Evaluations] P.70-74 google
  • 5. Ekman P. 1977 “Biological and cultural contributions to body and facial movement,” The Anthropology of the Body P.39-84 google
  • 6. Katz P., Singleton M., Wicentowski R. 2007 “SWAT-MP: the SemEval-2007 systems for task 5 and task 14” [Proceedings of the 4th International Workshop on Semantic Evaluations] P.308-313 google
  • 7. Kozareva Z., Navarro B., Vazquez S., Montoyo A. 2007 “UA-ZBSA: a headline emotion classification through web information” [Proceedings of the 4th International Workshop on Semantic Evaluations] P.334-337 google
  • 8. Chaumartin F. R. 2007 “UPAR7: a knowledge-based system for headline sentiment tagging” [Proceedings of the 4th International Workshop on Semantic Evaluations] P.422-425 google
  • 9. Strapparava C., Valitutti A. 2004 “WordNet-affect: an affective extension of WordNet” [Proceedings of the 4th International Conference on Language Resources and Evaluation] P.1083-1086 google
  • 10. Strapparava C., Mihalcea R. 2010 “Annotating and identifying emotions in text,” Intelligent Information Access. Studies in Computational Intelligence vol. 301 P.21-38 google
  • 11. Duan K. B., Keerthi S. S. 2005 “Which is the best multiclass SVM method? An empirical study” [Proceedings of the 6th International Workshop on Multiple Classifier Systems] P.278-285 google
  • 12. Vapnik V. N. 1995 The Nature of Statistical Learning Theory google
  • 13. Joachims T. 1998 “Text categorization with support vector machines: learning with many relevant features” [Proceedings of the 10th European Conference on Machine Learning] P.137-142 google
  • 14. Joachims T. 1999 “Making large-scale support vector machine learning practical,” Advances in Kernel Methods: Support Vector Learning P.169-184 google
  • 15. Porter M. F. 1980 “An algorithm for suffix stripping” [Program: Electronic Library and Information Systems] Vol.14 P.130-137 google
  • 16. Bird S., Loper E. 2004 “NLTK: the natural language toolkit” [Proceedings of the ACL on Interactive Poster and Demonstration Sessions] google
  • 17. Schmid H. 1994 “Probabilistic part-of-speech tagging using decision trees” [Proceedings of the International Conference on New Methods in Language Processing] P.44-49 google
  • 18. Nivre J., Hall J., Nilsson J., Eryigit G., Marinov S. 2006 “Labeled pseudo-projective dependency parsing with support vector machines” [Proceedings of the 10th Conference on Computational Natural Language Learning] P.221-225 google
  • 19. Johansson R., Nugues P. 2007 “Extended constituent-todependency conversion for English” [Proceedings of the 16th Nordic Conference of Computational Linguistics] P.105-112 google
  • 20. Fellbaum C. 1998 WordNet: An Electronic Lexical Database google
  • 21. Pang B., Lee L., Vaithyanathan S. 2002 “Thumbs up? sentiment classification using machine learning techniques” [Proceedings of the Conference on Empirical Methods in Natural Language Processing] P.79-86 google
  • 22. McCarthy K., Zabar B., Weiss G. 2005 “Does cost-sensitive learning beat sampling for classifying rare classes?” [Proceedings of the 1st International Workshop on Utility-Based Data Mining] P.69-77 google
  • 23. Morik K., Brockhausen P., Joachims T. 1999 “Combining statistical learning with a knowledge-based approach: a case study in intensive care monitoring” [Proceedings of the 16th International Conference on Machine Learning] P.268-277 google
  • 24. Xu Y., Wang Y., Liu J., Tu Z., Sun J. T., Tsujii J., Chang E. 2012 “Suicide note sentiment classification: a supervised approach augmented by web data” [Biomedical Informatics Insights] Vol.5 P.31-41 google
  • 25. Yang H., willis A., de Roeck A., Nuseibeh B. 2012 “A hybrid model for automatic emotion recognition in suicide notes” [Biomedical Informatics Insights] Vol.5 P.17-30 google
  • 26. Sohn S., Torii M., Li D., Wagholikar K., Wu S., Liu H. 2012 “A hybrid approach to sentiment sentence classification in suicide notes” [Biomedical Informatics Insights] Vol.5 P.43-50 google
  • 27. Cherry C., Mohammad S. M., de Bruijn B. 2012 “Binary classifiers and latent sequence models for emotion detection in suicide notes” [Biomedical Informatics Insights] Vol.5 P.147-154 google
  • 28. Read J., Carroll J. 2009 “Weakly supervised techniques or domain-independent sentiment classification” [Proceedings of the 1st International CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion] P.45-52 google
  • [Fig. 1.] Example sentences from a suicide note in the shared task training data.
    Example sentences from a suicide note in the shared task training data.
  • [Table 1.] The distribution of labels in the training data
    The distribution of labels in the training data
  • [Fig. 2.] Example dependency representation.
    Example dependency representation.
  • [Table 2.] Developmental results of various feature types. The baseline corresponds to labeling all items as instructions, the majority class
    Developmental results of various feature types. The baseline corresponds to labeling all items as instructions, the majority class
  • [Table 3.] Labels in the suicide notes task with feature sets and cost-balance (j) optimized with respect to the local label F1
    Labels in the suicide notes task with feature sets and cost-balance (j) optimized with respect to the local label F1
  • [Table 4.] Performance of our optimized classifiers trained using the development data and tested on the held-out evaluation data
    Performance of our optimized classifiers trained using the development data and tested on the held-out evaluation data