CRFBased Figure/Ground Segmentation with PixelLevel Sparse Coding and Neighborhood Interactions
 Author: Zhang Lihe, Piao Yongri
 Publish: Journal of information and communication convergence engineering Volume 13, Issue3, p205~214, 30 Sep 2015

ABSTRACT
In this paper, we propose a new approach to learning a discriminative model for figure/ground segmentation by incorporating the bagoffeatures and conditional random field (CRF) techniques. We advocate the use of image patches instead of superpixels as the basic processing unit. The latter has a homogeneous appearance and adheres to object boundaries, while an image patch often contains more discriminative information (e.g., local image structure) to distinguish its categories. We use pixellevel sparse coding to represent an image patch. With the proposed feature representation, the unary classifier achieves a considerable binary segmentation performance. Further, we integrate unary and pairwise potentials into the CRF model to refine the segmentation results. The pairwise potentials include color and texture potentials with neighborhood interactions, and an edge potential. High segmentation accuracy is demonstrated on three benchmark datasets: the Weizmann horse dataset, the VOC2006 cow dataset, and the MSRC multiclass dataset. Extensive experiments show that the proposed approach performs favorably against the stateoftheart approaches.

KEYWORD
Conditional random field , Figure/ground segmentation , Neighborhood interaction , Sparse coding

I. INTRODUCTION
Figure/ground segmentation aims at partitioning an image into regions of coherent properties as a means for separating objects from their backgrounds. Considerable effort has been made to develop many advanced techniques in the recent years, where learning segmentation has attracted considerable attention of researchers because of its significant performance in classification applications. Further, probabilistic graphical models have also been remarkably successful in segmentation applications.
In any image system, feature representation is crucial to enhancing the system performance. How to manifest an image and how to capture salient properties of the object regions are still challenging problems. The bagoffeatures (BoF) model [13] has been widely used in the field of image processing. The model treats an image as a collection of unordered appearance descriptors extracted from local patches, quantizes them into discrete ‘visual words,’ and then computes a compact histogram representation. In this work, we propose a patchlevel BoF model to effectively represent an image patch from raw image data. By pixellevel dictionary learning, sparse coding, and spatial pyramid matching, the feature representation can capture the salient properties of the image patch, thus resulting in high patchwise segmentation accuracy.
Learning segmentation converts the image segmentation problem into a data clustering problem of image elements. One of the core challenges for machine learning is to discover what kind of information can be learned from the data sources and cluster this data into segments depicting the same object. Ren and Malik [4] proposed a classification model for segmentation, which feeds the Gestalt grouping cues into a linear classifier to discriminate between good and bad segmentation. Wu and Nevatia [5] developed a method to simultaneously detect and segment objects by boosting the edgelet feature classifiers. Duygulu et al. [6] modeled object recognition as a process of annotating image regions with words, and learning a mapping between region types and keywords by using an EM algorithm.
Probabilistic graphical models usually construct a cost function on the basis of some image constraints and formulate the image segmentation problem as a stochastic optimization problem. A condition random field (CRF) provides a principled approach to incorporating datadependent interactions; the complex joint probability distribution need not be modeled in this case. In this work, we use the CRF model to fuse multiple visual cues. For the CRF model, the definition of unary and pairwise potentials is very important. Previously, the unary potential was directly defined on feature spaces [7]. Lately, researchers have paid more attention on using a classifier to generate a unary potential [815], and most of them prefer using a pixel or a superpixel as the basic processing unit. In contrast, we use a regular image patch. Image patches on object boundaries contain rich local structure information of an object (see Fig. 1). While superpixels usually have a homogeneous appearance and an almost uniform size along with edge preservation, particularly for weak boundaries, these properties weaken the discriminative capability of the unary classifier when the superpixel is taken as the sampling unit.
Our main contributions in this paper are twofold. First, we use an image patch as a sample of a unary classifier and propose an upgrade patch feature representation based on pixellevel sparse coding, which can capture more structure information of the local contour of objects. Second, we propose the color and texture pairwise potentials with neighborhood interactions and an edge potential representing edge continuity, which are validated to be very effective in our experiments.
II. CONDITIONAL RANDOM FIELDS
CRFs are probabilistic models for segmenting data with structured labels [16], which are defined on a twodimensional discrete lattice, every site on which corresponds to a graph node. Let
G ( ) be an undirected graph with image patches as nodesV,E and the links between pairs of nodes as edgesV . CRFs directly model the distributionE P L I,w,v of node labelsL conditional on image dataI for node parametersw and edge parametersv . We are interested in finding the labels wherel_{i} denotes the label of thei ^{th} node. In this work, we are concerned with binary segmentation (foreground and background), i.e.,l_{i} ∈ {1,1}. The joint distribution over the labelsL given the observationsI can be expressed as follows:where
z represents a normalized constant known as the partition function,N_{i} denotes the set of neighbors of thei ^{th} node in graphG , andA_{i} andI_{ij} indicate the unary and pairwise potentials, respectively. The unary potential is modeled using a local discriminative model that decides the association of a given node to a certain class, ignoring the interaction of its neighbors. In contrast, the pairwise potential is regarded as a datadependent smoothing function that denotes the interaction between two nodes. Both terms explicitly depend on a predefined set of features fromI .III. MAIN WORK
We use a CRF model to learn the conditional distribution over figure/ground labeling given an image, which allows us to incorporate different levels and different types of features in a single unified model.
> A. Unary Potentials
In this work, the unary potentials are defined by the prediction probability obtained from a linear support vector machine (SVM) classifier. Different from the existing feature descriptions, we train a pixellevel overcomplete dictionary to sparsely represent image patches in a highdimensional space.
1) PixelLevel Texture Descriptor
Gabor wavelets have received considerable attention because of biological reasons and their optimal resolution in both frequency and spatial domains. The Gabor wavelet representation can capture the local structure corresponding to the spatial scale, spatial localization, and orientation selectivity. It can characterize the spatial frequency structure in the image, while preserving the information of spatial relations. However, many existing image representation approaches in the Gabor domain merely consider the magnitude information. In this work, we proposed a new pixellevel feature descriptor, which fuses the Gabor magnitude and the Gabor phase.
To eliminate local noise interference, a simple smoothing filter is used for removing image noise in advance. Then, we perform the Gabor transform in
D directions andS scales on a given gray image, and respectively, denote the magnitude response and the phase response in directionθ and scaleσ asρ_{θσ} andα_{θσ} . Further, the 2π phase space is uniformly quantized intoJ intervals asΦ _{j} = [ø _{j,} _{min},ø _{j,} _{min} +ς ] ,j = 1,...J .ς = 2π /J represents the quantization step, andø _{j,} _{min} denotes the margin value between two phase intervalsΦ _{j} _{1} andΦ _{j} . Suppose that phase responseα_{θσ} belongs to thej ^{th} intervalΦ _{j} . Then, we compute Eq. (2) to get aJ dimensional vectory _{θσ} in directionθ and scaleσ as follows:where
An example of diagram
J = 8 is shown in Fig. 2, in which the phase space is quantized into eight intervals and =ς /4. Assuming the phase responseπ α_{θσ} ∈Φ _{2} , we can update the margin value as theJ dimensional vectorWe concatenate all vectors
y _{θσ} of the givenD directions andS scales as the pixellevel descriptor = [y _{1,1},y _{1,2},...,y y _{D,S} ] . Theℓ =D ×S ×J dimensional feature vector not only describes the distribution of the phase response in each scale and direction but also reflects the magnitude response.2) PixelLevel Dictionary Learning and Coding
Sparse representations have demonstrated considerable success in numerous applications, and the sparse modeling of signals has been proven to be very effective in signal reconstruction and classification. We randomly sample some pixels from the training image set to learn an overcomplete pixellevel dictionary. Assuming that we collect N training samples, we define a matrix
Y ∈R ^{ℓ×N} as the columns of samples:where
y _{i} stands for theℓ dimensional texture feature of thei ^{th} sample.Using an overcomplete dictionary
D ∈R ^{ℓ×L} , which containsL atoms as column vectors, we can approximate the observed sample well by using a sparse linear combination of these atoms. In particular, there exists a sparse coefficient vectory such thatx can be approximated asy ≈y D , where the vector_{x} represents the weighted contribution of these atoms when reconstructing the observed sample. Given the training samples, we can learn the dictionaryx D by solving the following optimization problem:where
λ denotes a balance parameter and the second term enforces to have a small number of nonzero elements.x The optimization problem is convex in
D orX while fixing the other, but not in both simultaneously [17]. We solve it by alternating the optimization overD andX ; the dictionaryD can be initialized by randomly samplingL columns fromY or by Kmeans clustering. When fixingD , the optimization becomes a standard sparse coding problem, which can be solved very efficiently by using the featuresign search algorithm. When fixingX , the problem reduces to a least squares problem (as shown in Eq. (5)), which can be solved by using the Lagrange dual algorithm.Once the overcomplete dictionary
D is given, the texture feature of each pixel can be coded as Ldimensional sparse vectory by solving the following l1norm regularization problem:x 3) Patch Feature Representation
The pivotal role of the unary potential in the CRFbased segmentation model has been demonstrated. It can be taken as a local decision term, which decides the association of a given graph node to a certain class. Usually, the use of the unary classifier alone leads to high accuracy as compared to the full CRF model as it can segment most parts of an object and loses only some details of the object boundaries.
In this work, we integrate the texture feature and the color feature to represent an image patch. We partition a patch into 1×2, 2×2 segments in two different scales, and then, compute the max pooling vector of the sparse codes of pixels within each of the five segments. We finally concatenate all the vectors to form a vector representation of the texture feature. The socalled spatial pyramid matching has had remarkable success in image classification applications. Color information is very useful for identifying the classes of image patches. For example, backgrounds (e.g., sky, water, grass, and tree) are usually distinguishable from objects (e.g., cow, sheep, and bird) in color. For a patch, we compute 64bin histograms in each CIE Lab color channel as its color feature and then, concatenate the texture vector and the color vector to form the final feature representation. In our experiments, we fixed the size of dictionary
D as 2048; thus, the dimension of the patch feature is 2048×5+64×3 =10432 . We also find that max pooling outperforms the other alternative pooling methods.4) Unary Potential Computation
We train a binary linear SVM classifier to predict the figure/ground probabilities of an image patch, which are used for computing the unary potential. However, the boundaries of the foreground segmented by the CRF model using patches as graph nodes have a blocking effect. To generate results close to the ground truth, we split patches into some perceptually meaningful entities by using the oversegmentation boundary map, which is generated by an existing region merging method [18]. As shown in Fig. 3, a patch is split into several regions. These regions are considered the graph nodes, and their unary potential values are defined as the corresponding prediction probability of the host patch. That is, the regions split from a patch are assigned the same probability. In particular, the unary potential in Eq. (1) is defined as follows:
where
P (l_{i} lI ) denotes the local class posterior, that is, the prediction probability given by the unary classifier.> B. Pairwise Potentials
After the unary binary classification, we can already obtain good segmentation results, but the classifier separately processes each image patch. The mutual dependence among neighboring patches is ignored, which results in some neighboring patches with a similar appearance being possibly improperly labeled as opposite classes (see Fig. 4). Therefore, as contextual knowledge is necessary for image segmentation, we define the pairwise potential to address this problem. In this work, the pairwise penalty
I_{ij} is defined as the weightw_{ij} of a graph edge, that is,In image segmentation, the weights encode a graph affinity such that a pair of nodes with a high weight edge is considered to be strongly connected and edges with low weights represent the nearly disconnected nodes. We exploit the color, texture, and edge cues to model the connection between nodes and incorporate the three types of potentials in a unified CRF framework using prelearned parameters. Assuming that the superscripts
c, t , ande denote the color, texture, and edge, respectively, we can rewritew_{ij} (I ) as follows:where
g_{ij} (I ) represents a distance function defined over node pairs (i, j ).w_{ij} (I ) synthetically reflects the connectivity of nodesi andj on multiple feature spaces.1) Color Potential and Texture Potential
Color information is an essential and typical representation for images and is a key element for distinguishing objects. Mean and histogram are two common color descriptors. Mean only describes the average color component rather than the color distribution in a region. We use the CIE Lab histogram as a color descriptor for computing the color potential. Similarly, each channel is uniformly quantified into 64 levels, and then, three channels are concatenated to form a 192dimensional color vector. The experimental comparison demonstrates that the histogram descriptor is more effective than the mean descriptor, increasing the overall pixelwise labeling accuracy by 3.8%.
Every region in natural images is not isolated and is strongly connected with its adjacent regions. When computing it is unreasonable to only use the node pair (
i, j ) and ignore the neighboring nodes. Therefore, we consider the neighborhood interactions in the main steps summarized in Algorithm 1.where
D (i, j ) = [_{X2} (h _{i} +h _{mi} ,h _{j} ) + _{X2} (h _{i} ,h _{j} +h _{mi} )] / 2.Similarly, we can compute the texture potential Further,
h _{i} stands for a texture histogram, which is computed as follows:where
K denotes the number of pixels in the region nodei , and represents thex L dimensional sparse texture vector described in Section IIIA. Experiments demonstrate that the step of incorporating neighborhood interactions increases the overall pixelwise labeling accuracy by 2.3%. We also find that the color potential using theLab / _{X2} descriptor outperforms the GMM color model as used in [10].2) Edge Potential
Inside a very small region, edge information basically indicates local image shape priors; the regions belonging to the same object often have strong edge continuities, which are described as the edge potential in this paper. As shown in Fig. 5, we find that there are many edges (blue lines) going through neighboring nodes. If two neighboring nodes are crossed by an edge, they very possibly belong to a visual unit and have the same figure/ground label. Motivated by this observation, we define the edge potential to capture the cue of edge continuity.
Given an image, we compute its binary gradient magnitude
M . LetS be the node index matrix, which indicates that the graph node that a pixel belongs to. Assuming thatc_{n} = (x_{n} ,y_{n} ) denotes the coordinate of pixeln and (i, j ) represents a pair of neighboring nodes, we can denote their common boundaries as a pixel pair set, as follows:Then, the edge potential is computed as follows:
where and
> C. Parameter Estimation
The parameter vector
v in Eq. (7) is automatically learned from the training data. Given a set of training imagesT = {(L ^{(n)},I ^{(n)}),n = 1,...N }, we assume that all the training data are independent and identically distributed. We then use the conditional maximum likelihood (CML) criterion to estimatev . Its log likelihood is computed as follows:where the last term is the logpartition function. In general, the evaluation of the partition function is a NPhard problem. We could use either sampling techniques (e.g., the Markov chain Monte Carlo method [19]) or some approximations (e.g., those of the free energy [20], piecewise training [21], pseudolikelihood [22]) to estimate the parameters.
The optimal parameter maximizes the log conditional likelihood according to the CML estimation as follows:
This can be solved by using the gradient descent method. The derivative of the log likelihood
L (v ) is written as follows:where the second term
E_{P} ( · ) denotes the expectation with respect to the distributionP (l I ^{(n)},v ). That is,In general, the expectation cannot be computed analytically because of the combinatorial number of elements in the configuration space of labels. In this work, we use belief propagation [23] to approximate it.
IV. EXPERIMENT AND DISCUSSION
> A. Image Datasets
We evaluate the proposed approach using three datasets. The MSRC dataset [10] contains 591 images with 21 categories. The performance of the unary classifier on this dataset is measured by using the pixel precision. Furthermore, for comparison with a previous work [24], we select the following 13 classes of 231 images with a 7class foreground (cow, sheep, airplane, car, bird, cat, and dog) and a 6class background (grass, tree, sky, water, road, and building) as the data subset. The ground truth labeling of the dataset contains pixels labeled as ‘void’ (i.e., colorcoded as black), which implies that these pixels do not belong to any of the 21 classes. In our experiments, void pixels are ignored for both the training and the testing of the unary classifier. The dataset is randomly split into roughly 40% training and 60% test sets, while ensuring approximately proportional contributions from each class.
The second dataset is the Weizmann horse dataset [25], which includes the side views of many horses that have different appearances and poses. We have also used the VOC2006 cow database [26] in which ground truth segmentations are manually created. For the two datasets, the numbers of images in the training and test sets are exactly the same as in [27].
> B. Common Parameters
When extracting the pixellevel texture descriptor, we set the parameters of the Gabor filter as scales
σ = {1,1.2} and directionsθ = {0, / 4,π / 2, 3π / 4}. The phase response is uniformly quantified into eight regions. Hence, the size of the pixellevel feature vector is 4×2×8 = 64 . We randomly select roughly 60000 samples from all the training images to learn the dictionaryπ D and ensure approximately proportional contributions from each image. We set the dictionary size as 2048. Thus, a 64dimensional pixellevel vector is sparsely encoded as a 2048dimensional vector; then, by spatial pyramid matching and max pooling, we extract a patch feature from 16×16 pixel patches, which are densely sampled with a step size of 4 pixels.During the training of the unary classifier, a patch possibly contains multiple labels; however, we take the label that accounts for more than 75% of all the pixels in this patch as its label. We find that the number of patches of the training images in each class on average is in the order of 10000, and some classes have more than 100000 patches. Considering the memory and computational constraints, we randomly select 8000 patches from each class to construct a patch dataset for the evaluation of the unary classification, and each class sample is randomly split into 25% training and 75% test sets. For efficiency, we reduce the dimension of a patch feature from 10432 to 4000 by using the incremental principal component analysis (PCA) algorithm [28], in which we feed 20% samples to increment PCA in order to approximate the mean vector and the basis vectors.
> C. Unary Accuracy
To evaluate the performance of the proposed patch representation, we use a simple linear SVM classifier to conduct 21class classification experiments on the MSRC dataset. We select 1200 patches per class as training samples and the rest of the patches as the testing samples. We achieve patchwise labeling accuracy of 71.0%, while the stateoftheart approach [10] gives pixelwise accuracy of 69.6%. For a fair comparison, we further refine the patch precision segmentations to the pixel precision ones by simply postprocessing.
In particular, we first get split patches (i.e., graph nodes) by using an existing segmentation method as described in Section IIIA. The nodes are not always larger than the patches in size. Then, we take the nodes within the same segment as generated by [18] as the content consistent nodes. Finally, the label of each node is decided by a majority vote of the labels of its neighboring nodes, which must also be its contentconsistent nodes. After the above processing, we achieve pixelwise accuracy of 72.1%.
We also evaluate the binary classification performance on the 13class dataset. We select 2200 patches per class as the training samples and label the 7class foreground and the 6class background as positive samples and negative samples, respectively. Thus, we achieve patchwise labeling accuracy of 87.5%. After the postprocessing, we achieve pixelwise accuracy of 88.4%. The unary pixelwise accuracy on the Weizmann dataset and the VOC2006 dataset is 89.9% and 94.5%, respectively.
> D. Potentials Analysis
On the 13class dataset, we compare the unary potential accuracy with the full model accuracy. The latter is improved by 3.2% on average. This seemingly small numerical improvement corresponds to a large perceptual improvement (see Fig. 6), which shows that our pairwise potentials are effective.
> E. Comparison
We evaluate the performance of the proposed method against that of three stateoftheart methods [24,27,29]. The quantitative measure is the accuracy, namely segmentation cover, which is defined as the percentage of correctly classified pixels in the image (both foreground and background) [24,29].
On the MSRC dataset, since the performance varies substantially for different classes, we respectively give the accuracy of each class. We list the quantitative comparison of seven classes in Table 1, which shows that our method outperforms the two competitors except for the cat class.
In addition, the method proposed in [24] only selects 10 images for each class such that there is a single object in each image, while we compute the segmentation accuracy on the 13class subdataset of 231 images, and many images contain several object instances. The difference in testing data also indicates that our method is more robust than that proposed in [24]. Fig. 7 shows some visual examples of the same images as reported in [24]. Although the accuracies of some examples are lower than those in the case of the competitor methods, our overall accuracy is higher.
On the Weizmann and VOC2006 datasets, we compute the 2class confusion matrix, as shown in Table 2, which shows that the proposed method performs favorably against the method proposed in [27] on the first dataset and much better on the second one. Figs. 8 and 9 show the same examples as those considered in [27]. The reason that the results on the cow dataset are very goodies that the appearances of the foreground and background are respectively homogeneous and the spatial distribution of the foreground is very compact. Compared to the horse dataset, the foreground of many images are inhomogeneous; in particular, horse shanks are very slim and their colors are different from those of the body, which leads to horse shanks going missing from the final segmentation, as shown in Fig. 8. In addition, the similar appearance of the foreground and the shadow in the background possibly causes some errors.
V. CONCLUSIONS
In this paper, we propose a new discriminative model for figure/ground segmentation. First, a pixellevel dictionary is learnt from mass pixelwise Gabor descriptors; second, each pixel is mapped as a highdimensional sparse vector, and then, all the sparse vectors in a patch are fused to represent the patch by max pooling and spatial matching. The proposed unary features can simultaneously capture the appearance and context information, which significantly enhances the unary classification accuracy. The upgrade color and texture potentials with neighborhood interactions and the proposed edge potential weaken the interference of abnormal nodes during graph affinity computation. The experimental results demonstrate that the proposed approach is powerful with a comparison with three stateoftheart approaches. In the future, we hope to integrate explicit semantic context and salient information to make the algorithm more intelligent.

[Fig. 1.] Some image patches on object boundaries. Ignoring their color information, we find that there are many similar spatial structures, which can be considered the common characteristics of the objects.

[]

[]

[]

[Fig. 2.] An example diagram of J = 8 ; phase response αθσ belongs to the 2nd phase interval Φ2.

[]

[]

[]

[]

[]

[Fig. 3.] Split patches. (a) Patch partition, where each square stands for an image patch. (b) Overlaying a patch partition on the image segmentation. (c) Local magnifying map, where an irregular region stands for a split patch.

[]

[Fig. 4.] Unary binary segmentation results (above: input images, below: binary classification results).

[]

[Algorithm 1]

[]

[Fig. 5.] Edge continuity. (a) Split patches. (b) Overlying image edge map on a split patch partition. (c) Local magnifying map.

[]

[]

[]

[]

[]

[]

[]

[Fig. 6.] Contribution of pairwise potentials. (a) Input images. (b) Results of the unary classification with patchwise accuracy. (c) Postprocessing results with pixelwise accuracy. (d) Results of the full CRF model. The first two examples show an increase in accuracy of 0.1% and 3.1%, respectively, while the last example significantly improves the accuracy by 39.0%.

[Table 1.] Quantitative comparison on the MSRC dataset

[Fig. 7.] Qualitative comparison with [24] for the MSRC database. The first and the third rows show the results reported in [24]. The second and the fourth rows show our results.

[Table 2.] Quantitative comparison on the Weizmann and VOC2006 datasets

[Fig. 8.] Examples of representative segmentation results on the Weizmann horse dataset. From top to down: input images, results reported in [27], and our results.

[Fig. 9.] Examples of representative segmentation results on the VOC2006 cow images. From top to down: Input images, results reported in [27], and our results.