# br Combining the complete data subset and the

3. Combining the complete data subset and the incomplete data subset after missing value estimation, the completed data set is derived.

Note that the mean 9(S)-HODE is covered by this framework by setting the number of clusters to one.

4. Classifier design

With the data completed, the cervical cancer screening task can be treated as a classification task, i.e., the patients are divided into two categories: positive samples (likely to have cervical cancer) or negative samples (unlikely to have cervical cancer). In this stage, there are still two major problems to be solved. First, in general, the data set is class imbalanced, i.e., for all the patients, the positive samples are much less than the negative samples. For example, in [18], the positive samples count for about 12% of the total samples. Directly training a classifier on such severely imbalanced data will not yield good results. Second, the completed data contain much noise and uncertainty, which may be generated during the collection and missing data estimation stage. Hence, how to handle uncertainty is another important issue in classifier design.

To solve the two problems above, a fuzzy ensemble learning approach is designed, which adopts a bagging strategy to generate class-balanced sub-datasets and a fuzzy logic based ensemble module to comprehensively analyze the classification results from each weak classifier and make the final decision. Denote the entire training set by X , where the subscript represents the data clustering algorithm adopted. The proposed classification approach runs as follows and its flowchart is given in Figs. 4 and 5, where the red arrows in Fig. 5 indicate that every input on the end is delivered to every receiver on the head.

1. Data sampling: As shown in Fig. 4, a series of subsets with class-balanced samples are sampled from the entire training dataset X and are denoted by D1 , D2 , · · · , DT (where T is the number of subsets and denotes the algorithm used to fix the missing attributes). In each subset Dt , 80% of the positive samples in X are randomly selected to construct the positive constituents and the similar number of negatives are selected from X to construct the

Fig. 6. The antecedent fuzzy set membership function adopted.

negative counterpart, which makes Dt a relatively balanced subset. Note Saturation density there is no overlapping negative sample in different subsets.

3. Meta-data generation: As shown in Fig. 5, for the kth weak learner, there are overall T classifiers trained from each subset, which are denoted by Ck,1, Ck,2, · · · , Ck,T . The bagging algorithm [10] is adopted to obtain the final estima-tion by combining the classification results from all the classifiers based on a specific combination strategy. Different

combination strategies result in different results. In the proposed approach, three kinds of combination strategies including maximum, minimum and average combination are adopted, denoted by σ max(·), σ min(·) and σ ave(·), respec-tively. The meta-data for the samples in the tth subset Dt based on the kth weak learner is given by the combined

classification outputs using all the combination strategies, i.e., σmax, σmin, σave (Ck,1, · · · , Ck,t−1, Ck,t+1, · · · , Ck,T ). Note that according to [39], when generating the metadata for a specific subset, the weak learner trained on itself should be excluded from the combining classifier set. Finally, for any observation in (D1 , D2 , · · · , DT ), its meta-data contain 3 × K combined classification results.

4. Ensemble learning module based on fuzzy logic: Inspired by [27], the fuzzy IF-THEN rules filter is selected as the en-semble learner to handle the uncertainty in X . For any sample x in X , the fuzzy IF-THEN rule takes the form