br GAOGB also achieves a average relative improvement
GAOGB also achieves a 15% − 28% average relative improvement in accuracy over its weak learner, which is higher than both OLRGB and OLRAB. From the work of Beygelzimer et al. (2015), OGB can achieve a highest 20% improvement over the SCH 58261 learner. In our experiment, OGB shows a 9% − 21% improvement. Also, OLRAB can improve the base learner by 6% − 24%, which has a larger range.
4.4. SEER breast cancer dataset
The Surveillance, Epidemiology, and End Results dataset is adopted in the BC survivability prediction task to evaluate our model. As part of the Surveillance Research Program launched by the US National Cancer Institute (NCI) through a cancer statis-tics program, SEER is a population-based cancer incidence dataset including reliable patient information on anatomical sites such as breast, colon, urinary, etc. Since 1973, new data are released each year, amounting to a total of more than 80,000 detailed patient information. SEER is a dataset with growing scales dur-ing recent years. With increases in the number of both instances and attributes, BC prediction models implemented with SEER data promise better predictive accuracy. Meanwhile, retraining costs for o ine models are increased, and challenges in model e ciency are added to the prediction task. SEER demonstrates a need for e -cient BC prognosis models.
The SEER data are requested on the o cial SEER website by signing a research data agreement. The latest SEER data was re-lease in 2016, containing more than 960,000 instances of cancer patient information for years 1973–2014. From that we extracted the newest BC data for our research purpose, which has approxi-
Fig. 2. Overall performances on 3 UCI datasets in terms of 4 metrics.
mately 800,000 observations. The BC dataset contains 133 features describing socio-demographic and cancer specific information for each of the cancer incidences (Delen et al., 2005; SEER, 2017).
The input data are carefully processed and labeled in order to obtain satisfactory classification performances. The features of SEER BC data records information about the Extent of Disease (EOD), the Site Specific Surgery (SSS), and personal information of patients such as age and race. The EOD is described by the tumor size, number of positive nodes, number of nodes, and number of primaries (Shambaugh et al., 1992). The SSS is represented in fea-tures such as the primary site code, lymph node involvement and the radiation. It is notable that features describing each aspect of the cancer incidences have witnessed changes since 1973 several times. Thus the raw SEER BC data contains a large portion of miss-ing information for features collected after 1973. To gather avail-
able fields of information, we removed features with more than half of missing records, and removed data with a large portion of missing data for selected features. At the labeling stage, we define “survival” as the incidence that a BC patient is still alive after 5 years since the date of diagnosis by following the expectations of patients estimated in previous research (Delen et al., 2005). After preprocessing, we obtained a dataset with 14 features as indicators of survivability, consisted by 82,707 records, with 76,716 (92.8%) cases that patients have survived the cancer and 5,991 (7.2%) pa-tients have not. The summary of data is described in Table 6, we have 10 categorical features and 4 continuous features.
This research focuses on devising a GAOGB algorithm and val-idating its effectiveness on breast cancer datasets. To gain a more unbiased evaluation of different online learning methods and avoid the large number of “survival” cases from influencing the inter-
Ranges of the attributes in the Spambase dataset.
Attribute(s) Attribute discription Data type Range
Mean SD Max
55 Average length of uninterrupted sequences of capital letters