br II br III br Information source br
Description of neoplasm
V First course of therapy
VI Follow up information VII Recoded variables
List of variables used in the experiments. r> No. Variable Description No. Variable Description
1 PUBCSNUM Patient identification 23 CSLYMPHN CS Lymph Nodes
2 REG SEER registry 24 CSMETSDX CS Mets at DX 3 MAR_STAT Marital status at diagnosis 25 CS1SITE CS Site-Specific Factor 1 4 RACE1V Race/ethnicity 26 CS2SITE CS Site-Specific Factor 2 5 NHIADE NHIA derived hisp origin 27 CS25SITE CS Site-Specific Factor 25 6 SEX Sex 28 DAJCCT Derived AJCC T 7 AGE_DX Age at diagnosis 29 DAJCCN Derived AJCC N 8 YR_BRTH Year of birth 30 DAJCCM Derived AJCC M 9 SEQ_NUM Sequence number 31 DAJCCSTG Derived AJCC Stage Group 10 MDXRECMP Month of diagnosis 32 DSS1977S Derived SS1977 11 YEAR_DX Year of diagnosis 33 DSS2000S Derived SS2000 12 PRIMSITE Primary site ICD-O-2 34 SURGPRIF RX Summ-surg prim site 13 LATERAL Laterality 35 SURGSCOF RX Summ-scope reg LN sur 14 HISTO3V Histologic Type ICD-O-3 36 SURGSITF RX Summ-surg oth reg/dis 15 BEHO3V Behavior code ICD-O-3 37 NO_SURG Reason no cancer-directed surgery 16 GRADE Grade 38 RADIATN Radiation 17 DX_CONF Diagnostic confirmation 39 RAD_SURG Radiation sequence with surgery 18 REPT_SRC Type of reporting source 40 REC_NO Record number 19 EOD10_PN EOD 10 - positive lymph 41 cs0204schema CS Schema v0204
20 EOD10_NE EOD 10 - number of lymph 42 HST_STGA SEER historic stage A
21 CSTUMSIZ CS Tumor size 43 FIRSTPRM First malignant primary indicator 22 CSEXTEN CS Extension 44 SUMM2K Historic SSG 2000 Stage
target values. Thus, 44 attributes were kept, which are listed in Table 4. It AMG 925 worth noting that one main difference to previous studies is that we used the patient identification number as one of the attributes in model construction, which has been found to be a significant factor in feature analysis using several methods. To support this statement, we made Pearson correlation analysis on patient identification number and other attributes in the SEER colorectal dataset. Our results showed that collinearity indeed exists between the patient identification number and other attributes, such as SEER registry, age at diagnosis, positive lymph nodes examined, and diagnostic date. Table 5 presents the attributes significantly correlated to the patient identification number together with the corresponding Pearson correlation coe cients. Although the specific coding rule for the patient identification number in the SEER program is unknown, we found in the literature that the patient identification numbers were often designated to be related to certain characteristics or types of patients. In summary, the patient identification number does contain some additional information of patients, although it does not have explicit medical meanings.
There are about 110,000 cases from 1973 to 2013 in the colorectal cancer dataset of 2016 submission, although only the cases whose diagnosis time is from 2004 to 2008 are used in this study, because the values of some variables are null in the cases diagnosed earlier than 2004 and the cases diagnosed later than 2008 cannot offer the patients’ outcome of five years (five-year survivability). Besides, the records without determined diagnostic date and the end of follow-up date and the records whose cause of death is not the primary cancer are also excluded. As stated in introduction, only the cases whose Derived AJCC Stage Group belongs to stage IV are included. After preprocessing, 1568 records, including 65 records with survival months over 60 and 1503 records with survival months less than 60, were retained for classification experiments, and the 1503 records with survival months less than 60 were for the experiments of regression stage.