br Further we are dealing with class imbalance
Further we are dealing with class imbalance dataset, which is a common phenomena in medical diagnosis as rare objects like esopha-geal cancer affected person are harder to identify in the population than the normal person. As life is precious in medical diagnosis, the cost of misclassifying a true patient as a normal person is infinitely higher than the cost of reverse error . Several techniques are available to tackle imbalance classification but those are mainly for SVM [30–32] algo-rithm. Study, related to prediction of breast and colon cancers using SVM with imbalanced data  and post operative life expactancy in lung cancer patients using boosted SVM for extracting rules  gives an indication of handling this practical problem efficiently. Although used for image retrival but asymmetric bagging based SVM has helped to overome imbalance with a low number of training set . However this may not focus on minimizing FN i.e towards achieving 100% sensitivity. We have also noted the use of ensemble methods for cost-sensitive decision tree with imbalanced dataset  but are not able to report any study with imbalanced data with LR. r> The above literature review highlight followings:
1. Although many literature are available for classifying a patient based on EMR for Bortezomib (PS-341) disease, breast cancer and some many dis-eases but hardly any literature are found for classification of eso-phageal cancer based on demographic, lifestyle and basic clinical data.
2. There is no prior work of personalized diagnostic test selection ac-cording to the choice of different stakeholders like doctor, patient, health care service provider or insurance companies etc.
3. There is no best data mining method across all types of EMR data. However the most used methods are NB,SVM,RF and LR in the lit-erature review for Linkage group study. These naive methods outnumbered modern or hybrid methods many times.
4. Researchers have not explored Kernel methods as a data mining tool in disease prediction with help of EMR.
2. Dataset and model evaluation
This section describes the data preparation methodology and the collected data set's statistics(Section 2.1), and the metrics for evaluation of performance of the model (Section 2.2).
2.1. Data collection and preprocessing
For this research, we have used the data collected by a reputed hospital in Mumbai, India using two mobile vehicles from the remote areas of Maharashtra, India. They collected the data in eight different
data collection forms, each populated by different health care profes-sionals (Doctors, paramedical personnel, Field operator, laboratory technician) and depending on the choice of patient (control or experi-ment). These data-sets were joined using patient's unique key to form the final electronic medical record (EMR).
The mobile vans collected the demographic and basic clinical data of the randomly selected rural people for initial screening and sent only 1 out of 7 (approx.) such person for barium swallow test (BST) and a doctor's diagnosis. The mobile vans originally collected data for 22,000 (approx.) patients in the initial screening stage and 3689 patients for the next stage i.e. for barium swallow test (BST) and doctor's diagnosis, depending on a screening criterion used by paramedical technicians accompanying the van. Finally BST and doctor's diagnosis determined oral cavity, which is our ground truth label. In the dataset, only 92 patients out of 3689, have oral cavity as present, and were further re-commended for esophageal cancer treatment. Hence, we note that while there was no self-selection bias in the data, since patients were chosen randomly, there was bias induced by selection from the para-medical staff.