Data mining

I need support with this Computer Science question so I can learn better.

Project topic: Data mining- Breast Cancer Diagnostic

Instructions for the Research Project

I. Data Analysis Project.

  • Identify the problem(s) to be solved or opportunities to be realized by mining the selected data set.
  • Consider the following data preparation questions and explain your answers. When appropriate cite resources that support your answer. Explain how the answers, and data preparation, differed when you chose a different data mining method.
  • Should instances with missing values be deleted?
  • Should missing values be specially coded and then retained in the data set?
  • Should numeric values be assigned predetermined ranges or left for the algorithm to split?
  • Should categorical variables be grouped or coded to reflect a hierarchy?
  • To explore the problem or opportunity, use two or more of the following data mining methods covered by this course:
  • regression: linear regression, discriminant analysis or logistic regression,
  • decision trees,
  • neural networks,
  • hierarchical or k-means clustering,
  • association rules,
  • time series,
  • genetic algorithms.
  • Describe the algorithms chosen and indicate why you chose them.
  • Exploring a method of interest is a satisfactory reason for this course paper.
  • Explain how and why you used specific pruning parameters or other adjustments to create a sparser model.
  • Compare the alternative solutions using methods found in comparative studies in the literature.
  • Create a table showing the number of cases correctly identified, Type I, and Type II errors. In addition, a ROC curve is appropriate with discriminant analysis and logistic regression. For these methods, changing the parameters for the line separating the classes, changes the percentages of Type I and Type II errors. Medical practitioners like ROC curves because they show the tradeoff between false positives and false negatives.
  • Which data mining method(s) seem superior for the chosen data set? Did the method that performed best in your study also dominate in similar comparative studies?
  • Compare the results or recommendations that would result from the use of the different methods.
  • Based on your analysis, justify a conclusion or recommendation.
  • Cite the relevant literature using APA formatting
  • Organize the paper into the sections of a formal research paper: Introduction, Methods, Results, etc.

For example, see “Data mining for network intrusion detection: A comparison of alternative methods” Dan Zhu, G Premkumar, Xiaoning Zhang, Chao-Hsien Chu. Decision Sciences.Atlanta: Fall 2001.Vol.32.… Report the results of the accuracy measures available with the software. If the software used does not have built-in accuracy reporting, then manually test the model’s accuracy on a small hold-out test sample of the data. The hold-out method creates separate training and test sets. This is particularly useful when testing the model on data from a later time period.

II. Writing Skills Research Paper

10 page excluding graph

Use the APA style guide for your citations. After each title add a short note evaluating its quality as a research tool (back to evaluation criteria), and its quality relative to the other sources cited in your bibliography.

Here are some ideas of what the ideal project should look like:

1.Introduction – abstract

a. Describe the purpose of the project

b. Include the conclusion of the research as a summary


a. Include previous research used or cited

b. Explain data used

c. Share basic theory used and terminology

d. Possible inclusion of similar research and pre-existed results


a. Include graphs and tables explaining the preliminary results from the model

b. Assess data quality produced by implementation of analytic plan

c. Analyze data structure Evaluate how well the model worked

d. Identify any areas of concern

e. Describe statistics generated from the various stages of model building that describe the model’s fit and ability to accurately depict the data

f. Describe statistics generated from the various stages of model building that describe the model results Submission has no major errors related to citations, grammar, spelling, syntax, or organization

g. Complete explanation of the processes

h. Graphs and Charts

i. Possibly Analysis of final choice of data structure, variables used etc.

j. Possibly include Flowcharts – complete and revised based on feedback.

5. Conclusions

1. explain what are your findings, did you have any findings? Be specific.

2. Is there any way you could advance your study?

3. Was your work inconclusive? If yes what are the underlying factors and what other information is needed, also who could provide this information

6. Raw Data

7. References