Download Service Learning Outcomes in an Undergraduate Data Mining Course

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Service Learning Outcomes in an
Undergraduate Data Mining Course
Terry Letsche
Department of Mathematics, Computer Science, and Physics
Wartburg College
Waverly, IA 50677
terry.letsche@wartburg.edu
Abstract
Data mining is becoming increasingly important as the amount of data being stored
increases, and data miners are highly sought after in the job market. This paper describes
a pilot course-offering in data mining in a small, four-year liberal arts college as part of a
mixed Computer Science and Computer Information Systems curriculum. Several
curricular choices are examined, including text and software selection, and the design of
a final project to assess student performance. The final project offered three alternatives:
The student could implement or extend a data mining algorithm in the language of their
choice, the student could prepare a major paper on the impact of data mining to their
discipline, or the student could find an outside project (with instructor approval) to
analyze using the developed data mining techniques. The role of service learning as an
intentional curricular choice is also examined, and two student projects are discussed.
1 Introduction
Data mining is gaining exposure as a way to make sense out of the increasing amounts of
data that are being collected and stored. The federal government recently announced new
software, ADVISE, would be used in homeland security to link and cross-match material
between websites, government records, and personal data.[1] The case has been made
that the business case for federal data mining efforts has not been made.[2] Netflix has
recently announced a contest to improve the accuracy of their current selection prediction
algorithms using data mining.[3,4] Data mining can even be applied to the NBA draft
process![5] Data miners are highly sought after in the job market. Data mining stands at
the confluence of a number of different disciplines.
Wartburg College is a four-year liberal arts college located in Waverly, Iowa. The college
offers majors in not only computer science (CS), but also in computer information
systems (CIS), a hybrid of computer science and business administration with a heavier
emphasis placed on computing than a typical management information systems major.
The initial offering of a data mining course occurred in the winter term of 2006 as a
rotating special topic, offered every three years. Since it would be some time until the
course could be offered again, the only prerequisite was CS1 or its equivalent. Also, since
students from many disciplines take CS1 as part of their major requirements (engineering
science, for example), the course would have to be accessible to a wide variety of
interests and computing backgrounds.
When designing the course, a number of previously published works were examined. For
example, Musicant’s [6] approach of basing an undergraduate data mining course around
research papers would be unworkable in an environment where the students had such
widely varying backgrounds. Additional sources were examined [7,8], but the closest
match to what was envisioned was Dian Lopez’s course, described in [9]. The University
of Minnesota, Morris approach of creating an undergraduate level course that not only
looked at the breadth of the data mining field, but also allowed students to formulate their
own research problem was extremely appealing.
This paper begins with a brief discussion of the texts that were examined, followed by the
software choice and a description of the final project. Two student projects are discussed,
followed by student reactions and thoughts on future data mining course offerings.
2 Book Selection
The text for the course was selected with two primary criteria in mind. First, it must be
accessible to the students while still being rigorous. One of the goals of the course was to
introduce students to several of the key algorithms in data mining, so it was natural that a
more computational approach was required that also used many examples. Secondly, it
would be a plus if the text was also linked to some data mining software that could be
1
used by the class since a second course goal was experience with a data mining tool.
Seven texts were considered.
2.1 Data Mining: A Tutorial-Based Primer by Roiger & Geatz
Roiger and Geatz [10] begin by covering data mining fundamentals with brief overviews
of the data mining process, classification, prediction, clustering, and market basket
analysis, then spend more time in later chapters on specific algorithms for decision trees,
association rules, the K-Means algorithm, and genetic algorithms. The book comes with a
180 day trial version of iData Analyzer, an Excel data mining add-in, which are both used
extensively in the development of data mining processes and techniques in later chapters.
The book also has more advanced chapters that could be used as separate topics on neural
networks, statistical analysis techniques, time-series analysis, rule-based systems, and
fuzzy reasoning. The book comes with several data sets that are used in the book as
examples, and has questions at the end of chapters.
This book met the criteria above in that the core of the content was accessible to almost
any student, with opportunities for further exploration in the advanced topics for more
advanced students.
2.2 Principles of Data Mining by Hand, et. al.
Hand, et. al. [11] focuses on bridging the ideas of statistical modeling and computational
data mining algorithms. The book is split into three sections. In the first section, a
foundational tutorial of data mining principles is presented at an intuitive level. The
second section builds on this, covering algorithms on trees, classification and regression
rule sets, association rules, statistical models, and neural networks, among others. The
final section shows how the preceding can be used together to solve real-world data
mining problems.
This book is marketed towards senior-level undergraduate or beginning graduate
students. As such, the level of statistics, in particular, seemed daunting for an offering
with a mix of potentially firs-t through fourth-year students of varying quantitative
experience. The book lacks exercises, but has extensive “Further Reading” sections at the
end of each chapter.
2.3 Data Mining by Adriaans & Zantinge
Adriaans and Zantinge [12] is a concise, management-level overview of data mining that
could be used as a springboard for a class made up of this overview and additional
readings from journals or other books on topics of interest to the instructor. The book
gives overviews of the data mining and knowledge discover process, with more extensive
treatment of three real-life applications.
2
The book itself is a great resource as an introductory overview, but its lack of algorithmic
depth, examples, and exercises caused this book to be placed further down the list.
2.4 Data Mining: Introductory and Advanced Topics by Dunham
Dunham [13] is also a rather concise book, split into three sections. In the first section,
data mining tasks such as classification, regression, prediction, clustering, etc. are
described and the core topics are motivated. The Second part devotes a more thorough
coverage to classification, clustering, and association rules using pseudocode and
numerous examples to describe these techniques. The book concludes with additional
advanced topics including web, spatial, and temporal mining techniques and algorithms.
Each chapter concludes with exercises. However, the book is geared to advanced
undergraduates and beginning graduate students who have completed at least an
introductory database course. The book has an extensive appendix that surveys numerous
data mining software packages, although the book itself does not adopt any particular
software for its examples.
2.5 Data Mining Techniques: For Marketing, Sales, and Customer
Support by Berry and Linoff
Berry and Linoff [14] approach data mining from the standpoint of the business
practitioner. After a lengthy motivation, seven data mining techniques are discussed:
cluster detection, memory-based reasoning, market basket analysis, genetic algorithms,
link analysis, decision trees, and neural networks. One of the books strong points is its
use of case studies. Its focus on the business, rather than computational end of the
spectrum, plus its lack of exercises caused this book to not be considered.
2.6 Data Mining: Concepts and Techniques by Han and Kamber
Han and Kamber [15] have the reputation of being the gold standard that other data
mining books are judged by. It is a comprehensive book – its broad coverage and indepth development of algorithms make it an excellent resource for the instructor. It
begins with an overview of data mining, progressing to a description of the data mining
process, including data preprocessing and transformation. Since the authors take a
database view of data mining, there are two chapters on data warehousing and data cubes
that could be omitted without affecting the flow of the course. The book has an extensive
treatment of association, correlation, classification, prediction, and clustering algorithms,
with additional material highlighted in the bibliographic notes at the end of each chapter.
The book also contains a number of advanced chapters on time series, social network,
and spatial data mining, as well as a concluding chapter on trends in data mining and a
number of case studies.
3
Although this book is database-oriented, the information presented doesn’t rely on a
database interpretation. Each chapter concludes with exercises, and the publisher makes
extensive instructor resources available. The book does not focus on any particular
software package.
2.7 Data Mining: Practical Machine Learning and Techniques by
Witten and Frank
Witten and Frank [16] have written a data mining textbook for use with the java-based
Waikato Environment for Knowledge Analysis, or Weka. The objective of the book is to
introduce the tools and techniques for machine learning that are used in data mining. The
book’s approach lands it somewhere between the practical approach of many businessoriented data mining books and the more theoretical approach used in other textbooks.
The book is presented in two sections. In the first section, data mining is introduced, and
the various algorithms are presented. Later chapters delve into algorithmic detail for each
of the eight main families of algorithms (decision trees, classification rules, linear
models, instance-based learning, numeric prediction, clustering, and Bayesian networks),
with a subsequent chapter on common transformations on the input and output and a
concluding chapter on extensions and applications of data mining.
The second part is devoted to the Weka software itself, from an introduction and tutorial,
to information on how the reader can extend or implement additional algorithms using
the Weka framework.
While the book itself doesn’t have exercises, the authors have made exercise sets and
other instructor materials available through the publisher.
2.8 And the winner is…
The KDnuggets website recently had a poll asking which book data mining practitioners
liked as an introduction or textbook to the field.[17] Twenty-three percent of the
respondents in the unscientific poll chose Han and Kamber, eighteen percent chose
Witten and Frank, and seventeen percent chose Hand, et. al. For this course, in the end, it
came down to a choice between Han and Kamber or Witten and Frank, with Witten and
Frank getting the nod due to the Weka software, particularly since the software itself was
able to run on virtually any platform, allowing the students to use a state-of-the-art
testbed for doing real data mining work, plus the added benefit of the software being used
in all the examples in the book. Obviously, this list is hardly exhaustive; there are many
other data mining books and textbooks available that would suit many course needs.
3 Software Selection
Software selection was concurrent with book selection. The principle benefit of Weka
was not only its platform independence, but also its cost: free. There are a number of data
4
mining companies that make trial versions of their software available for academic use
for free or at a reduced cost, such as IBM’s Intelligent Miner[18]. The only other
seriously considered software was YALE (Yet Another Learning Environment?)[19],
developed at the University of Dortmund. Like Weka, YALE is written in Java and uses a
building-block paradigm to allow rapid prototyping of data mining algorithms. It also has
a GUI and incorporates Weka’s machine learning library. YALE also supports the
paradigm of external modules that can be treated as plugins to allow additional
functionality to be incorporated.
Once the Witten and Frank textbook was selected, it seemed natural to use the Weka
software. Each student was invited to download a version for their own computer
(windows or Mac), and a version was installed in the Linux lab adjacent to the classroom.
The windows version comes with its own JRE, while the Mac and Linux users
downloaded only the Weka jar file. In the Linux lab, IBM’s free JRE was used since it
included just-in-time compilation (JIT), which can significantly reduce processing time
on large datasets.
4 Course Projects
The course followed the Witten and Frank book’s order of topics, with additional
examples done in class using active and collaborative learning. An invaluable resource
was the sample data mining curriculum created by Dr. Gregory Piatetsky-Shapiro and
Prof. Gary Parker located at [20]. A semester-long project was assigned and timed to
follow the content of the text. This project mirrored the efforts to use microarray data to
classify leukemia as acute myeloid leukemia or acute lymphoblastic leukemia, as
reported in Golub [21] and Piatestsky[22]. In the first part of the project, the microarray
data has been narrowed down to fifty relevant genes. The students use this data and a
number of suggested classification algorithms to determine which gene(s) is/are the best
predictor of the two classes of leukemia. In the second phase, the original data set is used
and students perform three tasks: separate training from test data sets, perform a variety
of data cleaning techniques on both data sets, and lastly, the cleaned training data is used
to build models using a variety of algorithms that are then used to evaluate predictions
against the test data set. Phase 3 covers feature set reduction using a number of different
methods. In phase 4, the reduced feature set is used to isolate gene U82759, the Human
homeodomain protein HoxA9 mRNA.
For the class’s final project, students were given the option of three different final
projects, all of which were to be presented to the class. In the first, students could code a
data mining algorithm in the language of their choice or extend a data mining algorithm
and demonstrate its effectiveness. The intent of this final project was to appeal to the
computer science majors who had learned Python in CS1 and Java in CS2. The simplest
data mining algorithms could easily be coded in Python, while extensions to an existing
algorithm could be done in Weka using Java.
The second project possibility was to write a formal paper on data mining and some
discipline-specific data mining issue, e.g. privacy, market basket analysis, etc. This paper
5
would be a major paper, with a minimum length of ten pages and annotated bibliography.
This option was suggested for those students who may not be as confident in their
programming skills, or who might have an interest in some of the non-technical aspects
of data mining.
The final project area was to complete a research activity with “real data” from an outside
source, spending a minimum of ten hours on the project. The presentation was required to
include a high-level description of the data, what cleaning procedures needed to be done
on the data, the results of mining, and a report on the activity to the “customer”. The
intent of this project was to pilot the use of service learning. Wartburg conducted a
service learning faculty development event in the fall of 2005, hosted by Campus
Compact.
Service learning consists of six key components. First, there should be curricular
connections between the project itself and the learning process in the class, preferably
building on existing disciplinary skills. Student voice is also important. Students should
have the opportunity to select, design, implement, and then evaluate the service learning
activity. Reflection is a third aspect, where students are urged to think, talk, and
optionally write about their service experience. Reciprocity is also important. For the
partnership to be successful, both the student and the customer should not only contribute
to the project, but both should also benefit. Ideally, the project should address an
authentic community need, and lastly, students should participate in some sort of
assessment where constructive feedback and reflection provide insight into the reciprocal
learning. The goal of service learning in this context was to enrich the learning
experience for the student, sustain interest in the topic, and provide expertise to
community partners. [23, 24]
4.1 Juvenile Court Services District 1
In the first case study, student A did a project for her mother, an employee of Juvenile
Court Services District 1 (JCS). JCS was becoming more aware of research-based data,
and was looking for ways to examine their outcome data to have better programming for
“at risk” youth that have been referred to JCS by juvenile court and the Department of
Human Services. Prior to this study, the state of Iowa visited with JCS to explain the
importance of routinely examining the outcome data, but it is unclear whether this
subsequently occurred. The goal of the project was to take the available data and identify
areas where adjustments could be made in order to gain better outcomes, where the
primary outcome goal was to decrease program recidivism. [25]
Student A began work with the data for the 4 Oaks day treatment program in Dubuque,
Independence, and Black Hawk county, Iowa. Individuals were anonymized by replacing
names with JIJI numbers, an internal identifier, and birthdates were replaced with their
current age. Data also included the current criminal charge, e.g. none, simple
misdemeanor, misdemeanor, and felony, as well as length of stay, anticipated discharge
date, assigned risk factor (low, medium, high), age at admission, county of residence, and
referral source. It was hoped to find relationships between risk, gender, age, and length of
6
stay, or to be able to predict recidivism. However, data was limited to 207 individuals,
and the three programs had wildly different clientele. Various algorithms were used on
the data, but predictive accuracy averaged around 60% when using a training set.
The analysis of the data confirmed the belief that minority students were not succeeding
in the traditional day treatment program. As a result, JCS has added a culturally specific
program, Dare to be King, for minority males. This program was implemented in July,
and by this January it appears to have made an impact.
A second finding was that there were not enough data points available, particularly at any
individual program site. Programmatic data was combined over multiple sites for the
study, on which further analysis demonstrated that increased time in the program is not
correlated to higher individual success rate or recidivism. As a result, a more structured
research program has been established at the three sites, two in July of 2006, and one in
September of 2006. Early results indicate that the improved programmatic followthrough has improved success rates. JCS continues to also use an existing risk/needs
assessment tool, but has recently more narrowly focused its efforts on the medium to
high-risk youth.
JCS continues to collect data and assess their program. JCS felt that the data mining
effort was immensely useful, and has an additional project to be completed with data
mining in the near future.
Student A reports that she feels the project was worthwhile:
“The fact that I could help make programs better for kids who need them was really
motivating. And knowing that the things I found would be used and helpful in “real life”
was very motivating as well. It did make me realize how important data collection is. I
was able to give my mom (and thus her whole department) some advice as to how they
could improve their data collection so that they can find some statistically significant
results.” [26]
4.2 Wartburg College Retention
Student B works in ITS at Wartburg as an application support specialist, primarily
supporting administrative staff using the college’s SQL server-based administrative
system. This database contains over 800 tables that hold data for all campus offices, e.g.
Admissions, Financial Aid, Registrar, Controller, Alumni/Development, etc. Student B
was interested in applying data mining principles to study retention, with the goal of
developing a model that could accurately predict high-risk retention students.
Retention is a measure of academic progression of a group of students from one period of
time to the next. Last year, Wartburg published an 84% retention rate across first through
third year students. The office of enrollment management is charged with recruitment and
retention of students through ongoing analysis and academic support services. At
Wartburg, Admissions, Financial Aid, Registrar, Pathways Center, and ITS serve under
7
the direction of the Vice-President of Enrollment Management. [27] There is also a
thirteen member standing committee of the faculty that recommends policies and
procedures that maximize student retention, as well monitor overall retention trends.
Student B met with the director of Institutional Research, who shared with her the
procedure that he uses to prepare the annual retention study. Student B learned that on the
tenth day of each fall term, a “frozen” copy of the database is created to allow the
processing of the retention report that is later reported by gender, class (1Y, 2Y, 3Y),
ethnicity, GPA, citizenship, and transfer status. The retention committee indicated that
they were also interested in additional possible predictors of retention, namely whether
the student was housed on campus, involved in activities, admitted by committee
(deviation from normal admittance), ACT scores, and high school rank.
A first cut at assembling the data for the study produced 1405 students, whereas the
official retention study was based on 1418 students, of which 10 students were in
student’s B data that were not in the retention data, and 23 students were in the retention
data that were not in the study data. It was discovered that student B was including
deceased students, those on church mission leave and those who left on schedule for a
cooperative degree program or who had graduated even though they had not started the
year with fourth year status.
Once there was agreement between the two data sets, student B began by anonymizing
the data and discretizing various features, including religion, home state, class code,
ethnicity, gender, citizenship, entrance code (transfer), and majors. It was later decided
that major code might be too restrictive, so student B replaced major codes with CIP
(Classification of Instructional Programs) codes to indicate the department of the major.
Student B discovered that the best indicator of retention, using a variety of algorithms
was GPA. Some algorithms provided more information than others, i.e. J48, the Weka
algorithm that builds on Quinlan’s C4.5,[28] indicated that students with a GPA less than
2.0 were much less likely to be retained. A surprising result was that within the group
with GPA less than 2.0, religious preference can be viewed as a secondary predictor,
where students with an undeclared religious affiliation are much less likely to be retained
within this group. Further analysis with other algorithms demonstrated high predictive
accuracy with three primary attributes: incoming class code, GPA, and religion.
Student B reports:
“What did I learn? I suspected all along that retention would be based on GPA since if
the student’s GPA is under 2.0, they are very close to being put on probation or
suspension. Data clean-up is very tedious and time consuming. Domain knowledge is
also very important when refining the data to be mined.” [29]
Student B had a number of recommendations for enrollment management and the
retention committee. ITS has recently installed a SAN (Storage Area Network) to contain
the Jenzabar databases, allowing more years of historical retention data to be preserved. It
8
was also suggested that a disability identifier be added to the database to more easily
allow them to be removed from the analysis. Thirdly, a new table should be created that
matches major codes and CIP code. Lastly, there should be increased effort put into
retaining cocurricular transcript data showing students’ activities, athletics, music, and
other involvement.
It is hoped that data mining retention data could be an ongoing effort between enrollment
management and ITS. The ability to store multiple years worth of data makes it
reasonable to assume that more highly predictive models could be developed.
5 Conclusion
Students found the inclusion of a service learning option to be a novel and exciting
prospect. Students who performed a service learning project reported a greater sense of
engagement with the course and relevant material. Students were also enthusiastic about
“making a difference”. One anonymous student reported on their course evaluation, “I
got to research something that really interested me in this field!” The only negative
comment from students overall was that they felt there should have been a statistics
prerequisite for the course, although a poll during the course showed that almost the
entire class had already taken an algebra-based statistics course.
6 Acknowledgements
The author gratefully acknowledges Cassandra Frush, Susan Higdon, and Dr. Edith
Waldstein, Vice President of Enrollment Management, for allowing me to share the
results of their research.
7 References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
J. Yaukey, "Feds test new data mining program," in USA Today Washington,
D.C., 2007, p. 3A.
B. Worthen, “IT Versus Terror,” CIO, vol. 19, no. 20, p.34, August 1, 2006.
http://www.netflixprize.com
K. Greene, “The $1 Million Netflix Challenge,” in Technology Review, October 6,
2006, http://www.technologyreview.com/Biztech/17587/page1/
P. Gearan, “Predicting NBA Draft Success and Failure through Historical
Trends,” in Draft Express, June 21, 2006,
http://www.draftexpress.com/viewarticle.php?a=1362
D. R. Musicant, "A data mining course for computer science: primary sources and
implementations," in Proceedings of the 37th SIGCSE technical symposium on
Computer science education Houston, Texas, USA: ACM Press, 2006.
R. Connelly, "Introducing data mining," J. Comput. Small Coll., vol. 19, pp. 8796, 2004.
9
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
Y. Lu and J. Bettine, "Data mining: an experimental undergraduate course," J.
Comput. Small Coll., vol. 18, pp. 81-86, 2003.
D. Lopez and L. Ludwig, "Data mining at the undergraduate level," in Midwest
Instruction and Computing Symposium, Cedar Falls, IA, 2001.
R. J. Roiger and M. W. Geatz, Data Mining: A Tutorial-Based Primer. Boston:
Addison Wesley, 2003.
D. Hand, H. Mannila, and P. Smyth, Principles of Data Mining. Cambridge,
Massachusetts: The MIT Press, 2001.
P. Adriaans and D. Zantinge, Data Mining. Harlow, England: Addison Wesley
Longman Limited, 1996.
M. H. Dunham, Data Mining: Introductory and Advanced Topics. Upper Saddle
River, New Jersey: Pearson Education, Inc., 2003.
M. J. A. Berry and G. Linoff, Data Mining Techniques: for Marketing, Sales, and
Customer Support. New York: John Wiley & Sons, Inc., 1997.
J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed. San
Francisco: Morgan Kaufmann, 2006.
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and
Techniques, 2nd ed. San Francisco: Morgan Kaufmann, 2005.
http://www.kdnuggets.com/polls/2005/data_mining_textbooks.htm
http://www.ibm.com/software/data/iminer/
http://rapid-i.com/
http://www.kdnuggets.com/data_mining_course/index.html
T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov,
H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S.
Lander, “Molecular Classification of Cancer: Class Discovery and Class
Prediction by Gene Expression Monitoring,” Science, vol. 286, pp. 531-537,
October 15, 1999.
G. Piatetsky-Shapiro, T. Khabaza, and S. Ramaswamy, “Capturing best practice
for microarray gene expression in data analysis,” in Proceedings of the ninth
ACM SIGKDD international conference on Knowledge Discovery and Data
Mining Washington, D.C.: ACM Press, 2003.
R. G. Bringle and J. A. Hatcher, “A Service-Learning Curriculum for Faculty,”
Michigan Journal of Community Service Learning, vol. 2, p. 112, 1995.
“Introduction to Service-Learning Toolkit: Readings and Resources for Faculty,”
Campus Compact, 2nd edition, 2003.
R. Frush, C. Frush, Ed., 2007, pp. e-mail correspondence.
C. Frush, T. Letsche, Ed., 2007, pp. e-mail correspondance.
http://www.wartburg.edu/academics/enrollment.html
R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan
Kaufmann, 1993.
S. Higdon, T. Letsche, Ed., 2007, pp. e-mail correspondence.
10