Download Incorporating Data Mining Techniques on Software Cost

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012)
Incorporating Data Mining Techniques on Software Cost
Estimation: Validation and Improvement
1
Narendra Sharma, 2Ratnesh Litoriya
Department of Computer Science and Engineering Jaypee University of Engg & Technology Guna, India
1
2
narendra_sharma88@yahoo.com
ratnesh.litoriya@juet.ac.in
These knowledge or information applied in the cost
estimation models and try to generate the approximate
estimation on the basis of past project data. In this research I
am trying to identify the common cost drivers that are
affected the cost of the project. For estimation the cost of the
new project we are using the agile cocomo model [2].
Abstract— Generally, data mining is the process of analyzing
data from different perspectives and summarizing it into useful
information. Information that can be used to increase revenue,
cuts costs, or both. Data mining software is one of a number of
analytical tools for analyzing data. It allows users to analyze
data from many different dimensions or angles, categorize it,
and summarize the relationships identified. I am using data
mining tools weka to identify the important and common cost
drivers that are used to generate the estimate of a project. Cost
drivers are multiplicative factors that determine the effort
required to complete our software project. In the analogy
estimation models, the cost drivers are the base of cost
estimation models. They estimate the new project with compare
the past project data or cost drivers and set the value of cost
drivers in the new projects. The aim of this research work to
identify the important cost drivers in the past project data with
the help of data mining tools weka..
This paper investigates the systemic cost estimation issues
that have been identified and best performing machine
learning techniques. While we have found that agile
COCOMO II, a software estimation model with publicly
available algorithms developed by Barry Boehm, et al. [9], is
a very robust model, it is generate the more accurate result
on the basis of past project data that are very similar for our
new projects.. However these results were only internally
validated, using leave one out cross validation, with the
historical data within the data mining system. We seek to
find the prediction accuracy of the new model developed by
the data mining system against new external data to evaluate
the true effectiveness of these models in comparison to
standard cost models that do not use machine learning
techniques. In this research we are used the data mining
tools weka for performing the data mining. The main aim of
the research to increase the efficiency of software cost
estimation with the help of the data mining techniques [1,3].
Keywords— Data mining, agile COCOMO Software
estimation tools. Weka data mining tools, software engineering
etc.
I. INTRODUCTION
Cost estimation is a process or an approximation of the
probable cost of a prod.uct, program, or a project, computed
on the basis of available information. Accurate cost
estimation is very important for every kind of project, if we
do not estimate the projects in a proper way; result the cost
of the project is very high sometimes it will be reached 150200% more than the original cost. So in that case it is very
necessary to estimate the project correctly. In this research
we are working with two different-different fields one is
software engineering and another field is data mining. Data
mining help us to classified the past project data and
generate the valuable information.
II. INTRODUCTION OF DATA MINING AND WEKA TOOLS
We know that the all software cost estimation models are
not able to produce accurate estimates that often can be off
by greater than 50% from the actual cost, and sometimes as
much as 150- 200% off from the actual cost. So we need
such types of new methods or models that can be helpful for
us for generate the actual costs and their accuracy are being
investigated. Even methods that show a small improvement
are considered great in the field of software estimation [2].
301
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012)
With the enormous amount of data stored in files,
databases, and other repositories, it is increasingly important,
if not necessary, to develop powerful means for analysis and
perhaps interpretation of such data and for the extraction of
interesting knowledge that could help in decision-making.
Data Mining, also popularly known as Knowledge Discovery
in Databases (KDD), refers to the nontrivial extraction of
implicit, previously unknown and potentially useful
information from data in databases. While data mining and
knowledge discovery in databases (or KDD) are frequently
treated as synonyms, data mining is actually part of the
knowledge discovery process [5,7].
Data mining, at its core, is the transformation of large
amounts of data into meaningful patterns and rules. Further,
it could be broken down into two types: directed and
undirected. In directed data mining, you are trying to predict
a particular data point the sales price of a house given
information about other houses for sale in the neighborhood,
for example.
In undirected data mining, we are trying to create groups of
data, or find patterns in existing data creating the "Soccer
Mom" demographic group, for example. In effect, every
U.S. census is data mining, as the government looks to
gather data All paragraphs must be indented. All paragraphs
must be justified, i.e. both left-justified and right-justified.
About everyone in the country and turn it into useful
information. Today we are using data mining in every type
of applications such as banking, insurances, medical,
education etc.
Working with categorical data or a mixture of continuous
numeric and categorical data? Classification analysis might
suit your needs well. This technique is capable of processing
a wider variety of data than regression and is growing in
popularity. We’ll also find output that is much easier to
interpret. Instead of the complicated mathematical formula
given by the regression technique you'll receive a decision
tree that requires a series of binary decisions. One popular
classification algorithm is the k-means clustering algorithm.
WEKA
Data mining isn't solely the domain of big companies and
expensive software. In fact, there's a piece of software that
does almost all the same things as these expensive pieces of
software the software is called WEKA. WEKA is the
product of the University of Waikato (New Zealand) and
was first implemented in its modern form in 1997. It uses the
GNU General Public License (GPL). The figure of weka is
shown in the figure 1.The software is written in the Java™
language and contains a GUI for interacting with data files
and producing visual results (think tables and curves). It also
has a general API, so you can embed WEKA, like any other
library, in our own applications to such things as automated
server-side data-mining tasks. I am using the k-means
clustering algorithms for classification of data. For working
of weka we not need the deep knowledge of data mining
that’s reason it is very popular data mining tool. Weka also
provides the graphical user interface of the user and provides
many facilities [4, 7].
A. Some basic operations of data miningRegression –
K-means clustering is a data mining/machine learning
algorithm used to cluster observations into groups of related
observations without any prior knowledge of those
relationships. The k-means algorithm is one of the simplest
clustering techniques and it is commonly used in medical
imaging, biometrics and related fields.
Regression is the oldest and most well-known statistical
technique that the data mining community utilizes.
Basically, regression takes a numerical dataset and develops
a mathematical formula that fits the data. When you're ready
to use the results to predict future behavior, you simply take
your new data, plug it into the developed formula and you've
got a prediction! The major limitation of this technique is
that it only works well with continuous quantitative data
(like weight, speed or age). If you're working with
categorical data where order is not significant (like color,
name or gender) you're better off choosing another technique
[2,7].
C. The k-means Algorithm:
The k-means algorithm is an evolutionary algorithm that
gains its name from its method of operation. The algorithm
clusters observations into k groups, where k is provided as
an input parameter.
B. Classification –
302
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012)
It then assigns each observation to clusters based upon the
observation’s proximity to the mean of the cluster. The
cluster’s mean is then recomputed and the process begins
again. Here’s how the algorithm works [7]:
1.
2.
3.
4.
III. INTRODUCTION OF COST ESTIMATION
In recent years, software has become the most expensive
component of computer system projects. The bulk of the cost
of software development is due to the human effort, and
most cost estimation methods focus on this aspect and give
estimates in terms of person-months [9].
The algorithm arbitrarily selects k points as the
initial cluster centres (“means”).
Each point in the dataset is assigned to the closed
cluster, based upon the Euclidean distance between
each point and each cluster centre.
Each cluster centre is recomputed as the average of
the points in that cluster.
Steps 2 and 3 repeat until the clusters converge.
Convergence may be defined differently depending
upon the implementation, but it normally means
that either no observations change clusters when
steps 2 and 3 are repeated or that the changes do not
make a material difference in the definition of the
clusters.
Accurate software cost estimates are critical to both
developers and customers. They can be used for generating
request for proposals, contract negotiations, scheduling,
monitoring and control. Underestimating the costs may
result in management approving proposed systems that then
exceed their budgets, with underdeveloped functions and
poor quality, and failure to complete on time.
Overestimating may result is too many resources committed
to the project, or, during contract bidding, result in not
winning the contract, which can lead to loss of jobs [6].
Figure1- front view of weka
IV. WHY WE NEED THIS STUDY
There are so many techniques available for software cost
estimation but they are not very effectively. There is more
work done of using data mining and software engineering. I
m trying to data predict good result to the combining both
fields.
V. EXISTING METHODS FOR ESTIMATION
The estimation is a process of determining amount of
efforts, money, resources and time for building a software
project with the help of available quality information. Many
estimation methods have been proposed in last 30 years and
almost all methods require quantitative information of
productivity, size of project and other important factors that
affect the project. There are various practices of software
estimation such as analogy, expert opinion and empirical
based practices [Jones, 2007]. Analogy based practices
require historical data of projects as an input for comparison
whereas expert opinion are intuition based [Jorgenson and
Sheppard, 2007]. Empirical way is a practice of deriving the
cost of software using some mathematical/ algorithmic
model. Examples of methods that use such practices are FP
based method and COCOMO II method in TEMs. Mostly,
all traditional software development methods follow either
COCOMO II or FP based estimation methods successfully
due to complete set of requirement specification.
Data mining techniques are being used extensively in a
variety of fields. It has been frequently applied in the
business arena for customer relationship management and
market analysis. In addition to the multitude of applications
of data mining, there has been parallel research in improving
data mining algorithms. While data mining techniques have
been applied across broad domains, it has been rarely
applied in the field of software cost estimation, a subfield of
software engineering [4].
303
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012)
The figure given in the below show that the methodology
of the research. We are applying the k-means clustering
algorithms and classifieds the data.
2 CEE is the first model work with using the machine
learning algorithms and cost estimation algorithms for
generating the cost of the projects but it is specially designed
for the NASA, so we cannot use it for publically but it is
gives the important guideline for the new researchrs [2, 4].
Agile cocomo model -- A COCOMO™ tool that is very
simple to use and easy to learn. It incorporates the full
COCOMO™ parametric model and used for analogy-based
estimation to generate accurate results for a new project.
Estimation by analogy is one of the most popular ways to
estimate software cost and effort. While comparing
similarities between the new and old projects provides a
great way to estimate, results could still be inaccurate from
overlooking differences between the two projects especially
if the grounds of dissimilarity are fairly important. To build
on the estimation by analogy approach while accounting for
differences between projects, USC-CSE has created Agile
COCOMO-II, a cost estimation tool that is based on
COCOMO-II. It uses analogy based estimation to generate
accurate results while being very simple to use and easy to
learn. It can provide the facility to estimate the project in
various ways, it is shown in the figure 5. We can estimate
the project in tem of person- month, in term of dollars, in
term of object points, in term of function points etc. In this
paper, we discuss motivation for the program, the program's
structure, the results of our research, and provide insight into
the future direction of this tool [10].
Figure2:- functional diagram of existing methodology
VI. 2CEE COST ESTIMATION TOOLS
VII. AN INTRODUCTION OF SCALE FACTORS AND COST
DRIVER
2CEE (21st Century Effort Estimation) is one of the cost
estimation tools that can be used both data mining area and
software engineering fields. It is developed for the NASA
and copyrighted by NASA. It uses a variety of data mining
and machine learning techniques nearest neighbour, feature
subset selection, bootstrapping local calibration to propose
the most accurate software cost model. It is designed to
explore the uncertainty in the model and in the estimate, to
allow estimates early in the lifecycle by representing new
projects as ranges of values, and to provide numerous
calibration options. 2CEE1 has been encoded in a Windows
based tool that can be used to both generate an estimate and
allow the model developer to calibrate and develop models
using various machine learning, data mining, and statistical
techniques. By automating many tasks for the user it
provides gains in cost analyst efficiency. 2CEE uses leaveone out cross validation as a measure of model performance.
A. The Scale Drivers
In the COCOMO II model, some of the most important
factors contributing to a project's duration and cost are the
Scale Drivers. You set each Scale Driver to describe your
project; these Scale Drivers determine the exponent used in
the Effort Equation. There are five scale driver used in the
cocomo model and each cost driver play an important role in
the estimation [5,9].
The 5 Scale Drivers are:
304

Precedentedness

Development Flexibility

Architecture / Risk Resolution
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012)

Team Cohesion

Process Maturity
C. Introduction of some cost drivers
1. Required Software Reliability (RELY)
This is the measure of the extent to which the software must
perform its intended function over a period of time. If the
effect of a software failure is only slight inconvenience then
RELY is low. If a failure would risk human life then
RELIES is very high.
B. Cost Drivers
COCOMO II has 17 cost drivers for estimation of project,
development environment, and team to set each cost driver.
The cost drivers are multiplicative factors that determine the
effort required to complete your software project. For
example, if your project will develop software that controls
an airplane's flight, you would set the Required Software
Reliability (RELY) cost driver to Very High. That rating
corresponds to an effort multiplier of 1.26, meaning that
your project will require 26% more effort than a typical
software project. In the cocomo model, the cost drivers
divide in the four groups show in the below and given an
introduction some cost drivers in short form[5].
2. Data Base Size (DATA)
This measure attempts to capture the affect large data
requirements have on product development. The rating is
determined by calculating D/P. The reason the size of the
database is important to consider it because of the effort
required to generate the test data that will be used to exercise
the program.
3. Product Complexity (CPLX)
The cost drivers dived four groups:

1.
2.
3.
4.
5.
6.

Analyst Capability
Programmer Capability
Applications Experience
Platform Experience
Personnel Continuity
Use of Software Tools
This cost driver accounts for the additional effort needed to
construct components intended for reuse on the current or
future projects. This effort is consumed with creating more
generic design of software, more elaborate documentation,
and more extensive testing to ensure components are ready
for use in other applications.
Required Software Reliability
Data Base Size
Required Reusability
Documentation match to life-cycle needs etc.
Platform Factors:
1.
2.

4. Required Reusability (RUSE)
Product cost driver:
1.
2.
3.
4.

Complexity is divided into five areas: control operations,
computational operations, device -dependent operations,
data management operations, and user interface management
operations. Select the area or combination of areas that
characterize the product or a sub-system of the product. The
complexity rating is the subjective weighted average of these
areas.
Personnel Factors:
5. Execution Time Constraint (TIME)
Execution Time Constraint
Platform Volatility
This is a measure of the execution time constraint imposed
upon a software system. the rating is expressed in term of the
percentage of available execution time expected to be used
by the system or subsystem consuming the execution time
resource. The rating ranges from nominal, less than 50% of
the execution time resource used, to extra high, 95% of the
execution time resource is consumed.
Project Factors:
1.
2.
Required Development Schedule
Multisite Development etc.
305
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012)
6. Analyst Capability (ACAP)
10. Use of Software Tools (TOOL)
Analysts are personnel that work on requirements, high level
design and detailed design. The major attributes that should
be considered in this rating are Analysis and Design ability,
efficiency and thoroughness, and the ability to communicate
and cooperate. The rating should not consider the level of
experience of the analyst; that is rated with AEXP. Analysts
that fall in the 15th percentile are rated very low and those
that fall in the 95th percentile are rated as very high.
Software tools have improved significantly since the 1970's
projects used to calibrate COCOMO™. The tool rating
ranges from simple edit and code, very low, to integrated
lifecycle management tools, very high[5].
VIII. COCOMO II EFFORT EQUATION
The COCOMO II model makes its estimates of required
effort (measured in Person-Months � PM) based primarily
on your estimate of the software project's size (as measured
in
thousands
of
SLOC,
KSLOC)):
7. Programmer Capability (PCAP)
Current trends continue to emphasize the importance of
highly capable analysts. However the increasing role of
complex COTS packages, and the significant productivity
leverage associated with programmers' ability to deal with
these COTS packages, indicates a trend toward higher
importance of programmer capability as well.
Effort
=
2.94
*
EAF
*
(KSLOC)E
Where
EAF Is the Effort Adjustment Factor derived from the Cost
Drivers E
Is an exponent derived from the five Scale
Drivers As an example, a project with all Nominal Cost
Drivers and Scale Drivers would have an EAF of 1.00 and
exponent, E, of 1.0997. Assuming that the project is
projected to consist of 9,000 source lines of code, COCOMO
II estimates that 29.9 Person-Months of effort is required to
complete it[ 1,9]. Effort = 2.94 * (1.0) * (9) 1.0997 = 29.9
Person-Months.
Evaluation should be based on the capability of the
programmers as a team rather than as individuals. Major
factors which should be considered in the rating are ability,
efficiency and thoroughness, and the ability to communicate
and cooperate. The experience of the programmer should not
be considered here; it is rated with AEXP. A very low rated
programmer team is in the 15th percentile and a very high
rated programmer team is in the 95th percentile.
Methodology
Our methodology is very simple, I am combine two
different-different fields data mining and the software
engineering and try to generate the accurate cost of the
project with the help of past project data whose cost or effort
is known and find out the common cost factors. We used
weka tools for data mining and agile cocomo tools for
software estimation. I am using the promise data set for the
analysis.
8. Applications Experience (AEXP)
This rating is dependent on the level of applications
experience of the project team developing the software
system or subsystem. The ratings are defined in terms of the
project team's equivalent level of experience with this type
of application. A very low rating is for application
experience of less than 2 months. A very high rating is for
experience of 6 years or more.
IX. DATASET
9. Platform Experience (PEXP)
This is a PROMISE Software Engineering Repository data
set made publicly available in order to encourage repeatable,
verifiable, refutable, and/or improvable predictive models of
software engineering. The data files in the arff and .csv
format. These data set directly apply in the weka and apply
the various algorithms. Result of weka applied in the agile
cocomo model.
The Post-Architecture model broadens the productivity
influence of PEXP, recognizing the importance of
understanding the use of more powerful platforms, including
more graphic user interface, database, networking, and
distributed middleware capabilities.
306
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012)
Result – Agile cocomo model is the analogy model. In this
model we estimate the new project with the help of compare
the past project data. The feature of new project and past
project is very similar to the past project. with the help of
weka and agile we are predicted some useful result. In this
research we have taken 60 nasa past project data whose
efforts are already given, the list of the project is shown in
the figure. I have search that the common cost drivers and
the scale factors that are mainly affected the project
estimation. With the help of agile cocomo model we have
changed one of the values of the cost drivers or scale factors
and predict the value of the cost drivers. The below figure
shown the classification of the after apply the k-means
clustering algorithms. With the help of clustering we are
grouped of similar group of cost drivers. These cost drivers
are very helpful to predict the estimate the new projects.
In the weka, it is provide the facility to classify the data we
are used Apriori algorithms. It also provides the graphical
user interface and command line interface of the user. with
the help of table 1 and 2 I am showing the cost drivers,
found out after the analysis of past project data. These cost
driver used in every type of project.
Figure4- clustering
Next figure show that the front view of agile cocomo model.
It provides the facility of estimate the project in various way
such as in term of the cost of the project in term of dollars, in
term of the person month, in term of function point and
object points etc
Figure3- past project dataset in weka
This figure 3 show that different cost drivers used in the
various past projects. I am using 60 past NASA project’s
data and apply these project data in the weka, This figure
show the actual effort of the past project data. We are taken
as a base value in the agile cocomo model and set the new
value of cost driver. After applying the k-means clustering
we are find out the clusters that are store the similar cost
drivers. Result of k-means clustering is shown in the figure
4. With the help of clustering we grouped the similar
behaviour instances in to the clusters.
Figure5- front view of agile cocomo model
307
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012)
Decrease these to decrease cost of the project
The next figure show the various cost factors. We set the
new value of the cost factors and change their value with
respect to the past project cost drivers. We are find out some
important or useful cost drivers that can be used in every
project and they are responsible for increase or decrease the
cost of project. These cost drivers shown in the table 1 and 2.
Store
main memory constraint
Data
data base size
Time
time constraint for cpu
Virt
machine volatility
Rely
required software reliability etc
Table 2- show cost drivers whose values is decrease
X. CONCLUSION
These results suggest that building data mining and
machine learning techniques into existing software
estimation techniques such as COCOMO can effectively
improve the performance of a proven method. We have used
weka tools for data mining because it consist of differentdifferent machine learning algorithms that can be help us to
classify the data easily. We understand that there is a lack of
serious research in this field. Our main aim to show the data
mining is also very useful for the field of software
engineering. Not all data mining techniques performed better
than the traditional method of local calibration. However, a
couple of techniques used in combination did provide more
accurate software cost models than the traditional technique.
While the best combination of data mining techniques were
not consistent across the different stratifications of data, it
shows that there are different populations of software
projects and that rigorous data collection should be
continued for improving the development of accurate cost
estimation models.
Figure6- show the various cost drivers
Increase these to decrease effort
Acap
analysts capability
Pcap
programmers capability
Aaexp
application experience
Modp
Modern programming practices
Tool
use of software tools etc
Lexp
language experience
On the basis of this research we can say that cost drivers
and scale factors perform important role in this estimation
which we used any analogy models. I found out some
common cost drivers that we can use for all projects.
The future work is the need to investigate some more data
mining algorithms that can be help to improve the process of
software cost estimation and easy to use. The main reason
for choose the cocomo model for this research because it is
the best model of the software cost estimation and it is
publicly available easily.
Table1- show the cost drivers whose value is increased
308
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012)
.
References
[1]
“COCOMO II Model definition manual”, version 1.4, University of
Southern California.
[2]
Karen T. Lum, Daniel R. Baker, and Jairus M. Hihn “The Effects of
Data Mining Techniques on Software Cost Estimation” 2009 IEEE.
[3]
Zhihao Chen, Tim Menzies? Dan PortTim Menzies? Dan Port
“Feature Subset Selection Can Improve Software Cost Estimation
Accuracy” Center for Software Engineering,Univ. of Southern
California.
[4]
Jairus Hihn,Karen Lum “2CEE, A TWENTY FIRST CENTURY
EFFORT ESTIMATION METHODOLOGY” Lane Dept. CSEE West
Virginia University ISPA / SCEA 2009 Joint International
Conference.
[5]
] Z. Oscar Marbán, Antonio de Amescua, Juan J. Cuadrado, Luis
García , “Cost Drivers of a Parametric Cost Estimation Model for Data
Mining Projects” Notes, vol. 30, no. 4, pp. 1-6, 2005
[6]
Oscar Marbán, Antonio de Amescua, Juan J. Cuadrado, Luis García
“A cost model to estimate the effort of data mining projects”
Universidad Carlos III de Madrid (UC3M)
[7]
Dr. Alassane Ndiaye and Dr. Dominik Heckmann “Weka: Practical
machine learning tools and techniques with Java implementations” AI
Tools Seminar University of Saarland, WS 06/07
[8]
S. Chandrasekaran1, R.Lavanya2 and V.Kanchana “MULTICRITERIA APPROACH FOR AGILE SOFTWARE COST
ESTIMATION MODEL ”
[9]
Caper Jones., “Estimating software cost” tata Mc- Graw -Hill Edition
2007
[10]
http://sunset.usc.edu/cse/pub/research/AgileCOCOMO/
AgileCOCOMOII/Main.html
309