MBA management

History and Evolution of Software Metrics Topics:

Introduction


The software industry is almost 60 years old, which makes it a fairly mature industry. One would think that after 60 years the software industry would have well-established methods for measuring productivity and quality, and also a large volume of accurate benchmark data derived from thousands of measured projects. However, this is not quite the case.

There are a number of propriety collections of software benchmark data, such as those collected by the software Engineering Institute (SEI), Gartner Group, Software productivity Research (SPR), the David Consulting Group, Quantitative Software Management (QSM), and a number of others. Some of these collections are large and may top 10,000 projects. However, the details from these benchmark collections are provided to clients, but not to the general public other than the data that gets printed in books such as this one.

Only the non-profit International Software Benchmark Standard Group (ISBSG) has data that is widely and commercially available. As this book is written in 2008, the ISBSG collection of benchmark data contains perhaps 4,000 projects. The rate of growth is about 5oo projects per year. The majority of ISBSG projects are measured using IFPUG function points 500 projects per year. The majority of ISBSG projects are measured using IFPUG function points, but some data is also available using COSMIC, NESMA, MARK II, and other common function point variants.

As of 2008 the software industry has dozens of metrics available, some of which have only a handful of users. Very few software metrics and measurement practices are supported by formal standards and training. Benchmarks vary from quite complete to so sparse in terms of what is collected that the value is difficult to ascertain.

Major topics such as service –oriented metrics, database volumes, and quality have a severe shortage of both research and published results. Today in 2008 as this book is written, the total volume of reliable productivity and quality data for the entire industry is less than I percent of what is really needed.

In order to understand the lack of solid measurements in the software industry, it is useful to look at the history of software work from its origins just after World War II through 2008.

Evolution of the Software Industry and Evolution of Software Measurements


For the first 10 years or so of the software industry starting at around 1947 through 1957, most applications were quite small: the great majorities were less than 1,000 source code statements in size. All of these were written in low-level assembly languages and some were patched in machine language, which is even more difficult to work with.

The first attempts to measure productivity and quality used” lines of code” measures and at the time (circa 1950) that metric was fairly effective. Coding took about 50 percent of the effort to build an application; debugging and testing took about 40 percent, and everything else took only about 10 percent.

In this early era, productivity measures based on lines of code and bug counts based on bugs or defects per 1,000 lines of code (KLOC) were the standard metrics of the era and were fairly effective . This is because in the early days of software, coding bugs were the most common and the most troublesome.

Between 1957 and 1967, the situation began to change dramatically. Low-level assembly languages started to be replaced by more powerful procedural languages such as COBOL, FORTRAN, and APL. As computers were applied to business issues such as banking and manufacturing, application sizes grew from 1,000 lines of code up past 100,000 lines of code.

These changes in both programming languages and application sizes began to cause problems for the lines of code metric. By 1967 coding itself was beginning to drop below 30 percent of application development effort, while production of requirements, specifications, plans, and paper documents began to approach 40 percent. Testing and debugging took about 30 percent. Adding to the problem, some applications were written in two or more different programming languages.

The lines of code metric continued to be used as a general productivity metric, but some weaknesses were being noted. For example, it was not possible to do direct measurement of design and documentation productivity and quality with LOC metrics, because these paper activities did not involve coding.

By the mid- 1970s more serious problems with LOC metrics were noted. It is reasonable to assume that high-level programming languages improve development productivity and quality, which indeed they do. But attempts to measure these improvements using LOC metrics led to the discovery of an unexpected paradox: LOC metrics penalize high- level languages and give the best results for low-level languages.

Assume you have two identical applications, with one written in assembly language and one written in assembly language and one written in COBOL. The assembly version required 10,000 lines of code but the COBOL version only required 3,000 lines of code. When you measure coding rates, both languages were coded at a rate of 1,000 lines of code per month. But since the COBOL version was only one-third the size of the assembly version, the productivity gains can’t be seen using LOC metrics.

Even worse, assume the specifications and documents for both versions took 5 months of effort. Thus, the assembly version required a total of 15 months of effort while the COBOL version required only 8 months of effort. If you measure the entire project with LOC metrics, the assembly version only had a productivity rate 375 lines of code per month. Obviously, the economic benefits of high-level programming languages disappear when measured using LOC metrics.

These economic problems are what caused IBM to assign Allan Albrecht and his colleagues in White Plains to try and develop useful software metric that was independent of code volumes, and which could measure both economic productivity and quality without distortion.

After several years of effort, what Albrecht and his colleagues came up with was a new kind of metric termed function points. Function point metrics are based on five external aspects of software applications: inputs, inquiries, logical files, and interfaces.

After being used internally within IBM for several years, function points were discussed publicly for the first time in October 1979 in a paper which Albrecht presented at a joint SHARE/GUIDE/IBM conference held at Monterey, California.

Once an application’s function point total is known, the metric can be used for a variety of useful economic purposes, including:

• Studies of software production
• Function points per person- month
• Work hours per function point
• Development cost per function point


Studies of software consumption

• Function points owned by an enterprise
• Function points needed by various kinds of end users
• Build, lease, or purpose decision making
• Contract vs.in-house decision making
• Software project value analysis

Studies of software quality

• Test cases and runs required per function point
• Requirements and design defects discovered per function points
• Coding defects per function points
• Documentation defects per function points

When he invented function points, Albrecht was working for IBM’s Data Processing Services group. He had been given the task of measuring the productivity of a number of software projects. Because IBM’s DP services group developed custom software for a variety of other organizations, the software projects were written in wide variety of languages: COBOL,PL/I , RPG, APL, and assembly language, to name but a few , and some indeed were written in mixed languages.

Albrecht knew, as did many other productivity experts, that it was not technically possible to measure software production rat across projects written in different levels of language with the traditional lines-of-code measures.

Other researchers knew the problems that existed with lines-of-code measures, but Albrecht deserves the credit for going beyond those traditional and imperfect metrics and developing a technique that can be used to explore the true economics of software production and consumption.

Albrecht’s paper on function points was first published in 1979 in the conference proceedings, which had only limited circulation of several hundred copies. In 1981, with both IBM”S and the conference organization’s permission, the paper was reprinted in the IEEE tutorial entitled programming productivity: Issues for the Eighties by the author.

This republication by the IEEE provided the first widespread circulation of the concept of function point metrics outside IBM.

The IEEE tutorial brought together two different threads of measurement research. In 1978, the author had published an analysis of the mathematical problems and paradoxes associated with lines-of-code measures. That article, also included in the 1981 IEEE tutorial, proved mathematically that lines of code were incapable of measuring productivity in the economic sense. Thus it provided strong justification for Albrecht’s work on function point metrics, which were the first in software history that could be used for measuring economic productivity.

It should be recalled that the standard economic definition of productivity is: “Goods or services produced per unit of labor or expense.” A line of code is neither goods nor services in the economic sense, Customers do not buy lines of code directly, and they often do not even know how many lines of code exist in a software product. Also, lines of code are not the primary deliverables of software projects so they cannot be used for serious studies of the production costs of software systems or programs.

The greatest bulk of what is actually produced and what gets delivered to users of software comprises words and paper documents. In the United States, sometimes as many as 400 English words will be produced for every line of source code in large systems. Often more than three times as much effort goes into word production as goes into coding. Words are obviously not economic units for software, since customers do not buy them directly, nor do they have any real control over the quality produced. Indeed in some cases, such as large military systems, far too many unnecessary words are produced.

As already mentioned, customers do not purchase lines of code either, so code quantities have no intrinsic value to users. In most instances, customers neither knows nor cares how much code was written or in what language an application is embodied. Indeed, if the same functionality could be provided to users with less code by mans of a higher-level language, customers might benefit from the cost reductions.

If neither of the two primary production units of software (words and code) is of direct interest to software consumers, then what exactly constitutes the “goods or services” that make software a useful economic commodity? The answer, of course, is that users care about the functions of the application.

Prior to Albrecht’s publication of the function point metric, there were only hazy and inaccurate ways to study software production, and there was no way at all to explore the demand or consumption side of the software economic picture.

Thus, until 1979 the historical problem of measuring software productivity could be stated precisely: “The natural units of software production (words and code) were not the same as the units of software consumption (functions).” Economic studies require standard definition of both what is produced and also of what is consumed.

Since neither words nor lines of code are of direct interest to software consumers, there was no tangible unit that matched the economic definition of goods or services that lent itself to studies of software’s economic productivity.

Recall from earlier that a function point is an abstract but workable surrogate for the goods that are produced by software projects. Function points are the weighted sums of five different factors that are of interest to users:

• Inputs
• Outputs
• Logical files (also called user data groups)
• Inquiries
• Interfaces

Function points were defined by Albrecht to be “end-user benefits.” and they are now serving as the economic units that customer wish to purchase or to have developed. That is, function points are beginning to be used in contract negotiations between software producers and their clients.

Clients and developers alike can discuss an application rationally in terms of its inputs, outputs, inquiries, files, and interfaces. Further, if requirements change, clients can request additional inputs or outputs after the initial agreement, and software providers can make rational predictions about the cost and schedule impact of such additions, which can then be discussed with clients in a reasonable manner.

Function points, unlike lines of code, can also be used for economic studies of both software production costs and software consumption. For production studies, function points can be applied usefully to effort, staffing, and cost-related studies. Thus, it is now known that the approximate U.S. average for software productivity at the project level is 5 function points per person- month. At the corporate level, where indirect personnel such as executives and administrators are included as well as effort expended on canceled projects, the U.S. average is about 1.5 function points per person- month. Function points can also be used to explore the volumes and costs of software paperwork production, a task for which lines of code were singularly inappropriate.

For consumption studies, function points are beginning to create an entirely new field of economic research that was never before possible. It is now possible to explore the utilization of software within industries and the utilization of software by the knowledge workers who us computers within those industries.

The Cost of Counting Function Point Metrics


There has been a long-standing problem with using function points metrics. Manual counting by certified experts is fairly expensive. Assuming that the average daily f for hiring a certified function point counter in 2008 is $2,500 and that manual counting using the IFPUG function point method proceeds at a rate of about 400 function points per day, the result is that manual counting costs about $6.25 per function point. Both the costs and comparatively slow sped has been an economic barrier to the widespread adoption of functional metrics.

These statements are true for IFPUG, COSMIC, MARK II, and NESMA. And other major forms of function points metrics.

There are some alternate methods for deriving function point counts that are less expensive, although perhaps at the cost of reduced accuracy. Table 1 shows the current range of function point counts using several alternative methods. Note that Table 1 has a high margin of error. There are broad ranges in counting speeds and also in daily costs for every function point variation. Also, Table 1 is not an accurate depiction because each “Agile story point “reflects a larger unit of work than a normal function point. A story point may be equal to at least 2 and perhaps more function points. A full Agile “story may top 20 function points in size.

The term Agile stories, the method used by some Agile projects for deriving requirements. An Agile story point is usually somewhat larger than a function point, and is perhaps roughly equivalent to two IFPUG function points, or perhaps even more.

Manual counting implies analyzing and enumerating function points from requirements and specifications by a certified counter. The accuracy is good, but the costs are high (Not function points counts by uncertified counters are erratic and unreliable. However, counts by certified counters have been studied and achieve good accuracy.) Since COSMIC, IFPUG< Mark II, and NESMA function points all have certification procedures, this method works with all of the common function point variants. Counting use case points is not yet a certified activity as of 2008.

Automatic derivation refers to experimental tools that can derive function points from written requirements and specifications.

TABLE 1: Range of Costs for Calculating Function point Metrics


Method of Counting   Function Points Counted per Day   Average Daily compensation   Cost per Function Point   Accuracy of Count
Agile story points   50   $2,500   $50.0   5%
Use case manual counting   250   $2,500   $10.0   3%
Mark II manual counting   350   $2,500   $7.14   3%
IFPUG manual counting   400   $2,500   $6.25   3%
NESMA manual counting   450   $2,500   $5.55   3%
COSMIC manual counting   500   $2,500   $5.00   3%
Automatic derivation   1,000   $2,500   $2.50   5%
“LIGHT” function points   1,500   $2,500   $1.67   10%
NESMA “indicative” counts   1,500   $2,500   $1.67   10%
Backfiring from LOC   10,000   $2,500   $0.25   50%
Pattern-matching   300,000   $2,500   $0.01   15%


Such tools have been built experimentally, but are not commercially available. IFPUG function points has been the major metric supported. These tools require formal requirements and/ or design documents such as those using structured design, use cases, and other standard methods. This method has good accuracy, but its sped is linked to the rate at which the requirements are created. It can go no faster. Even so, there is a great reduction in manual effort.

The phrase “light” function point refers to a method developed by David Herron of the David Consulting Group. This method simplifies the courts to a higher level and therefore is more rapid. The “light” method uses average values for the influential factors .As this book is written, the “light” method shows promise but is still somewhat experimental.

The phrase NESMA indicative refers to a high-speed method developed by the Netherlands Function points Users Group (NESMA) that uses constant weights and concentrates on counts of data files.

Backfiring is the oldest alternative to manual counting, and actually was developed in the mid-1970s when A.J. Albrecht and his colleagues first developed function point metrics. During the trials of function points, both lines of code and function points were counted, which provided approximate ratios between LOC metrics and function point metrics.

However, due to variations in how code is counted and variations in individual programming “styles”, the accuracy of backfiring is not high. At best, backfiring can come within 10 percent of manual counts, but at worst the difference can top 100 percent. Backfiring works best when logical statements are counted and worst when physical lines of code are counted. This means that backfiring is most effective when automated code counting tools are available. Note also that backfiring ratios are not the same for IFPUG, COSMIC, Mark II, or NESMA function points. Therefore, each function point method requires its own tables of backfiring values. For that matter, when counting rules change, all of the backfiring ratios would change at the same time, Backfiring works best when the code itself is counted automatically by a tool that supports formal counting rules.

The new pattern-matching approach is based on the fact that many thousands of projects have now been counted with function points. By using a formal taxonomy to match new projects with similar older projects, and by making some adjustments for complexity (problem, code, and data complexity specifically), pattern-matching works by aligning new projects with similar historical projects. Pattern-matching offers a good combination of speed and accuracy. The method is still experimental as this book is written in 2008, but the approach seems to offer solid economic advantages coupled with fairly good accuracy.

Pattern-matching also has some unique capabilities, such as being useful for legacy applications where there may be no written specifications. Office suites and operating systems where the specifications are not available and where the venders have not provided function point sizes.

However, as of 2008, the only two methods that have widespread utilization are normal counting and backfiring. Both have accumulated data on thousands of projects.

The cost and speed of counting function points is a significant barrier to usage for large applications. For example, a major ERP package such as SAP or Oracle is in the range of 275,000 function points in size. To calculate function points manually for such a large system would take up to six months by a team of certified counters, and cost more than half a million dollars. There is an obvious need for quicker, cheaper, but still accurate methods of arriving at function point totals before really large applications will utilize these powerful metrics.

However, due to variations in how code is counted and variations in individual programming “style”, the accuracy of backfiring is not high. At best, backfiring can come within 10 percent of manual counts, but at worst the difference can top 100 percent. Backfiring works best when logical statements are counted and worst when physical lines of code are counted. This means that backfiring is most effective when automated code counting tools are available. Note also that backfiring ratios are not the same for IFPUG, COSMIC, Mark II, or NESMA function points. Therefore, each function point method requires its own tables of backfiring values. For that matter, when counting rules change, all of the backfiring ratios would change at the same time. Backfiring works best when the cod itself is counted automatically by a tool that supports formal counting rules.

The new pattern-matching approach is based on the fact that many thousands of projects have now been counted with function points. By using a formal taxonomy to match new projects with similar older projects, and by making some adjustments for complexity (problem, code, and data complexity specifically). Pattern- matching works by aligning new projects with similar historical projects. Pattern matching offers a good combination of speed and accuracy. The method is still experimental as this book is written in 2008, but the approach seems to offer solid economic advantages coupled with fairly good accuracy.

Pattern-matching also has some unique capabilities, such as being useful for legacy applications where there may be no written specifications. It can also be used for commercial software packages, such as large ERP applications, office suites, and operating systems where the specifications are not available and where the venders have not provided function point sizes.

However, as of 2008, the only two methods that have widespread utilization are normal counting and backfiring. Both have accumulated data on thousands of projects.

The cost and speed of counting function points is a significant barrier to usage for large applications. For example, a major ERP package such as SAP or Oracle is in the range of 275,000 function points in size. To calculate function points manually for such large system would take up to six months by a team of certified counters, and cost more than half a million dollars. There is an obvious need for quicker, cheaper, but still accurate methods of arriving at function point totals before really large application will utilize these powerful metrics.

The high costs and low speed of manual counting explain why backfiring has been popular for so many years. The costs and speed are quite good, but unfortunately accuracy has lagged. The pattern-matching approach, which can be turned to fairly good precision, is a promising future method.

PROBLEMS WITH AND PARADOXES OF LINES-OF-CODE METRICS


One of the criticisms sometimes levied against function points is that they are subjective whereas lines of code are considered to be objective. It is true the function point counting to date has included a measure of human judgment, and therefore included subjectivity (The emergence of a new class of automated function point tools is about to eliminate the current subjectivity of functional metrics.) However, it is not at all true that the lines-of-code metric is an objective metric. Indeed, as a line of code since the days when the yard was based on the length of the length of the arm of the king of England.

Manual counting of code is fairly expensive and not particularly accurate. In fact for some “visual languages” such as Visual Basic, code counting is so difficult and unreliable that it almost never occurs. Code counting works best when it is automated by tools that can be programmed to follow specific rules for dealing with topics such as dad code, reused code, and blank lines between paragraphs.

Some compilers produce source code counts. There are many commercial tools that can count source code statements, and also calculate complexity values such as cyclomatic and essential complexity. Multiple programming languages such as Java and HTML add a measure of complexity. Also, topics such as comments, reused code, dead code, blank lines between paragraphs, and delimiters between logical statements tend to introduce ambiguity into code counts. In fact the phrase “manual count” is something of a misnomer. What usually happens for large applications is that a small sample is actually counted, and then those values are applied to the rest of the application.

The phrase “reverse backfiring” merely indicates that the formulae for converting code statements into function points are bi-directional and work from either starting point.

The pattern matching approach is similar for code counts as for function points. It operates by comparing the application in question to older applications of the same class and type written in the same programming language or languages.

To understand the effectiveness of function points, it is necessary to understand the problems of the older lines of code metric. Regretfully, most users of lines of code have no idea at all of the subjectivity, randomness, and quirky deficiencies of this metric.

As mentioned, the first complete analysis of the problems of problems of lines-of-code metrics was the previously mentioned study by the author in the IBM Systems Journal in 1978. In essence there are three serious deficiencies associated with lines of code:

• There has never been a national or international standard for a line of code that encompasses all procedural languages.

• Software can be produced by such methods as program generators, spreadsheets, graphic icons, reusable modules of unknown size, and inheritance, wherein entities such as lines of code are totally irrelevant.

• Lines-of-code metrics paradoxically move backward as the level of the language gets higher, so that the most powerful and advanced languages appear to be less productive than the more primitive low level languages. That is due to an intrinsic defect in the lines of code metrics. Some of the languages thus penalized include Ada, APL, C++, Java, Objective-C, SMALLTALK, and many more.

Let us consider these problems in turn.

Lack of a Standard Definition for Lines of Code: The software industry will soon be 70 years of age, and lines of code have been used ever since its start. It is surprising that, in all that time, the basic concept of a line of code has never been standardized.

Counting physical or Logical Lines: The variation that can cause the greatest apparent difference in size is that of determining whether a line of code should be terminated physically or logically. A physical termination would be caused by the ENTER key of a computer keyboard, which completes the current line and moves the cursor to the next line of the screen. A logical termination would be a formal delimiter, such as a semicolon, colon, or period.

For languages such as Basic, which allow many logical statements per physical line, the size counted by means of logical delimiters can appear to be up t0 500 percent larger than if lines are counted physically. On the other hand, for languages such as COBOL, which utilize conditional statements that encompass several physical lines, the physical method can cause the program to appear perhaps 200 percent larger than the logical method. From informal surveys of the clients of Software productivity Research carried out by the author, it appears that about 35 percent of U.S .project managers count physical lines, 15 percent count logical lines, and 50 percent do not count by either method.

Counting Types of Lines: The next area of uncertainty is which of uncertainty is which of several possible kinds of lines should be counted. The first full explanation of the variations in counting cod was perhaps that published by the author in 1986, which is surprisingly recent for a topic almost 65 years of age.

Most procedural languages include five different kinds of source code statements:

• Executable lines (used for actions, such as addition)
• Data definitions (used to identity information types)
• Comments (used to inform readers of the code)
• Blank lines (used to separate sections visually)
• Dead code (code left in place after updates)

Again, there has never been a U.S. standard that defined whether all five or only one or two of these possibilities should be utilized. In typical business application, about 40 percent of the total statements are executable lines, 35 percent are data definitions, 10 percent are blank, and 15 percent are comments. For systems software such as operating systems, about 45 percent of the total statements are executable, 30 percent are data definitions, 10 percent are blank, and 15 percent are comments. However, as applications age, dead code begins to appear and steadily increases over the years.

Dead code consists of code segments that have been bypassed or replaced with newer code following a bug fix or an update. Rather than excising the code, it is often left in place in case the new changes don’t work. It is also cheaper to leave dead code than to remove it. The volume of dead code increases with the age of software applications. For a typical legacy application that is now ten years old, the volume of dead code will probably approximate 15 percent of the total code in the application. Dead code became an economic issue during the Y2K era, when some Y2K repair companies were charging on the basis of every line of code. It was soon recognized that dead was going to be an expensive liability if an application was turned over to a commercial Y2K shop for remediation based on per line charges.

From informal surveys of the clients of software productivity Research carried out by the author, it appears that about 10 percent count only executable lines, 20 percent count executable lines and data definitions, 15 percent also include commentary lines, and 15 percent also include commentary lines, and 5 percent even include blank lines! About 50 percent do not count lines of code at all.

Counting Reusable Code: Yet another area of extreme uncertainty is that of counting reusable code within software applications. Informal code reuse by programmers is very common, and any professional programmer will routinely copy and reuse enough code to account for perhaps 20to30 percent of the code in an application when the programming is done in an ordinary procedural language such as C, COBOL, or FORTRAIN. For object-oriented languages such as SMALLTALK, C++, and objective C, the volume of reuse tends to exceed 50 percent because of the facilities of inheritance that are intrinsic in the object-oriented family of languages. Finally, some corporations have established formal libraries of reusable modules, and many applications in those corporations may exceed 75 percent of the total volume of reused code.

The problem with measuring reusability centers around whether a reused module should be counted at all, counted only once, or counted each time it occurs. For example, if a reused module of 100 source statements is included five times in a program, there are three variations in counting:

• Count the reused module at every occurrence.
• Count the reused module only once.
• Do not count the reused module at all, since it was not developed for the current project.

From informal surveys of the clients of software productivity Research carried out by the author, about 25 percent would count the module every time it occurred, 20 percent would count the module only once, and 5 percent would not count the reused module at all. The remaining 50 percent do not count source code at all.

Applications Written in Multiple Languages: The next area of uncertainty, which is almost never discussed in the software engineering literature, is the problem of using lines of code metrics for Multilanguage applications. From informal surveys of the clients of Software productivity Research, it appears that about a third of all U.S. applications include more than one language and some may include a dozen or more languages. Some of the more common language mixtures include:

• Java mixed with HTML
• Java mixed with C
• COBOL mixed with a query language such as SOL
• COBOL mixed with a data definition language such as DL/1
• COBOL mixed with several other special purpose languages
• C mixed with Assembler
• Visual Basic mixed with HTML
• Ada mixed with Assembler
• Ada mixed with Jovial and other languages

Since there are no U.S. standards for line counting that govern even a single language, multi-language projects show a great increase in the number of random errors associated with lines of code data.

Additional Uncertainties Concerning Lines of Code: Many other possible counting variations can affect the apparent size of applications in which lines of code are mused. For example:

• Including or excluding changed code for enhancements
• Including or excluding macro expansions
• Including or excluding job control language (JCL)
• Including or excluding deleted code
• Including or excluding scaffold or temporary code that is written but later discarded

The overall cumulative impact of all of these uncertainties spans more than an order of magnitude. That is, if the most verbose of the line-counting variations is compared to the most succinct, the apparent size of the application will be more than 10 times larger than the most succinct! That is an astonishing and even aw-inspiring range of uncertainty for a unit of measure approaching its 60th year of use!

Unfortunately, very few software authors brother to define which counting rules they used .The regrettable effect is that most of the literature on software productivity that express the results in terms of lines of code is essentially worthless for serious research purposes.

Size Variations That Are Due to Individual Programming Style: A minor controlled study carried out within IBM illustrates yet another problem with lines of code. Eight programmers were given the same specification and were asked to write the code required to implement it. The amount of code produced for the same specification varied by about 5 to 1 between the largest and the smallest implementation. That was due not to deliberate attempts to make productivity seem high, but rather to the styles of the programmers and to the varying interpretations of what the specifications asked for.

Software Functions Delivered Without Producing Code: As software reuse and Service-oriented architecture (SOA) become more common, it is possible to envision fairly large applications that consist of loosely coupled reusable components. In other words, some future applications can be constructed with little or no procedural code being written.

A large- scale study within ITT is in which the author participated found that about 26 percent of the approximately 30,000 applications owned by the corporation had been leased or purchased from external vendors rather than developed internally. Functionality was being delivered to the ITT software users of the packages, but ITT was obviously not producing the code. Specifically, about 140,000 function points out of the corporate total of 520,000 function points had been delivered to users in the form of packages rather than being developed by the ITT staff. The effective cost per function point of unmodified packages averaged about 35 percent of the cost per function point of custom development.

However, for packages requiring heavy modification, the cost per function point was about 105 percent of equivalent custom development. Lines-of-code metrics are essentially impossible for studying the economics of package acquisitions or for make-vs-buy productivity decisions. The venders do not provide code counts, so unless a purchaser uses some kind of code-counting engine there is no convenient way of ascertaining the volume of code in purchased software.

The advent of the object-oriented languages and the deliberate pursuit of reusable modules by many corporations is leading to the phenomenon that the number of unique lines of code that must actually be hand-coded is shrinking, whereas the functional content of applications continues to expand. The lines-of-code metric is essentially useless in judging the productivity impact of this phenomenon. The use of inheritance and methods by object-oriented languages, the use of corporate reusable module libraries, and the use of application and program generators make the concept of lines of code almost irrelevant.

As the 21st century progresses, an increasing number of graphics or icon- based “languages” will appear, and in them application development will proceed in a visual fashion quite different from that of conventional procedural programming. Lines of code, never defend adequately even for procedural languages, will be hopeless for graphics-based languages.

The Paradox of Reversed Productivity for High-Level Languages


Although this point has been discussed several times earlier, it cannot be overemphasized: The LOC metrics Penalize high-level languages and make-level languages look artificially better than they really are.

Although lack of standardization is the most visible surface problem with lines of code, the deepest and most severe problem is a mathematical paradox that causes real economic productivity and apparent productivity to move in opposite directions!

The paradox manifests itself under these conditions: As real economic software productivity improves, metrics expressed in both lines of source code per time unit and cost per source line from will tend to move backward and appear to be worse than previously. Thus, as real economic productivity improves, the apparent cost per source line will be higher and the apparent lines of source code per time unit will be lower than before even though less effort and cost were required to complete an application.

Failure to understand the nature of this paradox has proved to be embarrassing to the industry as a whole and too many otherwise capable managers and consultants who have been led to make erroneous recommendations based on apparent productivity data rather than on real economic productivity data. The fundamental reason for the paradox has actually been known since the industrial revolution, or for more than 200 years, by company owners and manufacturing engineers. The essence of the paradox is this: If a product’s manufacturing cycle includes a significant proportion of fixed costs and there is a decline in the number of units produced, the cost per unit will naturally go up.

For software, a substantial number of development activities either include or behave like fixed costs. For example, the applications requirements, specifications, and user documents are likely to stay constant in size and cost regardless of the language used for coding. This means that where enterprises migrate from a low- level language such as assembly language to a higher –level language such as COBOL or Ada or Java, they do not have to writ as many lines of source code units produced declines in the presence of fixed costs.

Since so many development activities either include fixed costs or behave like fixed costs, the cost per source fine naturally go up. Examples of activities that behave like fixed costs, since they are independent of coding, include user requirements, analysis, functional design, design reviews, user documentation, and some forms of testing such as function testing.

Table 3 is an example of the paradox associated with lines of source code metrics in a comparison of Assembler and Ada. Assume $5,000 per month is the fully burdened salary rate in both cases.

TABLE 3: The Paradox Of Lines –of-code Metrics and High-Level Languages

    Assembler Version   Java Version   Difference
Source code size   1,00,000   25,00   -75,000
Activity, in person- months:            
Requirements   10   10   0
Design   25   25   0
Coding   100   20   -80
Documentation   15   15   0
Integration and testing   25   15   -10
Management   25   15   -10
Total effort   200   100   -100
Total cost   $1,000,000   $500,000   -$500,000
Cost per line   $10   $20   +$ 20
Lines per month   500   250   -250


Note that Table 3 is intended to illustrate the mathematical paradox, and it exaggerates the trends to make the point clearly visible.

As shown in Table 4, with function points the economic productivity improvements are clearly visible and the true impact of a high-level language such as Ada can be seen and understood. Thus, function points provide a better base for economic productivity studies than lines- of- code metrics.

To illustrate in Table 4, with function points the economic productivity improvements are clearly visible and the true impact of a high-level languages as explored by function points, a larger study covering ten different programming languages will be used.

Some years ago the author and his colleagues at Software Productivity research were commissioned by a European telecommunications company to explore an interesting problem.

Many of this company’s products were written in the CHILL programming language. CHILL is a fairly powerful third-generation procedural language developed specifically for telecommunications applications by the CCITT, an international telecommunications association.

Software engineers and managers within the company were interested in moving to object-oriented programming using C++as the primary language. Studies had been carried out by the company to compare the productivity rats of CHILL and C++ for similar kinds of applications. These studies concluded that CHILL projects had higher productivity rats of CHILL and C++ for similar kinds of applications. These studies concluded that CHILL projects had higher productivity rates than C++ when measured with the productivity metric LOC per staff month.

Note that Table 3 is intended to illustrate the mathematical paradox, and it exaggerates the trends to make the point clearly visible.

As shown in Table 4, with function points the economic productivity improvements are clearly visible and the true impact of a high – level language such as Ada can be seen and understood. Thus function points provide a better base for economic productivity studies than lines-of-code metrics.

To illustrate some additional findings vis-à-vis the economic advantages of high-level languages as explored by function points, a larger study covering ten different programming languages will be used.

Some years ago the author and his colleagues at Software productivity Research were commissioned by a European telecommunications company to explore an interesting problem.

Many of this company‘s products were written in the CHILL programming language. CHILL is a fairly powerful third- generation procedural language developed specifically for telecommunications applications by the CCITT, an international telecommunications association.

Software engineers and managers within the company were interested in moving to object-oriented programming using C++ as the primary language. Studies had been carried out by the company to compare the productivity rats of CHILL and C++ for similar kinds of applications. These studies concluded that CHILL projects had higher productivity rates than C++ when measured with the productivity rats than C++ when measured with the productivity metric LOC per staff month.

We were asked to explore the results of these experiments, and either confirm or challenge the finding that CHILL was superior to C++ .We were also asked to make recommendations about other possible languages such as Ada 83, Ada95, C, PASCAL, or SMALLTALK.

TABLE 4: The Economic Validity of Function Point Metrics

    Assembler Version   Java Version   Difference
Source code size   1,00,000   25,000   -75,000
Function points   300   300   0
Activity, in person-months:            
Requirements   10   10   0
Design   25   25   0
Coding   100   20   -80
Documentation   15   15   0
Integration and testing   25   15   -10
Management   25   15   -10
Total effort   200   100   -100
Total cost   $ 1,000,000   $ 500,000   -$ 500,000
Cost Per Function point   $ 3,333   $ 1,666   -$ 1,667
Function Points Per Person month   1.5   3.0   -1.5


As background information, we also examined the results of using macro-assembly language. All eight of these languages were either being used being used for telecommunications software, or were candidates for use by telecommunications software as in the case of Ada 95,which was just being prepared for release.

Later, two additional languages were included in the analysis: PL/I and Objective C. The PL/I language has been used for switching software applications and construction of PBX switches for many years. For example, several ITT switches were constructed in a PL/I Variant called Electronic Switching PL/I (ESPL/I).

The objective C language actually originated as a telecommunications language within the ITT Corporation under Dr. Tom Love at the ITT programming Technology Center in Stratford, Connecticut. However, the domestic ITT research facilities were closed after Alcatel bought ITT’s telecommunications business, so the Objective C language was brought to the commercial market by Dr. Love and the Step Stone Corporation. The data on Objective C in this section was derived from mathematical modeling, and not from an actual product.

The basic conclusion of the study was that object-oriented languages did offer substantial economic productivity gains compared to third-generation procedural languages, but that these advantages were hidden when measured with the LOC metric.

However, object-oriented analysis and design is more troublesome and problematic. The unified modeling language (UML) and the older Booch, Jacobsen, and Rumbaugh “flavors” of 00 analysis and design had very step learning curves and were often augmented or abandoned in order to complete projects that used them.

Basis of the Study: The kind of project selected for this study was the software for a private branch exchange switch (PBX), similar to the kinds of private switching system utilized by the larger hotels and office buildings.

The original data on CHILL was derived from the client’s results, and data for the other programming languages were d rived from other telecommunication companies who are among our clients. (The Ada 95 results were originally modeled mathematically, since this language had no available compilers at the time of the original study. The Objective C results were also modeled.)

To ensure consistent results, all versions were compared using the same sets of activities, and any activities that were unique for a particular project were removed. The data was normalized using the CHECKPOINT measurement and estimation tool. This tool facilitates comparisons between different programming languages and different sets of activities, since it can highlight and mask activities that are not common among all projects included in the comparison. The full set of activities that we studied included more than 20, but the final study used consolidated data based on six major activities:

• Requirements
• Design
• Coding
• Integration and testing
• Customer documentation
• Management

The consolidation of data to six major activities was primarily to simplify presenting the results. The more granular data actually utilized included activity and task-level information. For example, the cost bucket labeled “integration and testing” really comprised information derived from integration, unit testing, new function testing, regression testing, stress and performance testing, and field testing.

For each testing stage, data was available on test case preparation, test case execution, and defect repair costs. However, the specific details of each testing step are irrelevant to an overall economic study. So long as the aggregation is based on the same sets of activities, this approach does not degrade the overall accuracy.

Since the original study concentrated primarily on object-oriented programming languages as opposed to object-oriented analysis and design, data from 00 requirements and analysis were not explored in depth in the earlier report.

Other SPR clients and other SPR studies that did explore various 00 analysis and design methods found that they had steep learning curves and did not benefit productivity in the near term. In fact, the 00 analysis and design approaches were abandoned or augmented by conventional analysis and design approaches in about 50 percent of the projects that attempted to us them initially.

The new unified modeling language (UML), which consolidates the methods of Booch, Rumbaugh, and Jacobsen, now has formal training curve has been reduced.

Metrics Evaluated for the Study: Since the main purpose of the study was to compare object-oriented languages and methods against older procedural languages and methods, it was obvious that we needed measurements and metrics that could handle both the old and the new. Since the study was aimed at the economic impact associated with entire projects and not just pure coding, it was obvious that the metric needed to be useful for measuring non-coding activities such as requirements, design, documentation, and the like. The metrics that were explored for this study included.

• Physical lines of code, using the SEI counting rules
• Logical lines of code, using the SPR counting rules
• Feature points
• Function points
• MOOSE (metrics for object- oriented system environments)

The SEI approach of using physical lines (Park, 92) was eliminated first, since the variability is both random and high for studies that span multiple programming languages. Individual programming styles can affect the count of physical lines by several hundred percent. When multiple programming languages are included, the variance can approach an order of magnitude.

The SPR approach of using logical statements gives more consistent results than a count of physical lines and reduces the variations due to individual programming styles. The usage of logical statements also facilitates a technique called “backfiring,” or the direct conversion of LOC metrics into functional metrics .The use of logical statements using the SPR counting rules published in the second edition of Applied software Measurement (Jones,96) but still used in 2008 for the third edition provides the basis of the LOC counts shown later in this report. However, LOC in any form is not a good choice for dealing with non-coding activities such as document creation.

The feature point metric was originally developed for telecommunication companies for software measurement work. This metric is not as well known in Europe as the function point metric, however, so it was not used in the final report. (For those unfamiliar with the feature point metric, it adds a sixth parameter, a count of algorithms, to the five parameters used by standard function point metrics. This metric was also described in the second edition of Applied Software Measurement.)

The IFPUG function point metric was selected for displaying the final results of this study, using a constant 1,500 function points as the size of all eight versions. The function point metric is now the most widely utilized software metric in both the United States and some 20 other countries, including much of Europe.

More data is now being published using this metric than any other, and the number of automated tools that facilitate counting function points is growing exponentially. Also, this metric was familiar to the European company for which the basic analysis was being performed.

Since a key purpose of the study was to explore the economics of object-oriented programming languages, it was natural to consider using the new “metrics for object- oriented system environments” (MOOSE) developed by Dr. Chris Kemerer of MIT (Kemerer and Chidamber, 1993).

The MOOSE metrics include a number of constructs that are only relevant to OO projects, such as Depth of the Inheritance Tree (DIT), Weighted Methods per Class (WMC), Coupling between Objects (CBO), and several others.

However, the basic purpose of the study is a comparison of ten programming languages, of which six are older procedural languages. Unfortunately the MOOSE metrics do not lend themselves to cross- language comparisons between OO projects and procedural projects, and so had to be excluded.

In some ways, the function point metric resembles the “horsepower” metric. In the year 1783 James Watt tested a strong horse by having it lift a weight, and found that it could raise a 150 pound weight almost four feet in on second. He created the ad hoc empirical metric, “horsepower,” as 550 foot pounds per second. The horsepower metric has been in continuous usage for more than 200 years and has served to measure steam engines, gasoline engines, diesel engines, electric motors, turbines, and even jet engines and nuclear power plants.

The function points metric also had an adhoc empirical origin, and is also serving to measure types of projects that did not exist at the time of its original creation in the mid-1970s, such as client-server software, object-oriented projects, web projects, and multimedia applications.

Function point metrics originated as a unit for measuring size, but have also served effectively as a unit of measure for software quality, for software productivity, and even for exploring software value and return on investment. In all of these cases, function points are generally superior to the older LOC metric. Surprisingly, function points are also superior to some of the modern object-oriented (OO) metrics, which are often difficult to apply to quality or economic studies.

For those considering the selection of metrics for measuring software productivity and quality, eight practical criteria can be recommended:

• The metric should have a standard definition and be unambiguous.
• The metric should not be biased and unsuited for large scale statistical studies.
• The metric should have a formal user group and adequate published data.
• The metric should be supported by tools and automation.
• It is helpful to have conversion rules between the metric and other metrics.
• The metric should deal with all software deliverables, and not just code.
• The metric should support all kinds and types of software projects.
• The metrics should support all kinds and types of programming languages.

It is interesting that the function point metric is currently the only metric that meets all eight criteria.

The feature point metric lacks a formal user group and has comparatively few published results but otherwise is equivalent to function points and hence meets seven of the eight criteria.

The LOC metric is highly ambiguous, lacks a user group, is not suited for large-scale statistical studies, is unsuited for measuring non-code deliverables such as documentation, and does not handle all kinds of programming languages such as visual languages or generators. In fact, the LOC metric does not really satisfy any of the criteria.

The MOOSE metrics are currently in evolution and may perhaps meet more criteria in the future. As this paper was written, the MOOSE metrics for object-oriented projects appeared unsuited for non-OO projects, do not deal with deliverables such as specifications or user documentation, and lack conversion rules to other metrics.
Copyright © 2014 Mbaexamnotes.com         Home | Contact | Projects | Jobs

Review Questions
  • 1. What is metrics? Write down its importance?
  • 2. Explain briefly about objective and Subjective measurement?
  • 3. What are the attributes of good metrics?
  • 5. Explain the types of metrics in detail?
  • 6. Explain briefly about Evolution of the software industry and Evolution of software measurements?
  • 7. Explain the cost of counting function points metrics?
  • 8. What is Counting Reusable Code?
  • 9. Explain in detail about the Paradox of Reversed Productivity?
Copyright © 2014 Mbaexamnotes.com         Home | Contact | Projects | Jobs

Related Topics
History and Evolution of Software Metrics Keywords
  • History and Evolution of Software Metrics Notes

  • History and Evolution of Software Metrics Programs

  • History and Evolution of Software Metrics Syllabus

  • History and Evolution of Software Metrics Sample Questions

  • History and Evolution of Software Metrics Subjects

  • History and Evolution of Software Metrics Syllabus

  • EMBA History and Evolution of Software Metrics Subjects

  • History and Evolution of Software Metrics Study Material

  • BBA History and Evolution of Software Metrics Study Material