University Librarian Centre | University of Cagliari
UniCA Eprints

Power laws in software systems

Tonelli, Roberto (2012) Power laws in software systems. [Doctoral Thesis]



The main topic of my PhD has been the study of power laws in software systems within the perspective of describing software quality. My PhD research contributes to a recent stream of studies in software engineering, where the investigation of power laws in software systems has become widely popular in recent years, since they appear on an incredible variety of different software quantities and properties, like, for example, software metrics, software faults, refactoring, Java byte-code, module dependencies, software fractal dimension, lines of code, software packages and so on. The common presence of power laws suggests that software systems belong to the much larger category of complex systems, where typically self organization, fractality and emerging phenomena occur. Often my work involved the determination of a complex graph associated to the software system, defining the so called “complex software network”. For such complex software networks I analyzed different network metrics and I studied their relationships with software quality. In this PhD I took advantage of the theory of complex systems in order to study, to explain and sometimes to forecast properties and behavior of software systems. Thus my work involved the empirical study of many different statistical properties of software, in particular metrics, faults and refactorings, the construction and the application of statistical models for explaining such statistical properties, the implementation and the optimization of algorithms able to model their behavior, the introduction of metrics borrowed from Social Network Analysis (SNA) for describing relationships and dependencies among software modules. More specifically, my research activity regarded the followings topics: Bugs, power laws and software quality In [1] [7] [16] [20] [21] [22] module faultness and its implications on software quality are investigated. I studied data mining from CVS repositories of two large OO projects, Eclipse and Netbeans, focusing on “fixing- issue” commits, and compared static traditional approaches, like Knowledge Engineering, to dynamic approaches based on Machine Learning techniques. The work compares for the first time performances of Machine Learning (ML) techniques to automatic classify “fixing-issues” among message commits. Our study calculates precision and recall of different Machine Learning Classifiers for the correct classification of issue- reporting commits. The results show that some ML classifiers can correctly classify up to 99.9% of such commits. In [22] Java software systems are treated as complex graphs, where nodes represent a Java file - called compilation unit (CU) - and an edges represent a relations between them. The distribution of the number of bugs per CU, exhibits a power-law behavior in the tail, as well as the number of CUs influenced by a specific bug. The exam of the evolution of software metrics across different releases allows to understand how relationships among CUs metrics and CUs faultness change with time. In [1] module faultness is further discussed from a statistical perspective, using as case studies five versions of Eclipse, to show how log-normal, Double Pareto and Yule-Simon statistical distributions may fit the empirical bug distribution at least as well as the Weibull distribution proposed by Zhang. In particular, I discuss how some of these alternative distributions provide both a superior fit to empirical data and a theoretical motivation to be used for modeling the bug generation process. Further studies reported in [3] present a model based on the Yule process, able to explain the evolution of some properties of large object- oriented software systems. Four system properties related to code production of four large object-oriented software systems – Eclipse, Netbeans, JDK and Ant are analyzed. The properties analyzed, namely the naming of variables and methods, the call to methods and the inheritance hierarchies, show a power-law distribution. A software simulation allows to verify the goodness of the model, finding a very good correspondence between empirical data of subsequent software versions, and the prediction of the model presented. In [18], [19] and [23] three algorithms for an efficient implementation of the preferential attachment mechanism lying at the core of the Yule process are developed, and their efficiency in generating power- law distribution for different properties of Object Oriented (OO) software systems is discussed. Software metrics and SNA metrics In [2] [8] [13] [17] software metrics related to quality are analyzed and some metrics borrowed from the Social Network Analysis are applied to OO software graphs. In OO systems the modules are the classes, interconnected with each other by relationships like inheritance and dependency. It is possible to represent OO systems as software networks, where the classes are the network nodes and the relationships among classes are the network edges. Social Networks metrics, as for instance, the EGO metrics, allow to identify the role of each single node in the information flow through the network, being related to software modules and their dependencies. In [2] these metrics are compared with other traditional software metrics, like the Chidamber-Kemerer suite, and software graph metrics. The exam of the empirical distributions of all the metrics across the software modules of several releases of two large Java systems systematically shows fat-tails for all the metrics. Moreover, the various metric distributions look very similar and consistent across all system releases and are also very similar in both systems. Analytical distribution functions suitable for describing and studying the observed distributions are also provided. The work in [17] presents an extensive analysis of software metrics for 111 object-oriented systems written in Java. For each system, we considered 18 traditional metrics such as LOC and Chidamber and Kemerer metrics, as well as metrics derived from complex network theory and social network analysis, computed at class level. Most metrics follow a leptokurtotic distribution. Only a couple of metrics have patent normal behavior while some others are very irregular, and even bimodal. The statistics gathered allow to study and discuss the variability of metrics along different systems. In [8] a preliminary and exploratory analysis of the Eclipse subprojects is presented, using a joint application of SNA and traditional software metrics. The entire set of metrics has been summarized performing a Principal Component Analysis (PCA) and obtaining a very reduced number of independent principal components, which allow to represent the classes into a space where they show typical patterns. The preliminary results show how the joint application of traditional and network software metrics may be used to identify subprojects developed with similar functionalities and scopes. In [13] the software graphs of 96 systems of the Java Qualitas Corpus are anlyzed, parsing the source code and identifying the dependencies among classes. Twelve software metrics were analyzed, nine borrowed from Social Net- work Analysis (SNA), and three more traditional software metrics, such as Loc, Fan-in and Fan-out. The results show how the metrics can be partitioned in groups for almost the whole Java Qualitas Corpus, and that such grouping can provide insights on the topology of software networks. For two systems, Eclipse and Netbeans, we computed also the number of bugs, identifying the bugs affecting each class, and finding that some SNA metrics are highly correlated with bugs, while others are strongly anti-correlated. Software fractal dimension In [6] [12] [14] [15] the self similar structure of software networks is used to introduce the fractal dimension as a global software metric associated to software quality, at the system level and at the subproject level. In [6] the source code of various releases of two large OO Open Source (OS) Java software systems, Eclipse and Netbeans is analyzed, investigating the complexity of the whole release and of its subprojects. In all examined cases there exists a scaling region where it is possible to compute a self-similar coefficient, the fractal dimension, using “the box counting method”. Results show that this measure looks fairly related to software quality, acting as a global quality software metric. In particular, we computed the defects of each software system and we found a clear correlation among the number of defects in the system, or in a subproject, and its fractal dimension. This correlation exists across all the subprojects and also along the time evolution of the software systems, as new releases are delivered. In [14] software systems are considered as complex networks which have a self- similar structure under a length-scale transformation. On such complex software networks a self-similar coefficient is computed, also known as fractal dimension, using "the box counting method”. Several releases of the publically available Eclipse software system were analyzed, calculating the fractal dimension for twenty sub-projects, randomly chosen, for every release, as well as for each release as a whole. Our results display an overall consistency among the sub- projects and among all the analyzed releases. The study founds a very good correlation between the fractal dimension and the number of bugs for Eclipse and for twenty sub-projects. This result suggests that the fractal dimension could be considered as a global quality metric for large software systems. Works [12] and [15] propose an algorithm for computing the fractal dimension of a software network, and compare its performances with two other algorithms. Object of study are various large, object-oriented software systems. We built the associated graph for each system, also known as software network, analyzing the binary relationships (dependencies), among classes. We found that the structure of such software networks is self-similar under a length-scale transformation. The fractal dimension of these networks is computed using a Merge algorithm, first devised by the authors, a Greedy Coloring algorithm, based on the equivalence with the graph coloring problem, and a Simulated Annealing algorithm, largely used for efficiently determining minima in multi-dimensional problems. Our study examines both efficiency and accuracy, showing that the Merge algorithm is the most efficient, while the Simulated Annealing is the most accurate. The Greeding Coloring algorithm lays in between the two, having speed very close to the Merge algorithm, and accuracy comparable to the Simulated Annealing algorithm. 1.b Further research activity In [4] [9] [10] [11] the problem of software refactoring is analyzed. The study reported in [4] analyzes the effect of particular refactorings on class coupling for different releases of four Object Oriented (OO) Open Source (OS) Java software systems: Azureus, Jtopen, Jedit and Tomcat, as representative of general Java OS systems. Specifically, the “add parameter” to a method and “remove parameter” from a method refactorings, as defined according to Fowler’s dictionary, may influence class coupling changing fan-in and fan-out of classes they are applied to. The work investigates, both qualitatively and quantitatively, what is the global effect of the application of such refactorings, providing best fitting statistical distributions able to describe the changes in fan-in and fan-out couplings. A detailed analysis of the best fitting parameters and of their changes when refactoring occurs, has been performed, estimating the effect of refactoring on coupling before it is applied. Such estimates may help in determining refactoring costs and benefits . In [9] a study of the effect of fan-in and fan-out metrics is performed from the perspective of two refactorings, “add parameter to” and “remove parameter from” a method, collecting these two refactorings from multiple releases of the Tomcat open source system. Results show significant differences in the profiles of statistical distributions of fan-in and fan-out between refactored and not refactored classes. A strong over-arching theme emerged: developers seemed to focus on the refactoring of classes with relatively high fan-in and fan-out values rather than classes with high values in any one. In [10] is considered for the first time how a single refactoring modified these metric values, what happened when refactorings had been applied to a single class in unison and finally, what influence a set of refactorings had on the shape of FanIn and FanOut distributions. Results indicated that, on average, refactored classes tended to have larger FanIn and FanOut values when compared with non-refactored classes. Where evidence of multiple (different) refactorings applied to the same class was found, the net effect (in terms of FanIn and FanOut coupling values) was negligible. In [11] is shown how highly-coupled classes were more prone to refactoring, particularly through a set of ‘core’ refactorings. However, wide variations were found across systems for our chosen measures of coupling namely, fan-in and fan-out. Specific individual refactorings were also explored to gain an understanding of why these differences may have occurred. An exploration of open questions through the extraction of fifty-two of Fowler’s catalog of refactorings drawn from versions of four open-source systems is accomplished, comparing the coupling characteristics of each set of refactored classes with the corresponding set of non-refactored classes. In [7] I presented some preliminary studies also on the relationships about Micro- patterns, more specifically anti-patterns, and software quality, while in [5] and [21] I analyzed the role of Agile methodologies in software production and the relationships with software quality and the presence of bugs.

Item Type:Doctoral Thesis
Date:06 March 2012
Tutor:Marchesi, Michele
PhD classes:Ciclo 24 > Ingegneria elettronica e informatica
Coordinator:Giua, Alessandro
Institution:Universita' degli Studi di Cagliari
Divisions:Dipartimenti (fino a dicembre 2011) > Dipartimento di Ingegneria elettrica ed elettronica
Subjects:Area 09 - Ingegneria industriale e dell'informazione > ING-INF/05 Sistemi di elaborazione delle informazioni
Uncontrolled Keywords:Power laws, bugs, software quality
ID Code:685
Deposited On:19 Mar 2012 17:00

Repository Staff Only: item control page