Anais. Congresso Brasileiro de Software: Teoria e Prática. SBES 2013 XXVII Simpósio brasileiro de engenharia de software

Transcrição

1 Congresso Brasileiro de Software: Teoria e Prática 29 de setembro a 04 de outubro de 2013 Brasília-DF Anais SBES 2013 XXVII Simpósio brasileiro de engenharia de software

2 SBES 2013 SBES 2013 XXVII Simpósio Brasileiro de Engenharia de Software 29 de setembro a 04 de outubro de 2013 Brasília-DF, Brasil ANAIS Volume 01 ISSN: COORDENADOR DO COMITÊ DE PROGRAMA Auri M. R. Vincenzi, Universidade Federal de Goiás COORDENAÇÃO DO CBSOFT 2013 Genaína Rodrigues UnB Rodrigo Bonifácio UnB Edna Dias Canedo - UnB Realização Universidade de Brasília (UnB) Departamento de Ciência da Computação (DIMAp/UFRN) Promoção Sociedade Brasileira de Computação (SBC) Patrocínio CAPES, CNPq, Google, INES, Ministério da Ciência, Tecnologia e Inovação, Ministério do Planejamento, Orçamento e Gestão e RNP Apoio Instituto Federal Brasília, Instituto Federal Goiás, Loop Engenharia de Computação, Secretaria de Turismo do GDF, Secretaria de Ciência Tecnologia e Inovação do GDF e Secretaria da Mulher do GDF 2

3 SBES 2013 SBES 2013 XXVII Brazilian Symposium on Software Engineering (SBES) September 29 to October 4, 2013 Brasília-DF, Brazil PROCEEDINGS Volume 01 ISSN: PROGRAM CHAIR Auri M. R. Vincenzi, Universidade Federal de Goiás, Brasil CBSOFT 2013 general CHAIRS Genaína Rodrigues UnB Rodrigo Bonifácio UnB Edna Dias Canedo - UnB ORGANIZATION Universidade de Brasília (UnB) Departamento de Ciência da Computação (DIMAp/UFRN) PROMOTION Brazilian Computing Society (SBC) SPONSORS CAPES, CNPq, Google, INES, Ministério da Ciência, Tecnologia e Inovação, Ministério do Planejamento, Orçamento e Gestão e RNP SUPPORT Instituto Federal Brasília, Instituto Federal Goiás, Loop Engenharia de Computação, Secretaria de Turismo do GDF, Secretaria de Ciência Tecnologia e Inovação do GDF e Secretaria da Mulher do GDF 3

4 SBES 2013 Autorizo a reprodução parcial ou total desta obra, para fins acadêmicos, desde que citada a fonte 4

5 SBES 2013 Apresentação Bem-vindo à XXVII edição do Simpósio Brasileiro de Engenharia de Software (SBES) que, este ano, é sediado na capital do Brasil, Brasília. Como tem acontecido desde 2010, o SBES 2013 faz parte do Congresso Brasileiro de Software: Teoria e Prática (CBSoft), que reúne o Simpósio Brasileiro de Linguagens de Programação (SBLP), o Simpósio Brasileiro de Métodos Formais (SBMF), o Simpósio Brasileiro de Componentes, Arquiteturas e Reutilização de Software (SBCARS) e da Miniconferência Latino-Americana de Linguagens de Padrões para Programação (MiniPLoP). Dentro do SBES o participante encontra seções técnicas, o Fórum de Educação de Engenharia de Software e três palestrantes convidados: dois internacionais e um nacional. Complementando este programa, o CBSoft oferece uma gama de atividades, incluindo cursos de curta duração, workshops, tutoriais, uma sessão de ferramentas, a trilha da Industrial e o workshop de Teses e Dissertações. Nas seções técnicas do SBES, trabalhos de pesquisa inéditos são apresentados, cobrindo uma variedade de temas sobre engenharia de software, mencionados na chamada de trabalhos, amplamente divulgados na comunidade brasileira e internacional. Um processo de revisão rigoroso permitiu a seleção criteriosa de artigos com a mais alta qualidade. O Comitê de Programa incluiu 76 membros da comunidade nacional e internacional de Engenharia de Software. Ao todo, 113 pesquisadores participaram na revisão dos 70 trabalhos submetidos. Desses, 17 artigos foram aceitos para apresentação e publicação nos anais do SBES. Pode-se observar que o processo de seleção foi competitivo, o que resultou numa taxa de aceitação de 24% dos artigos submetidos. Além da publicação de artigos no anais, disponível na Biblioteca Digital do IEEE, os oito melhores artigos escolhido por um comitê selecionado a partir do Comitê de Programa são convidados a submeter uma versão estendida para o Journal of Software Engineering Research and Development (JSERD). Para SBES 2013 os palestrantes convidados são: Jeff Offut (George Mason University) - How the Web Brought Evolution Back Into Design ; Sam Malek (George Mason University) - Toward the Making of Software that Learns to Manage Itself ; e Thais Vasconcelos Batista (DIMAP-UFRN) - Arquitetura de Software: uma Disciplina Fundamental para Construção de Software. Finalmente, gostaríamos de agradecer a todos aqueles que contribuíram com esta edição do SBES. Agradecemos ao os membros do Comitê Gestor do SBES e CBSoft, os membros do comitê de programa, os avaliadores dos trabalhos, as comissões organizadoras e todos aqueles que de alguma forma tornaram possível a realização de mais um evento com o padrão de qualidade dos melhores eventos internacionais. Mais uma vez, bem-vindo ao SBES Brasília, DF, setembro/outubro de Auri Marcelo Rizzo Vincenzi (INF/UFG) Coordenador do Comitê de Programa da Trilha Principal 5

6 SBES 2013 Foreword Welcome to the XXVII edition of the Brazilian Symposium on Software Engineering (SBES), which this year takes place in the capital of Brazil, Brasilia. As has happened since 2010, the SBES 2013 is part of the Brazilian Conference on Software: Theory and Practice (CBSoft) that gathers the Brazilian Symposium on Programming Languages (SBLP), the Brazilian Symposium on Formal Methods (SBMF), the Brazilian Symposium on Software Components, Architectures and Reuse (SBCARS) and the Latin American Mini conference on Pattern Languages of Programming (MiniPLoP). Within the SBES the participant finds two technical tracks, the Forum on Software Engineering Education and three invited speakers: two international and one national. Complementing this program, the CBSoft provides a range of activities including short courses, workshops, tutorials, a tools session, the Industrial Track and the Workshop of Theses and Dissertations. In the main technical track of SBES, unpublished research papers are presented, covering a range of topics on Software Engineering, mentioned in the call for papers, widely advertised in the Brazilian and international community. A rigorous peer review process enabled the careful selection of articles with the highest quality. The Program Committee included 76 members of the international Software Engineering community. In all, 113 investigators participated in the review of the 70 papers submitted. From those, 17 articles were accepted for presentation and publication in the SBES proceedings. It can be seen from these figures that we had a very competitive process that resulted in an acceptance rate of 24% of submitted articles. Besides the publication of articles in the proceedings, available in the IEEE Digital Library, the top eight articles chosen by a committee selected from members of the Program Committee are invited to submit an extended version to the Journal of Software Engineering Research and Development (JSERD). For SBES 2013 the invited speakers are: How the Web Brought Evolution Back Into Design - Jeff Offut (George Mason University) Toward the Making of Software that Learns to Manage Itself - Sam Malek (George Mason University) Software Architecture: a Core Discipline to Engineer Software - Thais Vasconcelos Batista (DIMAP-UFRN) Finally, we would like to thank those who contributed to making this edition of SBES. We thank the members of the Steering Committee of the SBES and CBSoft, the program committee members, the reviewers of papers, the organizing committees and all those who somehow made possible the realization of yet another event with a quality standard of the best international events. Once again, welcome to SBES Brasilia, DF, September/October Auri Marcelo Rizzo Vincenzi (INF/UFG) Coordinator of the Program Committee of the Main Track 6

7 SBES 2013 Comitês Técnicos / Technical Committees SBES Steering Committee Alessandro Garcia, PUC-Rio Auri Marcelo Rizzo Vincenzi, UFG Marcio Delamaro, USP Sérgio Soares, UFPE Thais Batista, UFRN CBSoft General Committee Genaína Nunes Rodrigues, UnB Rodrigo Bonifácio, UnB Edna Dias Canedo, UnB CBSoft Local Committee Diego Aranha, UnB Edna Dias Canedo, UnB Fernanda Lima, UnB Guilherme Novaes Ramos, UnB Marcus Vinícius Lamar, UnB George Marsicano, UnB Giovanni Santos Almeida, UnB Hilmer Neri, UnB Luís Miyadaira, UnB Maria Helena Ximenis, UnB Comitê do programa / Program Committee Adenilso da Silva Simão, ICMC - Universidade de São Paulo, Brasil Alessandro Garcia, PUC-Rio, Brasil Alfredo Goldman, IME - Universidade de São Paulo, Brasil Antônio Tadeu Azevedo Gomes, LNCC, Brasil Antônio Francisco Prado, Universidade Federal de São Carlos, Brasil Arndt von Staa, PUC-Rio, Brasil Augusto Sampaio, Universidade Federal de Pernambuco, Brasil Carlos Lucena, PUC-Rio, Brasil Carolyn Seaman, Universidade de Maryland, EUA Cecilia Rubira, Unicamp, Brasil Christina Chavez, Universidade Federal da Bahia, Brasil Claudia Werner, COPPE /UFRJ, Brasil Claudio Sant Anna, Universidade Federal da Bahia, Brasil Daltro Nunes, UFRGS, Brasil, Daniel Berry, Universidade de Waterloo, Canadá Daniela Cruzes, Universidade Norueguesa de Ciência e Tecnologia, Noruega Eduardo Almeida, Universidade Federal da Bahia, Brasil Eduardo Aranha, Universidade Federal do Rio Grande do Norte, Brasil 7

8 SBES 2013 Eduardo Figueiredo, Universidade Federal de Minas Gerais, Brasil Ellen Francine Barbosa, ICMC - Universidade de São Paulo, Brasil Fabiano Ferrari, Universidade Federal de São Carlos, Brasil Fabio Queda Bueno da Silva, Universidade Federal de Pernambuco, Brasil Fernanda Alencar, Universidade Federal de Pernambuco, Brasil Fernando Castor, Universidade Federal de Pernambuco, Brasil Flavia Delicato, Universidade Federal do Rio Grande do Norte, Brasil Flavio Oquendo, Universidade Européia de Brittany - UBS/VALORIA, França Glauco Carneiro, Universidade de Salvador, Brasil Gledson Elias, Universidade Federal da Paraíba, Brasil Guilherme Travassos, COPPE/UFRJ, Brasil Gustavo Rossi, Universidade Nacional de La Plata, Argentina Itana Maria de Souza Gimenes, Universidade Estadual de Maringá, Brasil Jaelson Freire Brelaz de Castro, Universidade Federal de Pernambuco, Brasil Jair Leite, Universidade Federal do Rio Grande do Norte, Brasil João Araújo, Universidade Nova de Lisboa, Portugal José Carlos Maldonado, ICMC - Universidade de São Paulo, Brasil José Conejero, Universidade de Extremadura, Espanha Leila Silva, Universidade Federal de Sergipe, Brasil Leonardo Murta, UFF, Brasil Leonor Barroca, Open Un./UK, Great Britain Luciano Baresi, Politecnico di Milano, Itália Marcelo Fantinato, Universidade de São Paulo, Brasil Marcelo de Almeida Maia, Universidade Federal de Uberlândia, Brasil Marco Aurélio Gerosa, IME-USP, Brasil Marco Túlio Valente, Universidade Federal de Minas Gerais, Brasil Marcos Chaim, Universidade de São Paulo, Brasil Márcio Barros, Universidade Federal do Estado do Rio de Janeiro, Brasil Mehmet Aksit, Universidade de Twente, Holanda Nabor Mendonça, Universidade de Fortaleza, Brasil Nelio Cacho, Universidade Federal do Rio Grande do Norte, Brasil Nelson Rosa, Universidade Federal de Pernambuco, Brasil Oscar Pastor, Universidade Politécnica de Valência, Espanha Otávio Lemos, Universidade Federal de São Paulo, Brasil Patricia Machado, Universidade Federal de Campina Grande, Brasil Paulo Borba, Universidade Federal de Pernambuco, Brasil Paulo Masiero, ICMC - Universidade de São Paulo, Brasil Paulo Merson, Software Engineering Institute, EUA Paulo Pires, Universidade Federal do Rio de Janeiro, Brasil Rafael Bordini, PUCRS, Brasil Rafael Prikladnicki, PUCRS, Brasil Regina Braga, Universidade Federal de Juiz de Fora, Brasil Ricardo Choren, IME-Rio, Brasil Ricardo Falbo, Universidade Federal de Espírito Santo, Brasil Roberta Coelho, Universidade Federal do Rio Grande do Norte, Brasil Rogerio de Lemos, Universidade de Kent, Reino Unido Rosana Braga, ICMC - Universidade de São Paulo, Brasil Rosângela Penteado, Universidade Federal de São Carlos, Brasil Sandra Fabbri, Universidade Federal de São Carlos, Brasil Sérgio Soares, Universidade Federal de Pernambuco, Brasil 8

9 SBES 2013 Silvia Abrahão, Universidade Politécnica de Valencia, Espanha Silvia Vergilio, Universidade Federal do Paraná, Brasil Simone Souza, ICMC - Universidade de São Paulo, Brasil Thais Vasconcelos Batista, Universidade Federal do Rio Grande do Norte, Brasil Tiago Massoni, Universidade Federal de Campina Grande, Brasil Uirá Kulesza, Universidade Federal do Rio Grande do Norte, Brasil Valter Camargo, Universidade Federal de São Carlos, Brasil Vander Alves, Universidade de Brasília, Brasil revisores externos / External Reviewers A. César França, Federal University of Pernambuco, Brazil Americo Sampaio, Universidade de Fortaleza, Brazil Anderson Belgamo, Universidade Metodista de Piracicaba, Brazil Andre Endo, ICMC/USP, Brazil Breno França, UFRJ, Brazil Bruno Cafeo, Pontifícia Universidade Católica do Rio de Janeiro, Brazil Bruno Carreiro da Silva, Universidade Federal da Bahia, Brazil Célio Santana, Universidade Federal Rural de Pernambuco, Brazil César Couto, CEFET-MG, Brazil Cristiano Maffort, CEFET-MG, Brazil Draylson Souza, ICMC-USP, Brazil Edson Oliveira Junior, Universidade Estadual de Maringá, Brazil Fernando H. I. Borba Ferreira, Universidade Presbiteriana Mackenzie, Brasil Frank Affonso, UNESP - Universidade Estadual Paulista, Brazil Gustavo Henrique Lima Pinto, Federal University of Pernambuco, Brazil Heitor Costa, Federal University of Lavras, Brazil Higor Souza, University of São Paulo, Brazil Igor Steinmacher, Universidade Tecnológica Federal do Paraná, Brazil Igor Wiese, UTFPR -Universidade Tecnológica Federal do Parana, Brazil Ingrid Nunes, UFRGS, Brazil Juliana Saraiva, Federal University of Pernambuco, Brazil Lucas Bueno, University of São Paulo, Brazil Luiz Carlos Ribeiro Junior, Universidade de Brasilia UnB, Brazil Marcelo Eler, Universidade de São Paulo, Brazil Marcelo Gonçalves, Universidade de São Paulo, Brazil Marcelo Morandini, Universidade de São Paulo, Brazil Mauricio Arimoto, Universidade de São Paulo, Brazil Milena Guessi, Universidade de São Paulo, Brazil Paulo Afonso Parreira Júnior, Universidade Federal de São Carlos, Brazil Paulo Meirelles, IME USP, Brazil Pedro Santos Neto, Universidade Federal do Piauí, Brazil Ricardo Terra, UFMG, Brazil Roberto Araujo, EACH/USP, Brazil Sidney Nogueira Federal University of Pernambuco, Brazil Vanessa Braganholo, UFF, Brazil Viviane Santos Universidade de São Paulo, Brazil Yijun Yu Open University, Great Britain 9

10 SBES 2013 Comitê organizador / Organizing Committee COORDENAÇÃO GERAL Genaína Nunes Rodrigues, CIC, UnB Rodrigo Bonifácio, CIC, UnB Edna Dias Canedo, CIC, UnB COMITÊ LOCAL Diego Aranha, CIC, UnB Edna Dias Canedo, FGA, UnB Fernanda Lima, CIC, UnB Guilherme Novaes Ramos, CIC, UnB Marcus Vinícius Lamar, CIC, UnB George Marsicano, FGA, UnB Giovanni Santos Almeida, FGA, UnB Hilmer Neri, FGA, UnB Luís Miyadaira, FGA, UnB Maria Helena Ximenis, CIC, UnB COORDENADOR DO COMITÊ DE PROGRAMA SBES 2013 Auri M. R. Vincenzi, Universidade Federal de Goiás, Brasil 10

11 SBES 2013 palestras convidadas / invited keynotes TOWARD THE MAKING OF SOFTWARE THAT LEARNS TO MANAGE ITSELF SAM MALEK A self-managing software system is capable of adjusting its behavior at runtime in response to changes in the system, its requirements, or the environment in which it executes. Self-management capabilities are sought-after to automate the management of complex software in many computing domains, including service-oriented, mobile, cyber-physical and ubiquitous settings. While the benefits of such software are plenty, its development has shown to be much more challenging than the conventional software. At the state of the art, it is not an impervious engineering problem in principle to develop a selfadaptation solution tailored to a given system, which can respond to a bounded set of conditions that are expected to require automated adaptation. However, any sufficiently complex software system once deployed in the field is subject to a broad range of conditions and many diverse stimuli. That may lead to the occurrence of behavioral patterns that have not been foreseen previously: in fact, those may be the ones that cause the most critical problems, since, by definition, they have not manifested themselves, and have not been accounted for during the previous phases of the engineering process. A truly self-managing system should be able to cope with such unexpected behaviors, by modifying or enriching its adaptation logic and provisions accordingly. In this talk, I will first provide an introduction to some of the challenges of making software systems self-managing. Afterwards, I will provide an overview of two research projects in my group that have tackled these challenges through the applications of automated inference techniques (e.g., machine learning, data mining). The results have been promising, allowing the software engineers to empower a software system with advanced self-management capabilities with minimal effort. I will conclude the talk with an outline of future research agenda for the community. HOW THE WEB BROUGHT EVOLUTION BACK INTO DESIGN JEFF OFFUTT To truly understand the effect the Web is having on software engineering, we need to look to the past. Evolutionary design was near universal in the days before the industrial revolution. The production costs were very high, but craftsmen were able to implement continuous improvement every new object could be better than the last. Software is different; it has a near-zero production cost, allowing millions of identical copies to be made. Unfortunately, near-zero production cost means software must be near-perfect out of the box. This fact has driven our research agenda for 50 years. But it is no longer true! This talk will discuss how near-zero production cost for near-perfect software has driven our research agenda. Then it will point out how the web has eliminated the need for near-perfect software out of the box. The talk will finish by describing how this shift is changing software development and research, and speculate on how this change our future research agenda. 11

12 SBES 2013 SOFTWARE ARCHITECTURE: A CORE DISCIPLINE TO ENGINEER SOFTWARE THAIS BATISTA Software architecture has emerged in the last decades as an important discipline of software engineering, dealing with the design decisions to define the organization of the system that have a long-lasting impact on its quality attributes. The architectural description documents the decisions and it is used as a blueprint to other activities in the software engineering process, such as implementation, testing, and evaluation. In this talk we will discuss the role of software architecture as a core activity to engineer software, its influence on other activities of software development, and the new trends and challenges in this area. 12

13 SBES 2013 PALESTRANTES / keynotes Sam Malek (George Mason University) Sam Malek is an Associate Professor in the Department of Computer Science at George Mason University. He is also the director of Software Design and Analysis Laboratory at GMU, a faculty associate of the C4I Center, and a member of the DARPA s Computer Science Study Panel. Malek s general research interests are in the field of software engineering, and to date his focus has spanned the areas of software architecture, autonomic software, and software dependability. Malek received his PhD and MS degrees in Computer Science from the University of Southern California, and his BS degree in Information and Computer Science from the University of California, Irvine. He has received numerous awards for his research contributions, including the National Science Foundation CAREER award (2013) and the GMU Computer Science Department Outstanding Faculty Research Award (2011). He has managed research projects totaling more than three million dollars in funding received from NSF, DARPA, IARPA, ARO, FBI, and SAIC. He is a member of the ACM, ACM SIGSOFT, and IEEE. Jeff Offutt (George Mason University) Dr. Jeff Offutt is Professor of Software Engineering at George Mason University and holds part-time visiting faculty positions at the University of Skovde, Sweden, and at Linkoping University, Linkoping Sweden. Offutt has invented numerous test strategies, has published over 150 refereed research papers (h-index of 51 on Google Scholar), and is co-author of Introduction to Software Testing. He is editor-in-chief of Wiley s journal of Software Testing, Verification and Reliability; co-founded the IEEE International Conference on Software Testing, Verification, and Validation; and was its founding steering committee chair. He was awarded the George Mason University Teaching Excellence Award, Teaching With Technology, in 2013, and was named a GMU Outstanding Faculty member in 2008 and For the last ten years he has led the 25-year old MS program in Software Engineering, and led the efforts to create PhD and BS programs in Software Engineering. His current research interests include software testing, analysis and testing of web applications, secure software engineering, objectoriented program analysis, usable software security, and software evolution. Offutt received the PhD in computer science in 1988 from the Georgia Institute of Technology and is on the web at cs.gmu.edu/~offutt/. Thais Batista (UFRN) Thais Batista is an Associate Professor at the Federal University of Rio Grande do Norte (UFRN) since She holds a Ph.D in Computer Science from the Catholic University of Rio de Janeiro (PUC-Rio), Brazil, In she was a post-doctoral researcher at the Lancaster University, UK. Her main research area is software architecture, distributed systems, middleware, cloud computing. Índice 13

14 SBES 2013 Índice de Artigos / Table of Contents Criteria for Comparison of Aspect-Oriented Requirements Engineering Approaches : Critérios para Comparação de Abordagens para Engenharia de Requisitos Orientada a Aspectos Paulo Afonso Parreira Júnior, Rosângela Aparecida Dellosso Penteado Using Tranformation Rules to Align Requirements and Archictectural Models Monique Soares, Carla Silva, Gabriela Guedes, Jaelson Castro, Cleice Souza, Tarcisio Pereira An automatic approach to detect traceability links using fuzzy logic Andre Di Thommazo, Thiago Ribeiro, Guilherme Olivatto, Vera Werneck, Sandra Fabbri Determining Integration and Test Orders in the Presence of Modularization Restrictions Wesley Klewerton Guez Assunção, Thelma Elita Colanzi, Silvia Regina Vergilio, Aurora Pozo Functional Validation Driven by Automated Tests / Validação Funcional Dirigida por Testes Automatizados Thiago Delgado Pinto, Arndt von Staa Visualization, Analysis, and Testing of Java and AspectJ Programs with Multi-Level System Graphs Otavio Augusto Lazzarini Lemos, Felipe Capodifoglio Zanichelli, Robson Rigatto, Fabiano Ferrariy, Sudipto Ghosh A Method for Model Checking Context-Aware Exception Handling Lincoln S. Rocha, Rossana M. C. Andrade, Alessandro F. Garcia Prioritization of Code Anomalies based on Architecture Sensitiveness Roberta Arcoverde, Everton Guimarães, Isela Macía, Alessandro Garcia, Yuanfang Cai

15 SBES 2013 Are domain-specific detection strategies for code anomalies reusable? An industry multi-project study : Reuso de Estratégias Sensíveis a Domínio para Detecção de Anomalias de Código: Um Estudo de Múltiplos Casos Alexandre Leite Silva, Alessandro Garcia, Elder José Reioli, Carlos José Pereira de Lucena F3T: From Features to Frameworks Tool Matheus Viana, Rosangela Penteado, Antônio do Prado, Rafael Durelli A Metric of Software Size as a Tool for IT Governance Marcus Vinícius Borela de Castro, Carlos Alberto Mamede Hernandes An Approach to Business Processes Decomposition for Cloud Deployment: Uma Abordagem para Decomposição de Processos de Negócio para Execução em Nuvens Computacionais Lucas Venezian Povoa, Wanderley Lopes de Souza, Antonio Francisco do Prado, Luís Ferreira Pires, Evert F. Duipmans On the InFLuence of Model Structure and Test CaseProFIle on the Prioritization of Test Cases in thecontext of Model-based Testing Joao Felipe S. Ouriques, Emanuela G. Cartaxo, Patrícia D. L. Machado The Impact of Scrum on Customer Satisfaction: An Empirical Study Bruno Cartaxo, Allan Araujo, Antonio Sa Barreto, Sergio Soares Identifying a Subset of TMMi Practices to Establish a Streamlined Software Testing Process Kamilla Gomes Camargo, Fabiano Cutigi Ferrari, Sandra Camargo Pinto Ferraz Fabbri On the Relationship between Features Granularity and Non-conformities in Software Product Lines: An Exploratory Study Iuri Santos Souza, Rosemeire Fiaccone, Raphael Pereira de Oliveira, Eduardo Santana de Almeida An Extended Assessment of Data-driven Bayesian Networks in Software Effort Prediction Ivan A. P. Tierno, Daltro J. Nunes

16 Criteria for Comparison of Aspect-Oriented Requirements Engineering Approaches Critérios para Comparação de Abordagens para Engenharia de Requisitos Orientada a Aspectos Paulo Afonso Parreira Júnior 1, 2, Rosângela Aparecida Dellosso Penteado 2 1 Bacharelado em Ciência da Computação UFG (Câmpus Jataí) - Jataí Goiás, Brasil 2 Departamento de Computação - UFSCar - São Carlos - São Paulo, Brasil {paulo_junior, rosangela}@dc.ufscar.br Resumo Early-aspects referem-se a requisitos de software que se encontram espalhados ou entrelaçados com outros requisitos e são tratados pela Engenharia de Requisitos Orientada a Aspectos (EROA). Várias abordagens para EROA têm sido propostas nos últimos anos e possuem diferentes características, limitações e pontos fortes. Sendo assim, torna-se difícil a tomada de decisão por parte de: i) engenheiros de software, quanto à escolha da abordagem mais apropriada as suas necessidades; e ii) pesquisadores em EROA, quando o intuito for entenderem as diferenças existentes entre suas abordagens e as existentes na literatura. Este trabalho tem o objetivo de apresentar um conjunto de critérios para comparação de abordagens para EROA, criado com base nas variabilidades e características comuns dessas abordagens. Além disso, tais critérios são aplicados a seis abordagens e os resultados obtidos podem servir como um guia para que usuários escolham a abordagem que melhor atenda as suas necessidades, bem como facilite a realização de pesquisas na área de EROA. Palavras-chave Engenharia de Software Orientada a Aspectos, Critérios para Comparação, Avaliação Qualitativa, Early Aspects. Abstract Early-aspects consist of software requirements that are spread or tangled with other requirements and can be treated by Aspect-Oriented Requirements Engineering (AORE). Many AORE approaches have been proposed in recent years and have different features, strengths and limitations. Thus, it becomes difficult the decision making by: i) software engineers, regards to the choice of the most appropriate approach to your needs, and ii) AORE researchers, when the intent is to understand the differences between their own approaches and other ones in the literature. This paper aims to present a set of comparison criteria for AORE approaches, based on common features and variability of these approaches. Such criteria are applied on six of the main AORE approaches and the results can serve as a guide so that users can choose the approach that best meets their needs, and to facilitate the conduct of research in AORE. Keywords Aspect-Oriented Requirements Engineering, Comparison Criteria, Qualitative Evaluation, Early Aspects. I. INTRODUÇÃO O aumento da complexidade do software e a sua aplicabilidade nas mais diversas áreas requerem que a Engenharia de Requisitos (ER) seja realizada de modo abrangente e completo, a fim de: i) contemplar todas as necessidades dos stakeholders [1]; e ii) possibilitar que os engenheiros de software tenham o completo entendimento da funcionalidade do software, dos serviços e restrições existentes e do ambiente sobre o qual ele deve operar [2]. Um requisito de software define uma propriedade ou capacidade que atende às regras de negócio de um software [1]. Um conjunto de requisitos relacionados com um mesmo objetivo, durante o desenvolvimento do software, define o conceito de interesse (concern). Por exemplo, um interesse de segurança pode contemplar diversos requisitos relacionados a esse objetivo, que é garantir que o software seja seguro. Idealmente, cada interesse do software deveria estar alocado em um módulo específico do software, que satisfizesse aos seus requisitos. Quando isso ocorre, diz-se que o software é bem modularizado, pois todos os seus interesses estão claramente separados [2]. Entretanto, há alguns tipos de interesses (por exemplo, desempenho, segurança, persistência, entre outros) para os quais essa alocação não é possível apenas utilizando as abstrações usuais da engenharia de software, como casos de uso, classes e objetos, entre outros. Tais interesses são denominados interesses transversais ou early aspect e referem-se aos requisitos de software que se encontram espalhados ou entrelaçados com outros requisitos. A falta de modularização ocasionada pelos requisitos espalhados e entrelaçados tende a dificultar a manutenção e a evolução do software, pois prejudica a avaliação do engenheiro de software quanto aos efeitos provocados pela inclusão, remoção ou alteração de algum requisito sobre os demais [1]. A Engenharia de Requisitos Orientada a Aspectos (EROA) é uma área de pesquisa que objetiva promover melhorias com relação à Separação de Interesses (Separation of Concerns) [3] durante as fases iniciais do desenvolvimento do software, oferecendo estratégias mais adequadas para identificação, modularização e composição de interesses transversais. Várias abordagens para EROA têm sido desenvolvidas nos últimos anos [4][5][7][8][9][10][11][12][13][14], cada uma com diferentes características, limitações e pontos fortes. Além disso, avaliações qualitativas ou quantitativas dessas abordagens foram realizadas [1][2][15][16][17][19][20]. Mesmo com a grande variedade de estudos avaliativos, apenas alguns aspectos das abordagens para EROA são considerados. Assim, para se ter uma visão mais abrangente sobre uma determinada abordagem há necessidade de se recorrer a outros estudos. Por exemplo, as informações sobre as atividades da EROA contempladas em uma abordagem são obtidas na publicação na qual ela foi proposta ou em alguns estudos comparativos que a envolvem. Porém, nem sempre essas publicações apresentam informações precisas sobre a

17 escalabilidade e/ou cobertura e a precisão dessa abordagem, sendo necessário recorrer a estudos de avaliação quantitativa. Na literatura há escassez de estudos que realizam a comparação de abordagens para EROA por meio de um conjunto bem definido de critérios. Também é difícil encontrar, em um mesmo trabalho, a comparação de características qualitativas e quantitativas das abordagens. Esses fatos dificultam a tomada de decisão por parte de: i) engenheiros de software, quanto à escolha da abordagem mais apropriada as suas necessidades; e ii) pesquisadores em EROA, para entenderem as diferenças existentes entre suas abordagens e as demais existentes na literatura. Este trabalho apresenta um conjunto de oito critérios para facilitar a comparação de abordagens para EROA. Esses critérios foram desenvolvidos com base nas variabilidades e características comuns de diversas abordagens, bem como nos principais trabalhos relacionados à avaliação qualitativa e quantitativa dessas abordagens. Os critérios elaborados contemplam: (1) o tipo de simetria de cada abordagem; (2) as atividades da EROA e (3) interesses contemplados por ela; (4) as técnicas utilizadas para realização de suas atividades; (5) o nível de envolvimento necessário para sua aplicação, por parte do usuário; (6) sua escalabilidade; (7) nível de apoio computacional disponível; e (8) as avaliações já realizadas sobre tal abordagem. A fim de verificar a aplicabilidade dos critérios propostos, seis das principais abordagens para EROA disponíveis na literatura são comparadas: Separação Multidimensional de Interesses [8]; Theme [9][10]; EA-Miner [4][5]; Processo baseado em XML para Especificação e Composição de Interesses Transversais [7]; EROA baseada em Pontos de Vista [13][14]; e Aspect-Oriented Component Requirements Engineering (AOCRE) [11]. O resultado obtido com essa comparação pode servir como um guia para que usuários possam compreender de forma mais clara e abrangente as principais características, qualidades e limitações dessas abordagens para EROA, escolhendo assim, aquela que melhor atenda as suas necessidades. O restante deste artigo está organizado da seguinte forma. Na Seção 2 é apresentada uma breve descrição sobre EROA, com enfoque sobre suas principais atividades. Na Seção 3 é apresentada uma visão geral sobre as abordagens para EROA comparadas neste trabalho. O conjunto de critérios para comparação de abordagens para EROA está na Seção 4. A aplicação dos critérios sobre as abordagens apresentadas é exibida e uma discussão dessa aplicação é mostrada na Seção 5. Os trabalhos relacionados estão na Seção 6 e, por fim, as conclusões e trabalhos futuros são apresentados na Seção 7. II. ENGENHARIA DE REQUISITOS ORIENTADA A ASPECTOS O princípio da Separação de Interesses tem por premissa a identificação e modularização de partes do software relevantes a um determinado conceito, objetivo ou propósito [3]. Abordagens tradicionais para desenvolvimento de software, como a Orientação a Objetos (OO), foram criadas com base nesse princípio, porém, certos interesses de escopo amplo (por exemplo, segurança, sincronização e logging) não são fáceis de serem modularizados e mantidos separadamente durante o desenvolvimento do software. O software gerado pode conter representações entrelaçadas, que dificultam o seu entendimento e a sua evolução [7]. Uma abordagem efetiva para ER deve conciliar a separação de interesses com a necessidade de atender aos interesses de escopo amplo [8]. A EROA surge como uma tentativa de se contemplar esse objetivo por meio da utilização de estratégias específicas para modularização de interesses que são difíceis de serem isolados em módulos individuais. Um interesse encapsula um ou mais requisitos especificados pelos stakeholders e um interesse transversal ou early aspect é um interesse que se intercepta com outros interesses do software. A explícita modularização de interesses transversais em nível de requisitos permite que engenheiros de software raciocinem sobre tais interesses de forma isolada desde o início do ciclo de vida do software, o que pode facilitar a criação de estratégias para sua modularização. Na Figura 1 está ilustrado o esquema de um processo genérico para EROA, proposto Chitchyan et al. [4], que foi desenvolvido com base em outros processos existentes na literatura [8][9][12][14] (os retângulos de bordas arredondadas representam as atividades do processo). Figura 1. Processo genérico para EROA (adaptado de Chitchyan et al. [4]). A partir de um conjunto inicial de requisitos disponível, a atividade Identificação de Interesses identifica e classifica interesses do software como base ou transversais. Em seguida, a atividade Identificação de Relacionamento entre Interesses permite que o engenheiro de software conheça as influências e as restrições impostas pelos interesses transversais sobre os outros interesses do software. A atividade Triagem auxilia na decisão sobre quais desses interesses são pertinentes ao software e se há repetições na lista de interesses identificados. A atividade Refinamento de Interesses ocorre quando houver necessidade de se alterar o conjunto de interesses e relacionamentos já identificados. Os interesses classificados como pertinentes são então representados durante a atividade Representação de Interesses em um determinado formato (template), de acordo com a abordagem para EROA utilizada. Esse formato pode ser um texto, um modelo de casos de uso, pontos de vista, entre outros. Por exemplo, no trabalho de Rashid et al. [13][14], interesses são representados por meio de pontos de vista; no de Baniassad e Clarke [9][10] são utilizados temas. Durante a representação dos interesses, o engenheiro de software pode identificar a necessidade de refinamento, ou seja, de incluir/remover interesses e/ou relacionamentos. Isso ocorrendo, ele pode retornar para as atividades anteriores do processo da Figura 1. Finalmente, os interesses representados em um determinado template precisam ser compostos e analisados para a detecção dos conflitos entre interesses do

18 software. Essas análises são feitas durante as atividades de Composição de Interesses e de Análise, Identificação e Resolução de Conflitos. Em seguida, os conflitos identificados são resolvidos com o auxílio dos stakeholders. Em geral, as atividades descritas no processo da Figura 1 são agregadas em quatro atividades maiores, a saber: Identificação, Representação e Composição de interesses e Análise e Resolução de Conflitos. Essas atividades são utilizadas como base para apresentação das características das abordagens para EROA na Seção 3 deste trabalho. III. ABORDAGENS PARA EROA A escolha das seis abordagens para EROA analisadas neste trabalho foi realizada por meio de um processo de Revisão Sistemática (RS), cujo protocolo foi omitido neste trabalho, devido a restrições de espaço. Tais abordagens têm sido consideradas maduras por outros autores em seus estudos comparativos [2][15][17], bem como foram divulgadas em veículos e locais de publicação de qualidade e avaliadas de forma quantitativa com sistemas reais. Apenas os principais conceitos dessas abordagens são apresentados; mais detalhes podem ser encontrados nas referências aqui apresentadas. A. Separação Multidimensional de Interesses Esta abordagem propõe que requisitos devem ser decompostos de forma uniforme com relação a sua natureza funcional, não funcional ou transversal [8]. Tratando todos os interesses da mesma forma, pode-se então escolher qualquer conjunto de interesses como base para analisar a influência dos outros interesses sobre essa base. i) Identificação e Representação de Interesses. Tem por base a observação de que certos interesses, como por exemplo, mobilidade, recuperação de informação, persistência, entre outros aparecem frequentemente durante o desenvolvimento de software. Assim, os autores dividiram o espaço de interesses em dois: i) o dos metainteresses, que consiste em um conjunto abstrato de interesses típicos, como os que foram mencionados acima; e ii) o dos sistema, que contempla os interesses específicos do sistema do usuário. Para se utilizar esta abordagem, os requisitos do sistema devem ser analisados pelo engenheiro de requisitos e categorizados com base nos interesses existentes no espaço de metainteresses, gerando assim os interesses concretos. Para representação dos interesses, tanto os abstratos (metainteresses) quanto os concretos, são utilizados templates XML. ii) Composição de Interesses. Após a representação dos interesses, regras de composição são definidas para se especificar como um determinado interesse influencia outros requisitos ou interesses do sistema. As regras de composição também são especificadas por meio de templates XML. Na Figura 2 é apresentado um exemplo de regra de composição na qual o interesse Recuperação de Informações afeta todos os requisitos do interesse de Customização (especificado pelo atributo id = all ), o requisito 1 do interesse Navegação e o requisito 1 do interesse Mobilidade (especificados pelo atributo id = 1 ), incluindo seus subrequisitos (especificado pelo atributo children = include ). iii) Análise e Resolução de Conflitos. É realizada a partir da observação das interações de um interesse com os outros do sistema. Seja C 1, C 2, C 3,..., C n os interesses concretos de um determinado sistema e SC 1, SC 2, SC 3,..., SC n, os conjuntos de interesses que eles entrecortam, respectivamente. <?xml version="1.0"?> <Composition> <Requirement concern="informationretrieval" id="all"> <Constraint action="provide" operator="during"> <Requirement concern="customisability" id="all" /> <Requirement concern="navigation" id="1" /> <Requirement concern="mobility" id="1" children="include" /> </Constraint> <Outcome action="fulfilled" /> </Requirement> </Composition> Figura 2. Regras de composição para o interesse Recuperação da Informação (adaptado de Moreira et al. [8]). Para se identificar os conflitos entre C 1 e C 2 deve-se analisar a Interseção de Composição SC 1 SC 2. Uma Interseção de Composição é definida por: seja o interesse C a membro de SC 1 e SC 2. C a aparece na interseção de composição SC 1 SC 2 se e somente se, C 1 e C 2 afetarem o mesmo conjunto de requisitos presentes em C a. Por exemplo, na Figura 2, nota-se que o interesse Recuperação de Informações afeta o requisito 1 do interesse Navegação. Supondo que o interesse Mobilidade também afete esse requisito, então SC Recuperação de Informações SC Mobilidade = { Navegação }. Os conflitos são analisados com base no tipo de contribuição que um interesse pode exercer sobre outro com relação a uma base de interesses. Essas contribuições podem ser negativas (-), positivas (+) ou neutras. Uma matriz de contribuição é construída, de forma que cada célula apresenta o tipo da contribuição (+ ou -) dos interesses em questão com relação aos interesses do conjunto de interseções de composição localizado dentro da célula. Uma célula vazia denota a não existência de relacionamento entre os interesses. Se a contribuição é neutra, então apenas o conjunto de interseções de composição é apresentado. B. Theme A abordagem Theme [9][10] apoia EROA em dois níveis: a) de requisitos, por meio da Theme/Doc, que fornece visualizações para requisitos textuais que permitem expor o relacionamento entre comportamentos em um sistema; b) de projeto, por meio da Theme/UML, que permite ao desenvolvedor modelar os interesses base e transversais de um sistema e especificar como eles podem ser combinados. i) Identificação de Interesses. Para esta atividade o engenheiro de software dispõe da visualização de ações, um tipo de visualização dos requisitos do sistema proposto pelos autores. Duas entradas são obrigatórias para se gerar uma visualização de ações: i) uma lista de ações-chaves, isto é, verbos identificados pelo engenheiro de software ao analisar o documento de requisitos; e ii) o conjunto de requisitos do sistema. Na Figura 3 é apresentada a visualização de ações criada a partir de um conjunto de requisitos e de uma lista de ações-chaves de um pequeno sistema de gerenciamento de cursos [9]. As ações-chaves são representadas por losangos e os requisitos do texto, por caixas com bordas arredondadas. Se um requisito contém uma ação-chave em sua descrição, então ele é associado a essa ação-chave por meio de uma seta

19 da caixa com borda arredondada para o losango correspondente à ação. Figura 3. Exemplo de uma visualização de ações [9]. A ideia é utilizar essa visualização para separar e isolar ações e requisitos em dois grupos: 1) o grupo base que é autocontido, ou seja, não possui requisitos que se referem a ações do outro grupo; e 2) o grupo transversal que possui requisitos que se referem a ações do grupo base. Para atingir essa separação em grupos, o engenheiro de software deve examinar os requisitos para classificá-los em um dos grupos. Caso o engenheiro de software decida que uma ação principal entrecorta as demais ações do requisito em questão, então uma seta de cor cinza com um ponto em uma de suas extremidades é traçada da ação que entrecorta para a ação que é entrecortada. Na Figura 3, denota-se que a ação logged entrecorta as ações unregister, give e register. ii) Representação e Composição de Interesses. Para essas atividades utiliza-se Theme/UML, que trabalha com o conceito de temas - elementos utilizados para representar interesses e que podem ser do tipo base ou transversal. Os temas base encapsulam as funcionalidades do domínio do problema, enquanto que os transversais encapsulam os interesses que afetam os temas base. A representação gráfica de um tema é um pacote da UML denotado com o estereótipo <<theme>>. Os temas transversais são representados por meio de gabaritos da UML, que permitem encapsular o comportamento transversal independentemente do tema base, ou seja, sem considerar os pontos reais do sistema que serão afetados. Um gabarito é representado graficamente por um pacote da UML com um parâmetro no canto superior direito, um template. Após a especificação do sistema em temas base e transversais, é necessário realizar a composição deles. Para isso utiliza-se o relacionamento de ligação (bind), que descreve para quais eventos ocorridos nos temas base o comportamento do tema transversal deve ser disparado. Para auxiliar o engenheiro de software a descobrir e representar os temas e seus relacionamentos, visualizações de temas (theme view) são utilizadas. Elas diferem das visualizações de ações, pois não apresentam apenas requisitos e ações, mas também entidades do sistema (informadas pelo engenheiro de software) que serão utilizadas na modelagem dos temas. iii) Análise e Resolução de Conflitos. Os trabalhos analisados sobre a abordagem Theme não apresentaram detalhes sobre a realização dessa atividade. C. EA-Miner A abordagem EA-Miner segue o processo genérico apresentado na Figura 1, o qual foi definido pelos mesmos autores dessa abordagem. Além disso, os autores propuseram uma suíte de ferramentas que apoiam as atividades desse processo [4][5]. Essas ferramentas exercem dois tipos de papéis: i) gerador de informações: que analisa os documentos de entrada e os complementa com informações linguísticas, semânticas, estatísticas e com anotações; e ii) consumidor de informações: que utiliza as anotações e informações adicionais atribuídas ao conjunto de entrada para múltiplos tipos de análise. A principal geradora de informações da abordagem EA- Miner é a ferramenta WMATRIX [6], uma aplicação web para Processamento de Linguagem Natural (PLN), que é utilizada por essa abordagem para identificação de conceitos do domínio do sistema. i) Identificação de Interesses. É realizada pela ferramenta EA-Miner (Early Aspect Mining), que recebe o mesmo nome da abordagem. Para identificação de interesses transversais não funcionais, EA-Miner constrói uma árvore de requisitos não funcionais com base no catálogo de Chung e Leite [18]. Os interesses transversais são identificados pela equivalência semântica entre as palavras do documento de requisitos e as categorias desse catálogo. Para identificação de interesses transversais funcionais, EA-Miner utiliza uma estratégia semelhante à da abordagem Theme, detectando a ocorrência de verbos repetidos no documento de requisitos, o que pode sugerir a presença de interesses transversais funcionais. ii) Representação e Composição de Interesses. Para esta atividade, utiliza-se a ferramenta ARCADE (Aspectual Requirements Composition and Decision). Com ela, o engenheiro de software pode selecionar quais requisitos são afetados pelos interesses do sistema, escolher os relacionamentos existentes entre eles e, posteriormente, gerar regras de composição. ARCADE utiliza a mesma ideia de regra de composição da abordagem Separação Multidimensional de Interesses [8]. iii) Análise e Resolução de Conflitos. ARCADE possui também um componente analisador de conflitos, o qual identifica sobreposição entre aspectos com relação aos requisitos que eles afetam. O engenheiro de requisitos é alertado sobre essa sobreposição e decide se os aspectos sobrepostos prejudicam ou favorecem um ao outro. D. Processo baseado em XML para Especificação e Composição de Interesses Transversais O processo de Soeiro et al. [7] é composto das seguintes atividades: identificar, especificar e compor interesses. i) Identificação de Interesses. Ocorre por meio da análise da descrição do sistema feita por parte do engenheiro de software. Os autores indicam que a identificação dos interesses pode ser auxiliada pelo uso de catálogos de requisitos não funcionais, como o proposto por Chung e Leite [18]. Para cada entrada do catálogo, deve-se decidir se o interesse em questão existe ou não no sistema em análise. ii) Representação e Composição de Interesses. Para essas atividades foram criados templates XML com o intuito de coletar e organizar todas as informações a respeito de um interesse. A composição dos interesses do sistema ocorre por regras de composição, que consistem dos seguintes elementos: Term: pode ser um interesse ou outra regra de composição.

20 Operator: define o tipo de operação >>, [> ou. C1 >> C2 refere-se a uma composição sequencial e significa que o comportamento de C2 inicia-se se e somente se C1 tiver terminado com sucesso. C1 [> C2 significa que C2 interrompe o comportamento de C1 quando começa a executar. C1 C2 significa que o comportamento de C1 está sincronizado com o de C2. Outcome: expressa o resultado das restrições impostas pelos operadores comentados anteriormente. iii) Análise e Resolução de Conflitos. Os trabalhos analisados sobre essa abordagem não apresentaram detalhes sobre a realização desta atividade. E. EROA baseada em Pontos de Vista Rashid et al. [13][14] propuseram uma abordagem para EROA baseada em pontos de vista (viewpoints). São utilizados templates XML para especificação dos pontos de vista, dos interesses transversais e das regras de composição entre pontos de vista e interesses transversais do sistema. Além disso, a ferramenta ARCADE automatiza a tarefa de representação dos conceitos mencionados anteriormente com base nos templates XML pré-definidos na abordagem. A primeira atividade dessa abordagem consiste na Identificação e Especificação dos Requisitos do Sistema e, para isso, pontos de vista são utilizados. i) Identificação e Representação de Interesses. É realizada por meio da análise dos requisitos iniciais do sistema pelo engenheiro de software. De modo análogo ao que é feito com os pontos de vista, interesses também são especificados em arquivos XML. Após a identificação dos pontos de vista e dos interesses, é necessário detectar quais desses interesses são candidatos a interesses transversais. Para isso cria-se uma matriz de relacionamento, na qual os interesses do sistema são colocados em suas linhas e os pontos de vista, nas colunas. Cada célula dessa matriz, quando marcada, representa que um determinado interesse exerce influência sobre os requisitos do ponto de vista da coluna correspondente daquela célula. Sendo assim, é possível observar quais pontos de vista são entrecortados pelos interesses do sistema. Segundo os autores, quando um interesse entrecorta os requisitos de vários pontos de vista do sistema, isso pode indicar que se trata de um interesse transversal. ii) Composição de Interesses e Análise e Resolução de Conflitos. Após a identificação dos candidatos a interesses transversais e dos pontos de vista do sistema, os mesmos devem ser compostos por meio de regras de composição e, posteriormente, a análise e resolução de conflitos deve ser realizada. A definição das regras de composição e da atividade de análise e resolução de conflitos segue a mesma ideia da abordagem Separação Multidimensional de Interesses [8]. F. Aspect-oriented Component Requirements Engineering (AOCRE) Whittle e Araújo [11] desenvolveram um processo de alto nível para criar e validar interesses transversais e não transversais. O processo se inicia com um conjunto de requisitos adquiridos pela aplicação de técnicas usuais para este fim. i) Identificação e Representação de Interesses. Os interesses funcionais e não funcionais são identificados a partir dos requisitos do sistema. Os interesses funcionais são representados por meio de casos de uso da UML e os interesses não funcionais, por um template específico com as informações: i) fonte do interesse (stakeholders, documentos, entre outros); ii) requisitos a partir dos quais ele foi identificado; iii) sua prioridade; iv) sua contribuição para outro interesse não funcional; e v) os casos de uso (interesses funcionais) afetados por ele. Com base na análise do relacionamento entre interesses funcionais e não funcionais, os candidatos a interesses transversais são identificados e, posteriormente, refinados em um conjunto de cenários. Cenários transversais (derivados dos interesses transversais) são representados por IPSs (Interaction Pattern Specifications) e cenários não transversais são representados por diagramas de sequência da UML. IPS é um tipo de Pattern Specifications (PSs) [23], um modo de se representar formalmente características estruturais e comportamentais de um determinado padrão. PSs são definidas por um conjunto de papéis (roles) da UML e suas respectivas propriedades. Dado um modelo qualquer, diz-se que ele está em conformidade com uma PS se os elementos desse modelo, que desempenham os papéis definidos na PS, satisfazem a todas as propriedades definidas para esses papéis. IPSs servem para especificar formalmente a interação entre papéis de um software. ii) Composição de Interesses. Cenários transversais são compostos com cenários não transversais. A partir desse conjunto de cenários compostos e de um algoritmo desenvolvido pelos autores da abordagem, é gerado um conjunto de máquinas de estados executáveis que podem ser simuladas em ferramentas CASE para validar tal composição. iii) Análise e Resolução de Conflitos. Os trabalhos analisados sobre essa abordagem não apresentaram detalhes sobre a realização dessa atividade. IV. CRITÉRIOS PARA COMPARAÇÃO DE ABORDAGENS PARA EROA O conjunto de critérios apresentado nesta seção foi elaborado de acordo com: i) a experiência dos autores deste trabalho que conduziram o processo de RS; ii) os trabalhos relacionados à avaliação de abordagens para identificação de interesses transversais [1][2][15][16][17][19][20]; e iii) os trabalhos originais que descrevem as abordagens selecionadas para comparação [4][5][7][8][9][10][11][13][14]. A confecção desse conjunto de critérios seguiu o seguinte procedimento: i) a partir da leitura dos trabalhos relacionados à avaliação de abordagens para identificação de interesses transversais (obtidos por meio da RS) foi criado um conjunto inicial de critérios; ii) esse conjunto foi verificado pelos autores deste trabalho e aprimorado com novos critérios ou adaptado com os já elencados; e iii) os critérios elencados foram aplicados às abordagens apresentadas na Seção 3. A. Tipo de Simetria: Assimétrica ou Simétrica Abordagens para EROA podem ser classificadas como; a) assimétricas quando há distinção e tratamento explícitos para os interesses transversais e não transversais; b) simétricas

21 quando todos os interesses são tratados da mesma maneira. É importante conhecer tal característica das abordagens para EROA, pois ela fornece indícios sobre: a representatividade da abordagem em questão: em geral, abordagens assimétricas possuem melhor representatividade, uma vez que os modelos gerados por meio delas possuem elementos que fazem distinção explícita entre interesses transversais e não transversais. Isso pode favorecer o entendimento desses modelos e consequentemente do software sob análise; e a compatibilidade com outras abordagens para EROA: conhecer se uma abordagem é simétrica ou não pode auxiliar pesquisadores e profissionais a refletirem sobre o esforço necessário para adaptar essa abordagem as suas necessidades. Por exemplo, criando mecanismos para integrá-la com outras abordagens já existentes. Para cada abordagem analisada com esse critério as seguintes informações devem ser coletadas: nome da abordagem em questão, tipo de simetria (simétrica ou assimétrica) e descrição. Essa última informação especifica os elementos de abstração utilizados para tratar com interesses transversais e não transversais, o que explica a sua classificação como simétrica ou assimétrica. Para todos os critérios mencionados nas próximas subseções, o nome da abordagem em análise foi uma das informações coletadas e não será comentada. B. Cobertura: Completa ou Parcial Com esse critério pretende-se responder à seguinte questão: A abordagem contempla as principais atividades preconizadas pela EROA? Neste trabalho, considera-se como completa a abordagem que engloba as principais atividades descritas no processo genérico para EROA apresentado na Figura 1, isto é, Identificação, Representação e Composição de interesses e Análise e Resolução de Conflitos. Uma abordagem parcial é aquela que trata apenas com um subconjunto (não vazio) dessas atividades. Para cada abordagem analisada com esse critério deve-se obter o tipo de cobertura. Se for cobertura parcial, deve-se destacar as atividades contempladas pela abordagem. C. Propósito: Geral ou Específico Este critério tem a finalidade de avaliar uma abordagem quanto ao seu propósito, ou seja, se é específica para algum tipo de interesse (por exemplo, interesses transversais funcionais, interesses transversais não funcionais, interesses de persistência, segurança, entre outros) ou se é de propósito geral. Se o propósito da abordagem for específico, deve-se destacar os tipos de interesses contemplados por ela. D. Técnicas Utilizadas Este critério elenca as técnicas utilizadas pela abordagem para realização de suas atividades. Por exemplo, para a atividade de identificação de interesses transversais não funcionais, uma abordagem A pode utilizar técnicas de PLN, juntamente com um conjunto de palavras-chave, enquanto que outra abordagem B pode utilizar apenas catálogos de requisitos não funcionais e análise manual dos engenheiros de software. Para esse critério, as seguintes informações são obtidas: i) atividade da EROA contemplada pela abordagem; e ii) tipo de técnicas utilizadas para realização dessa atividade. E. Nível de Envolvimento do Usuário: Amplo ou Pontual O envolvimento do usuário é amplo quando há participação efetiva do usuário na maior parte das atividades propostas pela abordagem, sem que ele seja auxiliado por qualquer tipo de recurso ou artefato que vise a facilitar o seu trabalho. Essa participação efetiva pode ocorrer por meio da: i) inclusão de informações extras; ii) realização de análises sobre artefatos de entrada e/ou saída; e iii) tradução de informações de um formato para outro. Um exemplo de participação efetiva do usuário ocorre quando ele deve fornecer informações adicionais, além daquelas constantes no documento de requisitos do sistema para identificação dos interesses do sistema (por exemplo, um conjunto de palavras-chave a ser confrontado com o texto do documento de requisitos). Outro exemplo seria se a representação de interesses do sistema fosse feita manualmente, pelo usuário, de acordo com algum template pré-estabelecido (em um arquivo XML ou diagrama da UML). Um envolvimento pontual significa que o usuário pode intervir no processo da EROA para tomar certos tipos de decisões. Por exemplo, resolver um conflito entre dois interesses que se relacionam. Sua participação, porém, tem a finalidade de realizar atividades de níveis mais altos de abstração, que dificilmente poderiam ser automatizadas. É importante analisar tal critério, pois o tipo de envolvimento do usuário pode impactar diretamente na escalabilidade da abordagem e na produtividade proporcionada pela mesma. O envolvimento excessivo do usuário pode tornar a abordagem mais dependente da sua experiência e propensa a erros. Para comparação das abordagens com base neste critério, deve-se observar: i) o tipo de envolvimento do usuário exigido pela abordagem; e ii) a descrição das atividades que o usuário deve desempenhar. F. Escalabilidade Com esse critério, pretende-se conhecer qual é o porte dos sistemas para os quais a abordagem em análise tem sido aplicada. Embora algumas abordagens atendam satisfatoriamente a sistemas de pequeno porte, não há garantias que elas sejam eficientes para sistemas de médio e grande porte. Os problemas que podem surgir quando o tamanho do sistema cresce muito, em geral, estão relacionados: i) à complexidade dos algoritmos utilizados pela abordagem; ii) à necessidade de envolvimento do usuário, que dependendo do esforço requisitado, pode tornar impraticável a aplicação da abordagem em sistema de maior porte; e iii) à degradação da cobertura e precisão da abordagem; entre outros. Para esse critério as seguintes informações devem ser coletadas: i) o nome do sistema utilizado no estudo de caso em que a abordagem foi avaliada; ii) os tipos de documentos utilizados; iii) as medidas de tamanho/complexidade do sistema (em geral, quando se trata de documentos textuais, os tamanhos são apresentados em números de páginas e/ou palavras); e iv) a referência da publicação na qual foi relatada a aplicação desse sistema à abordagem em questão.

22 G. Apoio Computacional Para quais de suas atividades a abordagem em análise oferece apoio computacional? Essa informação é importante, principalmente, se o tipo de envolvimento dos usuários exigido pela abordagem for amplo. Em muitos casos, durante a avaliação de uma abordagem para EROA, percebe-se um relacionamento direto entre os critérios Tipo de Envolvimento do Usuário e Apoio Computacional. Se a abordagem exige envolvimento amplo do usuário, consequentemente, ele deve possuir fraco apoio computacional; se exige envolvimento pontual, possivelmente deve oferecer apoio computacional adequado. Porém, essa relação precisa ser observada com cuidado, pois pode haver casos em que o fato de uma atividade exigir envolvimento pontual do usuário para sua execução não esteja diretamente ligado à execução automática da mesma. Por exemplo, sejam A e B duas abordagens para EROA que exijam que o usuário informe um conjunto de palavraschave para identificação de interesses em um documento de requisitos. A abordagem A possui um apoio computacional que varre o texto do documento de requisitos, selecionando algumas palavras mais relevantes (utilizando-se de técnicas de PLN) que possam ser utilizadas pelo engenheiro de software como palavras-chave. A abordagem B não possui apoio computacional algum, porém disponibiliza uma ontologia com termos do domínio do sistema em análise e um dicionário de sinônimos desses termos, que podem ser utilizados pelo engenheiro de software como diretrizes para elencar o conjunto de palavras-chave exigido pela abordagem. Neste caso, as duas abordagens poderiam ser classificadas como pontuais com relação ao critério Tipo de Envolvimento do Usuário, mesmo B não possuindo apoio computacional. Entretanto, um engenheiro de software que esteja utilizando a abordagem A, provavelmente, terminará a tarefa de definição do conjunto de palavras-chave em menor tempo do que outro que esteja utilizando a abordagem B. Assim, deve-se conhecer quais atividades da abordagem para EROA são automatizadas. Para cada abordagem comparada com esse critério deve-se obter: i) as atividades da EROA contempladas pela abordagem em questão; ii) os nomes dos apoios computacionais utilizados para automatização dessas atividades; e iii) a referência da publicação, na qual o apoio computacional foi proposto/apresentado. Uma abordagem pode oferecer mais de um apoio computacional para uma mesma atividade. H. Tipo de Avaliação da Abordagem A quais tipos de avaliação a abordagem em questão para EROA têm sido submetida? Para as avaliações realizadas, há um relatório adequado sobre a acurácia da abordagem, ressaltando detalhes importantes como cobertura, precisão e tempo necessário para execução das atividades dessa abordagem? Para Wohlin et al. [22], a avaliação qualitativa está relacionada à pesquisa sobre o objeto de estudo, sendo os resultados apresentados por meio de informações descritas em linguagem natural, como neste artigo. A avaliação quantitativa, geralmente, é conduzida por meio de estudos de caso e experimentos controlados, e os dados obtidos podem ser comparados e analisados estatisticamente. Estudos de caso e experimentos visam a observar um atributo específico do objeto de estudo e estabelecer o relacionamento entre atributos diferentes, porém, em experimentos controlados o nível de controle é maior do que nos estudos de caso. Para este critério deve-se destacar: i) o(s) tipo(s) de avaliação(ões) realizada(s) sobre a abordagem, listando a referência da avaliação conduzida (os tipos de avaliação são: qualitativa, estudo de caso e experimento controlado); e ii) os resultados obtidos com essa(s) avaliação(ões) realizada(s). Para o item (ii) sugere-se a coleta dos valores médios obtidos para as seguintes métricas: cobertura, precisão e tempo de aplicação da abordagem. Tais métricas foram sugeridas, pois são amplamente utilizadas para medição da eficácia de produtos e processos em diversas áreas de pesquisa, tais como recuperação da informação e processamento de linguagem natural, entre outras. Na área de EROA, essas métricas têm sido utilizadas em trabalhos relacionados à identificação de interesses tanto em nível de código [21], quanto em nível de requisitos [2][15]. A análise conjunta dos dados deste critério com os do critério Escalabilidade pode revelar informações importantes sobre a eficácia e eficiência de uma abordagem para EROA. V. AVALIAÇÃO DAS ABORDAGENS PARA EROA As abordagens para EROA apresentadas na Seção 3 foram comparadas com base nos critérios apresentados na Seção 4. As siglas utilizadas para o nome das abordagens são: i) SMI - Separação Multidimensional de Interesses; ii) EA-Miner - Early-Aspect Mining; iii) Theme - Abordagem Theme; iv) EROA/XML - Processo baseado em XML para Especificação e Composição de Interesses Transversais; v) EROA/PV - EROA baseada em Pontos de Vista; e vi) AOCRE - Aspect- Oriented Component Requirements Engineering. Na Tabela 1 encontra-se a avaliação dessas abordagens quanto ao tipo de simetria, com breve justificativa para o tipo escolhido. Tabela 1. TIPO DE SIMETRIA DAS ABORDAGENS PARA EROA. Abordagem SMI EA-Miner Theme EROA/XML EROA/PV AOCRE Tipo de Simetria Simétrica Assimétrica Assimétrica Simétrica Assimétrica Assimétrica Descrição Tanto os interesses transversais quanto os não transversais são tratados de modo uniforme. Todos são denominados interesses e podem influenciar /restringir uns aos outros. Os interesses transversais são tratados como aspectos e os não transversais, como pontos de vista. Os interesses transversais são tratados como temas transversais e os não transversais, como temas base. Tanto os interesses transversais quanto os não transversais são tratados apenas como interesse (concerns). Os interesses transversais são tratados como aspectos e os não transversais, como pontos de vista. Os interesses transversais são tratados como IPSs e os não transversais, como diagramas de sequência. Todas as abordagens analisadas foram consideradas como de propósito geral, pois contemplam tanto interesses funcionais quanto não funcionais. Quanto à cobertura, SMI,

23 EA-Miner e EROA/PV são completas, uma vez que atendem às principais atividades da EROA definidas no processo da Figura 1. As abordagens Theme, EROA/XML e AOCRE foram consideradas parciais, uma vez que não apresentam apoio à atividade de Análise e Resolução de Conflitos. As técnicas utilizadas por cada atividade das abordagens comparadas são descritas na Tabela 2. Nota-se que as técnicas mais utilizadas para a atividade Identificação de Interesses são o uso de palavras-chave e catálogos para interesses não funcionais. A técnica baseada em palavras-chave é fortemente dependente da experiência dos engenheiros de software que a aplica. Por exemplo, um profissional com pouca experiência no domínio do software em análise ou sobre os conceitos de interesses transversais pode gerar conjuntos vagos de palavraschave e que podem gerar muitos falsos positivos/negativos. Além disso, técnicas como essas são ineficazes para detecção de interesses implícitos, isto é, interesses que não aparecem descritos no texto do documento de requisitos. Já para Representação e Composição de interesses, a maioria das abordagens optou por criar seus próprios modelos de representação e composição de interesses utilizando para isso a linguagem XML. O uso de XML é, muitas vezes, justificado pelos autores das abordagens por permitir a definição/representação de qualquer tipo de informação de forma estruturada e por ser uma linguagem robusta e flexível. Outra forma de representação, utilizada pelas abordagens Theme e AOCRE, ocorre por meio de modelos bem conhecidos da UML, como diagramas de sequência e estados, para realização dessas atividades. Para a atividade Análise e Resolução de Conflitos, também parece haver um consenso na utilização de matrizes de contribuição e templates XML. Tabela 2. TÉCNICAS UTILIZADAS PARA REALIZAÇÃO DAS ATIVIDADES CONTEMPLADAS PELAS ABORDAGENS PARA EROA Identificação de Interesses Palavras-chave e Técnicas de Visualização Palavras-chave e Catálogo de INF Catálogo de INF Estendido Representação de Interesses Temas e Diagramas UML Templates XML Template XML 4 Catálogo de INF Template XML 5 6 Matriz de Relacionamento Casos de Uso, Template específico para INF. Pontos de Vista e Templates XML Diagramas de Sequência e IPSs Composição de Interesses Temas e Templates UML Templates XML e Regras de Composição Templates XML e Regras de Composição Templates XML e Regras de Composição Templates XML e Regras de Composição Diagramas de Sequência, IPSs e Máquinas de Estado Análise & Resolução de Conflitos - Matriz de Contribuição e Templates XML Matriz de Contribuição e Templates XML - Matriz de Contribuição e Templates XML Legenda: 1) Theme; 2) EA-Miner; 3) SMI; 4) EROA/XML; 5) EROA/PV; 6) AOCRE; INF: Interesses não funcionais. O tipo de envolvimento do usuário requerido pelas abordagens analisadas é apresentado na Tabela 3. A maioria das abordagens (SMI, Theme, EROA/XML, EROA/PV e AOCRE) foi classificada como as que exigem envolvimento amplo de seus usuários. Isto pode ser um fator impactante na - escalabilidade e na acurácia da abordagem, quando sistemas de larga escala forem analisados com elas. A abordagem EA-Miner, entretanto, requer interferência pontual do usuário, sendo a sua participação em atividades mais estratégicas do que mecânicas. Isso se deve, em parte, à utilização de uma suíte de ferramentas computacionais de apoio à execução desta abordagem. Tabela 3. TIPO DE ENVOLVIMENTO DO USUÁRIO REQUERIDO PELAS ABORDAGENS PARA EROA. Abordagem Envolvimento A P SMI X - EA-Miner - X Theme X - EROA/XML EROA/PV AOCRE X X X Legenda: A (Amplo); P (Pontual) Atividades Desenvolvidas Especificação dos interesses concretos do sistema a partir de um conjunto de metainteresses; Representação dos interesses em templates de arquivos XML; Definição de regras de composição; Definição da contribuição (positiva ou negativa) e das prioridades de um interesse sobre o(s) outro(s). Tomada de decisão com relação às palavras ambíguas detectadas no documento de requisitos; Definição de regras de composição; Definição da contribuição (positiva ou negativa) e das prioridades de um interesse sobre o(s) outro(s). Definição de um conjunto de ações e entidades-chave; Análise manual das visualizações geradas pela abordagem com o objetivo de encontrar interesses base e transversais; Construção manual dos temas a partir das visualizações geradas pela abordagem; Definição de regras de composição. Identificação manual dos interesses do sistema; Representação dos interesses em templates de arquivos XML; Definição de regras de composição. Identificação manual dos interesses do sistema; Representação dos interesses em templates de arquivos XML; Definição de regras de composição; Definição da contribuição (positiva ou negativa) e das prioridades de um interesse sobre o(s) outro(s). Identificação manual dos interesses do sistema; Representação dos interesses em cenários; Definição de diagramas de sequência e IPSs. Na 0 estão descritas as ferramentas disponibilizadas por cada abordagem para automatização de suas atividades. Notase que a abordagem mais completa em termos de apoio computacional é a EA-Miner, pois todas as suas atividades são automatizadas em partes ou por completo. Por exemplo, a atividade de composição de interesses é totalmente automatizada pela ferramenta ARCADE. O usuário precisa apenas selecionar os interesses a serem compostos e toda regra de composição é gerada automaticamente. ARCADE trabalha com base nos conceitos da abordagem SMI, automatizando as suas atividades. Nota-se ainda que as atividades melhor contempladas com recursos computacionais

24 são Representação e Composição de Interesses. Dessa forma, as atividades para EROA que exigem maior atenção da comunidade científica para confecção de apoios computacionais são Identificação de Interesses e Análise e Resolução de Conflitos. A aplicação dos critérios escalabilidade e tipo de avaliação para as abordagens analisadas são apresentados na 0 e Tabela 6. Tabela 4. APOIO COMPUTACIONAL DAS ABORDAGENS PARA EROA. Abordagem Atividade Apoio Computacional Ref Theme Representação de Interesses Plugin Ecplise Theme/UML [9] EA-Miner SMI EROA/XML EROA/PV AOCRE Abordagem SMI, EROA/XML EROA/PV EA-Miner e Theme Identificação de Interesses Triagem Representação e Composição de Interesses e Análise e Resolução de Conflitos Representação e Composição de Interesses e Análise e Resolução de Conflitos Representação e Composição de Interesses Representação e Composição de Interesses e Análise e Resolução de Conflitos Composição de Interesses EA-Miner, WMATRIX e RAT(Requirement Analysis Tool) KWIP (Key Word In Phrase) ARCADE [4] ARCADE [4] APOR (AsPect-Oriented Requirements tool) [7] ARCADE [4] Algoritmo proposto pelos autores Tabela 5. ESCALABILIDADE DAS ABORDAGENS PARA EROA. Sistema Health Watcher Complaint System ATM System Documentos Utilizados Documento de Requisitos e Casos de Uso Documento de Requisitos Documento de Requisitos Tamanho 19 páginas; palavras. [11] Ref [2] 230 páginas. [15] 65 páginas. [15] As abordagens SMI, Theme e EROA/PV são as abordagens mais avaliadas, tanto qualitativamente quanto quantitativamente. Isso ocorre, pois essas foram algumas das primeiras abordagens para EROA. Outros pontos interessantes são que: i) EROA/XML não havia ainda sido avaliada qualitativamente, de acordo com a revisão de literatura realizada neste trabalho; e ii) não foram encontrados estudos quantitativos que contemplassem a abordagem AOCRE. Quanto à escalabilidade ressalta-se que a maioria delas, com exceção da AOCRE, foi avaliada com documentos de requisitos de médio e grande porte. EA-Miner e Theme foram avaliadas com documentos de requisitos mais robustos (295 páginas de documentos, no total). Com os valores presentes na Tabela 6 percebe-se que as abordagens SMI, Theme, EROA/XML e EROA/PV apresentaram os maiores tempos para execução das atividades da EROA em proporção ao tamanho do documento de requisitos. EA-Miner foi classificada neste trabalho como a única abordagem que exige envolvimento pontual de seus usuários. Infere-se que, pelo envolvimento pontual de seus usuários, ela apresentou os melhores resultados com relação ao tempo para realização das atividades da EROA. Com base nos estudos de casos realizados observa-se que quanto à cobertura e à precisão das abordagens, a identificação de interesses base é melhor do que a de interesses transversais. A justificativa para isso é que interesses base são mais conhecidos e entendidos pela comunidade científica [2]. Além disso, tais requisitos aparecem no documento de requisitos de forma explícita, mais bem localizada e isolada, facilitando sua identificação. Dessa forma, a atividade de identificação de interesses em documentos de requisitos configura-se ainda um problema de pesquisa relevante e desafiador e que merece a atenção da comunidade científica. VI. TRABALHOS RELACIONADOS A literatura contém diversos trabalhos com o objetivo de avaliar qualitativa ou quantitativamente as abordagens para EROA. Herrera et al. [15] apresentaram uma análise quanto à acurácia das abordagens EA-Miner e Theme, quando são utilizados documentos de requisitos de dois sistemas de software reais. As métricas relacionadas à eficácia e à eficiência das abordagens, como cobertura, precisão e tempo, foram as que receberam maior enfoque. Sendo assim, poucos aspectos qualitativos das abordagens analisadas foram levantados, como tipo de simetria, cobertura, entre outros. Nessa mesma linha, Sampaio et al. [2] apresentaram um estudo quantitativo para as abordagens: EROA/PV, SMI, EROA/XML e Goal-based AORE. Foi avaliada a acurácia e a eficiência dessas abordagens. Por se tratarem de abordagens com características bem distintas, os autores elaboraram também um mapeamento entre os principais conceitos delas e criaram um esquema de nomenclatura comum para EROA. Outros trabalhos, em formato de surveys [1][16][17] [19][20], foram propostos com o intuito de comparar abordagens para EROA, descrevendo as principais características de cada abordagem. Entretanto, cada um desses trabalhos considerou apenas um conjunto restrito e distinto de características dessas abordagens, criando assim, um gap que dificulta a compreensão mais abrangente das características comuns e específicas de cada abordagem. Singh e Gill [20] e Chitchyan et al. [1] fizeram a caracterização de algumas abordagens para EROA, sem utilizar um conjunto de critérios. Bakker et al. [19] compararam algumas abordagens com relação: i) ao objetivo da abordagem; ii) às atividades contempladas; iii) ao apoio computacional oferecido; iv) aos artefatos utilizados; e v) à rastreabilidade. Porém, não há informações sobre a acurácia dessas abordagens, nem sobre os estudos avaliativos realizados com elas. Bombonatti e Melnikoff [16] compararam as abordagens considerando apenas os tipos de interesses (funcionais ou não funcionais) e atividades da EROA contemplados por essas abordagens. Rashid et al. [17] comparam as abordagens para EROA sob o ponto de vista dos objetivos da Engenharia de Requisitos, separação de interesses, rastreabilidade, apoio à verificação de consistência, entre outros. A principal diferença deste trabalho em relação aos demais comentados anteriormente está no fato de que o conjunto de critérios proposto contempla não apenas os pontos qualitativos comuns e específicos das abordagens para EROA analisadas, mas proporciona um vínculo com informações quantitativas obtidas por outros pesquisadores em trabalhos relacionados.

25 Abordagem SMI EA-Miner Theme Tabela 6. TIPOS DE AVALIAÇÃO REALIZADOS COM AS ABORDAGENS PARA EROA. Tipo de Avaliação Relatório Q EC EXC Cobertura Precisão Tempo - [2] - IB: 100%; ITF: 50%; ITNF: 70% IB: 88%; ITF: 100%; ITNF: 77% 104 min [1][16][17][20] Complaint System Complaint System - [15] - IB: 64%; ITF: 64%; ITNF: 45% IB: 31%; ITF: 78%; ITNF: 71% Complaint System: 70 min ATM System ATM System ATM System: 140 min IB: 86%; ITF: 80%; ITNF: 100% IB: 35%; ITF: 63%; ITNF: 71% [1][17] [15] - Complaint System IB: 73%; ITF: 55%; ITNF: 73% ATM System IB: 86%; ITF: 73%; ITNF: 40% Complaint System IB: 48%; ITF: 86%; ITNF: 80% ATM System IB: 50%; ITF: 91%; ITNF: 50% Complaint System: 760 min ATM System: 214 min [1] [17][19] [20] EROA/XML - [2] - IB: 100%; ITF: 50%; ITNF: 55% IB: 88%; ITF: 100%; ITNF: 100% 173 min EROA/PV - [2] - IB: 100%; ITF: 0%; ITNF: 100% IB: 70%; ITF: 0%; ITNF: 83% 62 min [1][16][17][19][20] AOCRE [1][17] [20] Legenda: IB: Interesses Base; ITF: Interesses Transversais Funcionais; ITNF: Interesses Transversais Não Funcionais. Q: Qualitativa. EC: Estudo de Caso. EXC: Experimento Controlado. Além disso, tais critérios compreendem um framework comparativo que pode ser estendido para contemplar outros tipos de critérios relacionados à área de ER. VII. CONSIDERAÇÕES FINAIS A grande variedade de abordagens para EROA existentes na literatura, com características diferentes, tem tornado difícil a escolha da mais adequada às necessidades dos usuários. Este trabalho apresentou um conjunto de critérios para comparação de abordagens para EROA, concebidos com base nas características comuns e especificidades das principais abordagens disponíveis na literatura, bem como em trabalhos científicos que avaliaram algumas dessas abordagens. Além disso, realizou-se a aplicação desses critérios sobre três abordagens bem conhecidas. Essa comparação pode servir como guia para que o engenheiro de software escolha a abordagem para EROA mais adequada as suas necessidades. Também foram destacados alguns dos pontos fracos das abordagens analisadas, como por exemplo, a baixa precisão e cobertura para interesses transversais não funcionais. Como trabalhos futuros, pretende-se: i) expandir o conjunto de critérios aqui apresentado a fim de se contemplar características específicas para cada uma das fases da EROA; ii) aplicar o conjunto de critérios expandido às abordagens já analisadas com o intuito de se obter novas informações sobre elas, bem como a novos tipos de abordagens existentes na literatura; iii) desenvolver uma aplicação web que permita aos engenheiros de software e pesquisadores da área de EROA pesquisarem e/ou divulgarem seus trabalhos utilizando o conjunto de critérios elaborados; e iv) por último, propor uma nova abordagem que reutilize os pontos fortes e aprimore os pontos fracos de cada abordagem analisada. REFERÊNCIAS [1] Chitchyan, R.; Rashid, A; Sawyer, P.; Garcia, A.; Alarcon, M. P.; Bakker, J.; Tekinerdogan, B.; Clarke, S.; Jackson, A. Report synthesizing state-of-theart in aspect-oriented requirements engineering, architectures and design. Lancaster University: Lancaster, p , Technical Report. [2] Sampaio, A.; Greenwood P.; Garcia, A. F.; Rashid, A. A Comparative Study of Aspect-Oriented Requirements Engineering Approaches. In 1 st International Symposium on Empirical Software Engineering and Measurement (ESEM '07) p , [3] Dijkstra, E. W. A Discipline of Programming. Pearson Prentice Hall, 217 p., ISBN: , [4] Chitchyan, R.; Sampaio, A.; Rashid, A.; Rayson, P. A tool suite for aspectoriented requirements engineering. In International Workshop on Early Aspects at ICSE. ACM, p , [5] Sampaio, A.; Chitchyan, R.; Rashid, A.; Rayson, P. EA-Miner: a Tool for Automating Aspect-Oriented Requirements Identification, Int'l Conf. Automated Software Engineering (ASE), ACM, pp , [6] WMATRIX. Corpus Analysis and Comparison Tool. Disponível em: Acessado em: Abril de [7] Soeiro E.; Brito, I. S; Moreira, A. An XML-Based Language for Specification and Composition of Aspectual Concerns. In 8 th International Conference on Enterprise Information Systems (ICEIS) [8] Moreira, A.; Rashid, A.; Araújo, J. Multi-Dimensional Separation of Concerns in Requirements Engineering. 13 th International Conference on Requirements Engineering (RE). Proceedings p , [9] Baniassad, E.; Clarke, S. Theme: An approach for aspect-oriented analysis and design. In 26 th Int. Conf. on Software Engineering (ICSE 04) [10] Clarke, S.; Baniassad, E. Aspect-Oriented Analysis and Design : The Theme Approach: Addison-Wesley, [11] Whittle J.; Araújo, J. Scenario Modeling with Aspects. IEEE Software, v. 151(4), p , [12] Yijun Y.; Leite, J.C.S.P.; Mylopoulos, J. From Goals to Aspects: Discovering Aspects from Requirements Goal Models. In International Conference on Requirements Engineering (RE) [13] Rashid, A.; Moreira, A.; Araújo, J. Modularisation and composition of aspectual requirements. In 2 nd International Conference on Aspect-Oriented Software Development (AOSD 03). ACM, [14] Rashid, A.; Sawyer, P.; Moreira, A.; Araújo, J. Early Aspects: a Model for Aspect-Oriented Requirements Engineering. In International Conference on Requirements Engineering (RE) [15] Herrera, J. et al. Revealing Crosscutting Concerns in Textual Requirements Documents: An Exploratory Study with Industry Systems. In 26th Brazilian Symposium on Software Engineering. p , [16] Bombonatti, D. L. G.; Melnikoff, S. S. S. Survey on early aspects approaches: non-functional crosscutting concerns integration in software sytems. In 4 th World Scientific and Engineering Academy and Society (WSEAS). Wisconsin, USA, p , [17] Rashid, A.; Chitchyan, R. Aspect-oriented requirements engineering: a roadmap. In 13 th Int. Workshop on Early Aspects (EA). p , [18] Chung, L.; Leite, J. S. P. Non-Functional Requirements in Software Engineering. Springer, 441 p., [19] Bakker J.; Tekinerdoğan, B.; Akist, M. Characterization of Early Aspects Approaches. In Early Aspects: Aspect-Oriented Requirements Engineering and Architecture Design [20] Singh, N.; Gill, N. S. Aspect-Oriented Requirements Engineering for Advanced Separation of Concerns: A Review. International Journal of Computer Science Issues (IJCSI). v 8(5) [21] Kellens, A.; Mens, K., and Tonella, P. A survey of automated code-level aspect mining techniques. Transactions on Aspect-Oriented Software Development IV, v. 4640, p , [22] Wohlin, C.; Runeson, P.; Höst, M.; Regnell, B.; Wesslén, A. Experimentation in Software Engineering: an Introduction [23] France, R.; Kim, D.; Ghosh, S.; Song, E. A UML-based pattern specification technique. IEEE Trans. Software Engineering. v. 30 (3), pp

26 Using Tranformation Rules to Align Requirements and Archictectural Models Monique Soares, Carla Silva, Gabriela Guedes, Jaelson Castro, Cleice Souza, Tarcisio Pereira Centro de Informática Universidade Federal de Pernambuco UFPE Recife, Brasil {mcs4, ctlls, ggs, jbc, Abstract In previous works we have defined the STREAM strategy to align requirements and architectural models. It includes four activities and several transformations rules that could be used to support the systematic generation of a structural architectural model from goal oriented requirements models. The activities include the Preparation of Requirements Models, Generation of Architectural Solutions, Selection of Architectural Solution and Refinement of the Architecture. The first two activities are time consuming and rely on four horizontal and four vertical transformation rules which are current performed manually, requiring much attention from the analyst. For example, the first activity consists of the refactoring of the goal models, while the second one derives architectural models from the refactored i* (istar) models. In this paper we automate seven out of the eight transformation rules of the two first activities of STREAM approach. The transformation language used to implement the rules was QVTO. We rely on a running example to illustrate the use of the automated rules. Hence, our approach has the potential to improve the process productivity and the quality of the models produced. Keywords- Requirements Engineering, Software Architecture, Transformation Rules, Automation I. INTRODUCTION The STREAM (A STrategy for Transition between REquirements Models and Architectural Models) is a systematic approach to integrate requirements engineering and architectural design activities, based on model transformation, to generate architectural models from requirements models [1]. It generates structural architectural models, described in Acme [4] (the target language), from goal-oriented requirements models, expressed in i* (istar) [3] (i.e. the source language). This approach has four activities, namely: Prepare Requirements Models, Generate Architectural Solutions, Select Architectural Solution and Refine Architecture. The first two activities are time consuming and rely on horizontal and vertical transformation rules (HTRs and VTRs), respectively. Currently, these transformations rules are made manually, requiring much attention from the analyst. However, they seem likely to be automated, which could reduce not only the human effort required to generate the target models, but also minimize the number of errors produced during the process. Hence, our proposal is to use the QVT [2] transformation language to properly define the rules, and also to develop some tool support to execute them. Therefore, two research questions are addressed by this paper: Is it possible to automate the transformation rules defined in the first two STREAM activities, namely: Prepare Requirements Models, Generate Architectural Solutions? And, if so, how could these rules be automated? Henceforth, the main objective of this paper is to automate the transformation rules defined by the first two phases of STREAM process 1. To achieve this goal it is necessary to: describe the transformation rules using a suitable transformation language; make the vertical and horizontal transformation rules compatible with the modeling environment used to create the goal-oriented requirements models, i.e. the istartool [6]; make the vertical transformation rules compatible with the modeling environment used to create the structural architectural models, i.e. the AcmeStudio [4]. In order to automate the HTRs and VTRs proposed by the STREAM process, it was necessary to choose a language that would properly describe the transformation rules and transform the models used in STREAM approach. We opted for the QVTO (Query / View / Transformation Operational) language [2], a transformation language that is integrated with Eclipse environment [16] and that is better supported and maintained. Note that as input of the first activity of the STREAM process is based on an i* goal model. The istartool [6] is used to generate XMI file of the goal-oriented requirements model. This file is read by the Eclipse QVTO plugin, which generates the XMI file of the Acme architectural model. Note that this file is consistent with the metamodel created in accordance with the AcmeStudio tool. The rest of the paper is organized as follows. Section II presents the theoretical background. Section III describes the horizontal transformations rules in QVTO. In Section IV, we present the vertical transformation rules in QVTO. In order to illustrate our approach, in Section V we use the BTW example [10]. Section VI presents some related works. Finally, Section VII concludes the paper with a brief explanation of the contributions achieved and the proposal of future work. 1 Note that, it is out of scope of this paper to support the other two phases of the approach (Select Architectural Solution, Refine Architecture).

27 Hurt Help D II. BACKGROUND In this section we present the baseline of this research: the original rules from the STREAM approach and the model transformation language (QVT) used to implement HTRs and VTRs of STREAM. A. STREAM STREAM is a systematic approach to generate architectural models from requirements models based on model transformation [1]. The source and target modelling languages are i* for requirements modelling and Acme for architectural description, respectively. The STREAM process consists of the following activities: 1) Prepare requirements models, 2) Generate architectural solutions, 3) Choose an architectural solution and 4) Derive architecture. Horizontal Transformation Rules (HTRs) are part of the first activity. They are useful to increase the modularity of the i* requirements models. Vertical Transformation Rules (VTRs) are proposed in second activity. They are used to derive architectural models from the modularized i* requirements model. Non-functional requirements (NFRs) are used in the third activity to select one of the possible architectural descriptions obtained. Depending on the NFR to be satisfied, some architectural patterns can be applied, in activity 4. The first STREAM activity is concerned with improving the modularity of the expanded system actor. It allows delegation of different parts of a problem to different software actors (instead of having a unique software actor). In particular, it is sub-divided into three steps: (i) analysis of internal elements (identify which internal elements can be extracted from the original software actor and relocated to a new software actor); (ii) application of horizontal transformation rules (the actual extraction and relocation of the identified internal elements); and, (iii) evaluation of the i* model (checking if the model needs to be modularized again, i.e., return to the step 1). In order to develop these steps, it is necessary to use, respectively: Heuristics to guide the decomposition of the software's actor; A set of rules to transform i* models; Metrics for assessing the degree of modularization of both the initial and modularized i* models. This is a semi-automatic process, since not all the activities can be automated. For example, the step 1 of the first activity cannot be automated because the analyst is the one in charge to choose the sub-graph to be moved to another actor. The Horizontal Transformation Rule 1 (HTR1) moves a sub-graph previously selected. Hence, HTR1 cannot be fully automatized because it always depends on the sub-graph chosen by the analyst. Observe that after applying the HTR1, the resulting model may not be in compliance with the i* syntax. So, the next HTRs are to correct possible syntax errors. The Horizontal Transformation Rule 2 (HTR2) moves a means-end link crossing actor s boundary. HTR2 considers the situation where the sub-graph moved to another actor has the root element as a means in a means-end relationship. The Horizontal Transformation Rule 3 (HTR3) moves a contribution link crossing the actor s boundary. HTR3 considers the situation where the sub-graph moved to another actor has a contribution relationship with others elements that were not moved. The Horizontal Transformation Rule 4 (HTR4) moves a task-decomposition link crossing the actor s boundary. HTR4 considers the situation where the sub-graph moved has a task-decomposition relationship with other elements that were not moved. Table 1 shows examples of these rules. The graph to be moved in HTR1 is highlighted with a dashed line and labelled with G. TABLE I. EXAMPLE OF HORIZONTAL TRANSFORMATION RULES ADAPTED FROM [8] Rule Original Model HTR1 HTR2 HTR3 HTR4 (a) Ator X Goal 1 Tarefa 5 Ator X Ator X Ator X Goal 2 Tarefa 2 Tarefa 1 Softgoal 1 Help Hurt Tarefa 3 Tarefa 4 Goal 2 Tarefa 6 Softgoal 1 Tarefa 6 G D Hurt Help Tarefa 6 Goal 1 Tarefa 5 Ator Y Tarefa 7 Ator Z Ator Z Ator Z D Tarefa 8 Tarefa 1 Tarefa 1 Tarefa 3 Resulting model after applying the rule (b) Ator X Ator X Ator X Ator X Goal 1 Tarefa 6 Goal 2 Tarefa 5 Goal 2 Softgoal 1 Tarefa 1 Softgoal 1 Tarefa 6 Hurt Tarefa 6 Tarefa 1 Tarefa 2 Tarefa 3 Tarefa 4 D D Help G Tarefa 2 Softgoal 1 Ator Z Ator Y Ator Z Ator Z D D Ator Z D Tarefa 8 Tarefa 7 Tarefa 1 Softgoal 1 The transformation rules are intended to delegate internal elements of the software actor to other (new) software actors. This delegation must ensure that new actors have with the original actor. Thus, the original model and the final model are supposed to be semantically equivalent. At the end of the first activity, the actors representing the software are easier to understand and maintain, since there is more actors with less internal elements. In the second STREAM activity (Derive Architectural Solutions), transformation rules are used to transform and i* requirements model into an initial Acme architectural model. In this case, we use the VTRs. Goal 1 Tarefa 5 D Tarefa 3 D Tarefa 1 Tarefa 3

28 In order to facilitate the understanding, we have separated the vertical transformation rules into four rules. VTR1 maps the i* actors into Acme components. VTR2 maps the i* dependencies into Acme connectors. VTR3 maps a depender actor as a required port of Acme connector. And last but not least, VTR4 maps the dependee actor to a provided port of an Acme connector. Note the goal of this paper is to fully automate three HTRs (HTR2, HTR3 and HTR4) and all VTRs proposed by the STREAM. HTR1 is not amenable to automation. First, we specify them in QVTO [2]. It is worth noting that to create the i* models, we have relied on the istartool tool [6]. B. QVT The QVT language has a hybrid declarative/imperative nature. The declarative part is divided into a two-tier architecture, which forms the framework for the execution semantics of the imperative part [5]. It has the following layers: A user-friendly Relations metamodel and language that supports standard combination of complex object and create the template object. A Core metamodel and language defined using minimal extensions to EMOF and OCL. In addition to the declarative languages (Relations and Core), there are two mechanisms for invoking imperative implementations of Relations or Core transformations: a standard language (Operational Mappings) as well as nonstandard implementations (Black-box MOF Operation). The QVT Operational Mapping language allows both the definition of transformations using a complete imperative approach (i.e. operational transformations) or it lets hybrid approach in which the transformations are complemented with imperatives operations (which implements the relationships). The operational transformation represents the definition of a unidirectional transformation that is expressed imperatively. This defines a signature indicating the models involved in the transformation and defines an input operation for its implementation (called main). An operational transformation can be instantiated as an entity with properties and operations, such as a class. III. AUTOMATION OF HORIZONTAL TRANSFORMATION The first activity of the STREAM process presents some transformation rules that can be defined precisely using the QVT (Query / View / Transformation) transformation language [5], in conjunction with OCL (Object Constraint Language) [9] to represent constraints. The transformation process requires the definition of transformation rules and metamodels for the source and target languages. The first STREAM activity uses the HTRs, which aim to improve the modularity of the i* models and have the i* language as source and target language. The rules were defined in QVTO and executed through a plugin for the Eclipse platform. Transformations were specified based on the i* language metamodel considered by the istartool. In QVT, it is necessary to establish a reference to the metamodel to be used. As explained in section II, the steps of the first activity of the STREAM process (Refactor Models Requirements) are: Analysis of internal elements; Application of horizontal transformation rules, and Evaluation of i* models. The Horizontal Transformation Rules activity takes as input two artefacts: the i* model and the selection of internal elements. The former is the i* system model, and the latter is the choice of elements to be modularized made by Engineer Requirements. The output artefact produced by the activity is refactored and more modularized i* model. Modularization is performed by a set of horizontal transformation rules. Each rule performs a small and located transformation that produces a new model that decomposes the original model. Both the original and the produced model are described in i*. Thus, the four horizontal transformation rules proposed by [8] are capable of implementation. First the analyst uses the istartool to produce the i* requirements model. Then the HTR1 can be performed manually by him/her also using the istartool. The analyst may choose to move the sub-graph for a new actor or an existing actor, and then moves the sub-graph. This delegation must ensure that new actors and the original actor have a relationship of dependency. Thus, the original model and the final model are supposed to be semantically equivalent. Upon completion of HTR1, the artefact generated is used in automated transformations that perform all other HTRs at once. This is the case if the obtained model is syntactically wrong. Table 1 describes the different types of relationship between the components that have been moved to another actor and a member of the actor to which it belonged. If the relationship is a means-end rule, HTR2 should be applied. While if the relationship is a contribution, HTR3 is used. In the situation where tasks decomposition is present, HTR4 is recommended. In the next section we detail how each of these HTRs was implemented in QVTO. A. HTR2- Move a means-end link across the actor's boundary If after applying the HTR1, there is a means-end link crossing the actors boundaries, the HTR2 corrects this syntax error since means-end links can exist only inside the actor s boundary. The means-end link is usually used to connect a task (means) to a goal (end). Thus, the HTR2 make a copy of the task inside the actor who has the goal, in such way that the means-end link is now inside of the actor s boundary that has the goal (Actor X in Table 1). After that, the rule establishes a dependency from that copied task to the task inside of the new actor (Actor Z in Table 1). To accomplish this rule, the HTR2 checks if there is at least a means-end link crossing the actors boundaries (line 7 of the code present in Table 2). If so, it then checks if this means-end link has as source and target attributes elements present in the boundary of different actors. If this condition holds (line 10), the HTR2 creates a copy of the source element inside the boundary of the actor which possesses the target element of the means-end link (atordahora variable

29 in line 19). A copy of the same source element is copied outside the actors boundaries to become a dependum (line 18). Then, a dependency is created from the element copied inside the actor to the dependum element (line 20) and from the dependum element to the original source element of the means-end link that remained inside the other actor (line 21). The result is the same presented in Table 1 for HTR2. The source code in QVTO for HTR2 is presented in Table 2. TABLE II. HTR2 DESCRIBED IN QVTO 1 actorresultamount := orimodel.rootobjects()[model].actors.name->size(); 2 while(actorresultamount > 0){ 3 if(self.actors- >at(actorresultamount).type.=(actortype::actorboun DARY)) then { 4 atoresboundary += self.actors- >at(actorresultamount); 5 var meansend := self.actors- >at(actorresultamount).meansend->size(); 6 var atordahora := self.actors- >at(actorresultamount); 7 while(meansend > 0) { 8 var sourcedahora := atordahora.meansend- >at(meansend).source.actor; 9 var targetdahora := atordahora.meansend- >at(meansend).target.actor; 10 if(sourcedahora <> targetdahora) then { 11 var atoresboundarysize := atoresboundary- >size(); 12 var otheractor : Actor; 13 while(atoresboundarysize > 0) { 14 if(atoresboundary- >at(atoresboundarysize).name <> atordahora.name) then { 15 otheractor := atoresboundary- >at(atoresboundarysize); } else { 16 otheractor := atordahora; } endif; 17 atoresboundarysize := atoresboundarysize - 1; }; 18 self.elements += object Element{ name := atordahora.meansend- >at(meansend).source.name; type := atordahora.meansend- >at(meansend).source.type; }; 19 atordahora.elements += object Element{ name := atordahora.meansend- >at(meansend).source.name; type := atordahora.meansend- >at(meansend).source.type; actor := atordahora.name; }; 20 self.links += object DependencyLink { }; source := atordahora.elements->last(); target := self.elements->last(); name := "M"; type := DependencyLinkType::COMMITED; 21 self.links += object DependencyLink { >at(meansend).source; }; } endif; source := self.elements->last(); target := otheractor.meansend- name := "M"; type := DependencyLinkType::COMMITED; 22 meansend := meansend - 1; }; } endif; 23 actorresultamount := actorresultamount - 1; }; B. HTR3- Move a contribution link across the actor's boundary HTR3 copies a softgoal that was moved to its source actor, if this softgoal is a target in a contribution link with some element in his initial actor. The target of the link is moved from the softgoal to its copy in the initial actor. This softgoal is still replicated as a dependum of a dependence link from the original softgoal to its copy. If an element of some actor has a contribution link with a softgoal that is within the actor that was created or received elements in HTR1, then this softgoal will be copied into the actor that has an element that has a contribution link with this softgoal. The target of the contribution link becomes that copy. This softgoal is also copied as a dependum of a softgoal dependency in its original copy. In order to accomplish this rule, we analyse if any contribution link has the source and target attributes with elements present in different actors. If the actor element present in the source or the target is different from the actor referenced in attribute of the element, then this element (softgoal) is copied to the actor present in source or target that has the different name of the actor analysed. The target attribute of the contribution link shall refer to this copy. This same softgoal is also copied to the modelling stage and creates a dependency from the softgoal copy to original softgoal with to the copied softgoal to the stage as dependum. The target of this dependence is the copy and the source is the original softgoal. C. HTR4- Move a task-decomposition link across the actor's boundary HTR4 replicates an element that is the decomposition of a task into this other actor as dependum a dependency link between this element and the task, and removes the decomposition link.

30 If an some actor's element is task decomposition within the actor that was created or received elements in HTR1, then that decomposition link is removed, and a copy of this element will be created and placed during the modelling stage as a dependum of a dependency between the task in the actor created or received elements in HTR1 and the element present in another actor that was the decomposition of this task. The target of this dependence is the element and the source is the task. In order to accomplish this rule, we analyse if any decomposition task link has source and target attributes with elements present in different actors. If the actor of the element present in the source or target is different from the referenced actor in the moves attribute of element, then that element is copied during the modelling stage to create a dependency from the referenced element as the source of decomposition link to the element referenced as the target, i.e., a dependency from the original element of the task, with the copied element to the stage as a dependum. The target of this dependence is the task and the source is the element. The decomposition link is removed. IV. AUTOMATION OF VERTICAL TRANSFORMATIONS The second STREAM activity (Generate Architectural Solutions) uses transformation rules to map i* requirements models into an initial architecture in Acme. As these transformations have different source and target languages, they are called vertical transformations. In order to facilitate the understanding of de VTRs as well as the description of them, we separate the vertical transformation rules in four rules [14]. Below we detail how each of these VTRs was implemented. A. VTR1- Mapping the i* actors in Acme components In order to describe this first VTR it is necessary to obtain the quantity of actors present in the artefact developed in the first activity. From this, we create the same quantity of Acme components (line 3 of code present in Figure 1), giving the same actors name. The Figure 1 shows an excerpt of QVTO code for VTR1. 1 while(actorsamount > 0) { 2 result.acmeelements += object Component{ 3 name := self.actors.name->at(actorsamount); } 4 actorsamount := actorsamount - 1; } Figure 1. Excerpt of the QVTO code for VTR1 The XMI output file will contain the Acme components within the system represented by the acmeelements tag (line 1 of code present in Figure 2), an attribute of that tag, xsi:type (line 1), that contain information that is a component element and the attribute name the element name, as depicted in Figure 2. Figure 3 shows graphically a component. 1 <acmeelements xsi:type="acme:component" name="advice Receiver"> 2 </acmeelements> Figure 2. XMI tag of component Figure 3. Acme components linked by connector However, an Acme component has other attributes, not just the name, so it is also necessary to perform the VTR3 and VTR4 rules to obtain the other necessary component attributes. B. VTR2- Mapping the i* dependencies in Acme connectors Each i* dependency creates two XMI tags. One captures the link from depender to the dependum and the other defines the link from the dependum to the dependee. 1 while(dependencyamount > 0) { 2 if(self.links.source- >includes(self.links.target->at(dependencyamount)) and self.actors.name->excludes(self.links.target- >at(dependencyamount).name)) then { 3 result.acmeelements += object Connector{ 4 name := self.links.target.name- >at(dependencyamount); 5 roles += object Role{ 6 name := "dependerrole"; }; 7 roles += object Role{ 8 name := "dependeerole"; }; }; } endif; 9 dependencyamount := dependencyamount - 1; }; Figure 4. Excerpt of the QVTO code for VTR2 As seen in Figure 4, for the second vertical rule, which transforms i* dependencies to Acme connectors (line 3 of code present in Figure 4), each i* dependency creates two tags in XMI, one captures the connection from the depender to the dependum (line 5) and another defines the connection from the dependee to the dependum (line 7). In order to map these dependencies into Acme connectors it is necessary to recover the two dependencies tags, observing that the have the same dependum, i.e., the target of a tag must be equal to the source of another tag, which can characterize the dependum. However, they should not consider the actor which plays the role of depender (source) in some dependency and dependee (target) in another. Once this is performed, there are only dependums (intentional elements) left. For each dependum, one Acme connector is created (line 1 of code present in Figure 5). The connector created receives the name of the intentional element that represents the dependum of the dependency link. Two roles are created within the connector, one named dependerrole and another named dependeerole. The XMI output file will contain the connectors represented by tags (see Figure 5).

31 1 <acmeelements xsi:type="acme:connector" name="connector"> 2 <roles name="dependerrole"/> 3 <roles name="dependeerole"/> 4 </acmeelements> Figure 5. Connector in XMI C. VTR3- Mapping depender actors as required port of Acme connector With the VTR3, we map all depender actors (source from some dependency) into a required port of an Acme connector. Thus, we list all model s actors that are source in some dependency (line 2 of code present in Figure 6). Furthermore, we create an Acme port for each depender actor (line 3). Each port created has a name and a property (lines 4 and 5), the name is assigned randomly, just to help to control them. The property must have a name and a value, the property name is Required once we are creating the required port, as figured in Figure 6. 1 while(dependencyamount > 0) { 2 if(self.actors.name- >includes(self.links.source- >at(dependencyamount).name) and self.actors.name- >at(actorsamount).=(self.links.source- >at(dependencyamount).name)) then { 3 ports += object Port{ 4 name := "port"+countport.tostring(); 5 properties := object Property { 6 name := "Required"; 7 value := "true" }; }; } endif; 8 dependencyamount := dependencyamount - 1; 9 countport := countport + 1; }; Figure 6. Excerpt of the QVTO code for VTR3 The XMI output file will contain within the component tag (line 1 of code present in Figure 7) the tags of the ports. Inside the port s tag there will be a property tag with the name attribute assigned as Required and the attribute value set true (lines 2 to 4). Figure 7 presents an example of a required port in XMI, while Figure 3 shows the graphic representation of the required port (white). 1 <acmeelements xsi:type="acme:component" name="component"> 2 <ports name="port8"> 3 <properties name="required" value="true"/> 4 </ports> 5 </acmeelements> Figure 7. Example of a required port in XMI D. VTR4- Mapping dependee actors as provided port of Acme connector VTR4 is similar to VTR3. We map all dependee actors (target from some dependency) as a provided Acme port of a connector. Thus, we list all model s actors that are target in some dependency (line 2 of code present in Figure 8). We create an Acme port for each dependee actor. It has a name and property, the name is assigned randomly, simply to control them (line 4). The property must have a name and a value. The property name is set to Provided once we are creating the provided port. Figure 8 presents an QVTO excerpt code for the provided port. The XMI output file will contain within the component a tag to capture the ports. Inside the port s tag that are provided there will be a property tag with the name attribute assigned as Provided (line 3 of code present in Figure 9). While the value attribute is set to true and the type attribute as boolean. 1 <acmeelements xsi:type="acme:component" name="advice Giver"> 2 <ports name="port17"> 3 <properties name="provided" value="true" type="boolean"/> 5 </ports> 6 </acmeelements> Figure 9. Provided port in XMI Figure 3 shows the graphic representation of the provided port (black). V. RUNNING EXAMPLE BTW (By The Way) [10] is a route planning system that helps users with advices on a specific route searched by them. The information is posted by other users and can be filtered for the user to provide only the relevant information 1 while(dependencyamount > 0) { 2 if(self.actors.name- >includes(self.links.target- >at(dependencyamount).name) and self.actors.name- >at(actorsamount).=(self.links.target- >at(dependencyamount).name)) then { 3 ports += object Port{ 4 name := "port"+countport.tostring(); 5 properties := object Property{ 6 name := "Provided"; 7 value := "true"; 8 type := "boolean"; }; }; } endif; 9 countport := countport + 1; }; Figure 8. Creation of Provided Port about the place he wants to visit. BTW was an awarded projected presented at the ICSE SCORE competition held in 2009 [11]. In order to apply the automated rules in i* models of this example, it necessary to perform the following steps: 1. Create the i* requirements model using the istartool; 2. Use the three heuristics defined by STREAM to guide the selection of the sub-graph to be moved from an actor to another; 3. Manually apply the HTR1, but with support of the istartool. The result is an i* model with syntax errors that must be corrected using the automated transformation rules; 4. Apply the automated HTR2, HTR3 and HTR4 rules.

32 After step 1, we identified some elements inside the BTW software actor that are not entirely related to the application domain of the software and these elements can be placed on other (new) software actors. In fact, the sub-graphs that contain the "Map to be Handled", "User Access be Controlled", and "Information be published in map" elements can be viewed as independents of the application domain. To facilitate the reuse of the first sub-graph, it will be moved to a new actor named "Handler Mapping". The second sub-graph will be moved to a new actor named "User Access Controller" while the third sub-graph will be moved to a new actor called Information Publisher. Steps 1 and 2 are performed using the istartool. This tool generates two types of files (extensions): the file "istar_diagram" has information related to the i* requirements model; the "istar" file has information related to the i* modelling language metamodel. Since the file "istar" is a XMI file, we changed its type (extension) to "xmi". XMI files are used as input and output files by the automated rules (HTR2, HTR3 and HTR4). The BTW i* model and the elements to be moved to other actors are shown in Figure 10. Figure 11 depicts the BTW i* model after applying HTR1. Note that there are some taskdecomposition and contribution links crossing the actors boundaries, meaning that the model is syntactically incorrect and must be corrected by the automated HTRs. In order to apply the HTR2, HTR3 and HTR4, we only need to execute a single QVTO file. Thus, with the eclipse configured to QVT transformations, along with the metamodel of i* language referenced in the project and the input files referenced in the run settings, the automated rules will be applied simultaneously by executing the QVTO project. Figure 11. BTW i* SR diagram and selected elements Figure 10. BTW i* diagram and selected elements After applying the HTRs, a syntactically correct i* model is produced. In this model, the actors are expanded, but in order to apply the vertical transformation rules, it is necessary to contract all the actors (as shown in Figure 12) to be used as input in the second STREAM activity (Generate Architectural Solutions). Moreover, when applying the VTRs, we only need to execute a single QVTO file. The VTRs are executed sequentially and the analyst will visualize just the result model [15]. Figure 13 presents the graphical representation of the XMI model generated after the application of the VTRs. This XMI is compatible with the Acme metamodel.

33 Figure 11. BTW i* model after performing HTR1 Figure 12. BTW i* model after applying all HTRs

34 VI. RELATED WORKS Coelho et al. proposes an approach to relate aspect oriented goal models (described in PL-AOV-Graph) and architectural models (described in the PL-AspectualACME) [12]. It defines the mapping process between these models and a set of transformations rules between their elements. The MaRiPLA (Mapping Requirements to Product Line Architecture) tool automates this transformation, which is implemented using the Atlas Transformation Language (ATL) transformation language. Medeiros et al. presents a MaRiSA-MDD, a strategy based on models that integrate aspect-oriented requirements, architectural and detailed design, using the languages AOVgraph, AspectualACME and asideml, respectively [13]. MARISA-MDD defines, for each activity, models (and metamodels) and a number of model transformations rules. These transformations were specified and implemented in ATL. However, none of these works relied on i*, our source language, which has much larger community of adopters than AOV Graph. VII. CONCLUSION This paper presented the automation of most of the transformation rules that support the first and second STREAM activities, namely Refactor Requirements Models and Derive Architectural Solutions [1]. Figure 13. BTW Acme Model obtained from the i* model In order to decrease the time and effort required to perform these STREAM activities, as well as to minimize the errors introduced by the manual execution of the rules, we proposed the use of the QVTO language to automatize the execution of seven out of the eight STREAM transformation rules. The input and output models of the Refactor Requirements Models activity are compatible with the istartool. While the ones generated by the Derive Architectural Solutions activity are compatible with the AcmeStudio tool. The istartool was used to create the i* model and to perform the HTR1 manually. The result is the input file to be processed by the automated transformation rules (HTR2, HTR3 and HTR4). Both the input and output files handled by the transformation process are XMI files. The STREAM transformation rules were defined in QVTO and an Eclipse based tool support was provided to enable their execution. In order to illustrate the use of the automated transformation rules the automated rules were used in the BTW example [10]. The output of the execution of the VTRs is a XMI file with the initial Acme architectural model. Currently, the AcmeStudio tool is only capable of reading XMI files, since it was designed to only process files described using the Acme textual language. As a consequence, the XMI file produced by the VTRs currently cannot be graphically displayed. Hence, we still need to define new transformation

35 rules to generate a description in Acme textual language from the XMI file already produced. Moreover, more case studies are still required to assess the benefits and identify the limitations of our approach. For example we plan to run an experiment to compare the time necessary to perform the first two STREAM activities automatically against an ad-hoc way. VIII. ACKNOWLEDGEMENTS This work has been supported by CNPq, CAPES and FACEPE. REFERENCES [1] J. Castro, M. Lucena, C. Silva, F. Alencar, E. Santos, J. Pimentel, "Changing Attitudes Towards the Generation of Architectural Models", Journal of Systems and Software March 2012: Vol 85. pp [2] Object Management Group. (January 2011). QVT 1.1. Meta Object Facility (MOF) 2.0. Query/View/Transformation Specification. Available in: < Acessed: April [3] E. Yu, "Modelling Strategic Relationships for Process Reengineering". Tese (Doutorado). University of Toronto: Department of Computer Science, [4] ACME. Acme. Acme - The Acme Architectural Description Language and Design Environment., Available in: < Accessed: April [5] OMG. QVT 1.1. Meta Object Facility (MOF) 2.0 Query/View/Transformation Specifica-tion, 01 January Available em: < Accessed: April [6] A. Malta, M. Soares, E. Santos, J. Paes, F. Alencar and J. Castro, "istartool: Modeling requirements using the i* framework". IStar 11, August [7] ECLIPSE GMF. GMF - Graphical Modelling Framework. Available in: < >. Accessed: April [8] J. Pimentel, M. Lucena, J. Castro, C. Silva, E. Santos, and F. Alencar, Deriving software architectural models from requirements models for adaptive systems: the STREAM-A approach, Requirements Engineering, vol. 17, no. 4, pp , June [9] OMG. OCL 2.0. Object Constraint Language: OMG Available Specification, Available in: < Accessed: April [10] J. Pimentel, C. Borba and L. Xavier, "BTW: if you go, my advice to you Project", Available in: < Accessed: April [11] SCORE The Student Contest on Software Engineering - SCORE 2009, Available in: < Accessed: April [12] K. Coelho, From Requirements to Architecture for Software Product Lines: a strategy of models transformations (In Portuguese: Dos Requisitos à Arquitetura em Linhas de Produtos de Software: Uma Estratégia de Transformações entre Modelo). Dissertation (M.Sc.). Centro de Ciências Exatas e da Terra: UFRN, Brazil, [13] A. Medeiros, MARISA-MDD: An Approach to Transformations between Oriented Aspects Models: from requirements to Detailed Project (In Portuguese: MARISA-MDD: Uma Abordagem para Transformações entre Modelos Orientados a Aspectos: dos Requisitos ao Projeto Detalhado). Dissertation (M.S.c). Center for Science and Earth: UFRN, Brazil, [14] M. Soares, Automatization of the Transformation Rules on the STREAM process (In Portuguese: Automatização das Regras de Transformação do Processo STREAM). Dissertation (M.Sc.). Center of Informatic: UFPE, Brazil, [15] M. Soares, J. Pimentel, J. Castro, C. Silva, C. Talitha, G. Guedes, D. Dermeval, Automatic Generation of Architectural Models From Goals Models, SEKE 2012: [16] Eclipse. Available in: < Acessed: April 2013.

36 An automatic approach to detect traceability links using fuzzy logic Andre Di Thommazo Instituto Federal de São Paulo, IFSP Universidade Federal de São Carlos, UFSCar São Carlos, Brazil Thiago Ribeiro, Guilherme Olivatto Instituto Federal de São Paulo, IFSP São Carlos, Brazil {guilhermeribeiro.olivatto, Vera Werneck Universidade do Estado do Rio de Janeiro, UERJ Rio de Janeiro, Brazil SandraFabbri Universidade Federal de São Carlos, UFSCar São Carlos, Brazil Abstract Background: The Requirements Traceability Matrix (RTM) is one of the most commonly used ways to represent requirements traceability. Nevertheless, the difficulty of manually creating such a matrix motivates the investigation into alternatives to generate it automatically. Objective: This article presents one approach to automatically create the RTM based on fuzzy logic, called RTM-Fuzzy, which combines two other approaches, one based on functional requirements' entry data called RTM-E and the other based on natural language processing called RTM-NLP. Method: To create the RTM based on fuzzy logic, the RTM-E and RTM-NLP approaches were used as entry data for the fuzzy system rules. Aimed at evaluating these approaches, an experimental study was conducted where the RTMs created automatically were compared to the reference RTM (oracle) created manually based on stakeholder knowledge. Results: On average the approaches matched the following results in relation to the reference RTM: RTM-E achieved 78% effectiveness, RTM-NLP 76% effectiveness and the RTM-Fuzzy 83% effectiveness. Conclusions: The results show that using fuzzy logic to combine and generate a new RTM offered an enhanced effectiveness for determining the requirement s dependencies and consequently the requirement s traceability links. Keywords- component; requirements traceability; fuzzy logic; requirements traceability matrix. I. INTRODUCTION Nowadays, the software industry is still challenged to develop products that meet client expectations and yet respect delivery schedules, costs and quality criteria. Studies performed by the Standish Group [1] showed that the quantity of projects in 2010 which finished successfully whilst respecting the schedule, budget and, principally, meeting the client s expectations is only 37%. Another study performed previously by the same institute [2] found that the three most important factors to define whether a software project was successful or not are: user specification gaps, incomplete requirements, and constant changes in requirements. Duly noted, these factors are directly related to requirements management. According to Salem [3], the majority of software errors found are derived from errors in the requirements gathering and on keeping pace with their evolution throughout the software development process. One of the main features of requirements management is the requirements traceability matrix (RTM), which is able to record the existing relationship between the requirements on a system and, due to its importance, is the main focus of many studies. Sundaram, Hayes, Dekhtyar and Holbrook [4], for instance, consider traceability determination essential in many software engineering activities. Nevertheless, such determination is a time consuming and error prone task, which can be facilitated if computational support is provided. The authors claim that the use of such automatic tools can significantly reduce the effort and costs required to elaborate and maintain requirements traceability and the RTM, and go further to state that such support is still very limited in existing tools. Among the ways to automate traceability, Wang, Lai and Liu [5] highlight that current studies make use of a spatial vector model, semantic indexing or probability network models. Regarding spatial vector modeling, the work of Hayes, Dekhtyar and Osborne [6] can be highlighted and it is going to be presented in detail in Section IV-B. Related to semantic indexing, Hayes, Dekhtyar and Sundaram [7] used the ideas proposed by Deerwester, Dumais, Furnas, Landauer and Harshman from Latentic Semantic Indexing (LSI) [8] in order to also automatically identify traceability. When LSI is in use, not only is the word frequency taken into consideration, but in addition the meaning and context used in their construction. With respect to the Network Probability model, Baeza-Yates, Berthier and Ribeiro-Neto [9] use ProbIR (Probabilistic Information Retrieval) to create a matrix in which the dependency between each term is mapped in relation to the other document terms. All the quoted proposals are also detailed by Cleland-Huamg, Gotel, Zisman [10] as possible approaches for traceability detection. As traceability determination involves many uncertainties this activity poses not to be trivial, not even to the team involved in the requirements gathering. Therefore, it is possible to achieve better effectiveness in the traceability link identification if we can use a technique

37 that can handle uncertainties, like fuzzy logic. Given the aforementioned context, the focus of this paper is to present one approach to automatically create the RTM based on fuzzy logic, called RTM-Fuzzy. This approach combines two approaches: one based on functional requirements (FR) entry data called RTM-E that is effective on CRUD FR traceability and other based on natural language processing called RTM-NLP that is effective on more descriptive FRs. The motivation of the RTM-Fuzzy is to join the good features of the two others approaches. The main contribution of this paper is to present the fuzzy logic approach, once it has equal or better effectiveness than the other ones (RTM-E and RTM- NLP) singly. The three proposed approaches were evaluated by an experimental study to quantify the effectiveness of each. It is worth mentioning that the RTM-E and RTM-NLP approaches had already provided satisfactory results in a previous experimental study [11]. In the re-evaluation in this paper, RTM-E had similar results and RTM-NLP (that was modified and improved) had a better effectiveness than the results of the first evaluation [11]. To make the experiment possible, the three RTM automatic generation approaches were implemented in the COCAR tool [12]. This article is organized as follows: in Section II the requirements management, traceability and RMT are introduced; Section III presents a brief definition of fuzzy logic theory; in Section IV, the three RMT automatic creation approaches are presented and exemplified by use of the COCAR tool; Section V shows the experimental study performed to evaluate the effectiveness of the approaches; conclusions and future work are discussed in Section VI. II. REQUIREMENTS MANAGEMENT TECHNIQUES Requirements management is an activity that should be performed throughout the whole development process, with the main objective of organizing and storing all requirements as well as managing any changes to them [13][14]. As requirements are constantly changing, managing them usually becomes a laborious and extensive task, thus making relevant the use of support tools to conduct it [5]. According to the Standish Group [15], only 5% of all developed software makes use of any requirements management tool, which can partially explain the huge problems that large software companies face when implementing effective requirements management and maintaining its traceability. Various authors emphasize the importance of tools for this purpose [13][14][16][17]. Zisman and Spanoudakis [14], for instance, consider the use of requirements management tools to be the only way for successful requirements management. Two important concepts for requirements management are requirements traceability and a traceability matrix, which are explained next. A. Requirements traceability Requirements traceability concerns the ability to describe and monitor a requirement throughout its lifecycle [18]. Such requirement control must cover all its existence from its source when the requirement was identified, specified and validated through to the project phase, implementation and ending at the product s test phase. Thus traceability is a technique that allows identifying and visualizing the dependency relationship between the identified requirements and the other requirements and artifacts generated throughout the software s development process. The dependency concept does not mean, necessarily, a precedence relationship between requirements but, instead, how coupled they are to each other with respect to data, functionality, or any other perspective. According to Guo, Yang, Wang, Yang and Li [18], requirements traceability is an important requirements management activity as it can provide the basis to requirements evolutional changes, besides directly acting on the quality assurance of the software development process. Zisman and Spanadousk [14] consider two kinds of traceability: Horizontal: when the relationships occur between requirements from different artifacts. This kind of traceability links a FR to a model or a source code, for example. Vertical: when the traceability is analyzed within the same artifact, like the RD for instance. By analyzing the FRs of this artifact it is possible to identify their relationships and generate the RTM. This type of traceability is the focus of this paper. Munson and Nguyen [19] state that traceability techniques will only be better when supported by tools that diminish the effort required to execute them. B. Requirement Traceability Matrix - RTM According to Goknil, Kurtev, Van den Berg and Veldhuis [17], despite existing various estudies treating traceability between requirements and other artifacts (horizontal traceability), only minor attention is given to the requirements relationship between themselves, i.e. their vertical traceability. The authors also state that this relationship influences various activities within the software development process, such as requirements consistency verification and change management. A method of mapping such a relationship between requirements is RTM creation. In addition, Cuddeback, Dekhtyar and Hayes [20] state that a RTM supports many software engineering validation and verification activities, like change impact analysis, reverse engineering, reuse, and regression tests. In addition, they state that RTM generation is laborious and error prone, a fact that means, in general, it is not generated or updated. Overall, RTM is constructed as follows: each FR is represented in the i-eseme line and in the i-eseme column of the RTM, and the dependency between them is recorded in the cell corresponding to each FR intersection [13]. Several authors [13] [17] [18] [19] [21] debate the importance and need of the RTM in the software development process, once such matrix allows predicting the impact that a change (or the insertion of a new requirement) has on the system as a whole. Sommerville [13] emphasizes the difficulty of obtaining such a matrix and goes further by proposing a way to subjectively indicate not only whether the requirements are dependent but how strong such a dependency is.

38 III. FUZZY LOGIC Fuzzy logic was developed by Zadeh [22], and proposes, instead of simply using true or false, the use of a variation of values between a complete false and an absolute true statement. In classic set theory there are only two pertinence possibilities for an element in relation to a set as a whole: the element pertains or does not pertain to a set [23]. In fuzzy logic the pertinence is given by a function whose values pertain to the real closed interval between 0 and 1. The process of converting a real number into its fuzzy representation is called Fuzzyfication. Another important concept in fuzzy logic is related to the rules that use linguistic variables in the execution of the decision support process. The linguistic variables are identified by names, have a variable content and assume linguistic values, which are the names of the fuzzy sets [23]. In the context of this work, the linguistic variables are the traceability obtained by the three RTM generation approaches and may assume the values (nebulous sets) non-existent, weak or strong, which will be represented by a pertinence function. This process is detailed in Section IV C. Fuzzy logic has been used in many software engineering areas and, specifically in the requirements engineering area, the work of Ramzan, Jaffar, Iqbal, Anwar, and Shahid [24] and Yen, Tiao and Yin [25] can be highlighted. The former conducts requirements prioritization based on fuzzy logic and the later uses fuzzy logic to aid the collected requirements precision analysis. In the metrics area, MacDonell, Gray and Calvet [26] also use fuzzy logic to propose metrics to the software development process and in the reuse area, Sandhu and Singh [27] likewise use fuzzy logic to analyze the quality of the created components. IV. APPROACHES TO RTM GENERATION The three approaches were developed aiming to generate the RTM automatically. The first one - RTM-E is effective to detect traceability links in FRs that have the same entry data, specially like CRUDs FRs. The second one RTM-NLP is appropriate to detect traceability links in FRs that have a lot of knowledge in the text description. The third one RTM-Fuzzy combines the previous approaches trying to extract the best of each one. These approaches only take into consideration the software FRs and establish the relationship degree between each pair of them. The RTM names were based on each approach taken. The approach called RTM-E had its traceability matrix named RTMe, the RTM-NLP s RTM was called RTMnlp, whereas the RTM-Fuzzy s RTM was called RTMfuzzy. The quoted approaches are implemented in the COCAR tool, which uses a template [28] for requirements data entry. The RD formed in the tool can provide all data necessary to evaluate the approaches. The main objective of such a template is to standardize the FR records, thus avoiding inconsistencies, omissions and ambiguities. One of the fields found in this template (which makes the RTM implementation feasible) is called Entry, and it records in a structured and organized way the data used in each FR. It is important noting that entry data should be included with the full description, i.e. client name or user name and not only name, avoiding ambiguity. Worth mentioning here is the work of Kannenberg and Saiedian [16], which considers the use of a tool to automate the requirements recording task highly desirable. Following, the approaches are presented. Aiming to exemplify the RTM generation according to them, a real application developed for a private aviation company was used in a case study. An example of how each approach calculates the dependence between a pair of FRs is presented at the end of each sub-session (IV-A, IV-B and IV-C). The system s purpose is to automate the company s stock control. As the company is located in several cities, the system manages the various stock locations and the products being inserted, retrieved or reallocated between each location. The system has a total of 14 FRs, that are referrenced in the first line and first column of the RTMs generated. In Section V the results of this case study ARE compared with the results of the experimental study. A. RMT-E Approach: RTM generation based on input data In the RMT-E approach, the dependency relationship between FRs is determined by the percentage of common data between FR pairs. This value is obtained through the Jaccard Index calculation [29], which compares the similarity and/or diversity degree between the data sets of each pair. Equation 1 represents this calculation. The equation numerator is given by the quantity of data intersecting both sets (A and B), whereas the denominator corresponds to the quantity associated to the union between those sets. The RTM-E approach defines the dependency between two FRs, according to the following: considering FRa the data set entries for a functional requirement A and FRb the data set entries for a functional requirement B, their dependency level can be calculated by Equation 2: Thus according to the RTM-E approach, each position (i,j) of the traceability matrix RTM(i,j) corresponds to values from Equation 3: Positions on the matrix s main diagonal are not calculated once they indicate the dependency of the FRs to themselves. Besides, the generated RTM is symmetrical, i.e. RTM(i,j) has the same value as RTM(j,i). Implementing such an approach in COCAR was possible because the requirements data are stored in an atomic and separated way, according to the template mentioned before. Each time entry data is inserted in a requirement data set it is automatically available and can be used as entry data for another FR. Such implementation avoids data ambiguity and data inconsistency. (1) (2) (3)

39 It is worth noting that initiatives using FR data entries to automatically determine the RTM were not found in the literature. Similar initiatives do exist to help determine traceability links between other artifacts, mainly models (for UML diagrams) and source codes, like those found in Cysneiros and Zisman [30]. In Figure 1, the matrix cell highlighted by the rectangle indicates the level of dependency between FR3 and FR5. In this case, FR3 is related to products going in and out from a company s stock (warehouse) and FR5 is related to an item transfer from one stock location to another. As both FRs deal with stock items, it is expected that they present a relationship. The input data of FR3 are: Contact, Transaction Date, Warehouse, Quantity, Unit Price and User. The input data of FR5 are: Contact, Transaction Date, Warehouse, Quantity, Unit Price, User, Origin Warehouse, Destination warehouse and Status. As the quantity of elements of the intersection between the input data of FR3 and FR5 (n(fr3 FR5)) is equal to 6, and the quantity of elements of the union set (n(fr3 FR5)) is equal to 9, the value obtained from Equation 4, that establishes the dependency relationship between FR3 and FR5 is: (4) The 66.67% dependency rate is highlighted in Figure 1, which is the RTM-E built using the aforementioned approach. It is worth mentioning that the colors indicate each FR dependency level as follows: green for a weak dependency level and red for a strong dependency level. Where there is no relationship between the FRs, no color is used in the cell. Figure 2 illustrates the intersection and union sets when the RTM-E approach is applied to FR3 and FR5 used so far as example. Also worth mentioning is that the COCAR tool presents a list of all input data entries already in place, in order to minimize requirements input errors such as the same input data entry with different names. The determination of the dependency levels was carried out based on the application taken as an example (stock control), and from two further RDs from a different scope. Such a process was performed in an interactive and iterative way, adjusting the values according to the detected traceability between the three RD. The levels obtained were: no dependence where the calculated value was 0%; weak for values between 0% and 50%; and strong for values above 50%. B. RMT-NLP approach: RTM generation based on natural language processing Even though there are many initiatives that make use of NLP to determine traceability in the software development process, as mentioned previously few of them consider traceability inside the same artifact [17]. In addition, the proposals found in the literature do not use a requirements description template and do not determine dependency levels as in this work. According to Deeptimahanti and Sanyal [31], the use of NLP in requirements engineering does not aim at text comprehension itself but, instead, at extracting embedded RD concepts. This way, the second approach to establish the RTM uses concepts extracted from FRs using NLP techniques to determine the FR s traceability. Initially, NLP was only applied to the field that describes the processing (actions) of the FR, and such a proposal was evaluated using an experimental study [11]. It was noted that many times, as there is a field for entry data input in the template (as already pointed in the RTM- E proposal), the analysts did not record the entry data once again in the processing field, thus reducing the similarity value. With such a fact in mind, this approach has been improved and all text fields contained in the template were used. This way, the template forces the requirements engineer to gather and enter all required data, and all this information is used by the algorithm that performs the similarity calculation. As it will be shown in this work s sequence, this modification had a positive impact on the approach s effectiveness. To determine the dependency level between the processing fields of two FRs, the Frequency Vector and Cosine Similarity methods [32] are used. Such a method is able to return a similarity percentage between two text excerpts. Figure 1 Resultant RTM generated using the RTM-E approach.

40 Figure 2 An example of the union and intersection of entry data between two FRs. With the intention of improving the process efficiency, text pre-processing is performed before applying the Frequency Vector and Cosine Similarity methods in order to eliminate all words that might be considered irrelevant, like articles, prepositions and conjunctions (also called stopwords) (Figure 3-A). Then, a process known as stemming (Figure 3-B) is applied to reduce all words to their originating radicals, leveling their weights in the text similarity determination. After the two aforementioned steps, the method calculates the similarity between two FR texts (Figure 3-C) using the template s processing fields, thus identifying, according to the technique, similar ones. The first step for Vector Frequency and Cosine Similarity calculation is to represent each sentence in a vector with each position containing one word from the sentence. The cosine calculation between them will determine the similarity value. As an example, two FRs (FR1 and FR2) described respectively by sentences S1 and S2 are taken. The similarity calculation is carried out as follows: 1) S1 is represented in vector x and S2 in vector y. Each word will use one position in each vector. If S1 has p words, vector x will also initially have p positions. In the same way, if S2 has q words, vector y will also have q positions. 2) As the vector cannot have repeated words, occurrences are counted to determine each word s frequency to be included in the vector. At the end, the vector should contain a single occurrence of each word followed by the frequency that such word appears in the text. 3) All vectors are alphabetically reordered. 4) Vectors have their terms searched for matches on the other and, when the search fails, the word is included in the faulting vector with 0 as its frequency. At the end of this step, both vectors will have not matched words included and the same number of positions. 5) With the adjusted vectors, the similarity equation sim(x,y) (Equation 5) must be applied between vectors x and y, considering n as the number of positions found in the vectors. Considering the same example used to illustrate the RTM-E approach (private aviation stock system) the RTMnlp was generated (Figure 4), evaluating the similarity between FR functionalities inserted into COCAR inside the processing attribute of the already mentioned template. After applying pre-processing (stopwords removal and stemming), and the steps depicted earlier for calculating the Frequency Vector and Cosine Similarity, the textual similarity between FR3 and FR5 (related to product receive in stock and product transfer between stocks, respectively) was determined as 88.63% (Figure 3-D). This high value does make sense in this relationship, once the texts describing both requirements are indeed very similar. (5) Entry Pre-processing FR3 Text FR5 Text Remove Stopwords (articles, prepositions, conjunctions) Stemming reduce used words to their radicals Frequency Vector and Cosine Similarity [32] Dependency between FR3 and FR5 (88.63%) A B C D Figure 3 Steps to apply the RTM-NLP approach.

41 Figure 4 Resultant RTM generated using the RTM-NLP approach. As in the RTM-E, the dependency level values had been chosen in an interactive and iterative way based on the data provided by the example application (stock control) and two more RDs from different scopes. The levels obtained were: no dependence where the value was between 0% and 40%; weak for values between 40% and 70%; and strong for values above 70%. C. RTM-Fuzzy approach: RTM generation based on fuzzy logic The purpose of this approach is to combine those detailed previously using fuzzy logic, so that we can consider both aspects explored so far the relationship between the entry data manipulated by the FRs (RTM-E) and the text informed in the FRs (RTM-NLP) to create the RTM. In the previously presented approaches, the dependency classification between two FRs of no dependence, weak, and strong is determined according to the approach s generated value related to the values set for each of the created levels. One of the problems with this approach is that the difference between the classification in one level and another can be miniscule. For instance, if the RTM-NLP approach generates a value of 39.5% for the dependency between 2 FRs, this would not indicate any dependency between the FRs, whereas a value of 40.5% would already indicate a weak dependency. Using the fuzzy logic, this problem is minimized as it is possible to work with a nebulous level between those intervals through the use of a pertinence function. As seen earlier, this conversion from absolute values to its fuzzy representation is called fuzzification, used for creating the pertinence functions. In the pertinence functions, the X axis represents the dependency percentage between FRs (from 0% to 100%), and the Y axis represents the pertinence level, i.e. the probability of belonging to a certain fuzzy set ( no dependence, weak or strong ), which can vary from 0 to 1. Figure 5 illustrates the fuzzy system adopted, with RTMe and RTMnlp as the entry data. Figures 6 and 7 present, respectively, the pertinence function of the RTM- E and RTM-NLP approaches, and the X axis indicates the dependency percentage calculated in each approach. The Y axis indicates the pertinence degree, ranging from 0 to 1. The higher the pertinence value, the bigger the chance of it being in one of the possible sets ( no dependence, weak, or strong ). There ranges of values exist in which the pertinence percentage can be higher for one set and low for the other (for example the range with a dependence percentage between 35% and 55% in Figure 6). Table I indicates the rules created for the fuzzy system. Such rules are used to calculate the output value, i.e. the RTMfuzzy determination. These rules were derived from the authors experience through an interative and iterative process. Figure 5 Fuzzy System Figure 6 Pertinence function for RTM-E no dependence weak strong Figure 8 shows the output pertinence function in the same way as shown in Figures 6 and 7, where the X axis

42 indicates the RTMfuzzy dependence percentage and the Y axis indicates the pertinence degree between 0 and 1. no dependence weak strong V. EXPERIMENTAL STUDY To evaluate the effectiveness of the proposed approaches, an experimental study has been conducted following the guidelines below: - Context: The experiment has been conducted in the context of the Software Engineering class at UFSCar, Federal University of São Carlos, as a volunteer extra activity. The experiment consisted of each pair of students conducting requirements gathering on a system involving a real stakeholder. The final RD had to be created in the COCAR tool. weak dependence Figure 7 Pertinence function for RTM-NLP if if if if if if if if if TABLE I RULES USED IN FUZZY SYSTEM Antecedent RTM-E = no dependence AND RTM-NLP = no dependence RTM-E = weak AND RTM- NLP = weak RTM-E = no dependence AND RTM-NLP = strong RTM-E = strong AND RTM- NLP = strong RTM-E = no dependence AND RTM-NLP = weak RTM-E = weak AND RTM- NLP = no dependence RTM-E = no dependence AND RTM-NLP = strong RTM-E = strong AND RTM- NLP = weak RTM-E = strong AND RTM- NLP= no dependence then then then then then then then then then Consequent no dependence weak dependence weak dependence strong dependence no dependence weak dependence weak dependence strong dependence weak dependence 42.5 Figure 8 Pertinence functions for the Fuzzy System output. Figure 9 RTM-Fuzzy calculation no dependence weak strong To exemplify the RTM-Fuzzy approach, the same aviation company stock system is used in the other approaches. The selected FRs to be used in the example are FR3, related to data insertion in a stock (and already used in the other examples), and FR7, related to the report generation on stock. Such a combination was due to the fact they do not have common entry data and, therefore, there is no dependency between them. Despite this, RTM- NLP indicates a strong dependency (75.3%) between these requirements. This occurs because both FRs deal with the same set of data (although they do not have common entry data) and a similar scope, thus explaining their textual similarity. It can be observed in Figure 8 that RTM-E shows no dependency, whereas RTM-NLP shows a strong dependency (treated in the third rule). In the fuzzy logic processing (presented in Figure 9) and after applying Mandami s inference technique, the resulting value for the entries is Looking at Figure 8, it can be concluded that this value corresponds to a weak dependence, with 1 as the pertinence level. In this way, the cell corresponding to the intersection of FR3 and FR7 of the RTMfuzzy has as the value weak. - Objective: Evaluate the effectiveness of the RTM-E, RTM-NLP, and RTM-Fuzzy approaches in comparison to a reference RTM (called RTM-Ref) and constructed by the detailed analysis of the RD. The RTM-Ref creation is detailed next. - Participants: 28 graduation students on the Bachelor Computer Sciences course at UFSCar - Artifacts utilized: RD, with the following characteristics: produced by a pair of students on their own; related to a real application, with the participation of a stakeholder with broad experience in the application domain; related to information systems domain with basic creation, retrieval, updating and deletion of data; inspected by a different pair of students in order to identify and eliminate possible defects; included in the COCAR tool after identified defects are removed. - RTM-Ref: created from RD input into the COCAR tool;

43 built based on the detailed reading and analysis of each FR pair, determining the dependency between them as no dependence, weak, or strong ; recorded in a spreadsheet so that the RTM-Ref created beforehand could be compared to the RTMe, RTMnlp and RTMfuzzy for each system; built by this work s authors, who were always in touch with the RD s authors whenever a doubt was found. Every dependency (data, functionality or predecessor) was considered as a link. - Metrics: the metric used was the effectiveness of the three approaches with regard to the coincidental dependencies found by each approach in relation to the RTM-Ref. The effectiveness is calculated by the relation between the quantity of dependencies correctly found in each approach, against the total of all dependencies that can be found between the FRs. Considering a system consisting of n FRs, the total quantity of all possible dependencies (T) is given by Equation 6: 7: (6) Therefore, the effectiveness rate is given by Equation - Threads to validity: The experimental study conducted poses some threads to the validity, mainly in terms of the students inexperience to identify the (7) requirements with the stakeholders. In an attempt to minimize this risk, known domain systems were used as well as RD inspection activities. The latter was conducted based on a defect taxonomy commonly adopted in this context and which considers inconsistencies, omissions, and ambiguities, among others. Another risk is the fact that RTM-Ref had been built by people who did not have direct contact with the stakeholder, and therefore this matrix could be influenced by the eventual problems possessed by their RDs. To minimize this risk, whenever any doubts were found when determining whether a relationship was occurring or not, the students help was solicited. In some cases a better comprehension of the requirements along with the stakeholder was necessary, which certainly minimized errors when creating the RTM- Ref. - Results: The results of the comparison between the data in RTMe, RTMnlp, and RTMfuzzy are presented in Table II. The first column contains the name of the specified system; the second column contains the FR quantity; the third provides the total number of possible dependencies between FRs that may exist (being strong, weak or no dependence ), and the formula for which was shown in Figure 6. The fourth, sixth and eighth columns contain the total number of coincidental dependencies between the RTMe, RTMnlp and RTMfuzzy matrices. Exemplifying: if the RTM-Ref has determined a strong dependency in a cell and the RTM-E approach also registered the dependency as strong in the same position, a correct relationship is determined. The fifth, seventh and ninth columns represent the effectiveness of the RTM-E, RTMnlp, and RTMfuzzy approaches, respectively, calculated by the relation between the quantity of correct dependencies found by the approach and the total number of dependencies that could be found (third column). TABLE II EXPERIMENTAL STUDY RESULTS System Req Qty RTM-E RTM-NLP RTM-Fuzzy # of possible dependencies correct effect. correct effect. correct effect. 1 Zoo % % % 2 Habitation % % % 3 Student Flat % % % 4 Taxi % 77 73% 85 81% 5 Clothing Store % % % 6 Freight % 85 71% % 7 Court % % % 8 Finantial Control % % % 9 Administration % % % 10 Book Store % % % 11 Ticket % 91 87% 94 90% 12 Movies % 82 68% 97 81% 13 Bus % 78 74% 81 77% 14 School % 77 73% 84 80%

44 - Analysis of Results: The statistical analysis has been conducted using SigmaPlot software. By applying the Shapiro-Wilk test it could be verified that the data were following a normal distribution, and the results shown next are in the format: average ± standard deviation. To compare the effectiveness of the proposed approaches (RTM-E, RTM-NLP and RTM-Fuzzy) variance analysis (ANOVA) has been used for post-test repeated measurements using the Holm-Sidak method. The significance level adopted is 5%. The RTM-Fuzzy approach was found to be the most effective with (82.57% ± 4.85), whereas the RTM-E approach offered (77.36% ± 5.05) and the RTM-NLP obtained an effectiveness level of (75.64% ± 6.57). These results are similar to the results of the real case study presented in Section IV (company s stock control). In the case study, the RTM-Fuzzy effectiveness was 81.69%, the RTM-E effectiveness was 78.06% and the RTM-NLP effectiveness was 71.72%. In this experimental study, the results found for the RTM-E approach were similar to those found in a previous study [11]. Despite that, in the previous study the RTM-NLP only presented an effectiveness level of 53%, which lead us to analyze and modify this approach and the improvements were already in place when this work was evaluated. Even with such improvements, the approach still generates false positive cases, i.e. non-existing dependencies between FRs. According to Sundaram, Hayes, Dekhtyar and Holbrook [4] the occurrence of false positive is an NLP characteristic, although this type of processing can easily retrieve the relationship between FRs. In the RTM-NLP approach, the reason for it generating such false positive cases is the fact that, many times, the same words are used to describe two different FRs, thus indicating a similarity between the FRs which is not a dependency between them. Examples of word that can generate false positives are set, fetch and list. Solutions to this kind of problem are being studied in order to improve this approach. One of the alternatives is the use of a Tagger to classify each natural language term in its own grammatical class (article, preposition, conjunction, verb, substantive, or adjective). In this way verbs could receive a different weight from substantives in similarity calculus. A preliminary evaluation of this alternative was manually executed, generating better effectiveness in true relationship determination. In the RTM-E data analysis, false positives did not occur. The dependencies found, even the weak ones, did exist. The errors influencing this approach were due to relationships that should have been counted as strong being counted as weak. This occurred because many times the dependency between two FRs was related to data manipulated by both, regardless of them being entry or output data. This way, the RTM-E approach is also being evaluated with the objective to incorporate improvements that can make it more effective. As previously mentioned, if a relation was found as strong in RTM-Ref and the proposed approach indicated that the relation was weak, an error in the experiment s traceability was counted. In the case relationships indicating only if there is or there is not a traceability link were generated, i.e. without using the weak or strong labels, the effectiveness determined would be higher. In such a case the Precision and Recall [10] metrics could be used, given that such metrics only take in account the fact that a dependency exists and not their level ( weak or strong ). In relation to the RTM-Fuzzy approach, the results generated by it were always the same or higher than the results found by the RMT-E and RTM-NLP approaches alone. Nevertheless, with some adjustments in the fuzzy system pertinence functions, better results could be found. Such an adjustment is an iterative process, depending on an evaluation each time a change is done. A more broadened research could, for instance, introduce intermediary levels between linguistic variables as a way to map concepts that are hard to precisely consider in the RTM. To improve the results, genetic algorithms can make a more precise determination of the parameters involved in the pertinence functions. VI. CONCLUSIONS AND FUTURE WORK This paper presented an approach based on fuzzy logic RTM-Fuzzy to automatically generate the requirements traceability matrix. Fuzzy logic was used to treat those uncertainties that might negatively interfere in the requirements traceability determination. The RTM-Fuzzy approach was defined based on two other approaches also presented in this paper: RTM-E, which is based on the percentage of entry data that two FRs have in common, and RTM-NLP, which uses NLP to determine the level of dependency between requirements. From the three approaches presented, it is worth pointing that there are already some reported proposals in the literature using NLP for traceability link determination, mainly involving different artifacts (requirements and models, models and source-code, or requirements and test cases). Such a situation is not found in RTM-E, for which no similar attempt was found in the literature. All approaches were implemented in the COCAR environment, so that the experimental study could be performed to evaluate the effectiveness of each approach. The results showed that RTM-Fuzzy presented a superior effectiveness compared to the other two. This transpired because the RTM-Fuzzy uses the results presented in the other two approaches but adds a diffuse treatment in order to perform more flexible traceability matrix generation. Hence the consideration of traceability matrix determination is a difficult task, even for specialists, and using the uncertainties treatment provided by fuzzy logic has shown to be a good solution to automatically determine traceability links with enhanced effectiveness. The results motivate the continuity of this research, as well as further investigation into how better to combine the approaches for RTM creation using fuzzy logic. The main contributions of this particular work are the incorporation of the COCAR environment, and correspond to the automatic relationship determination between FRs. This facilitates the evaluation of the impact that a change in a requirement can generate on the others. New studies are being conducted to improve the effectiveness of the approaches. As future work, it is intended to improve the NLP techniques used by considering the use of a tagger and the incorporation of a glossary for synonym treatment. Another investigation to be done regards how an RTM can aid the software maintenance process, more specifically, offer support for regression test generation.

45 REFERENCES [1] Standish Group, CHAOS Report 2011, Available at: Last accessed March [2] Standish Group, CHAOS Report 1994, Available at: p Last accessed February [3] A.M. Salem, "Improving Software Quality through Requirements Traceability Models", 4th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2006), Dubai, Sharjah, UAE, [4] S.K.A. Sundaram, J.H.B. Hayes, A.C. Dekhtyar, E.A.D. Holbrook, "Assessing Traceability of Software Engineering Artifacts", 18th International IEEE Requirements Engineering Conference, Sydney, Australia, [5] X. Wang, G. Lai, C. Liu, "Recovering Relationships between Documentation and Source Code based on the Characteristics of Software Engineering", Electronic Notes in Theoretical Computer Science, [6] J.H. Hayes, A. Dekhtyar, J. Osborne, "Improving Requirements Tracing via Information Retrieval", Proceedings of 11th IEEE International Requirements Engineering Conference, IEEE CS Press, Monterey, CA, 2003, pp [7] J.H. Hayes, A. Dekhtyar, S. Sundaram, "Advancing Candidate Link Generation for Requirements Tracing: The Study of Methods", IEEE Transactions on Software Engineering, vol. 32, no. 1, January 2006, [8] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, "Indexing by Latent Semantic Analysis", Journal of the American Society for Information Science, vol. 41, no. 6, 1990, pp [9] R. Baeza-Yates, A..Berthier, A. Ribeiro-Neto, Modern Information Retrieval. ACM Press / Addison-Wesley, [10] J. Cleland-Huang, O. Gotel, A. Zisman, Software and Systems Traceability. Springer, 2012, 491 p. [11] A. Di Thommazo, G. Malimpensa, G. Olivatto, T. Ribeiro, S. Fabbri, "Requirements Traceability Matrix: Automatic Generation and Visualization", Proceedings of the 26th Brazilian Symposium on Software Engineering, Natal, Brazil, [12] A. Di Thommazo, M.D.C. Martins, S.C.P.F. Fabbri, Requirements Management in COCAR Enviroment (in Portuguese), WER 07: Workshop de Engenharia de Requisitos, Toronto, Canada, [13] I. Sommerville, Software Engineering. 9th ed. New York, Addison Wesley, [14] A. Zisman, G. Spanoudakis, "Software Traceability: Past, Present, and Future", Newsletter of the Requirements Engineering Specialist Group of the British Computer Society, September [15] Standish Group, CHAOS Report 2005, Available at: spotlight.pdf Last accessed February [16] A. Kannenberg, H. Saiedian, "Why Software Requirements Traceability Remains a Challenge", CrossTalk: The Journal of Defense Software Engineering, July/August [17] A. Goknil, I. Kurtev, K. Van den Berg, J.W. Veldhuis, "Semantics of Trace Relations in Requirements Models for Consistency Checking and Inferencing", Software and Systems Modeling, vol. 10, iss. 1, February [18] Y. Guo, M. Yang, J. Wang, P. Yang, F. Li, "An Ontology based Improved Software Requirement Traceability Matrix", 2nd International Symposium on Knowledge Acquisition and Modeling, KAM, Wuhan, China, [19] E.V. Munson, T.N. Nguyen, "Concordance, Conformance, Versions, and Traceability", Proceedings of the 3rd International Workshop on Traceability in Emerging Forms of Software Engineering, Long Beach, California, [20] D. Cuddeback, A. Dekhtyar, J.H. Hayes, "Automated Requirements Traceability: The Study of Human Analysts", Proceedings of the th IEEE International Requirements Engineering Conference (RE2010), Sydney, Australia, [21] IBM, Ten Steps to Better Requirements Management. Available at: W14059USEN.PDF Last accessed March [22] L.A. Zadeh, "Fuzzy Sets", Information Control, vol. 8, pp , [23] A.O. Artero, "Artificial Intelligence - Theory and Practice", Livraria Fisica, 2009, 230 p. [24] M. Ramzan, M.A. Jaffar, M.A. Iqbal, S. Anwar, A.A. Shahid, "Value based Fuzzy Requirement Prioritization and its Evaluation Framework", 4th International Conference on Innovative Computing, Information and Control (ICICIC), [25] J. Yen, W.A. Tiao, "Formal Framework for the Impacts of Design Strategies on Requirements", Proceedings of the Asian Fuzzy Systems Symposium, [26] S.G. MacDonell, A.R. Gray, J.M. Calvert, "FULSOME: A Fuzzy Logic Modeling Tool for Software Metricians", Annual Conference of the North American Fuzzy Information Processing Society (NAFIPS), [27] P.S. Sandhu, H. Singh, "A Fuzzy-Inference System based Approach for the Prediction of Quality of Reusable Software Components", Proceedings of the 14th International Conference on Advanced Computing and Communications (ADCOM), [28] K.K. Kawai, "Guidelines for Preparation of Requirements Document with Emphasis on the Functional Requirements" (in Portuguese), Master Thesis, Universidade Federal de São Carlos, São Carlos, [29] R. Real, J.M. Vargas, The Probabilistic Basis of Jaccard's Index of Similarity, Systematic Biology, vol. 45, no. 3, pp , Avalilable at: Last accessed November [30] G. Cysneiros, A. Zisman, "Traceability and Completeness Checking for Agent Oriented Systems", Proceedings of the 2008 ACM Symposium on Applied Computing, New York, USA, [31] D.K. Deeptimahanti, R. Sanyal, "Semi-automatic Generation of UML Models from Natural Language Requirements", Proceedings of the 4th India Software Engineering Conference 2011 (ISEC'11), Kerala, India, [32] G. Salton, J. Allan, "Text Retrieval Using the Vector Processing Model", 3rd Symposium on Document Analysis and Information Retrieval, University of Nevada, Las Vegas, 1994.

46 Determining Integration and Test Orders in the Presence of Modularization Restrictions Wesley Klewerton Guez Assunção 1,2, Thelma Elita Colanzi 1,3, Silvia Regina Vergilio 1, Aurora Pozo 1 1 DINF - Federal University of Paraná (UFPR), CP: 19081, CEP: , Curitiba, Brazil 2 COINF - Technological Federal University of Paraná (UTFPR), CEP: , Toledo, Brazil 3 DIN - Computer Science Department - State University of Maringá (UEM), CEP: , Maringá, Brazil {wesleyk, thelmae, silvia, aurora}@inf.ufpr.br Abstract The Integration and Test Order problem is very known in the software testing area. It is related to the determination of a test order of modules that minimizes stub creation effort, and consequently testing costs. A solution approach based on Multi-Objective and Evolutionary Algorithms (MOEAs) achieved promising results, since these algorithms allow the use of different factors and measures that can affect the stubbing process, such as number of attributes and operations to be simulated by the stub. However, works based on such approach do not consider different modularization restrictions related to the software development environment. For example, the fact that some modules can be grouped into clusters to be developed and tested by independent teams. This is a very common practice in most organizations, particularly in that ones that adopt a distributed development process. Considering this fact, this paper introduces an evolutionary and multi-objective strategy to deal with such restrictions. The strategy was implemented and evaluated with real systems and three MOEAs. The results are analysed in order to compare the algorithms performance, and to better understand the problem in the presence of modularization restrictions. We observe an impact in the costs and a more complex search, when restrictions are considered. The obtained solutions are very useful and the strategy is applicable in practice. Index Terms Software testing; multi-objective evolutionary algorithms; distributed development. I. INTRODUCTION The Integration and Test Order problem is concerning to the determination of a test sequence of modules that minimizes stubbing costs in the integration testing. The test is generally conducted in different phases. For example, the unit testing searches for faults in the smallest part to be tested, the module. In the integration test phase the goal is to find interaction faults between the units. In many cases, there are dependency relations between the modules, that is, to test a module A another module B needs to be available. When dependency cycles among modules exist it is necessary to break the cycle and to construct a stub for B. However, the stubbing process may be expensive and to reduce stubbing costs we can find in the literature several approaches. This is an active research topic that was recently addressed in a survey [1]. The most promising results were found by the search-based approach with Multi-Objective and Evolutionary Algorithms (MOEAs) [2] [7]. These algorithms offer a multi-objective treatment to the problem. They use Pareto s dominance concepts to provide the tester a set of good solutions (orders) that represent the best trade-off among different factors (objectives) to measure the stubbing costs, such as, the number of operations, attributes, methods parameters and outputs, which are necessary to emulate the stub behaviour. The use of MOEAs to solve the Integration and Test Order problem in the object-oriented context was introduced in our previous work [2]. After achieving satisfactory results, we applied MOEAs in the aspect-oriented context with different number of objectives [4], [6] and using different strategies to test aspects and classes [7] (based on the study of Ré and Masiero [8]). In [3], the approach was generalized and named MOCAITO (Multi-objective Optimization and Coupling-based Approach for the Integration and Test Order problem). MO- CAITO is an approach that solves the referred problem by using MOEAs and coupling measures. It is suitable to any type of unit to be integrated in different contexts, including object and aspect-oriented software, component-driven, software product line, and service-oriented contexts. The units to be tested can be components, classes, aspects, services and so on. The steps include: i) the construction of a model to represent the dependencies between the units; ii) the definition of a cost model related to the fitness functions and objectives; iii) the multi-objective optimization, i.e., the application of the algorithms; and iv) the selection of a test order to be used by the tester. MOCAITO was implemented and evaluated in the object and aspect-oriented contexts and presented better results when compared with other existing approaches. However, we observe a limitation for MOCAITO and all approaches found in the literature. In practice there may be different restrictions related to the software development that are not considered by existing approaches. For example, some relevant restrictions are related to software modularization. Modularity is an important design principle that allows the division of the software in modules. Modularity is useful for dealing with complexity, improves comprehension, eases reuse, and reduces development efforts. Furthermore, it facilitates the management in a distributed development [9], [10]. In this kind of development, generally clusters of related modules are developed and tested at separate locations by different teams. In a posteriori stage all the sets are then integrated. In some cases, these teams may be members of the same organization; in other cases, collaboration or outsourcing involving different organizations may exist.

47 The dependencies between modules across different clusters make the integration testing more difficult to perform. To determine an order that implies a minimum cost is in most situations a very hard task for software engineers without using an automated strategy. Considering this fact, this paper introduces a strategy to help in this task and to determine the best module orders to the Integration and Test Order problem in the presence of modularization restrictions. The strategy is based on evolutionary optimization algorithms and is implemented in the MOCAITO approach. We implemented the MOEAs NSGA-II, SPEA2 and PAES, traditionally used in related work, and introduce evolutionary operators to consider that some modules are developed and tested together, and thus these modules need to appear as a cluster in the solution (order). Moreover, four measures (objectives) are used. Determining the orders in the presence of restrictions imposes some limitations in the search space, and consequently impacts the performance of the MOEAs. So, it is important to evaluate the impact of modularization restrictions during the integration testing. To evaluate this impact, we conducted experiments by applying two strategies: a strategy with and another one without software modularization restrictions. The experiment uses eight real systems and two development contexts: object and aspect-oriented ones. The paper is organized as follows. Section II reviews previous related researches, including the approach MOCAITO. Section III introduces the proposed strategy for the integration problem in the presence of modularization restrictions and shows how the algorithms were implemented. Section IV contains the experimental evaluation setting. Section V presents and analyses the obtained results. Finally, Section VI concludes the paper and points out future research works. II. RELATED WORK The Integration and Test Order problem has been addressed in the literature by many works [1] in different software development contexts: object and aspect-oriented software, component-driven development, software product lines and service-oriented systems. The existing approaches are generally based on graphs where the nodes represent the units to be integrated and the edges the relationships between them [11]. The goal is to find an order for integrating and testing the units that minimizes stubbing costs. At this end, several optimization algorithms have been applied, as well as, different cost measures. The called traditional approaches [11] [15] are based on classical algorithms, which provides an exact solution, not necessarily optima. Metaheuristics, search-based techniques, such as Genetic Algorithms (GAs), provide better solutions since avoid local optima [16]. The multi-objective algorithms offer a better treatment to the problem that is in fact dependent on different and conflicting measures [2], [4]. However, we observe that all of the existing approaches and studies have a limitation. They do not consider and were not evaluated with real restrictions associated to the software development, such as modularization restrictions and groups of modules that are developed together. To introduce a strategy that consider the problem in the presence of such restrictions is the goal of this paper. To this end, the strategy was proposed and implemented to be used with the multi-objective approach. This choice is justified by studies conducted in the works described above showing that, independently of the development context, multi-objective approaches present better results. Pareto s concepts are used to determine a set of good and non-dominated solutions to the problem. A solution is considered non-dominated according to its associated objective values. At least one of them needs to be better than the corresponding values of all other solutions, and the remaining values need to be at least equal. Then, the solution that deals with modularization restrictions is proposed to be used with MOCAITO [3]. The main reason to do this is that this approach is generic and can be used in different software development contexts. MOCAITO is based on multi-objective optimization of coupling measures, which are used as objectives for the algorithms. The steps of such approach are presented in Figure 1. First of all, a dependency model that represents the dependency relations among the units to be integrated is built. This allows MOCAITO application in different development contexts, with different kind of units to be integrated. Example of such model, used in our work, is the ORD (Object Relation Diagram [11]), and its extension for the aspect-oriented context [12]. In these diagrams, the vertexes represent the modules (classes or aspects), and the edges represent the relations that can be: association, cross-cutting association, use, aggregation, composition, inheritance, inter-type declarations, and so on. Other step is the definition of a cost model. This model is generally based on software measures, used as fitness function (objectives) by the optimization algorithms. Such measures are related to the stubbing process costs. MOCAITO was evaluated with different numbers of objectives, traditionally considered in the literature, and based on four coupling measures. Then, considering that m i and m j are two coupled modules, the used coupling measures are defined as follows: - Attribute Coupling (A): The maximum number of attributes to be emulated in stubs related to the broken dependencies [16]. A is represented by a matrix AM(i, j), where rows and columns are modules and i depends on j. For a given test order t with n modules and a set of d dependencies to be broken, considering that k is any module included before the module i, A is calculated according to: A(t) = n n i=1 j=1 AM(i, j); j k - Operation Coupling (O): The maximum number of operations to be emulated in stubs related to the broken dependencies [16]. O is represented by a matrix OM(i, j), where rows and columns are modules and i depends on j. Then, for a given test order t with n modules and a set of d dependencies to be broken, considering that k is any module included before the module i, O is computed as defined by: O(t) = n n i=1 j=1 OM(i, j); j k

48 Dependency information Construction of the dependency model Dependency model Rules Legend Artifact Constraints Multi-objective optimization Test orders Order selection Steps User Information Cost information Definition of the cost model Cost model Selected test order Fig. 1. MOCAITO Steps (extracted from [3]) - Number of distinct return types (R): Number of distinct return types of the operations locally declared in the module m j that are called by operations of the module m i. Returns of type void are not counted, since they represent the absence of return. Similarly to the previous, this measure is given by: R(t) = n n i=1 j=1 RM(i, j); j k - Number of distinct parameter types (P): Number of distinct parameters of the operations locally declared in m j that are called by operations of m i. When there is overloading operation, the number of parameters is equals to the sum of all distinct parameter types among all implementations of each overloaded operation. So, it is considered the worst case, represented by situations in which the coupling consists of calls to all implementation of a given operation. This measure is given by: P (t) = n n i=1 j=1 P M(i, j); j k After this, multi-objective algorithms are applied. They can work with constraints given by the tester, which can make an order invalid. Some constraints, adopted in the approach evaluation [3], are not to break inheritance and inter-type declarations dependencies. These dependencies are complex to simulate, so to deal with these types of constraints the dependent modules are preceded by the required modules. The treatment involves to check the test order from the first to the last module, and if a precedence constraint is broken, the module in question is placed at the end of the order. As output, the algorithms generate a set of solutions for the problem that have the best trade-off among the objectives. The orders can be ranked according to some priorities (rules) of the tester. In [3], MOCAITO was evaluated in object and aspectoriented contexts with different number of objectives and with three MOEAs: NSGA-II [17], SPEA2 [18] and PAES [19]. These three MOEAs were chosen because they are well known, largely applied and adopt different evolution and diversification strategies [20]. Moreover, to know which algorithm is more suitable to solve a particular problem is a question that needs to be answered by means of experimental results. The main results found are: there is no difference among the algorithms for simple systems and contexts; SPEA2 was the most expensive with the greatest runtime; NSGA-II was the most suitable in a general case (considering different quality indicators and all systems); PAES presented better performance for more complex systems. However, the approach was not evaluated taking into account some real restrictions, generally associated to the software development, and mentioned in the last section. Most organizations nowadays adopted the distributed development, but in spite of this, we observe in the survey of the literature [1] that related restrictions are not considered by studies that deal with the Integration and Test Order problem. In this sense, the main contribution of this paper is to introduce and evaluate a strategy based on optimization algorithms to solve the referred problem in the presence of modularization restrictions. In fact, in the presence of such restrictions a new problem (variant of the original one) emerges, which presents new challenges related to task allocation versus the establishment of an integration and test order. For this, a novel solution strategy is necessary and proposed, including a new representation and new genetic operators. This new solution strategy is evaluated with the MOCAITO approach, by using its cost and dependency models. However, it could be applied with other approaches. So, the next section describes implementation aspects of the strategy to deal with such modularization restrictions. III. WORKING WITH MODULARIZATION RESTRICTIONS A restriction is a condition that a solution is required to satisfy. In the context of this study, restrictions are related to modularization. We consider that some modules are grouped by the software engineer, forming a cluster. Modules in a cluster must be developed and tested together. Figure 2 presents an example of modularization restrictions. Considering a system with twelve modules identified from 1 to 12. Due to a distributed development environment, the software engineer determines three clusters (groups) identified by A1, A2 and A3. To mix modules of distinct clusters is not valid, such as it happens in Order C. Using Order C the developers of A1 need to wait developers of A3 to develop some modules to finish and test their modules. Orders A and B are examples of valid orders. These orders allow teams working independently. In order A the modules of cluster A1 are firstly developed, integrated and tested. Since the team responsible for the modules of A1 finish their work, the development of modules of cluster A3 can start having all the modules of A1 available to be used in the integration test. Similarly, when the team responsible for the cluster A2 starts its work, the modules of A1 and A3 are already available. The independence on the development of each cluster by different teams also occurs in order B, since the modules of each cluster are in sequence to be developed, integrated and tested. Although Figure 1 shows the modules in sequence, when there

49 Cluster1 Cluster2 Cluster Child Cluster1 Genetic Operators in MECBA-Cluster Combined 1 Child Strategy Cluster2 Cluster3 5 6 Cluster Cluster Cluster1 4 Child2 System Modules Cluster Cluster2 6 5 Cluster Parent1 Cluster Clu 6 Clusters of Modules A1 A2 A Integration and Test Orders Cluster Cluster2 Cluster3 Cluster2 6 5 Cluster Parent Parent2 Cluster1 5 6 Cluster2 Cluster Cluster Cluster1 Crosspoint: Cluster1 Parent Cluster1 6 Clu Order A Order B Cluster Cluster2 6 5 Cluster Child1 Cluster Crosspoint1 Crosspoint2 Cluster Clu 6 Order c Fig. 2. Example of modularization restrictions are no dependencies between some clusters, the development may be performed in a parallel way. In this case, each team could develop and test the modules of the cluster according to the test order, later the modules would be integrated also in accordance with the same test order. A. Problem Representation To implement a solution to the problem with restrictions, the first point refers to the problem representation. The traditional way to deal with the problem uses as representation an array of integers, where each number in the array corresponds to a module identifier [3]. However, a more elaborated representation is needed to consider the modules grouped into clusters. A class Cluster, as presented in Figure 3 was implemented. An object of the class Cluster is composed by two attributes: (i) a cluster identifier (id) of the type integer; and (ii) an array of integers (modules) that represents the cluster modules. An individual is composed of n Cluster objects, where n is the number of clusters. B. Evolutionary Operators ClusterCluster1 Cluster2 Cluster id: int Parent1 +modules: ArrayList<Integer> Parent2 Cluster2 Fig. 3. Class Cluster Cluster1 Cluster Cluster1 Crosspoint: Cluster1 The traditional way [2] [5] to Cluster1 apply4the 3 2evolutionary 1 operators to the Integration and Test Crosspoint1 Order Crosspoint2 problem is the same adopted in a permutation problem. However, with the Cluster1 Cluster2 Cluster3 modularization restrictions a new4 2way 3 1to6generate and Child1 deal with the individuals (solutions) is required. Next, we present the implemented crossover and mutation operators Child2 Cluster2 Cluster3 Cluster1 1) Crossover: MOCAITO approach applies the two points crossover [3]. However, a simple random selection of two points to perform the crossover could disperse the modules of a cluster across the order. So, considering Cluster1 Cluster2 the modularization Cluster3 restrictions, two types of crossover 1 were 2 3 4implemented: Parent1 (i) Inter 4 Cluster; and (ii) Intra Cluster, which are depicted in Figure Parent2 The goal of the Inter Cluster Cluster2 crossover Cluster3 is tocluster1 generate children receiving the full exchange of complete cluster between 5 6 Cluster Cluster3 Genetic Operators in MECBA-Cluster Combined Strategy Cluster1 (a) Inter Cluster Child2 Cluster1 Cluster2 Cluster Child1 5 6 Cluster2 Cluster Cluster the random selection of the cluster 6 to be exchanged, Parent2 Child1 Cluster2 Cluster Cluster3 Fig. 4. Crossover Operator Cluster3 5 Cluster Child2 Cluster1 (b) Intra Cluster Cluster Parent1 their parents. As illustrated in the example of Figure 4(a), after Cluster1 receives Cluster2 and Cluster3 from Parent1, and Cluster1 from Parent2. In the same way, Child2 receives Cluster2 and Cluster3 Cluster3 from Parent2, and Cluster from 1 6 5Parent Child1 The Intra Cluster crossover aims at creating new solutions Child2 that receive clusters generatedcluster2 withcluster3 two points Cluster1 crossover of a specific cluster. After the random selection of one cluster, the traditional two points crossover for permutation problems is applied. The other clusters that do not participate of the crossover are copied from parents to children. As illustrated in Figure 4(b), Cluster1 was randomly selected, as well as, two points are defined to crossover. Cluster2 and Cluster3 from Parent1 are just copied to Child1, and in the same way the Cluster2 and Cluster3 from Parent2 are just copied to Child2. During the crossover, in the evolutionary process, two parents are selected and four children are generated, two using Inter Cluster crossover, and two, using Intra Cluster crossover. Cluster1 Cluster2 Cluster3 42) 3Mutation: MOCAITO 8 9 Parent approach applies the traditional swap mutation [3], swapping module position in the order. But, again, the simple application of this mutation could disperse Genetic 5 7 Operators 8 9 Childin MECBA-Cluster the Cluster1 modules Cluster2 of acluster3 cluster across the order. So, two different Combined Strategy types of mutation are implemented: (i) Inter Cluster; and (ii) Intra Cluster, which are presented in Figure 5. Cluster1 2 Cluster1 Cluster2 Cluster2 Cluster3 Cluster Parent1 8 9 Parent Parent2 Cluster3 Cluster Child Cluster2 Cluster1 Cluster1 Cluster2 Crosspoint: Cluster Cluster Cluster3 (a) Inter Cluster Cluster1 Cluster2 Cluster Child Child2 Cluster Cluster1 Cluster1 Cluster2 6 Fig. 5. Mutation Operator Cluster2 Cluster2 6 Cluster Parent Child Cluster3 (b) Intra Cluster Cluster3 Both types of mutation are simple While 5 the 7 8Inter 9 Parent Cluster Crosspoint1 Crosspoint2 mutation swaps cluster positions in the order, the Intra Cluster mutation swaps module positions in a cluster. Figure 5(a) Child illustrates the Inter Cluster mutation, Cluster1 where Cluster2 the Cluster3positions of Cluster1 and Cluster3 are swapped. Figure 5(b) presents an Cluster2 Cluster3 Cluster1 example of Intra Cluster, where after the random selection Cluster1 6 5 Clu Cluster Cluster2 6 5 Cluster Child1 5 6 Cluster Cluster Cluster1 Child2 Cluster Cluster2 6 5 Cluster Parent1

50 of Cluster1, the positions of Modules 1 and 3 are swapped. During the evolutionary process both kind of mutations have 50% of probability to be chosen. C. Repairing broken dependencies There are two types of repairing orders with broken dependencies constraints (inheritance and inter-type declarations). In the Intra Cluster treatment, the constraints between modules in the same cluster are checked and the precedence is corrected by placing the corresponding module at the cluster end. After all the precedences of modules inside the clusters are correct, the constraints between modules of different clusters are checked during the Inter Cluster treatment. The precedence is corrected by placing the cluster at the order end, thus the dependent cluster becomes the last one of the order. IV. EXPERIMENTAL SETUP The goal of the conducted experiment is to evaluate the solution of the Integration and Test Order problem in the presence of modularization restrictions and to answer some questions such as: How does the use of the modularization restrictions impact on the performance of the algorithms? and How are the usefulness and the applicability of the solutions obtained by the proposed strategy?. To the first research question we followed the GQM method [21] 1. In the case of the second question, a qualitative analysis was performed. The experiment was conducted using similar methodology and same systems of related work [3]. Two strategies were applied and compared: a strategy named here MC, which deals with modularization restrictions using clusters, according to the implementation described in the last section, and a strategy M applied according to [3] without using modularization restrictions. A. Used Systems The study was conducted with eight real systems; the same ones used in [3]. Table I presents some information about these systems, such as number of modules (classes for Java programs; classes and aspects for AspectJ), dependencies, LOC (Lines of Code) and clusters. TABLE I USED SYSTEMS System Language Modules Dependencies LOC Clusters BCEL Java JBoss Java JHotDraw Java MyBatis Java AJHotDraw AspectJ AJHSQLDB AspectJ HealthWatcher AspectJ Toll System AspectJ Due to lack of space, the GQM table is available at: B. Clusters Definition To define the clusters of modules, the separation of concerns principle [22] was followed. Considering this principle, the effort to develop a software, and consequently test it, became negligibly small. Following the separation of concerns the modules should be interconnected in a relatively simple manner presenting low coupling to other clusters. Hence, this procedure benefits the distributed development since decreases the interdependence between the teams. In this way, each system was divided into clusters according to the concerns that they realize. So each team should develop, integrate and test one cluster that deals with one concern present in the system. Aiming at confirming the interdependencies between the modules of the clusters, we check such division by constructing directed graphs and considering the inheritance and inter-type declarations dependencies, that ones that should not be broken. The number of clusters for each system is presented in the last column of Table I. C. Obtaining the Solutions Sets To analyze the result we will use different sets of solutions. These sets are found in different ways. Below, we describe how we obtained each solution set used: P F approx : one set P F approx for a system was obtained in each run of an algorithm. Each MOEA was executed 30 times for each system in order to know the behavior of each algorithm to solve the problem. So, at the end, 30 sets P F approx were obtained. P F known : this set was obtained for each system through the union of the 30 sets P F approx, removing dominated and repeated solutions. P F known represents the best solutions found by each MOEA. P F true : this set was obtained for each system through the union of the sets P F known, removing dominated and repeated solutions. P F true represents the best solutions known to the problem. This procedure to obtain the best solutions to a problem is recommended when the ideal set of solutions is not known [23]. D. Quality Indicators To compare the results presented by the MOEAs, we used two quality indicators, generally used in the MOEA literature: (i) Coverage and (ii) Euclidean Distance from an Ideal Solution (ED). The Coverage (C) [19] calculates the proportion of solutions in the Pareto Front, P F a, which are dominated by P F b. The function C(P F a, P F b ) maps the ordered pair of (P F a, P F b ) into the range [0,1] according to the proportion of solutions in P F b that are dominated by P F a. Similarly, we compare C(P F b, P F a ) to obtain the proportion of solutions in P F a that are dominated by P F b. Value 0 for C indicates that the solutions of the former set do not dominate any element of the latter set; on the other hand, value 1 indicates that all elements of the latter set are dominated by elements of the former set. The Euclidean Distance from an Ideal Solution (ED) is used to find the closest solutions to the best objectives. It is

51 based on Compromise Programming [24], a technique used to support decision maker when a set of good solutions is available. An Ideal Solution has the minimum value of each objective of P F true, considering a minimization problem. E. Parameters of the Algorithms The same methodology adopted in [2] [4] was adopted to configure the algorithms. The parameters are in Table II. The number of fitness evaluations was used as stop criterion for the algorithms, this allows comparing the solutions with similar computational cost. Moreover, they were executed in the same computer and the runtime was recorded. TABLE II MOEAS PARAMETERS Parameter NSGA-II PAES SPEA2 Strategy MC Population Size Fitness Evaluations Archive Size Crossover Rate 0,95-0,95 Inter Cluster Crossover rate 1,0-1,0 Intra Cluster Crossover rate 1,0-1,0 Mutation Rate 0,02 1 0,02 Inter Cluster Mutation Rate 0,5 0,5 0,5 Intra Cluster Mutation Rate 0,5 0,5 0,5 Strategy M Population Size Fitness Evaluations Archive Size Crossover Rate 0,95-0,95 Mutation Rate 0,02 1 0,02 F. Threats to Validity The main threats to our work are related to the evaluation of the proposed solution. In fact an ideal evaluation should consider similar strategies, and different kind of algorithms, including the traditional ones. However, we have not found a similar strategy in the literature. A random strategy could be used, however, this strategy is proven to present the worst results in the related literature, and the obtained results would be obvious. Besides, the traditional approaches, not based on evolutionary algorithms, are very difficult to adapt (some of them impossible) to consider the modularization restrictions and different cost measures. Hence, we think that addressing such restrictions with multi-objective and evolutionary approaches is more promising and practical. In addition to this, a comparison with a strategy that does not consider the restrictions can provide insights about the usage impact. Other threat is related to the clusters and systems used. An ideal scenario would be consider clusters used in real context of distributed development. To mitigate such threat we consider as a criterion to compose the clusters the separation of concerns, which we think it is implicitly considered in team allocations. Another threat is the number of systems used that is reduced and can influence in the generalization of the obtained results. To reduce this influence we selected object and aspect-oriented systems, with different sizes and complexities, given by the number of modules and dependencies. V. RESULTS AND ANALYSIS In this section the results are presented and evaluated to answer the research questions. The impact of using restrictions is analysed and the practical use of MC is addressed. A. On the impact of using modularization restrictions In this section the restrictions impact is analysed in two ways (subsections): (i) evaluating the MOEAs performance using MC, and (ii) comparing the strategies M and MC. At the end, a synthesis about the impact of modularization restrictions is presented. 1) Performance of the MOEAs using MC: The analysis conducted in this section allows evaluating the performance of the MOEAs when the modularization restrictions are considered. It is based on the quality indicators described previously. Table III presents the values of the indicator C for the sets P F known of each MOEA. The results show difference for five systems: BCEL, MyBatis, AJHotDraw, AJHSQLDB and Toll System. For BCEL, the NSGA-II solutions dominate 75% of PAES solutions and around 60% of the SPEA2 solutions. The SPEA2 solutions also dominate 75% of PAES solutions. For MyBatis, PAES solutions dominated 100% NSGA-II and SPEA2 solutions, and NSGA-II solutions dominated around 73% of SPEA2 solutions. For AJHotDraw, PAES was also better, but SPEA2 was better than NSGA-II. For AJHSQLDB, a similar behaviour was observed. For Toll System NSGA- II and SPEA2 solutions dominate 50% of PAES solutions. Hence, NSGA-II and SPEA2 presented the best results. TABLE III COVERAGES VALUES - STRATEGY MC System MOEA NSGA-II PAES SPEA2 BCEL JBoss JHotDraw MyBatis AJHotDraw AJHSQLDB Health Watcher Toll System NSGA-II - 0,75 0, PAES 0-0 SPEA2 0, ,75 - NSGA-II PAES 0-0 SPEA NSGA-II - 0, , PAES 0, ,45098 SPEA2 0, , NSGA-II - 0 0, PAES 1-1 SPEA2 0, NSGA-II - 0 0,16129 PAES 1-1 SPEA2 0, NSGA-II - 0 0, PAES 1-1 SPEA2 0, NSGA-II - 0, PAES 0-0 SPEA2 0 0, NSGA-II - 0,5 0 PAES 0-0 SPEA2 0 0,5 - Table IV contains the results obtained for indicator ED. The second column presents the cost of the ideal solutions. Such costs were obtained considering the lowest values of each objective from all solutions of the P F true of each system and independently from which solution they were achieved.

52 System Ideal Solution TABLE IV COST OF THE IDEAL SOLUTION AND LOWER ED FOUND - STRATEGY MC NSGA-II PAES SPEA2 Lowest ED Solution Cost Lowest ED Solution Cost Lowest ED Solution Cost BCEL (40,54,33,59) 24,5764 (57,59,50,60) 74,0000 (51,59,34,132) 23,4094 (45,63,52,68) JBoss (25,17,4,14) 2,0000 (25,17,6,14) 2,0000 (25,17,6,14) 2,0000 (25,17,6,14) JHotDraw (283,258,92,140) 63,2297 (301,274,105,197) 63,2297 (301,274,105,197) 63,2297 (301,274,105,197) MyBatis (259,148,57,145) 203,2855 (1709,204,81,191) 147,5263 (282,235,78,260) 221,4746 (386,267,97,276) AJHotDraw (190,100,40,62) 51,6817 (196,105,43,113) 49,1325 (197,106,45,110) 49,6488 (200,106,45,110) AJHSQLDB (3732,737,312,393) 526,5302 (4217,879,415,499) 167,2692 (3836,810,365,488) 403,7809 (4069,879,403,538) Health Watcher (115,149,49,52) 39,7869 (138,166,67,73) 39,7869 (138,166,67,73) 39,7869 (138,166,67,73) Toll System (68,41,18,16) 5,4772 (68,42,20,21) 5,4772 (68,42,20,21) 5,4772 (68,42,20,21) The other columns present the solution closest to the ideal solution and its cost in terms of each objective. For the systems JBoss, JHotDraw, Health Watcher and Toll System, all MOEAs found the same solution with the lowest value of ED. For BCEL, SPEA2 found the solution with the lowest ED. Finally, PAES obtained solutions with the lowest ED for MyBatis, AJHotDraw and AJHSQLDB. From the results of both indicators it is possible to see that, in the context of our study, PAES is the best MOEA, since it obtained the best results for six systems: JBoss, JHotDraw, MyBatis, AJHotDraw, AJHSQLDB and Health Watcher. Such systems have the greatest numbers of modules and clusters (Table I). NSGA-II is the second best MOEA, since it found the best results for five systems: BCEL, JBoss, JHotDraw, Health Watcher and Toll System. SPEA2 also obtained the best results for four systems: BCEL, JBoss, Health Watcher and Toll System. NSGA-II and SPEA2 have similar behavior, presenting satisfactory results for systems with few modules and few clusters (Table I). 2) Comparing the strategies M and MC: Aiming at analysing the impact of using restrictions, two pieces of information were collected for the strategies M and MC: number of obtained solutions and runtime. Such numbers are presented in Table V. The third and the sixth columns contain the cardinality of P F true. The fourth and the seventh columns present the mean quantity of solutions from the sets P F approx and the cardinality of P F known between parentheses. The fifth and eighth columns present the mean runtime (in seconds) used to obtain each P F approx and the standard deviation (between parentheses), respectively. Verifying the number of solutions of P F true, it can be noticed that for BCEL and MyBatis the number of solutions found by MC was lower than M. On the other hand, for JBoss and JHotDraw such number was greater in MC than in M. So, it can be observed that the systems with more solutions found by M have less solutions found by MC and vice-versa. In spite of the strategies M and MC involve the same effort related to the number of fitness evaluations, the runtime between them have great difference (Figure 6 and Table V). For all systems, NSGA-II, PAES and SPEA2 spent more runtime in strategy MC. The single exception was SPEA2 that spent less time with strategy MC for JHotDraw. From the three MOEAs, SPEA2 spent the greatest runtime. Such fact allows us to infer that in the presence of several restrictions in the search space the SPEA2 behavior may become random. Figure 7 presents the solutions in the objectives space. Due to graphics dimension limitation, only three measures were presented in the pictures. In the case of JHotDraw (Figure 7(a)), the solutions of M are closer to the minima objectives (A=0, O=0, R=0, P=0). These solutions are not feasible for the strategy MC due to the restrictions. They impose the MOEAs to find solutions in other places in the search space, where a greater number of solutions are feasible, but more expensive. MyBatis illustrates well this point. Figure 7(b) presents that the M solutions for MyBatis are in the same area, next to the minima objectives. The restrictions impose MOEAs to explore other areas in the search space, and in this case, a lower number of solutions is found. These solutions are more expensive. From the results, it is possible to state that the restrictions imply a more complex search, limiting the search space and imposing a greater stubbing cost. To better evaluate the impact on the cost of the solutions obtained by both strategies, we use the indicator ED. The solutions closest to the ideal solution are those ones that have the best trade-off among the objectives and are good candidates to be adopted by the tester. We compare the cost of the ideal solutions and the cost of the solutions obtained by a MOEA. In our comparison we chosen the PAES solutions, this algorithm presented the best performance, lower ED values for six systems. These costs are presented in Table VI. TABLE VI COST OF THE SOLUTIONS IN BOTH STRATEGIES System BCEL JBoss JHotDraw MyBatis AJHotDraw AJHSQLDB Health Watcher Toll System M MC Ideal PAES Ideal PAES Solution Solution Solution Solution (45,24, (64,39, (40,54, (51,59, 0,96) 15,111) 33,59) 34,132) (10,6, (10,6, (25,17, (25,17, 2,9) 2,9) 4,14) 6,14) (27,10, (30,12, (283,258, (301,274, 1,12) 1,18) 92,140) 105,197) (203,70, (265,172, (259,148, (282,235, 13,47) 49,184) 57,145) 78,260) (39,12, (46,19, (190,100, (197,106, 0,18) 1,34) 40,62) 45,110) ) (1263,203, (1314,316, (3732,737, (3836,810, 91,138) 138,236) 312,393) 365,488) (9,2, (9,2, (115,149, (138,166, 0,1) 0,1) 49,52) 67,73) (0,0, (0,0, (68,41, (68,42, 0,0) 0,0) 18,16) 20,21) We can observe that, except for BCEL, the cost of the MC solutions are notably greater than the M solutions cost. In most cases the MC cost is two or three times greater, depending on

53 JBoss TABLE V NUMBER OF SOLUTIONS AND RUNTIME System BCEL JBoss JHotDraw MyBatis AJHotDraw AJHSQLDB Health Watcher Toll System MOEA M MC # P F true Number of Solutions Execution Time # P F true Number of Solutions Execution Time NSGA-II 37,43 (37) 5,91 (0,05) 7,57 (11) 8,61 (0,11) PAES 37 39,30 (37) 6,58 (1,25) 15 3,40 (8) 29,89 (22,25) SPEA2 36,70 (37) 123,07 (18,84) 8,53 (19) 3786,79 (476,23) NSGA-II 1,00 (1) 18,73 (0,20) 1,97 (2) 42,50 (0,47) PAES 1 1,13 (1) 10,69 (0,62) 2 2,87 (2) 56,15 (12,50) SPEA2 1,00 (1) 2455,35 (612,18) 2,17 (2) 3536,01 (335,97) NSGA-II 8,40 (10) 29,85 (0,34) 45,80 (110) 71,90 (0,45) PAES 11 10,47 (19) 24,29 (1,50) ,47 (143) 51,18 (2,82) SPEA2 9,63 (9) 922,99 (373,98) 49,17 (102) 532,83 (81,93) NSGA-II 276,37 (941) 74,03 (0,87) 72,60 (103) 189,91 (0,83) PAES ,60 (679) 104,30 (7,91) ,43 (200) 132,37 (3,91) SPEA2 248,77 (690) 128,88 (2,65) 64,33 (144) 517,52 (67,52) NSGA-II 70,03 (79) 75,05 (0,57) 16,30 (36) 194,34 (0,83) PAES 94 40,73 (84) 62,07 (2,16) 31 26,57 (31) 115,12 (2,82) SPEA2 68,87 (78) 195,56 (28,22) 17,53 (31) 1005,36 (268,37) NSGA-II 156,63 (360) 62,34 (0,53) 62,07 (196) 160,38 (1,64) PAES ,97 (266) 75,62 (5,27) ,57 (240) 122,01 (4,92) SPEA2 119,10 (52) 104,29 (0,68) 58,30 (170) 505,11 (101,90) NSGA-II PAES 1 1,00 (1) 1,07 (1) 12,72 (0,15) 8,27 (0,58) 11 10,70 (11) 7,47 (12) 27,52 (0,10) 46,98 (5,34) SPEA2 1,00 (1) 2580,39 (596,29) 10,20 (11) 990,19 (95,94) NSGA-II 1,00 (1) 7,33 (0,09) 4,27 (4) 13,23 (0,09) PAES 1 1,07 (1) 4,10 (0,75) 4 3,50 (4) 31,13 (16,23) SPEA2 1,00 (1) 3516,71 (570,76) 4,00 (4) 2229,26 (271,47) Runtime (s) M MC Runtime (s) M MC Runtime (s) M MC 0 BCEL AJHotDraw Health Watcher AJHSQLDB 0 Toll System BCEL AJHotDraw Health Watcher AJHSQLDB 0 Toll System MyBatis BCEL JBoss JBoss JHotDraw JHotDraw JHotDraw AJHotDraw Health Watcher AJHSQLDB Toll System MyBatis MyBatis System (a) NSGA-II System (b) PAES Fig. 6. Runtime System (c) SPEA2 P M MC P M MC R (a) JHotDraw A R (b) MyBatis A 3000 Fig. 7. P F true with and without Modularization Restrictions the measure. The greatest difference was obtained by programs Health Watcher, Toll System and JHotDraw. In the two first cases optimal solutions were found by all the algorithms with the strategy M. These solutions are not feasible when the restrictions are considered. 3) Summarizing impact results: Based on the results, it is clear that the modularization restrictions increase the integration testing costs. Hence, the strategy MC can also be used in the modularization task as a simulation and decision supporting tool. For example, in a distributed software development,

54 the strategy MC can be used to allocate the modules to the different teams to ensure lower testing costs. Furthermore, all the implemented algorithms can be used and present good results, solving efficiently the problem. However we observe that for most complex systems PAES is the best choice. B. Practical Use of MC This section evaluates through an example the usefulness and applicability of the proposed strategy. We presented in the last section that the strategy MC implies in greater costs than M. However the automatic determination of such orders in the presence of restrictions is fundamental. When we consider the restrictions, a huge additional effort is necessary. The usefulness of the proposed strategy relies on the infeasibility of manually obtaining a satisfactory solution for the problem. To illustrate this, consider the smallest system used in the experiment BCEL, with 3 clusters and 45 modules. For it, there is a number of 1.22E+47 possibilities of different permutations among the clusters and modules inside clusters to be analysed. For the other systems such effort is even higher. Since the task of determining a test order is delegated to some MOEA, the tester only needs to concentrate his/her effort on choosing an order achieved by the algorithm, as it is explained in the example of how to use the proposed strategy presented next. 1) Example of Practical Use of MC: Table VII presents some solutions from the set of non-dominated solutions achieved by PAES for JHotDraw. The first column presents the cost of the solutions (metrics A,O,R,P) and the second column presents the order of modules in the cluster. JHotDraw is the fourth largest system (197 modules) and the third largest system considering the clusters (13 clusters). For this system PAES found 143 solutions. Therefore, it is necessary that the software engineer chooses which of these orders will be used. To demonstrate how each solution should be analysed, we use the first solution from the table, the solution cost is (A=283,O=292,R=102,P=206). The order shown in the second column; {87, 9, 196,...},..., {..., 120, 194, 141}; indicates the sequence in which the modules must be developed, integrated and tested. Using this order, to perform integration testing of the system will be needed the construction of stubs to simulate 283 attributes; 292 operations, that may be class methods, aspect methods or aspect advices; different 102 types of return and 206 distinct parameter types. To choose among the solutions presented in Table VII, it could be used the rule concerning to the lowest cost for a given measure. The lowest cost is highlighted in bold, therefore, the first solution has the lowest cost for the measure A, the second solution has the lowest cost measure for O, and so on. The fifth solution provides the best balance of cost among the four measures and was selected based on the indicator ED (Table IV). So, if the system under development presents complex attributes to construct, then the first solution should be used, or if the system presents parameters of the operations that are difficult to be simulated, the fourth solution should be used. However, if the software tester choose to prioritize all of the measures the third solution is the best option since it is closer to the minimum cost for all of the measures. This diversity of solutions with different trade-offs among the measures is one of the great advantages of using multiobjective optimization, easing the selection of an order of modules that meets the needs of the tester. VI. CONCLUDING REMARKS This work described a strategy to solve the Integration and Test Order problem in the presence of modularization restrictions. The strategy is based on multi-objective and evolutionary optimization algorithms and generates orders considering that some modules are grouped and need to be developed and tested together, due, for instance, to a distributed development. Specific evolutionary operators were proposed to allow mutation and crossover inside a cluster of modules. To evaluate the impact of such restrictions the strategy, named MC, was applied using three different multi-objective evolutionary algorithms and eight real systems. During the evaluation the results obtained from the application of MC were compared with another strategy without restrictions. With respect to our first research question, all the MOEAs achieved similar results, so they are able to satisfactorily solve the referred problem. The results point out that the modularization restrictions impact significantly on the optimization process. The search becomes more complex since the restrictions limit the search space, then the stubbing cost increases. Therefore, as the modularization restrictions impact on the costs, the proposed strategy can be used as a decision supporting tool during the cluster composition task, helping, for example, in the allocation of modules to the different teams in a distributed development, aiming at minimizing integration testing costs. Regarding to the second question, the usefulness of the strategy MC is supported by the difficulty of manually obtaining solutions with satisfactory trade-off among the objectives. The application of MC provides the tester a set of solutions allowing he/she to prioritize some coupling measures, reducing testing efforts and costs. MOCAITO adopts only coupling measures, despite such measures are the most used in the literature, we are aware that other factors can impact on the integration testing cost. So, such limitation could be eliminated by using other measures during the optimization process. This should be evaluated in future experiments. Other future work we intend to perform is to conduct experiments involving systems with a greater number of clusters and dependencies as well as using other algorithms. In further experiments, we also intend to use a way to group the modules of a system, such as a clustering algorithm. As MOCAITO is a generic approach, it is possible to explore other development contexts and kind of restrictions besides modularization. ACKNOWLEDGMENTS We would like to thank CNPq for financial support.

55 Solution Cost (283,292,102,206) (322,258,103,192) (2918,326,92,201) (3423,313,103,140) (301,274,105,197) TABLE VII SOME SOLUTIONS OF PAES FOR THE SYSTEM JHOTDRAW Order {87, 9, 196, 187, 67, 185}, {68, 86, 173, 105, 118, 48, 24, 170, 93, 99, 30, 104, 102, 91, 116, 23, 82, 190, 186, 46, 134, 121, 49, 89, 164, 188, 25, 115, 117, 189, 100, 50, 157, 69, 155, 26, 38, 191}, {84, 85, 58, 0, 79, 66, 98, 7, 111}, {96, 180, 123, 114, 101, 53, 12, 75, 36, 140, 107, 144, 145, 47, 62, 11, 34, 77, 97, 146, 57, 3, 73, 103, 152, 28, 158, 2, 179, 83, 122, 16, 129, 182, 27, 45, 176, 159, 5, 31, 52, 156, 165, 166, 32, 4, 150, 192, 54, 110, 137, 1, 8, 151, 113, 65, 95, 135, 132, 130}, {128, 181, 147, 125, 169, 124, 61}, {138, 108, 15, 33, 29, 154, 153}, {56, 6, 139, 42, 39, 41, 44, 40, 172, 43}, {184, 160, 76, 119, 18, 70, 178, 55, 35, 71, 64, 59, 175, 74, 126, 167, 72, 177, 174}, {136, 94, 60, 171}, {133, 51, 14, 109, 21, 148}, {10, 90, 195, 78, 88, 183, 81}, {17, 63, 19, 20, 22, 161, 37, 106, 163, 112, 168, 142, 127, 149, 92, 143, 193, 162, 80}, {131, 13, 120, 194, 141} {87, 9, 196, 187, 67, 185}, {84, 85, 0, 58, 66, 98, 7, 111, 79}, {138, 108, 15, 33, 29, 154, 153}, {68, 86, 173, 105, 118, 48, 24, 170, 93, 99, 30, 104, 102, 91, 116, 23, 82, 190, 186, 46, 134, 121, 49, 89, 164, 188, 25, 115, 117, 189, 100, 50, 157, 69, 155, 26, 38, 191}, {96, 180, 123, 114, 101, 53, 12, 75, 36, 140, 107, 144, 16, 145, 47, 62, 11, 34, 77, 97, 146, 57, 3, 73, 103, 152, 28, 158, 2, 132, 110, 179, 83, 122, 32, 129, 182, 27, 45, 176, 159, 5, 31, 52, 156, 165, 166, 135, 4, 150, 192,54, 137, 1, 8, 151, 113, 130, 65, 95}, {56, 6, 139, 42, 39, 41, 44, 40, 172, 43}, {17, 63, 19, 20, 22, 161, 37, 106, 163, 112, 168, 142, 127, 149, 92, 143, 193, 162, 80}, {184, 160, 76, 119, 18, 70, 178, 55, 35, 71, 64, 59, 175, 74, 126, 167, 72, 177, 174}, {136, 94, 60, 171}, {133, 51, 14, 109, 21, 148}, {128, 181, 147, 125, 169, 124, 61}, {131, 13, 120, 194, 141}, {10, 90, 195, 78, 88, 183, 81} {96, 180, 123, 114, 101, 53, 12, 75, 36, 140, 107, 144, 145, 47, 62, 11, 34, 77, 97, 146, 57, 3, 73, 103, 152, 28, 158, 2, 179, 83, 122, 16, 129, 182, 27, 45, 176, 159, 5, 31, 52, 156, 165, 166, 32, 4, 150, 192, 54, 110, 137, 1, 8, 151, 113, 65, 95, 135, 132, 130}, {138, 108, 15, 33, 29, 154, 153}, {87, 9, 196, 187, 67, 185}, {84, 85, 58, 0, 79, 66, 98, 7, 111}, {128, 181, 147, 125, 169, 124, 61}, {68, 86, 173, 105, 118, 48, 24, 170, 93, 99, 30, 104, 102, 91, 116, 23, 82, 190, 186, 46, 134, 121, 49, 89, 164, 188, 25, 115, 117, 189, 100, 50, 157, 69, 155, 26, 38, 191}, {56, 6, 139, 42, 39, 41, 44, 40, 172, 43}, {184, 160, 76, 119, 18, 70, 178, 55, 35, 71, 64, 59, 175, 74, 126, 167, 72, 177, 174}, {136, 94, 60, 171}, {133, 51, 14, 109, 21, 148}, {10, 90, 195, 78, 88, 183, 81}, {17, 63, 19, 20, 22, 161, 37, 106, 163, 143, 168, 142, 127, 149, 92, 112, 193, 162, 80}, {131, 13, 120, 194, 141} {138, 108, 15, 29, 154, 33, 153}, {56, 6, 42, 39, 139, 43, 40, 172, 44, 41}, {96, 103, 114, 123, 180, 101, 165, 12, 97, 47, 34, 36, 146, 140, 107, 77, 32, 45, 53, 75, 57, 145, 83, 11, 16, 156, 3, 95, 73, 152, 158, 192, 4, 28, 113, 144, 166, 110, 137, 27, 5, 159, 52, 62, 54, 2, 182, 179, 122, 31, 129, 150, 135, 132, 130, 1, 8, 65, 151, 176}, {187, 87, 9, 67, 196, 185}, {84, 85, 0, 58, 66, 7, 98, 79, 111}, {128, 181, 124, 147, 169, 61, 125}, {17, 19, 37, 168, 127, 161, 20, 163, 106, 22, 63, 112, 142, 143, 149, 193, 92, 80, 162}, {10, 88, 195, 78, 183, 81, 90}, {136, 171, 94, 60}, {133, 51, 14, 21, 148, 109}, {134, 157, 24, 25, 102, 191, 89, 26, 115, 173, 46, 104, 49, 30, 91, 100, 170, 82, 116, 164, 105, 121, 68, 93, 38, 99, 190, 117, 50, 48, 69, 86, 189, 155, 118, 186, 188, 23}, {131, 141, 13, 194, 120}, {184, 119, 76, 160, 18, 35, 71, 178, 70, 55, 64, 59, 74, 175, 167, 177, 174, 72, 126} {138, 29, 108, 15, 33, 154, 153}, {187, 87, 9, 196, 185, 67}, {24, 116, 93, 82, 25, 190, 49, 91, 30, 99, 170, 104, 105, 26, 115, 173, 191, 164, 121, 86, 50, 189, 46, 69, 186, 134, 38, 48, 155, 102, 100, 188, 117, 89, 23, 118, 157, 68}, {101, 96, 114, 12, 53, 180, 123, 62, 182, 16, 156, 140, 107, 103, 145, 45, 75, 144, 34, 36, 146, 11, 97, 3, 152, 158, 95, 73, 2, 122, 179, 176, 47, 28, 27, 31, 5, 165, 54, 77, 4, 57, 159, 113, 150, 52, 129, 166, 192, 83, 110, 32, 137, 135, 1, 8, 151, 65, 132, 130}, {128, 147, 124, 125, 169, 181, 61}, {131, 13, 141, 120, 194}, {58, 84, 66, 0, 7, 98, 111, 85, 79}, {10, 81, 78, 88, 195, 90, 183}, {133, 51, 14, 148, 109, 21}, {56, 6, 139, 42, 40, 39, 44, 43, 172, 41}, {136, 60, 94, 171}, {184, 76, 160, 119, 18, 64, 55, 74, 59, 35, 70, 126, 178, 175, 177, 71, 174, 167, 72}, {17, 161, 163, 20, 63, 142, 168, 19, 149, 106, 112, 193, 22, 143, 37, 127, 92, 162, 80} REFERENCES [1] Z. Wang, B. Li, L. Wang, and Q. Li, A brief survey on automatic integration test order generation, in Software Engineering and Knowledge Engineering Conference (SEKE), 2011, pp [2] W. K. G. Assunção, T. E. Colanzi, A. T. R. Pozo, and S. R. Vergilio, Establishing integration test orders of classes with several coupling measures, in 13th Genetic and Evolutionary Computation Conference (GECCO), 2011, pp [3] W. K. G. Assunção, T. E. Colanzi, S. R. Vergilio, and A. T. R. Pozo, A multi-objective optimization approach for the integration and test order problem, Information Sciences, 2012, submitted. [4] T. E. Colanzi, W. K. G. Assunção, A. T. R. Pozo, and S. R. Vergilio, Integration testing of classes and aspects with a multi-evolutioanry and coupling-based approach, in 3th International Symposium on Search Based Software Engineering (SSBSE). Springer Verlag, 2011, pp [5] S. Vergilio, A. Pozo, J. Árias, R. Cabral, and T. Nobre, Multiobjective optimization algorithms applied to the class integration and test order problem, International Journal on Software Tools for Technology Transfer, vol. 14, no. 4, pp , [6] T. E. Colanzi, W. K. G. Assunção, S. R. Vergilio, and A. T. R. Pozo, Generating integration test orders for aspect oriented software with multi-objective algorithms, in Proceedings of the Latin-American Workshop on Aspect Oriented Software (LA-WASP), [7] W. Assunção, T. Colanzi, S. Vergilio, and A. Pozo, Evaluating different strategies for integration testing of aspect-oriented programs, in Proceedings of the Latin-American Workshop on Aspect Oriented Software (LA-WASP), [8] R. Ré and P. C. Masiero, Integration testing of aspect-oriented programs: a characterization study to evaluate how to minimize the number of stubs, in Brazilian Symposium on Software Engineering (SBES), 2007, pp [9] E. Carmel and R. Agarwal, Tactical approaches for alleviating distance in global software development, Software, IEEE, vol. 18, no. 2, pp , mar/apr [10] J. Noll, S. Beecham, and I. Richardson, Global software development and collaboration: barriers and solutions, ACM Inroads, vol. 1, no. 3, pp , Sep [11] D. C. Kung, J. Gao, P. Hsia, J. Lin, and Y. Toyoshima, Class firewall, test order and regression testing of object-oriented programs, Journal of Object-Oriented Program, vol. 8, no. 2, pp , [12] R. Ré, O. A. L. Lemos, and P. C. Masiero, Minimizing stub creation during integration test of aspect-oriented programs, in 3rd Workshop on Testing Aspect-Oriented Programs (WTAOP), Vancouver, British Columbia, Canada, 2007, pp [13] L. C. Briand, Y. Labiche, and Y. Wang, An investigation of graph-based class integration test order strategies, IEEE Transactions on Software Engineering, vol. 29, no. 7, pp , [14] Y. L. Traon, T. Jéron, J.-M. Jézéquel, and P. Morel, Efficient objectoriented integration and regression testing, IEEE Transactions on Reliability, pp , [15] A. Abdurazik and J. Offutt, Coupling-based class integration and test order, in International Workshop on Automation of Software Test (AST). Shanghai, China: ACM, [16] L. C. Briand, J. Feng, and Y. Labiche, Using genetic algorithms and coupling measures to devise optimal integration test orders, in Software Engineering and Knowledge Engineering Conference (SEKE), [17] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp , [18] E. Zitzler, M. Laumanns, and L. Thiele, SPEA2: Improving the Strength Pareto Evolutionary Algorithm, Swiss Federal Institute of Technology (ETH) Zurich, Gloriastrasse 35, CH-8092 Zurich, Switzerland, Tech. Rep. 103, [19] J. D. Knowles and D. W. Corne, Approximating the nondominated front using the pareto archived evolution strategy, Evolutionary Computation, vol. 8, pp , [20] C. A. C. Coello, G. B. Lamont, and D. A. van Veldhuizen, Evolutionary Algorithms for Solving Multi-Objective Problems (Genetic and Evolutionary Computation). Secaucus, NJ, USA: Springer-Verlag New York, Inc., [21] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén, Experimentation in Software Engineering: An Introduction. Kluwer Academic Publishers, Norwell, MA, USA, [22] R. S. Pressman, Software Engineering : A Practitioner s Approach. NY: McGraw Hill, [23] E. Zitzler, L. Thiele, M. Laumanns, C. M. Fonseca, and V. G. da Fonseca, Performance assessment of multiobjective optimizers: An analysis and review, IEEE Transactions on Evolutionary Computation, vol. 7, pp , [24] J. L. Cochrane and M. Zeleny, Multiple Criteria Decision Making. University of South Carolina Press, Columbia, 1973.

56 Functional Validation Driven by Automated Tests Validação Funcional Dirigida por Testes Automatizados Thiago Delgado Pinto Departamento de Informática Centro Federal de Educação Tecnológica, CEFET/RJ Nova Friburgo, Brasil Arndt von Staa Departamento de Informática Pontifícia Universidade Católica, PUC-Rio Rio de Janeiro, Brasil Resumo A qualidade funcional de um software pode ser avaliada por quão bem ele atende aos seus requisitos funcionais. Estes requisitos são muitas vezes descritos por intermédio de casos de uso e verificados por testes funcionais que checam sua correspondência com as funcionalidades observadas pela interface com o usuário. Porém, a criação, a manutenção e a execução destes testes são trabalhosas e caras, enfatizando a necessidade de ferramentas que as apoiem e realizem esta forma de controle de qualidade. Neste contexto, o presente artigo apresenta uma abordagem totalmente automatizada para a geração, execução e análise de testes funcionais, a partir da descrição textual de casos de uso. A ferramenta construída para comprovar sua viabilidade, chamada de FunTester, é capaz de gerar casos de teste valorados junto com os correspondentes oráculos, transformá-los em código-fonte, executá-los, coletar os resultados e analisar se o software está de acordo com os requisitos funcionais definidos. Avaliações preliminares demonstraram que a ferramenta é capaz de eficazmente detectar desvios de implementação e descobrir defeitos no software sob teste. Abstract The functional quality of any software system can be evaluated by how well it conforms to its functional requirements. These requirements are often described as use cases and are verified by functional tests that check whether the system under test (SUT) runs as specified. There is a need for software tools to make these tests less laborious and more economical to create, maintain and execute. This paper presents a fully automated process for the generation, execution, and analysis of functional tests based on use cases within software systems. A software tool called FunTester has been created to perform this process and detect any discrepancies from the SUT. Also while performing this process it generates conditions to cause failures which can be analyzed and fixed. Keywords functional validation; automated functional tests; use cases; business rules; test data generation; test oracle generation; test case generation and execution; I. INTRODUÇÃO A fase de testes é sabidamente uma das mais caras da construção de um software, correspondendo a 35 a 50% de seu custo total quando feito da forma tradicional [1] e de 15 a 25% quando desenvolvido com uso de técnicas formais leves [2]. Quando feita de forma manual, a atividade de teste se torna ineficiente e tediosa [3], usualmente apoiada em práticas ad hoc e dependente da habilidade de seus criadores. Assim, torna-se valioso o uso de ferramentas que possam automatizar esta atividade, diminuindo os custos envolvidos e aumentando as chances de se entregar um software com menor quantidade de defeitos remanescentes. Em geral, é entendido que um software de qualidade atende exatamente aos requisitos definidos em sua especificação [4]. Para verificar este atendimento, geralmente são realizados testes funcionais que observam a interface (gráfica) do software visando determinar se este realmente executa tal como especificado. Evidentemente supõe-se que os requisitos estejam de acordo com as necessidades e expectativas dos usuários. Como isso nem sempre é verdade, torna-se necessária a possibilidade de redefinir a baixo custo os testes a serem realizados. Para simular a execução destes testes, é possível imitar a operação de um usuário sobre a interface, entrando com ações e dados, e verificar se o software se comporta da maneira especificada. Esta simulação pode ser realizada via código, com a combinação de arcabouços de teste unitário e arcabouços de teste de interface com o usuário. Entretanto, para gerar o código de teste de forma automática, é preciso que a especificação do software seja descrita com mais formalidade e de maneira estruturada ou, pelo menos, semiestruturada. Como casos de uso são largamente utilizados para documentar requisitos de um software, torna-se interessante adaptar sua descrição textual para este fim. A descrição textual de casos de uso, num estilo similar ao usado por Cockburn [5], pode ser descrita numa linguagem restrita e semiestruturada, como o adotado por Días, Losavio, Matteo e Pastor [6] para a língua espanhola. Esta formalidade reduz o número de variações na interpretação da descrição, facilitando a sua transformação em testes. Trabalhos como [7, 8, 9, 10, 11, 12], construíram soluções para apoiar processos automatizados ou semiautomatizados para a geração dos casos de teste. Entretanto, alguns aspectos importantes não foram abordados, deixando de lado, por exemplo, a geração dos valores utilizados nos testes, a geração dos oráculos e a combinação de cenários entre múltiplos casos de uso, que são essenciais para sua efetiva aplicação prática.

57 TABELA I. PANORAMA SOBRE AS FERRAMENTAS # Questão [9] [11] [12] [7] [10] Fun Tester 1 Usa somente sim sim sim sim sim sim 1 casos de uso como fonte para os testes? 2 Qual a forma de PRS IRS IRS IRS UCML VRS documentação dos casos de uso? 3 Controla a sim não não não não Sim declaração de casos de uso? 4 Dispensa a não não não não não Sim declaração de fluxos alternativos que avaliam regras de negócio? 5 Gera cenários sim sim sim sim sim Sim automaticamente? 6 Há um cenário sim sim sim sim sim sim cobrindo cada fluxo? 7 Há cenários que não sim não não não sim verifiquem regras de negócio para um mesmo fluxo? 8 Há cenários que não sim sim sim sim sim combinam fluxos? 9 Há cenários que não não não não não sim incluem mais de um caso de uso? 10 Há métricas para não não não sim sim sim cobertura dos cenários? 11 Gera casos de não sim não não sim sim teste semânticos? 12 Gera valores para não não não não não sim os casos de teste automaticamente? 13 Gera oráculos não não não não não sim automaticamente? 14 Casos de teste são não sim não não sim sim gerados para um formato independente de linguagem ou framework? 15 Gera código de sim não sim sim não sim teste? 16 Os resultados da execução do código gerado são rastreados? sim N/A não não N/A sim a. N/A=Não se aplica; PRS=Português Restrito Semiestruturado; IRS=Inglês Restrito Semiestruturado; UCML=Use Case Markup Language; VRS=Vocabulário Restrito Semiestruturado independente de idioma. No processo de geração de casos de teste automatizados criam-se primeiro os casos de teste abstratos. Estes determinam as condições que cada caso de teste deve satisfazer (por exemplo, os caminhos a serem percorridos). A partir deles determinam-se os casos de teste semânticos, isto é, casos de teste independentes de arcabouço de testes. Estes levam em conta as condições que os dados de entrada devem satisfazer de modo que os testes abstratos sejam realizados. A seguir selecionam-se os valores dos dados de entrada, gerando os casos de teste valorados. Aplicando a especificação aos casos de teste valorados determinam-se os oráculos, obtendo-se assim os casos de teste úteis. Estes, finalmente são traduzidos para scripts ou código a ser usado por ferramentas ou arcabouços de teste automatizado. Este artigo descreve um processo totalmente automatizado que trata muitos dos problemas não resolvidos por trabalhos anteriores (como a geração automática de dados de teste, oráculos e cenários que combinam mais de um caso de uso) e introduz novas abordagens para aumentar sua aplicação prática e realizar uma validação funcional de alta eficácia. As próximas seções são organizadas da seguinte forma: A Seção II apresenta trabalhos correlatos. A Seção III detalha o novo processo definido. A Seção IV expõe brevemente a arquitetura da solução. A Seção V retrata uma avaliação preliminar da ferramenta. Por fim, a Seção VI apresenta as conclusões do trabalho. II. TRABALHOS CORRELATOS Esta seção realiza uma avaliação de alguns trabalhos correlatos, com foco na descrição textual de casos de uso como principal fonte para a geração dos testes. A Tabela I apresenta um panorama sobre os trabalhos que construíram ferramentas para este propósito, incluindo a ferramenta que materializa a abordagem discutida neste artigo, chamada de FunTester (acrônimo para Funcional Tester). Nela, é possível observar que FunTester apresenta uma solução mais completa, implementando avanços que permitem sua aplicação prática em projetos reais. III. PROCESSO A Figura 1 apresenta o processo realizado na abordagem apresentada e seguido pela ferramenta construída. 1 As regras de negócio, descritas adicionalmente em relação às outras soluções, ainda pertencem aos casos de uso. Fig. 1. Processo seguido pela ferramenta

58 Neste processo, o usuário participa apenas das etapas de descrição dos casos de uso e de suas regras de negócio, sendo as demais totalmente automatizadas. As etapas 1, 2, 3, 4, 5 e 9 são realizadas pela ferramenta em si, enquanto as etapas 6, 7 e 8 são realizadas por extensões da ferramenta, para a linguagem e arcabouço de testes alvo. A seguir, será realizada uma descrição de cada uma delas. A. Descrição textual de casos de uso (Etapa 1) Nesta etapa, o usuário realiza a especificação do software através de casos de uso, auxiliado pela ferramenta. A descrição textual segue um modelo similar ao de Cockburn [5]. A ferramenta usa somente alguns dos campos desta descrição para a geração de testes. As pré-condições e pós-condições são usadas para estabelecer as dependências entre casos de uso, numa espécie de máquina de estados. Dos fluxos (disparador, principal e alternativos) são obtidos os cenários de execução. De seus passos são extraídas as ações executadas pelo ator e pelo sistema, que junto a outras informações do caso de uso (como a indicação se ele pode ser disparado somente através de outro caso de uso), são usadas para a geração dos testes úteis, na etapa 5. A ferramenta permite definir um vocabulário composto pelos termos esperados por sua extensão para a transformação dos testes úteis em código-fonte e pelos termos correspondentes, usados na descrição textual dos casos de uso. Isto permite tanto documentar o software usando palavras ou até idiomas diferentes do vocabulário usado para geração dos testes quanto adaptar esse último para diferentes arcabouços de teste. A sintaxe de um passo numa gramática livre de contexto (GLC) similar à Backus-Naur Form (BNF) é descrita a seguir: <passo> ::= <disparador> <ação> <alvo>+ <disparador> <documentação> <disparador> ::= "ator" "sistema" <alvo> ::= <elemento> <caso-de-uso> <elemento> ::= <widget> <URL> <comando> <tecla> <tempo> <ação> ::= string <documentação> ::= string <caso-de-uso> ::= string <widget> ::= string <URL> ::= string <comando> ::= string <tecla> ::= string <tempo> ::= integer O ator ou o sistema dispara uma ação sobre um ou mais alvos ou sobre uma documentação. Cada alvo pode ser um elemento ou um caso de uso. Um elemento pode ser um widget, uma URL, um comando, uma tecla ou um tempo (em milissegundos, que é geralmente usado para aguardar um processamento). O tipo de alvo e o número de alvos possíveis para uma ação podem variar conforme a configuração do vocabulário usado. O tipo de elemento também pode variar conforme a ação escolhida. B. Detalhamento das regras de negócio (Etapa 2) Esta etapa é um dos importantes diferenciais da ferramenta e permite que o usuário detalhe os elementos descritos ao preencher os passos dos fluxos. Este detalhamento possibilita saber o que representa cada elemento, extrair as informações necessárias para sua conversão em widgets, inferir seus possíveis valores e formatos e extrair as informações necessárias para a geração de oráculos. A introdução das regras de negócio também permite reduzir o número de fluxos alternativos necessários para tratar erros de uso, uma vez que a ferramenta gerará automaticamente casos de teste para isto. Isto permite reduzir consideravelmente o número de caminhos no uso do software (introduzidos pela combinação dos fluxos), diminuindo o número de cenários, casos de teste e consequentemente o tempo de execução dos testes. A sintaxe definida para as regras de negócio permite determinar regras tanto para dados contínuos quanto para dados discretos. O detalhamento de um elemento e de suas regras é exposto a seguir (em GLC): <elemento> ::= <nome><tipo><nome-interno> <nome><tipo><nome-interno><regra>+ <tipo> ::= "widget" "url" "comando" "teclas" "tempo" <nome> ::= string <nome-interno> ::= string <regra> ::= <tipo-dado><espec-valor>+ <tipo-dado> ::= "string" "integer" "double" "date" "time" "datetime" <espec-valor> ::= <tipo-espec><mensagem> <tipo-espec> ::= "valor-min" <ref-valor> "valor-max" <ref-valor> "comprimento-min" <ref-valor> "comprimento-max" < ref-valor> "formato" <ref-valor> "igual-a" <ref-valor>+ "diferente-de" <ref-valor>+ <mensagem> ::= string <ref-valor> ::= <valor>+ <elemento> Um elemento que seria equivalente ao widget da descrição do passo possui um nome (que é o exibido para o usuário na documentação), um tipo (ex.: uma janela, um botão, uma caixa de texto, etc.) e um nome interno (que é usado internamente para identificar o widget no SST). Se for definido como editável (se recebe entradas de dados do usuário), pode conter uma ou mais regras de negócio. Cada regra define o tipo de dado admitido pelo elemento e uma ou mais especificações de valor, que permitem definir valores limítrofes, formatos, lista de valores admissíveis ou não admissíveis, além (opcionalmente) da mensagem esperada do SST caso alguma destas definições seja violada (ex.: valor acima do limítrofe). Cada especificação de valor pode ser proveniente de definição manual, de definição a partir de outro elemento ou a partir de consulta parametrizável a um banco de dados (obtendo seus parâmetros de outra definição, se preciso), fornecendo flexibilidade para a construção das regras. A definição de regras com valores obtidos através de consulta a banco de dados permite utilizar dados criados com o propósito de teste. Estes dados podem conter valores idênticos aos esperados pelo sistema, permitindo simular condições de uso real, o que é desejável em ferramentas de teste. C. Geração de cenários para cada caso de uso (Etapa 3) Nesta etapa, a ferramenta combina os fluxos de cada caso de uso, gerando cenários. Cada cenário parte do fluxo

59 principal, possivelmente passando por fluxos alternativos, retornando ao fluxo principal ou caindo em recursão (repetindo a passagem por um ou mais fluxos). Como os casos de recursão potencializam o número de combinações entre fluxos, o número de recursões deve ser mantido baixo, para não inviabilizar a aplicação prática da geração de cenários. Para isso, a ferramenta permite parametrizar o número máximo de recursões, limitando a quantidade de cenários gerados. A geração de cenários realizada cobre a passagem por todos os fluxos do caso de uso, bem como a combinação entre todos eles, pelo menos uma vez. Em cada fluxo, todos os passos são cobertos. Isto garante observar defeitos relacionados à passagem por determinados fluxos ou ocasionados por sua combinação. Segundo o critério de cálculo de cobertura de caso de uso adotado por Hassan e Yousif [13], que divide o total de passos cobertos pelo teste pelo total de passos do caso de uso, a cobertura atingida é de 100%. D. Combinação de cenários entre casos de uso (Etapa 4) Esta etapa realiza a combinação entre cenários, levando em conta os estados definidos nas pré-condições e pós-condições, bem como as chamadas a casos de uso, que podem ocorrer em passos de certos fluxos. Quando uma pré-condição referencia uma pós-condição gerada por outro caso de uso, é estabelecida uma relação de dependência de estado. Logo, os cenários do caso de uso do qual se depende devem ser gerados primeiro, para então realizar a combinação. O mesmo ocorre quando existe uma chamada para outro caso de uso. Para representar a rede de dependências entre os casos de uso do SST e obter a ordem correta para a geração dos cenários, é gerado um grafo acíclico dirigido dos casos de uso e então aplicada uma ordenação topológica [14]. Antes de combinar dois cenários, entretanto, é preciso verificar se os fluxos de um geram os estados esperados pelo outro. Caso não gerem, os cenários não são combinados, uma vez que a combinação poderá gerar um novo cenário incompleto ou incorreto, podendo impedir que a execução alcance o caso de uso alvo do teste. Assim, na versão atual da ferramenta, para combinar dois casos de uso, e, onde depende de, são selecionados de somente os cenários que terminem com sucesso, isto é, que não impeçam a execução do caso de uso conforme previsto. Para garantir a correta combinação dos cenários de um caso de uso, sem que se gerem cenários incompletos ou incorretos, realiza-se primeiro a combinação com os cenários de casos de uso chamados em passos; depois com os cenários de fluxos disparadores do próprio caso de uso; e só então com cenários de casos de uso de pré-condições. Dados e dois casos de uso quaisquer do conjunto de casos de uso do software,, seja o número de cenários do caso de uso, o número de cenários do caso de uso, o número de cenários de sucesso de, se A depende de B, então a cobertura dos cenários de,, pode ser calculada como: Se, por exemplo, um caso de uso tiver 5 cenários, sendo 3 de sucesso, e outro caso de uso,, que depende de, tiver 8 cenários, a cobertura total seria de 40 combinações, enquanto a cobertura alcançada seria de 24 combinações, ou 60% do total. Apesar de esta cobertura não ser total (100%), acredita-se que ela seja eficaz para testes de casos de uso, uma vez que a geração de cenários incorretos ou incompletos pode impedir o teste do caso de uso alvo. É importante notar que a combinação entre cenários é multiplicativa, ou seja, a cada caso de uso adicionado, seu conjunto de cenários é multiplicado pelos atuais. Foram consideradas algumas soluções para este problema, cuja implementação está entre os projetos futuros, discutidos na Seção V. E. Geração de casos de teste úteis (Etapa 5) Nesta etapa, a ferramenta gera os casos de teste úteis (formados por comandos, dados valorados e oráculos) utilizando os cenários e as regras de negócio. Estas regras permitem inferir os valores válidos e não válidos para cada elemento e gerá-los automaticamente, de acordo com o tipo de teste. Os oráculos são gerados com uso da definição das mensagens esperadas para quando são fornecidos valores não válidos, de acordo com o tipo de verificação desejada para o teste. Como cada descrição de expectativa de comportamento do sistema é transformada em comandos semânticos e estes (posteriormente) em comandos na linguagem usada pelo arcabouço de testes, quando um comando não é executado corretamente por motivo do SST não corresponder à sua expectativa, o teste automaticamente falhará. Assim, não é necessário haver oráculos que verifiquem a existência de elementos de interface, exceto a exibição de mensagens. Segundo Meyers [15], o teste de software torna-se mais eficaz se os valores de teste são gerados baseados na análise das "condições de contorno" ou "condições limite". Ao utilizar valores acima e abaixo deste limite, os casos de teste exploram as condições que aumentam a chance de encontrar defeitos. De acordo com o British Standard [16], o teste de software torna-se mais eficaz se os valores são particionados ou divididos de alguma maneira, como, por exemplo, ao meio. Além disto, é interessante a inclusão do valor zero (0), que permite testar casos de divisão por zero, bem como o uso de valores aleatórios, que podem fazer o software atingir condições imprevistas. Portanto, levando em consideração as regras de negócio definidas, para a geração de valores considerados válidos, independentes do tipo, são adotados os critérios de: (a) valor mínimo; (b) valor imediatamente posterior ao mínimo; (c) valor máximo; (d) valor imediatamente anterior ao máximo; (e) valor intermediário; (f) zero, se dentro da faixa permitida; (g) valor aleatório, dentro da faixa permitida. E para a geração de valores considerados não válidos, os critérios de: (a) valor imediatamente anterior ao mínimo; (b) valor aleatório anterior ao mínimo; (c) valor imediatamente posterior ao máximo; (d) valor aleatório posterior ao máximo; (e) formato de valor incorreto.

60 A Tabela II exibe os tipos de teste gerados na versão atual da ferramenta, que visam cobrir as regras de negócio definidas. Baseado neles, podemos calcular a quantidade mínima de casos de teste úteis gerados por cenário, QM, como: Onde é o número de elementos editáveis, é o número de elementos editáveis obrigatórios e é o número de elementos editáveis com formato definido. Para cada cenário de um caso de uso simples, por exemplo, com 5 elementos, sendo 3 obrigatórios e 1 com formatação, existirão 32 casos de teste. TABELA II. TIPOS DE TESTE GERADOS Descrição Conclui Testes caso de uso Somente obrigatórios Sim 1 Todos os obrigatórios exceto um Não 1 por elemento editável obrigatório Todos com valor/tamanho mínimo Sim 1 Todos com valor/tamanho posterior Sim 1 ao mínimo Todos com valor/tamanho máximo Sim 1 Todos com valor/tamanho anterior ao Sim 1 máximo Todos com o valor intermediário, Sim 1 dentro da faixa Todos com zero, ou um valor Sim 1 aleatório dentro da faixa, se zero não for permitido Todos com valores aleatórios dentro Sim 1 da faixa Todos com valores aleatórios dentro da faixa, exceto um, com valor imediatamente anterior ao mínimo Não 1 por elemento editável Todos com valores aleatórios dentro da faixa, exceto um, com valor aleatório anterior ao mínimo Todos com valores aleatórios dentro da faixa, exceto um, com valor imediatamente posterior ao máximo Todos com valores aleatórios dentro da faixa, exceto um, com valor aleatório posterior ao máximo Todos com formato permitido, exceto um Não Não Não Não 1 por elemento editável 1 por elemento editável 1 por elemento editável 1 por elemento com formato definido Os testes gerados cobrem todas as regras de negócio definidas, explorando seus valores limítrofes e outros. Como o total de testes possíveis para um caso de uso não é conhecido, acredita-se que com a cobertura acumulada, unindo todas as coberturas descritas até aqui, espera-se exercitar consideravelmente o SST na busca por defeitos, obtendo alta eficácia. Atualmente os casos de teste úteis são exportados para a JavaScript Object Notation (JSON), por ser compacta, independente de linguagem de programação e fácil de analisar gramaticalmente. F. Transformação em código-fonte (Etapa 6) Esta etapa e as duas seguintes utilizam a extensão da ferramenta escolhida pelo usuário, de acordo o software a ser testado. A extensão da ferramenta lê o arquivo JSON contendo os testes úteis e os transforma em código-fonte, para a linguagem e arcabouços de teste disponibilizados por ela. Atualmente, há uma extensão que os transforma em código Java e os arcabouços TestNG 2 e FEST 3, visando o teste de aplicações com interface gráfica Swing. Para facilitar o rastreamento de falhas ou erros nos testes, a extensão construída realiza a instrumentação do código-fonte gerado, indicando, com comentários de linha, o passo semântico correspondente à linha de código. Esta instrumentação será utilizada na Etapa 8, para pré-análise dos resultados. G. Execução do código-fonte (Etapa 7) Nesta etapa, a extensão da ferramenta executa o códigofonte de testes gerado. Para isto, ela usa linhas de comando configuradas pelo usuário, que podem incluir a chamada a um compilador, ligador (linker), interpretador, navegador web, ou qualquer outra aplicação ou arquivo de script que dispare a execução dos testes. Durante a execução, o arcabouço de testes utilizado gera um arquivo com o log de execução dos testes. Este arquivo será lido e analisado na próxima etapa do processo. H. Conversão e pré-análise dos resultados de execução (Etapa 8) Nesta etapa, a extensão da ferramenta lê o log de execução dos testes e analisa os testes que falharam ou obtiveram erro, investigando: (a) a mensagem de exceção gerada, para detectar o tipo de problema ocorrido; (b) o rastro da pilha de execução, para detectar o arquivo e a linha do código-fonte onde a exceção ocorreu e obter a identificação do passo semântico correspondente (definida pela instrumentação realizada na Etapa 6), possibilitando rastrear o passo, fluxo e cenário correspondentes; (c) comparar o resultado esperado pelo teste semântico com o obtido. O log de execução e as informações obtidas da pré-análise são convertidos para um formato independente de arcabouço de testes e exportados para um arquivo JSON, que será lido e analisado pela ferramenta na próxima etapa. I. Análise e apresentação dos resultados (Etapa 9) Por fim, a ferramenta realiza a leitura do arquivo com os resultados da execução dos testes e procura analisá-los para rastrear as causas de cada problema encontrado (se houverem). Nesta etapa, o resultado da execução é confrontado com a especificação do software, visando identificar possíveis problemas

61 Fig. 2. Arquitetura da solução IV. ARQUITETURA A Figura 2 apresenta a arquitetura da solução construída, indicando seus componentes e a ordem de interação entre os mesmos, de forma a fornecer uma visão geral sobre como o processo descrito é praticado. É interessante observar que, em geral, o código gerado fará uso de dois arcabouços de teste: um para automatizar a execução dos testes e outro para testar especificamente o tipo de interface (com o usuário) desejada. Esse arcabouço de automação dos testes que irá gerar os resultados da execução dos testes lidos pela extensão da ferramenta. V. AVALIAÇÃO Uma avaliação preliminar da eficácia da ferramenta foi realizada com um software construído por terceiros, coletado da Internet. O software avaliado contém uma especificação de requisitos por descrição textual de casos de uso, realiza acesso a um banco de dados MySQL 4 e possui interface Swing. A especificação encontrada no software avaliado estava incompleta, faltando, por exemplo, alguns fluxos alternativos e regras de negócio. Quando isso ocorre, dá-se margem para que a equipe de desenvolvedores do software complete a especificação conforme sua intuição (e criatividade), muitas vezes divergindo da intenção original do projetista, que tenta mapear o problema real. Como isto acabou ocorrendo no software avaliado, optou-se por coletar os detalhes não 4 presentes na especificação a partir da implementação do software. Desta forma, são aumentadas as chances da especificação estar próxima da implementação, acusando menos defeitos deste tipo. Para testar a eficácia da ferramenta, foram geradas: (a) uma versão da especificação funcional do SST com divergências; (b) duas versões modificadas do SST, com emprego de mutantes. Para gerar as versões com empregos de mutantes, levou-se em consideração, assim como no trabalho de Gutiérrez et al. [7], o modelo de defeitos de caso de uso introduzido por Binder [17], que define operadores mutantes para casos de uso. O uso desses operadores é considerado mais apropriado para testes funcionais (do que os operadores "clássicos"), uma vez que seu objetivo não é testar a cobertura do código do SST em si, mas gerar mudanças no comportamento do SST que possam ser observadas por testes funcionais. Nesse contexto, o termo "mutante" possui uma conotação diferente do mesmo termo aplicado a código. Um mutante da especificação funcional tem o poder de gerar uma variedade de casos de teste que reportarão falhas. Da mesma forma, um mutante de código tem o poder de gerar falhas em uma variedade de casos de teste gerados a partir da especificação funcional. Na versão da especificação do SST com divergências, foram introduzidas novas regras de negócio, visando verificar se os testes gerados pela ferramenta seriam capazes de identificar as respectivas diferenças em relação ao SST. Na primeira versão modificada do SST, foi usado o operador

62 mutante para casos de uso "substituição de regras de validação ou da admissão de um dado como correto" (SRV). Com ele, operadores condicionais usados no código de validação de dados do SST foram invertidos, de forma a não serem admitidos como corretos. E na segunda versão do SST, foi utilizado operador mutante de casos de uso "informação incompleta ou incorreta mostrada pelo sistema" (I3). Com ele, as mensagens mostradas pelo sistema quando detectado um dado inválido foram modificadas, de forma a divergirem da esperada pela especificação. A Tabela III apresenta a quantidade de modificações realizadas nos três principais cenários analisados e a Tabela IV, o número testes que obtiveram falha, aplicando-se estas modificações. Com o total de 7 modificações na especificação original, mais 9 testes obtiveram falha em relação ao resultado original. Com o emprego de 22 mutações com o primeiro mutante (SRV), mais 51 testes obtiveram falha. E com o emprego de 10 mutações com o segundo mutante (I3), mais 12 testes falharam. TABELA III. Cenário MODIFICAÇÕES NOS TRÊS PRINCIPAIS CENÁRIOS Modificações na especificação funcional Mutações com mutante SRV Mutações com mutante I3 Cenário Cenário Cenário TOTAL TABELA IV. Cenário SST original NÚMERO DE TESTES QUE FALHARAM Número de testes que falharam SST frente à especificação funcional com divergências SST com mutante SRV SST com mutante I3 Cenário Cenário Cenário TOTAL Assim, além de corretamente detectar defeitos na versão original, geradas por entradas de dados não válidos e por diferenças na especificação, os testes gerados pela ferramenta foram capazes de observar as mudanças realizadas, tanto em relação à especificação quanto em relação à implementação do SST. VI. CONCLUSÕES O presente artigo apresentou uma nova abordagem para a geração e execução automática de testes funcionais, baseada na especificação de requisitos através da descrição textual de casos de uso e do detalhamento de suas regras de negócio. Os resultados preliminares obtidos com o emprego da ferramenta foram bastante promissores, uma vez que se pôde perceber que ela é capaz de atingir alta eficácia em seus testes, encontrando corretamente diferenças entre a especificação e a implementação do SST. Além disso, seu uso permite estabelecer um meio de aplicar Test-Driven Development no nível de especificação, de forma que o software seja construído, incrementalmente, para passar nos testes gerados pela especificação construída. Ou seja, uma equipe de desenvolvimento pode criar a especificação funcional de uma parte do sistema, gerar os testes funcionais a partir desta especificação, implementar a funcionalidade correspondente e executar os testes gerados para verificar se a funcionalidade corresponde à especificação. Isto, inclusive, pode motivar a criação de especificações mais completas e corretas, uma vez que compensará fazê-lo. A. Principais contribuições Dentre as principais contribuições da abordagem construída destacam-se: (a) Apresentação de um processo completo e totalmente automatizado; (b) Uso da descrição de regras de negócio na especificação dos casos de uso, permitindo gerar os valores e oráculos dos testes, bem como tornar desnecessário descrever uma parcela significativa dos fluxos alternativos; (c) Uso de fontes de dados externas (ex.: banco de dados) na composição das regras de negócio, permitindo a simulação de condições reais de uso; (d) Geração de cenários que envolvem repetições de fluxos (loops) com número de repetições parametrizável; (e) Geração de cenários que envolvem mais de um caso de uso; (f) Geração de testes semânticos com nomes correlatos ao tipo de verificação a ser realizada, permitindo que o desenvolvedor entenda o que cada teste verifica, facilitando sua manutenção; e (g) Uso de vocabulário configurável, permitindo realizar a descrição textual em qualquer idioma e diferentes arcabouços de teste. B. Principais restrições As principais restrições da abordagem apresentada atualmente são: (a) Não simulação de testes para fluxos que tratam exceções (ex.: falha de comunicação via rede, falha da mídia de armazenamento, etc.), uma vez que exceções tendem a ser de difícil (senão impossível) simulação através de testes funcionais realizados através da interface com o usuário; (b) As regras de negócio atualmente não suportam o uso de expressões que envolvam cálculos, fórmulas matemáticas ou o uso de expressões condicionais (ex.: if-then-else); (c) A abrangência dos tipos de interface gráfica passíveis de teste pela ferramenta é proporcional aos arcabouços de testes de interface utilizados. Assim, é possível que determinados tipos de interface com o usuário, como as criadas para games ou aplicações multimídia, possam não ser testadas por completo, se o arcabouço de testes escolhido não suportá-los, ou não suportar determinadas operações necessárias para seu teste. C. Trabalhos em andamento Atualmente outros tipos de teste estão sendo acrescentados à ferramenta, aumentando seu potencial de verificação. A capacidade de análise automática dos problemas ocorridos, de acordo com os resultados fornecidos pelo arcabouço de testes alvo, também está sendo ampliada. Testes mais rigorosos da ferramenta estão sendo elaborados para verificar a flexibilidade e eficácia da ferramenta em diferentes situações de uso prático.

63 D. Trabalhos futuros A atual geração de cenários combina todos os fluxos do caso de uso pelo menos uma vez, garantindo um bom nível de cobertura para a geração de testes. Isto é um atributo desejável para verificar o SST antes da liberação de uma versão para o usuário final. Durante seu desenvolvimento, entretanto, pode ser interessante diminuir a cobertura realizada, para que o processo de execução de testes ocorra em menor tempo. Para isto, propõe-se o uso de duas técnicas: (a) atribuir um valor de importância para cada fluxo, que servirá como como filtro para a seleção dos cenários desejados para teste; e (b) realizar a indicação da não influência de certos fluxos em outros, evitando gerar cenários que os combinem. Esta última técnica deve ser empregada com cuidado, para evitar a indicação de falsos negativos (fluxos que se pensa não influenciar o estado de outros, mas na verdade influencia). Para diminuir a cobertura realizada pela combinação de cenários, com intuito de acelerar a geração dos cenários para fins de testes rápidos, pode-se empregar a técnica de uso do valor de importância, descrita anteriormente, além de uma seleção aleatória e baseada em histórico. Nesta seleção, o combinador de cenários realizaria a combinação entre dois casos de uso escolhendo, pseudoaleatoriamente, um cenário compatível (de acordo com as regras de combinação descritas na Etapa 4) de cada um. Para esta seleção não se repetir, seria guardado um histórico das combinações anteriores. Dessa forma, a cobertura seria atingida gradualmente, alcançando a cobertura completa ao longo do tempo. [6] DÍAS, Isabel, LOSAVIO, Francisca, MATTEO, Alfredo, PASTOR, Oscar, "A specification pattern for use cases", Information & Management, n. 41, 2004, pp [7] GUTIÉRREZ, Javier J., ESCALONA, Maria J., MEJÍAS, Manuel, TORRES, Jesús, CENTENO, Arturo H, "A case study for generating test cases from use cases", University of Sevilla, Sevilla, Spain, [8] CALDEIRA, Luiz Rodolfo N., "Geração semi-automática de massas de testes funcionais a partir da composição de casos de uso e tabelas de decisão", Dissertação de Mestrado, PUC-Rio, Rio de Janeiro, [9] PESSOA, Marcos B., "Geração e execução automática de scripts de teste para aplicações web a partir de casos de uso direcionados por comportamento", Dissertação de Mestrado, PUC-Rio, Rio de Janeiro, [10] KASSEL, Neil W., "An approach to automate test case generation from structured use cases", Tese de Doutorado, Clemson University, [11] JIANG, Mingyue, DING, Zuohua, "Automation of test case generation from textual use cases", Zhejiang Sci-Tech University, Hangzhou, China, [12] BERTOLINI, Cristiano, MOTA, Alexandre, "A framework for GUI testing based on use case design", Universidade Federal de Pernambuco, Recife, Brazil, [13] HASSAN, Hesham A., YOUSIF, Zahraa E., "Generating test cases for platform independent model by using use case model", International Journal of Engineering Science and Technology, vol. 2, [14] KAHN, Arthur B., "Topological sorting of large networks", Communications of the ACM 5, 1962, pp [15] MEYERS, Glenford J., "The art of software testing", John Wiely & Sons, New York, [16] BRITISH STANDARD , "Software testing: vocabulary", [17] BINDER, Robert V., "Testing object-oriented systems: models, patterns and tools", Addison-Wesley, Para reduzir o número de testes gerados, também se pode atribuir um valor de importância para as regras de negócio, de forma a gerar testes apenas para as mais relevantes. Também se poderia adotar a técnica de seleção gradual aleatória (descrita anteriormente) para as demais regras, a fim de que a cobertura total das regras de negócio fosse atingida ao longo do tempo. Por fim, pretende-se criar versões dos algoritmos construídos (não discutidos neste artigo) que executassem em paralelo, acelerando o processo. Obviamente, também serão criadas extensões para outros arcabouços de teste, como Selenium 5 ou JWebUnit 6 (que visam aplicações web), aumentando a abrangência do uso da ferramenta. REFERÊNCIAS [1] MILER, Keith W., MORELL, Larry J., NOONAN, Robert E., PARK, Stephen K., NICOL, David M., MURRIL, Branson W., VOAS, Jeffrey M., "Estimating the probability of failure when testing reveals no failures", IEEE Transactions on Software Engineering, n. 18, 1992, pp [2] BENDER, Richard, "Proposed software evaluation and test KPA", n. 4, 1996, Disponível em: [3] MAGALHÃES, João Alfredo P. "Recovery oriented software", Tese de Doutorado, PUC-Rio, Rio de Janeiro, [4] CROSBY, Philip B., "Quality is free", McGraw-Hill, New-York, [5] COCKBURN, Alistar. "Writing effective use cases", Addison-Wesley,

64 Visualization, Analysis, and Testing of Java and AspectJ Programs with Multi-Level System Graphs Otávio Augusto Lazzarini Lemos, Felipe Capodifoglio Zanichelli, Robson Rigatto, Fabiano Ferrari, and Sudipto Ghosh Science and Technology Department Federal University of São Paulo at S. J. dos Campos Brazil {otavio.lemos, felipe.zanichelli, Computing Department Federal University of Sao Carlos Brazil Department of Computer Science, Colorado State University, Fort Collins, CO, USA Abstract Several software development techniques involve the generation of graph-based representations of a program created via static analysis. Some tasks, such as integration testing, require the creation of models that represent several parts of the system, and not just a single component or unit (e.g., unit testing). Besides being a basis for testing and other analysis techniques, an interesting feature of these models is that they can be used for visual navigation and understanding of the software system. However, the generation of such models henceforth called system graphs is usually costly, because it involves the reverse engineering and analysis of the whole system, many times done upfront. A possible solution for such a problem is to generate the graph on demand, that is, to postpone detailed analyses to when the user really needs it. The main idea is to start from the package structure of the system, representing dependencies at a high level, and to make control flow and other detailed analysis interactively and on demand. In this paper we use this idea to define a model for the visualization, analysis, and structural testing of objectoriented (OO) and aspect-oriented (AO) programs. The model is called Multi-Level System Graph (MLSG), and is implemented in a prototype tool based on Java and AspectJ named SysGraph4AJ (for Multi-Level System Graphs for AspectJ). To evaluate the applicability of SysGraph4AJ with respect to performance, we performed a study with three AspectJ programs, comparing SysGraph4AJ with a similar tool. Results indicate the feasibility of the approach, and its potential in helping developers better understand and test OO and AO Java programs. In particular, SysGraph4AJ performed around an order of magnitude faster than the other tool. I. INTRODUCTION Several software engineering tasks require the representation of source code in models suitable for analysis, visualization, and testing [1]. For instance, control flow graphs can be used for structural testing [2], and call graphs can be used for compiler optimization [3]. Some techniques require the generation of graphs that represent the structure of multiple modules or whole systems. For instance, structural integration testing may require the generation of control flow graphs for several units that interact with each other in a program [2, 4]. Most testing tools focus only on the representation of local parts of the systems, outside their contexts, or do not even support the visualization of the underlying models. For instance, JaBUTi, a family of tools for testing Java Object Oriented (OO) and Aspect-Oriented (AO) programs, supports the visualization of models of single units [5], pairs of units [4], subsets of units [2], or, at most, a cluster of units of the system [6]. On the other hand, tools like Cobertura [7] and EMMA [8], which support branch and statement coverage analysis of Java programs, do not support visualization of the underlying control flow models. Being able to view these models in a broader context is important to improve understanding of the system as a whole, especially for testers and white-box testing researchers and educators. Nevertheless, the generation of such system models is usually costly because it requires the reverse engineering and analysis of whole systems. Such a task may affect the performance of tools, a feature very much valued by developers nowadays. For instance, recently Eclipse, the leading Java IDE, has been criticized for performance issues [9]. To make the construction of these models less expensive, analyses can be made at different levels of abstraction, on demand, and interactively. This strategy also supports a visual navigation of the system, where parts considered more relevant can be targeted. The visualization itself might also help discovering faults, since it is a form of inspection, but more visually appealing. The strategy of analyzing systems incrementally is also important because studies indicate that the distribution of faults in software systems follow the Pareto principle; that is, a small number of modules tend to present the majority of faults [10]. In this way, it makes more sense to be able to analyze systems in parts (but also within their contexts), providing a focused testing activity and thus saving time. In this paper we apply this strategy for the representation of Java and AspectJ programs. The type of model we propose called Multi-Level System Graph (MLSG) supports the visualization, analysis, and structural testing of systems written in those languages. Since researchers commonly argue that Aspect-Oriented Programming (AOP) introduces uncertainty about module interactions [11], we believe the MLSG is particularly useful in this context. The first level of the MLSG shows the system s package

65 structure. As the user chooses which packages to explore, classes are then analyzed and shown in a second level, and methods, fields, pieces of advice, and pointcuts can be viewed in a third level. From then on, the user can explore low level control-flow and call chain graphs built from specific units (methods or pieces of advice). At this level, dynamic coverage analysis can also be performed, by supporting the execution of test cases. It is important to note that the analysis itself is only done when the user selects to explore a particular structure of the system; that is, it is not only a matter of expanding or collapsing the view. To automate the visualization, analysis, and testing based on MLSGs, we implemented a tool called SysGraph4AJ (Multi- Level System Graphs for AspectJ). The tool implements the MLSG model and supports its visualization and navigation. Currently the tool also supports statement, branch, and cyclomatic complexity coverage analysis at unit level within MLSGs. We have also conducted an initial evaluation of the tool with three large AspectJ systems. Results show evidence of the feasibility of the tool for the analysis and testing of OO and AO programs. The remainder of this paper is structured as follows. Section II discuss background concepts related to the presented approach and Section III presents the proposed approach. Section IV introduces the SysGraph4AJ tool. Section V discusses an initial evaluation of the approach and Section VI presents related work. Finally, Section VII presents the conclusions and future work. II. BACKGROUND To help understanding the approach proposed in this paper, we briefly introduce key concepts related to Software testing and AOP. A. Software testing and White-box testing Software testing can be defined as the execution of a program against test cases with the intent of revealing faults [12]. The different testing techniques are defined based on the artifact used to derive test cases. Two of the most wellknown techniques are functional or black-box testing, and structural or white-box testing. Black-box testing derives test cases from the specification or description of a program, while white-box testing derives test cases from the internal representation of a program [12]. Some of the well-known structural testing criteria are statement, branch, or def-use [13] coverage, which require that all commands, decisions, or pairs of assignment and use locations of a variable be covered by test cases. In this paper we consider method and advice (see next section) as the smallest units to be tested, i.e., the units targeted by unit testing. In white-box testing, the control-flow graph (CFG) is used to represent the flow of control of a program, where nodes represent a block of statements executed sequentially, and edges represent the flow of control from one block to another [13]. White-box testing is usually supported by tools, as manually deriving CFGs and applying testing criteria is unreliable and uneconomical [14]. However, most open and professional coverage analysis tools do not support the visualization of the CFGs (such as Cobertura, commented in Section I, Emma 1, and Clover 2 ). Some prototypical academic tools do support the visualization of CFG, but mostly separated from its context (i.e., users select to view the model from code, and only the CFG for that unit is shown, apart from any other representation). For instance, JaBUTi supports the visualization of CFGs or CFG clusters apart from the system overall structure. In this paper we propose a multi-level model for visualization, analysis, and testing of OO and AO programs that is built interactively. CFGs are shown within the larger model, so testing can be done with a broader understanding of the whole system. B. AOP and AspectJ AOP supports the implementation of separate modules called aspects that contribute to the implementation of other modules of the system. General-purpose AOP languages define four features: (1) a join point model that describes hooks in the program where additional behavior may be defined; (2) a mechanism for identifying these join points; (3) modules that encapsulate both join point specifications and behavior enhancement; and (4) a weaving process to combine both base code and aspects [15]. AspectJ [16], the most popular AOP language to date, is an extension of the Java language to support general-purpose AOP. In AspectJ, aspects are modules that combine join point specifications pointcuts or, more precisely, pointcut designators (PCDs 3 ); pieces of advice, which implement the desired behavior to be added at join points; and regular OO structures such as methods and fields. Advice can be executed before, after, or around join points selected by the corresponding pointcut, and are implemented as method-like constructs. Advice can also pick context information from the join point that caused them to execute. Aspects can also declare members fields and methods to be owned by other types, i.e., inter-type declarations. AspectJ also supports declarations of warnings and errors that arise when certain join points are identified at compile time, or reached at execution. Consider the partial AspectJ implementation of an Online Music Service shown in Figure 1, where songs from a database can be played and have their information shown to the user (adapted from an example presented by Bodkin and Laddad [17]). Each user has an account with credit to access songs available on the database. At a certain price, given for each song, users can play songs. The song price is debited from the user account whenever the user plays it. Reading lyrics of a song should be available to users at no charge. If a /14/ /14/ A pointcut is the set of selected join points itself and the PCD is usually a language construct that defines pointcuts. For simplicity, we use these terms interchangeably.

66 public class Song implements Playable { private String name; public Song(String name) {... } public String getname() {... } public void play() {... } public void showlyrics() {... } public boolean equals(object o) {... } public int hashcode() {... } } public aspect BillingPolicy { public pointcut usetitle() : execution(* Playable.play(..)) execution(* Song.showLyrics(..)); public pointcut toplevelusetitle(): usetitle() &&!cflowbelow(usetitle()); after(playable playable) returning : toplevelusetitle() && this(playable) { User user = (User)Session.instance().getValue("currentUser"); int amount = playable.getname().length(); user.getaccount().bill(amount); System.out.println("Charge: " + user + " " + amount); } } public aspect AccountSuspension { private boolean Account.suspended = false; public boolean Account.isSuspended() {... } after(account account) returning: set(int Account.owed) && this(account) {... } before() : BillingPolicy.topLevelUseTitle() { User user = (User) Session.instance().getValue("currentUser"); if (user.getaccount().issuspended()) { throw new IllegalArgumentException(); } } } Fig. 1. Partial source code of the Online Music Service [17]. user tries to play a song without enough credits, the system yields an adequate failure. The system also manages the user access sessions. In particular, Figure 1 shows the Song class that represents songs that can be played; the BillingPolicy aspect, that implements the billing policy of the system; and the AccountSuspension aspect, which implements the account suspension behavior of the system. Note that the after returning advice of the BillingPolicy aspect and the before advice of the AccountSuspension aspect affect the execution of some of the system s methods, according to the toplevelusetitle pointcut. III. INCREMENTAL VISUALIZATION, ANALYSIS, AND TESTING OF OO AND AO PROGRAMS To support the visualization, analysis, and structural testing of OO and AO programs, an interesting approach is to derive the underlying model by levels and interactively. Such a model could then be used as a means to navigate through the system incrementally and apply structural testing criteria to test the program as it is analyzed. The visualization of the system using this model would also serve itself as a visually appealing inspection of the system s structure. This type of inspection could help discovering faults statically, while the structural testing functionality would support the dynamic verification of the system. In the AO code example presented in Figure 1 there are two faults. The first is related to the usetitle pointcut, which selects the execution of the showlyrics method as a join point. This makes the user be charged when accessing songs lyrics. However, according to the specification of the program, reading lyrics should not be charged. As commented in Section I, AOP tends to introduce uncertainty about module interactions, when only the code is inspected. In particular, for this example scenario, the explicit visualization of where the pieces of advice affect the system is important to discover the fault 4. We believe the model we propose would help the tester in such a task. The second fault present in the AO code example in Figure 1 is the throwing of an inadequate exception in AccountSuspension s before advice IllegalArgumentException instead of AccountSuspensionException. Such fault would not be easily spotted via inspection: only by executing the true part of the branch, would the fault most probably be revealed. For this case, structural testing using branch coverage analysis would be more adequate than the system s inspection 5. Based on this motivating example, we believe the interplay between visualization and structural testing is a promising approach, specially for AO programs. Therefore, in this section we define a model to support such approach, also keeping in mind the performance issue when dealing with large models. The incremental strategy is intended to keep an adequate performance while deriving the model. A. The Multi-Level System Graph The model we propose is called Multi-Level System Graph (MLSG), and it represents the high-level package structure of the system all the way down to the control flow of its units. The MLSG is a composition of known models such as call and control flow graphs that nevertheless, to the best of our knowledge, have not been combined in the way we propose in this paper. The MSLG can be formally defined as a directed graph MLSG = N, E where: N = P C A M Ad F P c F l, where: P is the set of package nodes, C is the set of class nodes, A is the set of aspect nodes, M is the set of method nodes, Ad is the set of advice nodes, F is the set of field nodes, P c is the set of pointcut nodes, and F l is the set of control flow nodes that represent blocks of code statements. 4 There are some features presented by modern IDEs that also help discovering such a fault. For instance, Eclipse AJDT [18] shows where each advice affects the system. This could help the developer notice that the showlyrics method should not be affected. However, we believe the model we present makes the join points more explicit, further facilitating the inspection of such types of fault. 5 The example we present is only an illustration. It is clear that the application of graph inspections and structural testing could help revealing other types of faults. However, we believe the example serves as a motivation for the approach we propose in this paper, since it shows two instances of faults that could be found using an approach that supports both techniques.

67 billing main model repository + P P P P package level A BillingPolice AccountSuspension + m init + + m pc issuspended + usetitle + a m afterreturning init pc a toplevelusetitle before + a afterreturning A + + C Main C Song C Playable + + C Playlist m init + m hashcode + m showlyrics + f m getname m play m equals class/aspect level method/advice level fl 1 fl fl 2 3 Legend P Package node C Class node A Aspect node m Method node f Field node a Advice node pc Pointcut node fl Control flow node n control flow level + Unanalyzed node Contains edge Call/Interception edge Control flow edge Fig. 2. An example MLSG for the Music Online program. E = Co Ca I F e, where: Co is the set of contains edges. A contains edge (N 1, N 2 ) represents that the structure represented by node N 1 contains the structure represented by node N 2 ; Ca is the set of call edges (N 1, N 2 ), N 1 M, N 2 M, which represent that the method represented by N 1 calls the method represented by N 2 ; I is the set of interception edges. An interception edge (N 1, N 2 ), N 1 Ad, N 2 (M Ad), represents that the method or advice represented by N 2 is intercepted by the advice represented by N 1 ; F e is the set of flow edges (N 1, N 2 ), N 1 F l, N 2 F l, which represent that control may flow from the block of code represented by N 1 to the block of code represented by N 2. The edges types are defined by the types of their source and target nodes (e.g., interception edges must have advice nodes as source nodes and advice or method nodes as target nodes). An example of a partial MLSG of the example AO system discussed in Section II is presented in Figure 2. Note that there are parts of the system that were not fully analyzed (Unanalyzed Nodes). This is because the model shows a scenario where the user chose to expand some of the packages, modules (classes/aspects), and units (methods/pieces of advice), but not all. By looking at the model we can quickly see that the BillingPolicy after returning advice and the AccountSuspension before advice are affecting the Song s play and showlyrics methods. As commented before, the interception of the showlyrics method shown by crossed interception edges is a fault, since only the play operation should be charged. This is an example of how the inspection of the MLSG can support discovering faults. In the same model we can also see the control-flow graph of the AccountSuspension before advice (note that the control flow graphs of the other units could also be expanded once the user selects this operation). Coverage analysis can be performed by using the MLSG. For instance, by executing test cases against the program, with the model we could analyze how many of the statement blocks or branches of the AccountSuspension before advice would be covered. While doing this the fault present in this unit could be uncovered. Another type of model that is interesting for the visualization and testing of systems is the call chain graph (CCG), obtained from the analysis of the call hierarchy (such as done in [19]). The CCG shows the sequence of calls (and, in our case, advice interactions as well) that happen from a given unit. The same information is available at the MLSG; however, the CCG shows a more vertical view of the system, according to the call hierarchy, while the MLSG shows a more horizontal view of the system, according to its structure, with methods and pieces of advice at the same level. Figure 3 shows an example CCG built from the Playlist s play method. Numbers indicate the order in which the methods are called or in which control flow is passed to pieces of advice. IV. IMPLEMENTATION: SYSGRAPH4AJ We developed a prototype tool named SysGraph4AJ (from Multi-Level System Graphs for AspectJ) that implements the proposed MLSG model. The tool is a standalone application written in Java. The main features we are keeping in mind while developing SysGraph4AJ is its visualization capability (we want the models to be shown by the tool to be intuitive

68 m getname play m a before a afterreturning Fig. 3. An example CCG for the Music Online program, built from the Playlist s play method. and useful at the same time) and its performance. In particular, we started developing SysGraph4AJ also because of some performance issues we have observed in other testing tools (such as JaBUTi). We believe this feature is much valued by developers nowadays, as commented in Section I. The architecture of the tool is divided into the following five packages: model, analysis, gui, visualization, and graph. The model package contains classes that represent the MLSG constituents (each node and edge type); the analysis package is responsible for the analysis of the object code in order to be represented by a MLSG; the gui package is responsible for the Graphical User Interface; the visualization package is responsible for implementing the visualization of the MLSGs; and the graph package is responsible for the control flow graph implementation (we decided to separate it from the model package because this part of the MLSG is more complex than others). For the analysis part of the tool, we used both the Java API (to load classes, for instance), and the Apache Byte Code Engineering Library (BCEL 6 ). BCEL was used to analyze methods (for instance, to check their visibility, parameter and return types) and aspects. We decided to make the analysis using bytecode for three reasons: (1) AspectJ classes and aspects are compiled into common Java bytecode and can be analyzed with BCEL, so no weaving of the source code is required (advice interactions can be more easily identified due to implementation conventions adopted by AspectJ [20]); (2) analysis can be made even with the absence of the source code; and (3) in comparison with source code, the bytecode represents more faithfully the interactions that actually occur in the system (so the model is more realistic). The analysis package also contains a class responsible for the coverage analysis. Currently we are using the Java Code Coverage Library (JaCoCo 7 ) to implement class instrumentation and statement and branch coverage analysis. We decided to use JaCoCo because it is free and provided good performance in our initial experiments. For the visualization functionality, we used the Java Universal Network/Graph Framework (JUNG 8 ). We decided to use JUNG because it is open source, provides adequate documentation and is straightforward to use. Moreover, we made performance tests with JUNG and other graph libraries and noted that JUNG was the fastest. The visualization package is the one that actually uses the JUNG API. This package is /15/ /31/ /30/2013 responsible for converting the underlying model represented in our system built using the analysis package to a graph in JUNG representation. This package also implements the creation of the panel that will show the graphical representation of the graphs, besides coloring and mouse functionalities. To convert our model to a JUNG graph we begin by laying it out as a tree, where the root is the bin folder of the project being analyzed, and the methods, pieces of advice, fields and pointcuts are the leaves. This strategy makes the graph present the layered structure that we desire, such as in the example presented in Figure 2. Since the model is initially a tree, there are no call, interception, or flow edges, only contains edges. Information about the additional edges is stored in a table, and later added to the model. The visualization package also contains a class to construct CCGs from MLSGs. The construction of the CCG is straightforward because all call and interception information is available at the MLSG. Control flow graphs are implemented inside the graph package. It contains a model subpackage that implements the underlying model, and classes to generate the graphs from a unit s bytecode. A. Tool s Usage When the user starts SysGraph4AJ, he must first decide which Java project to analyze. Currently this is done by selecting a bin directory which contains all bytecode classes of the system. Initially the tool shows the root package (which represents the bin directory) and all system packages. Each package is analyzed until a class or aspect is found (that is, if there are packages and subpackages with no classes, the lower level packages are analyzed until a module is reached). From there on, the user can double-click on the desired classes or aspects to see their structure constituents (methods, pieces of advice, fields, and pointcuts). When a user double-clicks a method, dependence analysis is performed, and call and interception edges are added to the MLSG, according to the systems interactions. Figure 4 shows a screenshot of SysGraph4AJ with a MLSG similar to the one presented in Figure 2, for the same example system. In SysGraph4AJ, we decided to use colors instead of letter labels to differentiate node types. We can see, for instance, that aspects are represented by pink nodes, methods are represented by light blue nodes, and pieces of advice are represented by white nodes. Contains edges are represented by dashed lines, call/interception edges are represented by dashed lines with larger dashes, and control-flow edges are represented by solid lines. Within CFGs, yellow nodes represent regular blocks while black nodes represent blocks that contain return statements (exit nodes). Control flow graphs can be shown by left-clicking on a unit and choosing the View control flow graph option. Call chain graphs can also be shown by choosing the View call chain option, or the View recursive call chain. The second option does the analysis recursively until the lowest level calls and advice interceptions, and shows the corresponding graph. The first option shows only the units that were already

69 Fig. 4. A screenshot of the SysGraph4AJ tool. analyzed in the corresponding MLSG. The CCG is shown in a separate window, as its view is essentially different from that of the MLSG (as commented in Section III). Figure 5 shows a screenshot of SysGraph4AJ with a CCG similar to the one presented in Figure 3. The sequence of calls skips number 2 because it refers to a library or synthetic method call. We decided to exclude those to provide a leaner model containing only interactions within the system. Such calls are not represented in the MLSG either. Fig. 5. An example CCG for the Music Online program, built from the Playlist s play method, generated with SysGraph4AJ. Code coverage is performed by importing JUnit test cases. Currently, this is done via a menu option on SysGraph4AJ s main window menu bar. The user chooses which JUnit test class to execute, and the tool automatically runs the test cases and calculates instruction and branch coverage. Coverage is shown in a separate window, but we are currently implementing the visualization of coverage on the MLSG itself. For instance, when a JUnit test class is run, we want to show on the MLSG which classes were covered and the coverage percentages close to the corresponding class and units. V. EVALUATION As an initial evaluation of the SysGraph4AJ tool, we decided to measure its performance while analyzing realistic software systems. It is important to evaluate performance because this was one of our main expected quality attributes while developing the tool. For this study, we selected three medium-sized AspectJ applications. The first is an AspectJ version of a Java-based object-relational data mapping framework called ibatis 9. The second system is an aspect-oriented version of Health- Watcher [21], a typical Java web-based information system. The third target application is a software product line for mobile devices, called MobileMedia [22]. The three systems were used in several evaluation studies [2, 11, 23, 24]. To have an idea of size, the ibatis version used in our study has approximately 15 KLOC within 220 modules (classes and aspects) and 1330 units (methods and pieces of advice); HealthWatcher, 5 KLOC within 115 modules and 512 units; and MobileMedia, 3 KLOC within 60 modules and 280. Besides measuring the performance of our tool by itself while analyzing the three systems, we also wanted to have a basis for comparison. In our case, we believe the JaBUTi tool [5] is the most similar in functionality to SysGraph4AJ. In particular, it also applies control-flow testing criteria and supports the visualization of CFGs (although not interactively and within a multi-level system graph such as in SysGraph4AJ). Moreover, JaBUTi is an open source tool, so we have access to its code. Therefore, we also evaluated JaBUTi while making similar analysis of the same systems and selected modules. Since JaBUTi also performs data-flow analysis, to make a fair comparison, we removed this functionality from the system /05/2013

70 TABLE I RESULTS OF THE EXPLORATORY EVALUATION. System Class Method/Advice SG4AJ u SG4AJ C JBT C DynamicTagHandler - dostartfragment ibatis ScriptRunner - runscript BeanProbe - getreadablepropertynames HWServerDistribution - around execution(...) HealthWatcher HWTimestamp - updatetimestamp SearchComplaintData - executecommand MediaUtil - readmediaasbytearray MobileMedia UnavailablePhotoAlbumException - getcause PersisteFavoritesAspect - around getbytesfromimageinfo(...) Avg Legend: SG4AJ SysGraph4AJ; JBT JaBUTi; u analysis of a single unit; C analysis of whole classes. before running the experiment. The null hypothesis H1 0 of our experiment is that there is no difference in performance between SysGraph4AJ and JaBUTi while performing analyses of methods; and the alternative hypothesis H1 a is that SysGraph4AJ presents better performance than JaBUTi while performing analyses of methods. We randomly selected three units (method/advice) inside three modules (class/aspect) of each of the target systems. The idea was to simulate a situation where the tester would choose a single unit to be tested. We made sure that the analyzed structures were among the largest. The time taken to analyze and instrument each unit including generating its CFG was measured in milliseconds (ms). Since the generation of the model in SysGraph4AJ is interactive, to measure only the analysis and instrumentation performance, we registered the time taken by the specific operations that perform these functions (i.e., we recorded the time before and after each operation and summarized the results). With respect to JaBUTi, in order to test a method with this tool, the class that contains the method must be selected. When this is done, all methods contained in the class are analyzed. Therefore, in this evaluation we were in fact comparing the interactive analysis strategy adopted by SysGraph4AJ against the upfront analysis strategy adopted by JaBUTi. Table I shows the results of our study. Note that the analysis of methods and pieces of advice in SysGraph4AJ (column SG4AJ u) is very fast (it takes much less than a second, on average, to analyze and instrument each unit and generate the corresponding model). In JaBUTi, on the other hand, the performance is around 10 times worse. This is somehow expected, since JaBUTi analyzes all methods within the class. To check wether the difference was statistically significant, we ran a Wilcoxon/Mann Whitney paired test, since the observations did not seem to follow a normal distribution (according to a Shapiro-Wilk normality test). The statistical test revealed a significant difference at 95% confidence level (p-value = ; with Bonferroni correction since we are performing two tests with the same data). Now it is clear that this evaluation is in fact measuring how analyzing methods interactively is faster than forthrightly. To compare the performance of SysGraph4AJ with JaBUTi only with respect to their instrumentation and analysis methods, regardless of the adopted strategy, we also measured how long SysGraph4AJ took to analyze all methods within the target classes. Column SG4AJ C of Table I shows these observations. The difference was again statistically significant (p-value = , with Bonferroni correction). Both statistical tests support the alternative hypothesis that SysGraph4AJ is faster than JaBUTi in the analyses of methods. Although SysGraph4AJ appears to perform better than JaBUTi while analyzing and instrumenting modules and units, the figures shown even for JaBUTi are still small (i.e., waiting half a second for an analysis to be done might not affect the user s experience). However, we must note that these figures are for the analysis of single modules, and therefore only a part of the startup operations performed by JaBUTi. To have an idea of how long it would take for JaBUTi to analyze a whole system, we can estimate the total time to analyze ibatis, the largest system in our sample. Note that even if the tester was interested in testing a single method from each class, the whole system would have to be analyzed, because of the strategy adopted by the tool. ibatis contains around 220 modules (classes + aspects). Therefore, we could multiply the number of modules by the average time JaBUTi took to analyze the nine target modules ( ms). This would summarize 99, ms more than 1.5 minutes. Having to wait more than a minute and a half before starting to use the tool might annoy users, considering, for instance, the recent performance critiques the Eclipse IDE has received [9]. Also note that ibatis is a medium-sized system, for larger systems the startup time could be even greater. It is important to note, however, that JaBUTi implements many other features that are not addressed by SysGraph4AJ (e.g., dataflow-based testing and metrics gathering). This might also explain why it performs worse than SysGraph4AJ: the design goals of the developers were broader than ours. Moreover, we believe that the use of consolidated libraries in the implementation of core functions of SysGraph4AJ, such as BCEL for analysis, JaCoCo for instrumentation and coverage analysis, and JUNG for visualization, helped improving its performance. JaBUTi s implementation also relies on some

71 libraries (such as BCEL), but many other parts of the tool were implemented by the original developers themselves (such as class instrumentation), which might explain in part its weaker performance (i.e., it is hard to be an expert in the implementation of a diversity of features, and preserve their good performance, specially when the developers are academics more interested in proof-of-concept prototypes). With respect to the coverage analysis performance, we have not yet been able to measure it for the target systems, because they require a configured environment unavailable at the time. However, since the observed execution time overhead for instrumented applications with JaCoCo is typically less than 10% [25], we believe the coverage analysis performance will also be adequate. In any case, to have an initial idea of performance for the coverage analysis part of SysGraph4AJ, we recorded the time taken to execute 12 test cases against the Music Online example application shown in Section II. It took only 71 ms to run the tests and collect coverage analysis information for the BillingPolicy aspect. VI. RELATED WORK Research which is related to this work addresses: (i) challenges and approaches for testing AO programs, with focus on structural testing; and (ii) tools to support structural testing of Java and AspectJ programs. Both categories are next described. A. Challenges and Approaches for AO Testing Alexander et al. [26] were the first to discuss the challenges for testing AO programs. They described potential sources of faults and fault types which are directly related to the control and data flow of such programs (e.g. incorrect aspect precedence, incorrect focus on control flow, and incorrect changes in control dependencies). Ever since, several refinements and amendments to Alexander et al. s taxonomy have been proposed [24, 27, 28, 29]. To deal with challenges such as the ones described by Alexander et al., researchers have been investigating varied testing approaches. With respect to structural testing, Lemos et al. [30] devised a graph-based representation, named AODU, which includes crosscutting nodes, that is, nodes that represent control flow and data flow information about the advised join points in the base code. Evolutions of Lemos et al. s work comprises the integration of unit graphs to support pairwise [4], pointcut-based [2] and multi-level integration testing of Java and AspectJ programs [6]. These approaches are supported by the JaBUTI family of tools, which is described in the next section. The main difference between the AODU graph and the MLSG introduced in this paper is that the former is constructed only for a single unit and is displayed out of its context. The MLSG shows the CFGs of units within the system s context. On the other hand, the AODU contains dataflow information, which are not yet present in our approach. Other structural testing approaches for AO programs at the integration level can also be found in the literature. For example, Zhao [31] proposes integrated graphs that include particular groups of communicating class methods and aspect advices. Another example is the approach of Wedyan and Ghosh [32], who represent a whole AO system through a data flow graph, named ICFG, upon which test requirements are derived. Our approach differs from Zhao s and Wedyan and Ghosh s approaches again with respect to the broader system context in which the MLSG is built. Moreover, to the best of our knowledge, none of the described related approachers were implemented. B. Tools JaBUTi [5] was developed to support unit-level, control flow and data flow-based testing of Java programs. The tool is capable of deriving and tracking the coverage of test requirements for single units (i.e., class methods) based on bytecode-level instrumentation. In subsequent versions, JaBUTi was extended to support the testing of AO programs written in AspectJ at the unit [30] and integration levels [2, 4, 6]. The main difference between the JaBUTi tools and Sys- Graph4AJ (described in this paper) relies on the flexibility the latter offers to the user. SysGraph4AJ enables the user to expand the program graph up to a full view of the system in terms of packages, modules, units and internal unit CFGs. That is, SysGraph4AJ provides a comprehensive representation of the system, and derives test requirements from this overall representation. JaBUTi members, on the other hand, provide more restricted views of the system structure. For example, the latest JaBUTi/AJ version automates multi-level integration testing [6]. In this case, the tester selects a unit from a preselected set of modules, then an integrated CFG is built up to a prespecified depth level. Once the CFG is built, the test requirements are derived and the tester cannot modify the set of modules and units under testing. DCT-AJ [32] is another recent tool that automates data flow-based criteria for Java and AspectJ programs. Differently from JaBUTi/AJ and SysGraph4AJ, DCT-AJ builds an integrated CFG to represent all interacting system modules, which however is only used as the underlying model to derive test requirements. That is, the CFG created by DCT-AJ cannot be visualized by the user. Other open and professional coverage analysis tools such as Cobertura [7], EMMA [8] and Clover [33] do not support the visualization of the CFGs. They automate control flowbased criteria like statement and branch coverage, and create highlighted code coverage views the user can browse through. Finally, widely used IDEs such as Eclipse 10 and NetBeans 11 offer facilities related to method call hierarchy browsing. This enables the programmer to visualize method call chains in a tree-based representation that can be expanded or collapsed through mouse clicks. However, these IDEs neither provide native coverage analysis nor a program graph representation as rich in detail as the MLSG model /04/ /04/2013.

72 VII. CONCLUSIONS In this paper we have presented an approach for visualization, analysis, and structural testing of Java and AspectJ programs. We have defined a model called Multi-Level System Graph (MLSG) that represents the structure of a system, and can be constructed interactively. We have also implemented the proposed approach in a tool, and provided initial evidence of its good performance. Currently, the tool supports visualization of the system s structure and structural testing at unit level. However, we intend to make SysGraph4AJ a basic framework for implementing other structural testing approaches, such as integration testing. Since, in general, most professional developers do not have time to invest in understanding whole systems with the type of approach presented in this paper, we believe the MLSG can be especially useful for testers at this moment. However, we also believe that if the MLSG could be seamlessly integrated with development environments, the approach would also be interesting for other types of users. For instance, by providing direct links from the MLSG nodes to the source code of the related structures, users could navigate through the system and also easily edit its code. In the future we intend to extend our tool to provide such type of functionality. ACKNOWLEDGEMENTS The authors would like to thank FAPESP (Otavio Lemos, grant 2010/ ), for financial support. REFERENCES [1] E. S. F. Najumudheen, R. Mall, and D. Samanta, A dependence representation for coverage testing of objectoriented programs. Journal of Object Technology, vol. 9, no. 4, pp. 1 23, [2] O. A. L. Lemos and P. C. Masiero, A pointcutbased coverage analysis approach for aspect-oriented programs, Inf. Sci., vol. 181, no. 13, pp , Jul [3] D. Grove, G. DeFouw, J. Dean, and C. Chambers, Call graph construction in object-oriented languages, in Proc. of the 12th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, ser. OOPSLA 97. New York, NY, USA: ACM, 1997, pp [4] O. A. L. Lemos, I. G. Franchin, and P. C. Masiero, Integration testing of object-oriented and aspect-oriented programs: A structural pairwise approach for java, Sci. Comput. Program., vol. 74, no. 10, pp , Aug [5] A. M. R. Vincenzi, J. C. Maldonado, W. E. Wong, and M. E. Delamaro, Coverage testing of java programs and components, Science of Computer Programming, vol. 56, no. 1-2, pp , Apr [6] B. B. P. Cafeo and P. C. Masiero, Contextual integration testing of object-oriented and aspect-oriented programs: A structural approach for java and aspectj, in Proc. of the th Brazilian Symposium on Software Engineering, ser. SBES 11. Washington, DC, USA: IEEE Computer Society, 2011, pp [7] M. Doliner, Cobertura tool website, Online, 2006, http: //cobertura.sourceforge.net/index.html - last accessed on 06/02/2013. [8] V. Roubtsov, EMMA: A free Java code coverage tool, Online, 2005, - last accessed on 06/02/2013. [9] The H Open, Weak performance of eclipse 4.2 criticised, Online, 2013, Weak-performance-of-Eclipse-4-2-criticised html - last accessed on 19/04/2013. [10] C. Andersson and P. Runeson, A replicated quantitative analysis of fault distributions in complex software systems, IEEE Trans. Softw. Eng., vol. 33, no. 5, pp , May [11] F. Ferrari, R. Burrows, O. Lemos, A. Garcia, E. Figueiredo, N. Cacho, F. Lopes, N. Temudo, L. Silva, S. Soares, A. Rashid, P. Masiero, T. Batista, and J. Maldonado, An exploratory study of fault-proneness in evolving aspect-oriented programs, in Proc. of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ser. ICSE 10. New York, NY, USA: ACM, 2010, pp [12] G. J. Myers, C. Sandler, T. Badgett, and T. M. Thomas, The Art of Software Testing, 2nd ed. John Wiley & Sons, [13] S. Rapps and E. J. Weyuker, Selecting software test data using data flow information, IEEE Trans. Softw. Eng., vol. 11, no. 4, pp , [14] IEEE, IEEE standard for software unit testing, Institute of Electric and Electronic Engineers, Standard , [15] T. Elrad, R. E. Filman, and A. Bader, Aspect-oriented programming: Introduction, Communications of the ACM, vol. 44, no. 10, pp , [16] G. Kiczales, J. Irwin, J. Lamping, J.-M. Loingtier, C. Lopes, C. Maeda, and A. Menhdhekar, Aspectoriented programming, in Proc. of the European Conference on Object-Oriented Programming, M. Akşit and S. Matsuoka, Eds., vol Berlin, Heidelberg, and New York: Springer-Verlag, 1997, pp [17] R. Bodkin and R. Laddad, Enterprise AspectJ tutorial using eclipse, Online, 2005, eclipsecon Available from: EclipseCon2005 EnterpriseAspectJTutorial9.pdf (accessed 12/3/2007). [18] The Eclipse Foundation, AJDT: Aspectj development tools, Online, 2013, - last accessed on 19/04/2013. [19] A. Rountev, S. Kagan, and M. Gibas, Static and dynamic analysis of call chains in java, in Proc. of the 2004 ACM SIGSOFT international symposium on Software testing and analysis, ser. ISSTA 04. New York, NY, USA:

73 ACM, 2004, pp [20] E. Hilsdale and J. Hugunin, Advice weaving in aspectj, in Proceedings of the 3rd international conference on Aspect-oriented software development, ser. AOSD 04. New York, NY, USA: ACM, 2004, pp [21] P. Greenwood, T. Bartolomei, E. Figueiredo, M. Dosea, A. Garcia, N. Cacho, C. Sant Anna, S. Soares, P. Borba, U. Kulesza, and A. Rashid, On the impact of aspectual decompositions on design stability: an empirical study, in Proc. of the 21st European conference on Object- Oriented Programming, ser. ECOOP 07. Berlin, Heidelberg: Springer-Verlag, 2007, pp [22] E. Figueiredo, N. Cacho, C. Sant Anna, M. Monteiro, U. Kulesza, A. Garcia, S. Soares, F. Ferrari, S. Khan, F. Castor Filho, and F. Dantas, Evolving software product lines with aspects: an empirical study on design stability, in Proc. of the 30th international conference on Software engineering, ser. ICSE 08. New York, NY, USA: ACM, 2008, pp [23] F. C. Filho, N. Cacho, E. Figueiredo, R. Maranhão, A. Garcia, and C. M. F. Rubira, Exceptions and aspects: the devil is in the details, in Proc. of the 14th ACM SIGSOFT international symposium on Foundations of software engineering, ser. SIGSOFT 06/FSE-14. New York, NY, USA: ACM, 2006, pp [24] F. C. Ferrari, J. C. Maldonado, and A. Rashid, Mutation testing for aspect-oriented programs, in Proc. of the 2008 International Conference on Software Testing, Verification, and Validation, ser. ICST 08. Washington, DC, USA: IEEE Computer Society, 2008, pp [25] Mountainminds GmbH & Co. KG and Contributors, Control flow analysis for java methods, Online, 2013, available from: trunk/doc/flow.html (accessed 02/04/2013). [26] R. T. Alexander, J. M. Bieman, and A. A. Andrews, Towards the systematic testing of aspect-oriented programs, Dept. of Computer Science, Colorado State University, Tech. Report CS , [27] M. Ceccato, P. Tonella, and F. Ricca, Is AOP code easier or harder to test than OOP code? in Proceedings of the 1 st Workshop on Testing Aspect Oriented Programs (WTAOP) - held in conjunction with AOSD, Chicago/IL - USA, [28] A. van Deursen, M. Marin, and L. Moonen, A systematic aspect-oriented refactoring and testing strategy, and its application to JHotDraw, Stichting Centrum voor Wiskundeen Informatica, Tech.Report SEN-R0507, [29] O. A. L. Lemos, F. C. Ferrari, P. C. Masiero, and C. V. Lopes, Testing aspect-oriented programming pointcut descriptors, in Proceedings of the 2 nd Workshop on Testing Aspect Oriented Programs (WTAOP). Portland/- Maine - USA: ACM Press, 2006, pp [30] O. A. L. Lemos, A. M. R. Vincenzi, J. C. Maldonado, and P. C. Masiero, Control and data flow structural testing criteria for aspect-oriented programs, The Journal of Systems and Software, vol. 80, no. 6, pp , [31] J. Zhao, Data-flow-based unit testing of aspect-oriented programs, in Proceedings of the 27 th Annual IEEE International Computer Software and Applications Conference (COMPSAC). IEEE Computer Society, 2003, pp [32] F. Wedyan and S. Ghosh, A dataflow testing approach for aspect-oriented programs, in Proceedings of the 12 th IEEE International High Assurance Systems Engineering Symposium (HASE). San Jose/CA - USA: IEEE Computer Society, 2010, pp [33] Atlassian, Inc., Clover: Java and Groovy code coverage, Online, overview - last accessed on 06/02/2013.

74 A Method for Model Checking Context-Aware Exception Handling Lincoln S. Rocha Grupo de Pesquisa GREat UFC, Quixadá-CE, Brasil Rossana M. C. Andrade Grupo de Pesquisa GREat UFC, Fortaleza-CE, Brasil Alessandro F. Garcia Grupo de Pesquisa OPUS PUC-Rio, Rio de Janeiro-RJ, Brasil Resumo O tratamento de exceção sensível ao contexto (TESC) é uma técnica de recuperação de erros empregada na melhoria da robustez de sistemas ubíquos. No projeto do TESC, os projetistas especificam condições de contexto que são utilizadas para caracterizar situações de anormalidade e estabelecem critérios para a seleção das medidas de tratamento. A especificação errônea dessas condições representam faltas de projeto críticas. Elas podem fazer com que o mecanismo de TESC, em tempo de execução, não seja capaz de identificar as situações de anormalidade desejadas ou reagir de forma adequada quando estas são detectadas. Desse modo, para que a confiabilidade do TESC não seja comprometida, é necessário que estas faltas de projeto sejam rigorosamente identificadas e removidas em estágios iniciais do desenvolvimento. Contudo, embora existam abordagens formais para verificação de sistemas ubíquos sensíveis ao contexto, estas não proveem suporte apropriado para a verificação do TESC. Nesse cenário, este trabalho propõe um método para verificação de modelos do TESC. As abstrações propostas pelo método permitem aos projetistas modelarem aspectos comportamentais do TESC e, a partir de um conjunto de propriedades pré-estabelecidas, identificar a existência de faltas de projeto. Com o objetivo de avaliar a viabilidade do método: (i) uma ferramenta de suporte à verificação foi desenvolvida; e (ii) cenários recorrentes de faltas em TESC foram injetados em modelos de um sistema de forma a serem analisados com a abordagem de verificação proposta. Index Terms Sistemas Ubíquos, Tratamento de Exceção, Verificação de Modelos Abstract The context-aware exception handling (CAEH) is an error recovery technique employed to improve the ubiquitous software robustness. In the design of CAEH, context conditions are specified to characterize abnormal situations and used to select the proper handlers. The erroneous specification of such conditions represents a critical design fault that can lead the CAEH mechanism to behave erroneously or improperly at runtime (e.g., abnormal situations may not be recognized and the system s reaction may deviate from what is expected). Thus, in order to improve the CAEH reliability this kind of design faults must be rigorously identified and eliminated from the design in the early stages of development. However, despite the existence of formal approaches to verify context-aware ubiquitous systems, such approaches lack specific support to verify the CAEH behavior. This work proposes a method for model checking CAEH. This method provides a set of modeling abstractions and 3 (three) properties formally defined that can be used to identify exiting design faults in the CAEH design. In order to assess the method feasibility: (i) a support tool was developed; and (ii) fault scenarios that are recurring in the CAEH was injected in a correct model and verified using the proposed approach. Index Terms Ubiquitous Systems, Exception Handling, Model Checking I. INTRODUÇÃO Os sistemas ubíquos sensíveis ao contexto são sistemas capazes de observar o contexto em que estão inseridos e reagir de forma apropriada, adaptando sua estrutura e comportamento ou executando tarefas de forma automática [1]. Nesses sistemas, o contexto representa um conjunto de informações sobre o ambiente (físico ou lógico, incluindo os usuários e o próprio sistema), que pode ser usado com o propósito de melhorar a interação entre o usuário e o sistema ou manter a sua execução de forma correta, estável e otimizada [2]. Devido ao seu amplo domínio de aplicação (e.g., casas inteligentes, guias móveis de visitação, jogos e saúde) e por tomarem decisões de forma autônoma no lugar das pessoas, os sistemas ubíquos sensíveis ao contexto precisam confiáveis para cumprir com a sua função. Tal confiabilidade requer que esses sejam robustos (i.e., capazes de lidar com situações anormais) [3]. O tratamento de exceção sensível ao contexto (TESC) é uma abordagem utilizada para recuperação de erros que vem sendo explorada como uma alternativa para melhorar os níveis de robustez desse tipo de sistema [4][3][5][6][7][8]. No TESC, o contexto é usado para caracterizar situações anormais no sistemas (e.g., uma falha de software/hardware ou a quebra de algum invariante do sistema), denominadas de exceções contextuais, e estruturar as atividades de tratamento, estabelecendo critérios para a seleção e execução de tratadores. De um modo geral, a ocorrência de uma exceção contextual requer que o fluxo normal do sistema seja desviado para que o tratamento apropriado seja conduzido. Entretanto, dependendo da situação e do mecanismo de TESC adotado, o fluxo de controle pode ser retomado, ou não, após o término tratamento. No projeto de sistemas ubíquos sensíveis ao contexto, o uso de formalismos e abstrações apropriados, faz-se necessário, para lidar com questões complexas inerentes a esses sistemas (e.g., abertura, incerteza, adaptabilidade, volatilidade e gerenciamento de contexto) em tempo de projeto [9][10][11][12][13][14]. Em particular, o projetista de TESC é responsável por especificar as condições de contexto utilizadas [5][3]: (i) na definição das exceções contextuais; e (ii) na seleção das medidas de tratamento. No caso (i), as condições de contexto especificadas são usadas pelo mecanismo de TESC para detectar a ocorrência de exceções contextuais em tempo de execução. Por outro lado, no caso

75 (ii), as condições de contexto funcionam como pré-condições que são estabelecidas para cada possível tratador de uma exceção contextual. Assim, quando uma exceção é detectada, o mecanismo seleciona, dentre os possíveis tratadores daquela exceção, aqueles cujas pré-condições são satisfeitas no contexto corrente do sistema. Entretanto, a falibilidade humana e o conhecimento parcial sobre a forma como o contexto do sistema evolui, podem levar os projetistas a cometerem erros de especificação, denominadas de faltas de projeto (design faults). Por exemplo, devido a negligência ou lapsos de atenção, contradições podem ser facilmente inseridas pelos projetistas na especificação das condições de contexto ou, mesmo não existindo contradições, essas condições podem representar situações de contexto que nunca ocorrem em tempo de execução, devido a forma como o contexto evolui. Nessa perspectiva, a inserção de faltas de projeto, e a sua eventual propagação até a fase de codificação, podem fazer com que o mecanismo de TESC seja configurado de maneira imprópria, comprometendo a sua confiabilidade em detectar exceções contextuais ou selecionar os tratadores apropriados. Estudos recentes relatam que a validação do código de tratamento de exceção de sistemas que utilizam recursos externos não confiáveis, a exemplo dos sistemas ubíquos sensíveis ao contexto, é uma atividade difícil e extremamente desafiadora [15]. Isso decorre do fato de que para testar todo o espaço de exceções levantadas no sistema, é necessário gerar, de forma sistemática, todo o contexto que desencadeia essas exceções. Essa atividade, além de complexa, pode se tornar proibitiva em alguns casos devido aos altos custos associados. Desse modo, uma análise rigorosa do projeto do TESC, buscando identificar e remover faltas de projeto, podem contribuir para a melhoria dos níveis de confiabilidade do TESC e para a diminuição dos custos associados à identificação e correção de defeitos decorrentes da inserção de faltas de projeto. Contudo, embora existam abordagens formais baseadas em modelos voltadas para a análise do comportamento de sistemas ubíquos sensíveis ao contexto [16][11][17], essas estão voltadas somente para o comportamento adaptativo. Elas não provêem abstrações e suporte apropriado para modelagem e análise do comportamento do TESC, tornando essa atividade ainda mais complexa e sujeita a introdução de faltas. Nesse cenário, este trabalho propõe um método baseado em verificação de modelos para apoiar a identificação automática de faltas de projeto no TESC (Seção IV). O método proposto provê um conjunto de abstrações que permitem aos projetistas modelarem aspectos comportamentais do TESC e mapeá-los para um modelo formal de comportamento (estrutura de Kripke) compatível com a técnica de verificação de modelos [18]. O formalismo adotado é baseado em estados, transições e lógica temporal devido as necessidades peculiares de projeto e verificação de modelos de TESC (Seção III). Um conjunto de propriedades comportamentais é estabelecido e formalmente definido com lógica temporal no intuito de auxiliar os projetistas na identificação de determinados tipos de faltas de projeto (Seção II). Além disso, com o objetivo de avaliar a viabilidade do método (Seção V): (i) uma ferramenta de suporte à verificação foi desenvolvida e (ii) cenários recorrentes de faltas em TESC foram injetados em modelos de um sistema de forma a serem analisados com a abordagem de verificação proposta. Ao final, a Seção VI descreve os trabalhos relacionados e a Seção VII conclui o artigo. II. TRATAMENTO DE EXCEÇÃO SENSÍVEL AO CONTEXTO No tratamento de exceção sensível ao contexto (TESC), o contexto e a sensibilidade ao contexto são utilizados pelo mecanismo de tratamento de exceção para definir, detectar e tratar condições anormais em sistemas ubíquos, chamadas de exceções contextuais. Na Seção II-A desta seção são descritos os principais tipos de exceções contextuais. Além disso, uma discussão sobre onde e como faltas 1 de projeto podem ser cometidas no projeto do TESC é oferecida na Seção II-B. A. Tipos de Exceções Contextuais As exceções contextuais representam situações anormais que requerem que um desvio no fluxo de execução seja feito para que o tratamento da excepcionalidade seja conduzido. A detecção da ocorrência de uma exceção contextual pode indicar uma eventual falha em algum dos elementos (hardware ou software) que compõem o sistema ou que alguma invariante de contexto, necessária a execução de alguma atividade do sistema, tenha sido violada. Neste trabalho, as exceções contextuais foram agrupadas em 3 (três) categorias: infraestrutura, invalidação de contexto e segurança. Elas são descritas nas próximas subseções. 1) Exceções Contextuais de Infraestrutura: Esse tipo de exceção contextual está relacionada com a detecção de situações de contexto que indicam que alguma falha de hardware ou software ocorreu em algum dos elementos que constituem o sistema ubíquo. Um exemplo desse tipo de exceção contextual é descrito em [4] no escopo de um sistema de aquecimento inteligente. A função principal daquele sistema é ajustar a temperatura do ambiente às preferências dos usuários. Naquele sistema, uma situação de excepcionalidade é caracterizada quando a temperatura do ambiente atinge um valor acima do limite estabelecido pelas preferencias do usuário. Esse tipo de exceção contextual ajuda a identificar, de forma indireta, a ocorrência de falhas cuja origem pode ser o sistema que controla o equipamento de aquecimento (falha de software) ou o próprio equipamento (falha de hardware). O mau funcionamento do sistema de aquecimento é considerado uma situação anormal, pois pode colocar em risco a saúde dos usuários. Observe que para detectar essa exceção contextual é necessário ter acesso à informações de contexto sobre a temperatura do ambiente e as preferências dos usuários. 2) Exceções Contextuais de Invalidação de Contexto: Esse tipo de exceção contextual está relacionada com a violação de determinadas condições de contexto durante a execução de alguma tarefa do sistema. Essas condições de contexto funcionam como invariantes da tarefa e, quando violadas, 1 Este trabalho adota a nomenclatura de [19], na qual uma falta (fault) é a causa física ou algorítmica de um erro (error), que, se propagado até a interface de serviço do componente ou sistema, caracteriza uma falha (failure).

76 caracterizam uma situação de anormalidade. Por exemplo, os autores de [3] descrevem esse tipo de exceção em uma aplicação de leitor de música sensível ao contexto. O leitor de música executa no dispositivo móvel do usuário, enviando um fluxo continuo de som para a saída de áudio do dispositivo. Entretanto, quando o usuário entra em uma sala vazia, o aplicativo busca por algum dispositivo de áudio disponível no ambiente e transfere o fluxo de som para aquele dispositivo. Nessa aplicação, é estabelecido como contexto invariante a necessidade do usuário estar sozinho dentro da sala. Para os autores de [3], a violação desse invariante é considerado uma situação excepcional, pois o seu não cumprimento pode trazer desconforto ou aborrecimento para as demais pessoas presentes na sala. Note que a detecção dessa exceção depende de informações de contexto sobre a localização do usuário e o número de pessoas que estão na mesma sala que ele. 3) Exceções Contextuais de Segurança: Esse tipo de exceção está relacionada com situações de contexto que ajudam a identificar a violação de políticas de segurança (e.g., autenticação, autorização e privacidade) e demais situações que podem colocar em risco a integridade física ou financeira dos usuários do sistema. Por exemplo, o sistema de registro médico sensível ao contexto apresentado em [3] descreve esse tipo de exceção. Nesse sistema existem três usuários envolvidos: os pacientes, os enfermeiros e os médicos. Os médicos podem fazer registros sobre seus pacientes e os enfermeiros podem ler e atualizar esses registros ao tempo em que assistem aos pacientes. Entretanto, os enfermeiros só podem ter acesso aos registros se estiverem dentro da enfermaria em que o paciente se encontra e se o médico responsável estiver presente. Naquele sistema, quando um enfermeiro tenta acessar os registros do paciente, porém não se encontra na mesma enfermaria que este paciente ou encontra-se na enfermaria, mas o médico responsável não está presente, caracteriza-se uma situação excepcional. Perceba que a detecção desse tipo de exceção depende das informações de contexto sobre a localização e o perfil do paciente, do enfermeiro e do médico. B. Propensão a Faltas de Projeto Com base em trabalhos existentes na literatura [4][3][5][6][7][8], é possível dividir o projeto do TESC em duas grandes atividades: (i) especificação do contexto excepcional; e (ii) especificação do tratamento sensível ao contexto. Na atividade (i), os projetistas especificam as condições de contexto que caracterizam as situações de anormalidade identificadas no sistema. Dessa forma, em tempo de execução, quando uma dessas situações são detectadas pelo mecanismo de TESC, diz-se que uma ocorrência da exceção contextual associada foi identificada. Por outro lado, na atividade (ii), os projetistas especificam as ações de tratamento a serem executadas para mitigar a situação de excepcionalidade detectada. Entretanto, dependendo do contexto corrente do sistema quando a exceção é detectada, um conjunto de ações de tratamento podem ser mais apropriado do que outro para tratar aquela determinada ocorrência excepcional. Desse modo, faz parte do trabalho do projetista na atividade (ii), agrupar as ações de tratamento e estabelecer condições de contexto que ajudem o mecanismo de TESC, em tempo de execução, a selecionar o conjunto de ações de tratamento apropriado para lidar com uma ocorrência excepcional em função do contexto corrente. A falibilidade dos projetistas, o conhecimento parcial sobre a forma como o contexto do sistema evolui em tempo de execução, a inexistência de uma notação apropriada e a falta de suporte ferramental, tornam o projeto do TESC uma atividade extremamente propensa a faltas de projeto. Por exemplo, devido a negligência ou lapsos de atenção, contradições podem ser facilmente inseridas pelos projetistas durante a especificação das condições de contexto construídas nas atividades (i) e (ii) do projeto do TESC. Além disso, mesmo que os projetistas criem especificações livres de contradições, essas podem representar situações de contexto que nunca ocorrerão em tempo de execução devido a forma como o sistema e o seu contexto evoluem. Faltas de projeto como estas podem fazer com que o mecanismo de TESC seja mal configurado, comprometendo a sua confiabilidade em detectar as situações de anormalidade desejadas e selecionar as ações de tratamento adequadas para lidar com ocorrências excepcionais específicas. Adicionalmente, existe outro tipo de falta de projeto pode ser facilmente cometida por projetistas. Por exemplo, considere o projeto do TESC para uma exceção contextual em que as especificações das condições de contexto construídas nas atividades (i) e (ii) estejam livres de faltas de projeto como as descritas anteriormente. Perceba que, mesmo nesse caso, pode ocorrer do projetista especificar a condição de contexto que caracteriza a situação de anormalidade e as condições de seleção das ações de tratamento de tal forma que estas nunca sejam satisfeitas, simultaneamente, em tempo de execução. Isso pode acontecer nos casos em que essas condições de contexto sejam contraditórias entre si ou que não seja possível o sistema atingir um estado em que seu contexto satisfaça a ambas ao mesmo tempo. Desse modo, face a propensão à falta de projetos, uma abordagem rigorosa deve ser empregada pelos projetistas para que faltas de projeto sejam identificadas e removidas, antes que sejam propagadas até a fase de codificação. III. VERIFICAÇÃO DE MODELOS A verificação de modelos é um método formal empregado na verificação automática de sistemas reativos concorrentes com número finito de estados [18]. Nessa abordagem, o comportamento do sistema é modelado através de algum formalismo baseado em estados e transições e as propriedades a serem verificadas são especificadas usando lógicas temporais. A verificação das propriedades comportamentais é dada através de uma exaustiva enumeração (implícita ou explícita) de todos os estados alcançáveis do modelo do sistema. A estrutura de Kripke (Definição 1) é um formalismo para modelagem de comportamento, onde os estados são rotulados ao invés das transições. Na estrutura de Kripke, cada rótulo representa um instantâneo (estado) da execução do sistema. Essa característica foi preponderante para sua escolha neste

77 trabalho, uma vez que os aspectos comportamentais do projeto TESC que se deseja analisar são influenciados pela observação do estado do contexto do sistema, e não pelas ações que o levaram a alcançar um estado de contexto em particular. Definição 1 (Estruturas de Kripke). Uma estrutura de Kripke K = S, I, L, sobre um conjunto finito de proposições atômicas AP é dado por um conjunto finito de estados S, um conjunto de estados iniciais I S, uma função de rótulos L : S 2 AP, a qual mapeia cada estado em um conjunto de proposições atômicas que são verdadeiras naquele estado, e uma relação de transição total S S, isto é, que satisfaz a restrição s S s S tal que (s, s ). Usualmente, as propriedades se deseja verificar são divididas em dois tipos: (i) de segurança (safety), que buscam expressar que nada ruim acontecerá durante a execução do sistema; e (ii) de progresso (liveness), que buscam expressar que, eventualmente, algo bom acontecerá durante a execução do sistema. Essas propriedades são expressas usando lógicas temporais que são interpretadas sobre uma estrutura de Kripke. Dessa forma, dada uma estrutura de Kripke K e uma fórmula temporal ϕ, uma formulação geral para o problema de verificação de modelos consiste em verificar se ϕ é satisfeita na estrutura K, formalmente K = ϕ. Nesse caso, K representa o modelo do sistema e ϕ a propriedade que se deseja verificar. Neste trabalho, devido a sua expressividade e ao verificador de modelos utilizado, a lógica temporal CTL (Computation Tree Logic) foi escolhida para a especificação de propriedades sobre o comportamento do TESC. CTL é uma lógica temporal de tempo ramificado que permite expressar propriedades sobre estados. As fórmulas de CTL são construídas sobre proposições atômicas utilizando operadores proposicionais (,,, e ) e operadores temporais (EX, EF, EG, EU, AX, AF, AG e AU). Sejam φ e ϕ fórmulas CTL, a intuição para os operadores temporais é dada na Tabela I. Para obter mais detalhes sobre CTL e verificação de modelos, consulte [18]. Tabela I INTUIÇÃO PARA OS OPERADORES TEMPORAIS DE CTL. EXφ existe um caminho tal que no próximo estado φ é verdadeira. EFφ existe um caminho tal que no futuro φ será verdadeira. EGφ existe um caminho tal que φ é sempre verdadeira. EU(φϕ) existe um caminho tal que φ é verdadeira até que ϕ passe a ser. AXφ para todo caminho, no próximo estado φ é verdadeira. AFφ para todo caminho, φ é verdadeira no futuro. AGφ para todo caminho, φ é sempre verdadeira. AU(φϕ) para todo caminho, φ é verdadeira até que ϕ passe a ser. IV. O MÉTODO PROPOSTO Nesta seção é apresentado o método proposto para a verificação de modelos do TESC. Uma visão geral do método é oferecida na Seção IV-A. A Seção IV-B aborda a forma como o espaço de estados a ser explorado é derivado. Além disso, a atividade de modelagem (Seção IV-C), a derivação da estrutura de Kripke (Seção IV-D) e a atividade de especificação (Seção IV-E) do método são detalhadas. A. Visão Geral O método proposto provê um conjunto de abstrações e convenções que permitem aos projetistas expressarem de forma rigorosa o comportamento excepcional sensível ao contexto e mapeá-lo para uma estrutura de Kripke particular, formalismo apresentado na Seção III que serve de base para a técnica de verificação de modelos. Além disso, o método oferece uma lista de propriedades comportamentais, a serem verificadas sobre o comportamento excepcional sensível ao contexto, com o intuito de auxiliar os projetistas na descoberta de determinados tipos de faltas de projeto no TESC. O método é decomposto em duas atividades: modelagem e especificação. Na atividade de modelagem (Seção IV-C), o comportamento do TESC é modelado utilizando um conjunto de construtores próprios que ajudam a definir as exceções contextuais e estruturar as ações de tratamento. Na atividade de especificação (Seção IV-E), um conjunto de propriedades que permitem identificar um conjunto bem definido de faltas de projeto no TESC, são apresentadas e formalizadas utilizando a lógica CTL. Entretanto, o fato do método conseguir representar o modelo de comportamento do TESC como uma estrutura de Kripke, permite que outros tipos de propriedades comportamentais possam ser definidas pelos projetistas utilizando CTL. B. Determinando o Espaço de Estados Nos estágios iniciais do projeto do TESC, um dos principais esforços está na identificação das informações de contexto que podem ser úteis para projetar o TESC. Nesses estágios, por não haver um conhecimento detalhado sobre o tipo, a origem e a estrutura dessas informações, é pertinente abstrair esses detalhes e buscar lidar com informações de contexto mais alto nível. Essas informações de contexto de alto nível, podem ser vistas como proposições sobre o contexto do sistema, que recebem, em tempo de execução, uma interpretação, verdadeira ou falsa, de acordo com a valoração assumida pelas variáveis de contexto de mais baixo nível observadas pelo sistema. No método proposto, essas proposições são denominadas de proposições contextuais e compõem o conjunto CP, que representa a base de conhecimento utilizada para caracterizar situações contextuais relevantes para o projeto do TESC. Nesse cenário, para que o espaço de estados a ser explorado seja obtido é preciso criar uma função de valoração que atribua valores para as proposições contextuais em CP. Entretanto, construir essa função não é uma atividade trivial, pois essas proposições contextuais representam informações de contexto de baixo nível que assumem uma valoração de forma não determinística, seguindo leis que extrapolam a fronteira e o controle do sistema (e.g., tempo, condições climáticas e mobilidade) e que podem estar relacionadas entre si de forma dependente ou conflitante. Desse modo, embora as proposições contextuais permitam abstrair os detalhes das variáveis de contexto de baixo nível, elas trazem consigo problemas de dependência semântica que dificultam a construção de uma

78 função de valoração. Para lidar com essa questão, o método proposto adota a técnica de programação por restrições [20] como função de valoração para as proposições contextuais. Essa técnica permite ao projetista estabelecer restrições semânticas (Definição 2) sobre CP garantindo que todas as soluções geradas (o espaço de estados a ser explorado) satisfazem as restrições estabelecidas. Por convenção, a função csp(cp, C) será utilizada para designar o espaço de estados derivado a partir do conjunto C de restrições definido sobre CP. Definição 2 (Restrição). Uma restrição é definida como uma fórmula lógica sobre o conjunto CP de proposições contextuais tal como descrito na gramática em (1), onde p CP é uma proposição contextual, φ e ϕ são fórmulas lógicas e (negação), (conjunção), (disjunção), (disjunção exclusiva), (implicação) e (dupla implicação) operadores lógicos. φ, ϕ : := p φ φ ϕ φ ϕ φ ϕ φ ϕ φ ϕ (1) C. Atividade de Modelagem Durante a modelagem do comportamento do TESC, algumas questões de projeto relacionadas com a definição e a detecção de exceções contextuais, com o agrupamento, seleção e execução das medidas de tratamento precisam ser pensadas. No método proposto, a atividade de modelagem tem como objetivo tratar essas questões e possibilitar o mapeamento do modelo de comportamento do TESC para uma estrutura de Kripke. Para isso, são propostas as abstrações de exceções contextuais, casos de tratamento e escopos de tratamento. 1) Exceções Contextuais: No método proposto, uma exceção contextual é definida por um nome e uma fórmula lógica utilizada para caracterizar o seu contexto excepcional (Definição 3). Uma exceção contextual é detectada quando a fórmula ecs é satisfeita em algum dado estado de contexto do sistema. Nesse momento, diz-se que a exceção contextual foi levantada. Por convenção, dada uma exceção contextual e = name, ecs a função ecs(e) é definida para recuperar a especificação de contexto excepcional (ecs) da exceção e. Definição 3 (Exceção Contextual). Dado um conjunto proposições contextuais CP, uma exceção contextual é definida pela tupla name, ecs, onde name é o nome da exceção contextual, ecs é uma fórmula lógica definida sobre CP que especifica o contexto excepcional de detecção. 2) Casos de Tratamento: Como discutido anteriormente, uma exceção contextual pode ser tratada de formas diferentes dependendo do contexto em que o sistema se encontra. Os casos de tratamento (Definição 4) definem as diferentes estratégias que podem ser empregadas para tratar uma exceção contextual em função do contexto corrente do sistema. Um caso de tratamento é composto por uma condição de seleção e um conjunto de fórmulas lógicas que são utilizadas para descrever a situação de contexto esperada após a execução de cada ação (ou bloco de ações) de tratamento de forma sequencial. Por convenção, os constituintes de um caso de tratamento serão referenciados de agora em diante como condição de seleção e conjunto de medidas de tratamento, respectivamente. Definição 4 (Caso de Tratamento). Dado um conjunto de proposições contextuais CP, um caso de tratamento é definido como uma tupla hcase = α, H, onde α é uma fórmula lógica definida sobre CP e H é um conjunto ordenado de fórmulas lógicas definidas sobre CP. 3) Escopos de Tratamento: Tipicamente, os tratadores de exceção encontram-se vinculados a áreas específicas do código do sistema onde exceções podem ocorrer. Essa estratégia ajuda a delimitar o escopo de atuação de um tratador durante a atividade de tratamento. No método proposto, o conceito de escopos de tratamento (Definição 5) é criado para delimitar a atuação dos casos de tratamento e estabelecer uma relação de precedência entre estes. Essa relação de precedência é essencial para resolver situações de sobreposição entre condições de seleção de casos de tratamento (i.e., situações em que mais de um caso de tratamento pode ser selecionado num mesmo estado do contexto). Dessa forma, no método proposto, o caso de tratamento de maior precedência é avaliado primeiro, se este não tiver a sua condição de seleção satisfeita, o próximo caso de tratamento com maior precedência é avaliado, e assim por diante. Definição 5 (Escopo de Tratamento). Dado um conjunto de proposições contextuais CP, um escopo de tratamento é definido pela tupla e, HCASE, onde e é uma exceção contextual e HCASE é um conjunto ordenado de casos de tratamento. A noção de conjunto ordenado, mencionado na Definição 5, está relacionada com a existência de uma relação de ordem entre os casos de tratamento. Essa relação permite estabelecer a ordem de precedência em que cada caso de tratamento será avaliado quando uma exceção contextual for levantada. No método proposto, a ordem de avaliação utilizada leva em consideração a posição ocupada por cada caso de tratamento dentro do conjunto HCASE. Portanto, para os casos de tratamento hcase i e hcase j, se i < j, então hcase i tem precedência sobre hcase j (i.e., hcase i hcase j ). No entanto, essa relação de ordem não é fixa, porém obrigatória, podendo ser alterada pelo projetista com o propósito de obter algum tipo benefício. D. Derivando a Estrutura de Kripke Como apresentado na Seção III, uma estrutura de Kripke é uma tupla K = S, I, L, definida sobre um conjunto finito de proposições atômicas AP. Desse modo, o processo de derivação de uma estrutura de Kripke consiste em estabelecer os elementos que a constituem, observando todas as restrições impostas pela sua definição, quais sejam: (i) o conjunto S de estados deve ser finito; e (ii) a relação de transição deve ser total. Ao longo desta seção são descritos os procedimentos adotados pelo método para obter cada um dos constituintes da estrutura de Kripke que representa o TESC, chamada de EK. Inicialmente, de forma direta, o conjunto AP de proposições atômicas sobre o qual EK é definida, é formado pelo

79 conjunto CP de proposições contextuais (i.e., AP = CP). Além disso, o conjunto S de estados de EK é obtido a partir dos conjuntos CP de proposições contextuais e G de restrições estabelecidas sobre CP por meio da função S = csp(cp, G). Já o conjunto I de estados iniciais é estabelecido como segue: I = {s s S, e E, val(s) = ecs(e)}, onde S é o conjunto de estados, E é o conjunto de todas as exceções contextuais modeladas no sistema e val(s) significa a valoração das proposições contextuais no estado s. Desse modo, os estados iniciais são os estados em que a valoração das proposições contextuais (val(s)) satisfazem ( =) as especificações de contexto excepcional das exceções modeladas (ecs(e)), i.e., o conjunto I é composto pelos estados excepcionais do modelo. Já a função de rótulos L, é composta pela valoração de todos os estados do sistema: L = {val(s) s S}. Antes de apresentar a forma como a relação de transição de EK é derivada, duas funções auxiliares são introduzidas. O objetivo dessas funções é construir um conjunto de transições entre pares de estados. Em (2a), as transições são construídas a partir de um dado estado s e uma fórmula lógica φ, onde o estado de partida é o estado s e os estados de destino são todos aqueles cujo rótulo satisfazem ( =) a fórmula φ. Já em (2b), as transições são construídas a partir de um par de fórmulas, φ e ϕ, onde os estados de partida são todos aqueles que satisfazem φ e os de destino são os que satisfazem ϕ. ST(s, φ, S) = {(s, r) r S, L(r) = φ} FT(φ, ϕ, S) = {(s, r) s, r S, L(s) = φ, L(r) = ϕ} (2a) (2b) A relação de transição de EK representa a sequência de ações realizadas durante a atividade de tratamento para cada exceção contextual detectada e tratada pelo mecanismo de TESC. Essas transições entre estados iniciam em um estado excepcional e terminam em um estado caracterizado pela última medida de tratamento do caso de tratamento selecionado para tratar aquela exceção. O Algoritmo 1 descreve como as transições em EK são geradas, recebendo como entrada os conjuntos: Γ, de escopos de tratamento; I, de estados iniciais (excepcionais); e S de todos os estados. Desse modo, para cada escopo de tratamento e, HCASE (linha 4) e para cada estado em I (linha 5), verifica-se se a exceção e do escopo de tratamento corrente pode ser levantada no estado s (linha 6). Em caso afirmativo, para cada caso de tratamento α, H (linha 7), verifica-se se este pode ser selecionado (linha 8). Transições entre o estado excepcional e os estados que satisfazem a primeira medida de tratamento do caso de tratamento é feita por meio de uma chamada a função ST(s, H(0), (S)) (linha 9), sendo armazenada em conjunto auxiliar (AUX ). Caso este conjunto auxiliar seja não vazio (linha 10), essas transições são guardadas no conjunto de retorno T R (linha 12). Adicionalmente, o mesmo é feito para cada par de medidas de tratamento por meio de chamadas a função FT(H(i 1), H(i), (S)) (linhas 13 e 15). Perceba que os laços mais interno e intermediário são interrompidos (linhas 17 e 20) quando não é possivel realizar transições entre estados. Além disso, o comando break (linha 22) garante que apenas um caso de tratamento é selecionado para tratar a exceção e levando em consideração a relação de ordem baseada nos índices. Por fim, antes de retornar o conjunto final de relações de transição (linha 36), o fragmento de código compreendido entre as linhas 19 e 35 adiciona uma auto-transição (transição de loop) nos estados terminais (i.e., nos estados que não possuem sucessores) para garantir a restrição de totalidade imposta pela definição de estrutura de Kripke. Algoritmo 1 Geração da Relação de Transição de EK. 1: function TRANSICAOEK(Γ, I, S) 2: T R = 3: AUX = 4: for all e, HCASE Γ do 5: for all s I do 6: if L(s) = ecs(e) then 7: for all α, H HCASE do 8: if L(s) = α then 9: AUX = ST(s, H(0), S) 10: if AUX then 11: T R = T R AUX 12: for i = 1, H do 13: AUX = FT(H(i 1), H(i), S) 14: if AUX then 15: T R = T R AUX 16: else 17: break 18: end if 19: end for 20: break 21: else 22: break 23: end if 24: end if 25: end for 26: end if 27: end for 28: end for 29: if T R = then 30: for all s S do 31: if t S, (s, t) T R then 32: T R = T R (s, s) 33: end if 34: end for 35: end if 36: return T R 37: end function E. Atividade de Especificação A atividade de especificação consiste na determinação de propriedades sobre o comportamento do TESC com o intuito de encontrar faltas de projeto. Neste trabalho foram catalogadas 3 propriedades comportamentais que, se violadas, indicam a existência de faltas de projeto no TESC, são elas: progresso de detecção, progresso de captura e progresso de tratador. Cada uma dessas propriedades é apresentada a seguir. 1) Progresso de Detecção: Essa propriedade determina que para cada estado da estrutura de Kripke do contexto, deve existir pelo menos um estado onde cada exceção contextual é detectada. A violação dessa propriedade indica a existência de exceções contextuais que não são detectadas. Esse tipo de falta de projeto é denominada de exceção morta. Essa propriedade deve ser verificada para cada uma das exceções contextuais modeladas no sistema. Desse modo, seja e uma exceção contextual, a fórmula (3), escrita em CTL, especifica formalmente essa propriedade. EF(ecs(e)) (3)

80 2) Progresso de Captura: Essa propriedade estabelece que para cada exceção de contexto levantada, deve existir, pelo menos, um caso de tratamento habilitado a capturar aquela exceção. A violação dessa propriedade indica que existem estados do contexto onde exceções contextuais são levantadas, mas não podem ser capturadas e, consequentemente, tratadas. Esse tipo de falta de projeto é denominada de tratamento nulo. É importante observar que, mesmo existindo situações de contexto onde o sistema não pode tratar aquela exceção, é importante que o projetista esteja ciente de que esse fenômeno ocorre no seu modelo. Sendo assim, seja e, HCASE um escopo de tratamento com α 0, H 0, α 1, H 1,..., α n, H n HCASE casos de tratamento, a fórmula (4), escrita em CTL, especifica formalmente essa propriedade. i< HCASE EF ecs(e) (4) 3) Progresso de Tratador: Essa propriedade determina que para cada estado do contexto onde uma exceção contextual é levantada, deve existir pelo menos um destes estados onde cada caso de tratamento é selecionado para tratar aquela exceção. A violação dessa propriedade indica que existem casos de tratamento, definidos em um escopo de tratamento de uma exceção contextual particular, que nunca serão selecionados. Esse tipo de falta de projeto é denominada de tratador morto. Desse modo, seja e, HCASE um escopo de tratamento com α 0, H 0, α 1, H 1,..., α n, H n HCASE casos de tratamento, a fórmula (5), escrita em CTL, especifica formalmente essa propriedade. i< HCASE 0 0 α i EF(ecs(e) (α i )) (5) V. AVALIAÇÃO Nesta seção é feita uma avaliação do método proposto. Na Seção V-A é apresentada a ferramenta desenvolvida de suporte ao método. A Seção V-B descreve o sistema exemplo utilizado na avaliação. Na Seção V-C o projeto do TESC de duas exceções do sistema exemplo é detalhado. Por fim, na Seção V-D os cenários de injeção de faltas são descritos e um sumário dos resultados é oferecido na Seção V-E. A. A Ferramenta A ferramenta 2 foi implementada na plataforma Java e provê uma API para que o projetista especifique as proposições de contexto (Seção IV-B), as restrições semânticas (Seção IV-B), as exceções contextuais (Seção IV-C1), os casos de tratamento (Seção IV-C2) e os escopos de tratamento (Seção IV-C3). Essas especificações são enviadas ao módulo conversor que gera os estados do contexto e constrói o modelo de comportamento do TESC e o conjunto de propriedades descritas pelo método. É importante mencionar que o projetista pode informar propriedades adicionais a serem verificadas sobre 2 o modelo, além daquelas já predefinidas por nosso método (Seção IV-E). De posse do modelo de comportamento e das propriedades, a ferramenta submete estas entradas ao módulo de verificação de modelos, o qual executa o processo de verificação e gera um relatório de saída contendo os resultados da verificação. Para fazer a geração dos estados do contexto, a ferramenta faz uso do Choco Solver 3, uma das implementações de referência da JSR 331: Constraint Programming API 4. Já no processo de verificação, a ferramenta utiliza o verificador de modelos MCiE desenvolvido no projeto The Model Checking in Education 5. Esse verificador foi escolhido, principalmente, pelo fato de ser implementado na plataforma Java, o que facilitou a sua integração com a ferramenta desenvolvida. B. A Aplicação UbiParking O UbiParking é uma aplicação baseada no conceito de Cidades Ubíquas. A ideia por trás desse conceito é o provimento de serviços ubíquos relacionados com o cotidiano das cidades e das pessoas que nelas habitam, com o propósito de melhorar a convivência urbana sob diversos aspectos, tais como trânsito, segurança e atendimento aos cidadãos. O objetivo UbiParking é auxiliar motoristas na atividade estacionar seus veículos. Nesse sentido, o UbiParking disponibiliza um mapa plotado com todas as vagas de estacionamento disponíveis por região. Este mapa de vagas livres é atualizado com base em informações coletadas por meio de sensores implantados nos acostamentos das vias e nos estacionamentos públicos. Os sensores detectam quando uma vaga de estacionamento está ocupada ou livre, enviando esta informação para o sistema. Desse modo, utilizando o UbiParking em seus dispositivos móveis ou no computador de bordo dos seus veículos, os cidadãos podem obter informações sobre a distribuição das vagas livres por região, podendo reservar uma vaga e solicitar ao sistema uma rota mais apropriada com base em algum critério de sua preferência (e.g., menor distância, maior número de vagas livres ou menor preço). Chegando ao estacionamento escolhido, o UbiParking conduz o motorista até a vaga reservada ou à vaga livre mais próxima, considerando os casos onde a vaga reservada é ocupada de forma imprevisível por outro veículo. Do mesmo modo, quando o motorista retorna ao seu veículo, o UbiParking o conduz até a saída mais próxima, poupandolhe tempo. Os estacionamentos do UbiParking possuem uma disposição espacial composta por entradas, pátio de vagas e saídas. Além disso, o estacionamento ubíquo é equipado com sensores de temperatura, detectores de fumaça e aspersores controlados automaticamente, para o caso de incêndio. C. Projeto do TESC para o UbiParking Nesta seção é descrita a utilização do método no projeto do TESC de duas exceções contextuais da aplicação UbiParking, a exceção de incêndio e a exceção de vaga indisponível. A exceção de incêndio modela uma condição de incêndio dentro

81 do estacionamento. Por meio das informações de contexto coletadas pelos sensores de fumaça e temperatura, o UbiParking consegue detectar a ocorrência desse tipo de exceção contextual dentro do estacionamento. Para lidar com essa exceção, o sistema aciona os aspersores e conduz os motoristas até o lado de fora do estacionamento. Já a exceção de vaga indisponível, modela uma situação em que o veículo está em movimento dentro do pátio de vagas indo em direção à sua vaga reservada. Porém, outro veículo ocupa aquela vaga. Nesse caso, se a vaga for a última vaga livre disponível no estacionamento, fica caracterizada a situação de anormalidade. Essa exceção contextual é detectada pelo sistema através de informações de contexto sobre as reservas de vagas, a localização do veículo e os dados que vem dos sensores de detecção de vaga ocupada. Como forma de tratar essa exceção contextual, o UbiParking conduz o veículo até o lado de fora do estacionamento, onde outra vaga livre em outro estacionamento pode ser reservada. Com base nesses dois cenários de exceção, as proposições descritas na Tabela II foram estabelecidas. Proposições inmovement atparkentrance atparkplace atparkexit hasspace ishot hassmoke issprinkleron Tabela II PROPOSIÇÕES CONTEXTUAIS DO UbiParking. Significado O veículo está em movimento? O veículo está na entrada do estacionamento? O veículo está no pátio de vagas do estacionamento? O veículo está na saída do estacionamento? Há vaga livre no estacionamento? Esta quente no estacionamento? Há fumaça no estacionamento? Os aspersores estão ligados? Perceba que as proposições contextuais atparkplace, atparkexit e atparkentrance (Tabela II) possuem uma relação semântica particular. No UbiParking, o veículo, do ponto de vista espaço-temporal, só pode estar fora ou dentro do estacionamento em um dado instante. Caso ele esteja fora do estacionamento, as três proposições devem assumir valor verdade falso. Por outro lado, se o veículo estiver no estacionamento, ele só poderá estar em um dos seguintes lugares: na entrada, no pátio de vagas ou na saída do estacionamento, mas não em mais de um local simultaneamente. Esse tipo de relação semântica entre proposições contextuais deve ser levado em consideração no momento da modelagem. Desse modo, a seguinte restrição deve ser derivada durante a modelagem do UbiParking para garantir a consistência semântica: (atparkentrance atparkplace atparkexit) ( atparkentrance atparkplace atparkexit). A exceção contextual de incêndio é descrita pela tupla: F ire, hassmoke ishot. Dois casos de tratamento foram identificados para tratar essa exceção contextual em função de contexto do veículo. Para a situação de contexto em que o veículo encontra-se na entrada do estacionamento, o seguinte caso de tratamento pode ser formulado: hcase r 0 = α r 0, H r 0, onde α r 0 = inmovement atparkentrance e H r 0 = {issprinkleron ( atparkentrance atparkplace atparkexit)}. O caso de tratamento hcase f 0 é selecionado quando o veículo encontra-se entrando no estacionamento. Dessa forma, se ele é selecionado, o efeito esperado após a execução do tratamento (H0) r é que o sistema atinja um estado em que os aspersores estejam ligados e o veículo esteja fora do estacionamento. Por outro lado, na situação em que o veículo encontra-se dentro do pátio de vagas do estacionamento, outro caso de tratamento pode ser derivado: hcase r 1 = α1, r H1, r onde α f 1 = inmovement atparkplace e H1 r = {issprinkleron atparkexit, issprinkleron ( atparkentrance atparkplace atparkexit)}. No hcase r 1, o veículo encontra-se em movimento dentro do pátio de vagas do estacionamento. Nesse caso de tratamento, duas medidas de tratamento são esperadas que ocorram sequencialmente (H1). r A primeira consiste em levar o sistema a um estado em que o veículo esteja na saída do estacionamento e os aspersores encontrem-se ligados. Já a segunda, consiste em levar o sistema a um estado no qual os aspersores continuem ligados e o veículo encontrar-se fora do estacionamento. O escopo de tratamento dessa exceção é dado por: F ire, {hcase r 0, hcase r 1}. A exceção contextual de vaga indisponível é descrita pela tupla: N of reespace, inmovement atparkplace ( hasspace). Para essa exceção, apenas um caso de tratamento foi definido: hcase n 0 = α0 n, H0 n, onde α0 n = inmovement atparkplace e H0 n = {inmovement atparkexit}. A condição desse caso de tratamento estabelece que ele só é selecionado se o veículo estiver em movimento dentro do pátio de vagas. A medida de tratamento associada a esse caso de tratamento define que após o tratamento, o veículo deve encontrar-se em movimento na saída do estacionamento. O escopo de tratamento dessa exceção é dado por: NoF reespace, {hcase n 0 }. D. Cenários de Injeção de Faltas A injeção de faltas (fault injection) é uma técnica empregada na avaliação da confiabilidade de sistemas computacionais [21]. Ela consiste na inserção controlada de faltas em um modelo ou sistema computacional com o propósito de avaliar aspectos de robustez e dependabilidade. Essa técnica foi utilizada neste trabalho como forma de avaliar a eficácia do método proposto. Para isso, o projeto do TESC das exceções de incêndio e vaga indisponível (Seção V-C) do UbiParking foi modelado utilizando a ferramenta de suporte ao método. Esse projeto foi submetido ao verificador de modelos da ferramenta e nenhuma das faltas de projeto estabelecidas pelo método foi encontrada (i.e., exceção morta, tratamento nulo e tratador morto), portanto, trata-se de um modelo correto. A partir desse modelo correto, para cada propriedade que se deseja verificar (i.e., progresso de detecção, progresso de captura e progresso de tratador), foi feita uma alteração deliberada no modelo com o propósito de violá-la. Essas alterações representam faltas de projeto similares aquelas descritas na Seção II-B, as quais os projetistas estão sujeitos a cometer.

82 1) Cenário 1: Violando o Progresso de Detecção: Essa propriedade é violada quando não existe, pelo menos, um estado de contexto do sistema onde a exceção em questão pode ser detectada. Isso pode ocorrer quando o projetista: (i) insere uma contradição na especificação do contexto excepcional; ou (ii) a especificação representa uma situação de contexto que nunca ocorrerá em tempo de execução. Embora essas duas faltas de projeto sejam diferentes do ponto de vista de significado, elas representam a mesma situação para o modelo de comportamento do TESC: uma expressão que não pode ser satisfeita dentro do modelo do sistema. Dessa forma, para injetar uma falta de projeto que viole essa propriedade é suficiente garantir que a especificação do contexto excepcional seja insatisfazível no modelo do TESC, independente de ser provocada por uma falta do tipo (i) ou (ii). Neste trabalho, optou-se por utilizar faltas do tipo (i). Desse modo, as contradições foram construídas a partir da conjunção da especificação do contexto excepcional e a sua negação. 2) Cenário 2: Violando o Progresso de Captura: Essa propriedade é violada quando não é possível selecionar, pelo menos um, caso de tratamento quando uma exceção é detectada. Isso pode ocorrer quando a condição de seleção dos casos de tratamento representam: (i) uma contradição; (ii) uma situação de contexto que nunca ocorre; ou (iii) uma contradição entre a condição de seleção e a especificação do contexto excepcional da exceção contextual associada. Embora essas faltas de projeto sejam diferentes do ponto de vista de significado, elas representam a mesma situação para o modelo de comportamento do TESC: uma expressão que não pode ser satisfeita dentro do modelo do sistema ou quando uma exceção contextual é detectada. Assim, para injetar uma falta de projeto que viole essa propriedade é suficiente garantir que a condição de seleção dos casos de tratamento seja insatisfazível no modelo do TESC, independente de ser provocada por uma falta do tipo (i), (ii) ou (iii). Neste trabalho, foi utilizado faltas do tipo (i), construídas a partir da conjunção de cada condição de seleção dos casos de tratamento e a sua negação. 3) Cenário 3: Violando o Progresso de Tratador: Essa propriedade é violada quando existe, pelo menos um, caso de tratamento que nunca é selecionado quando uma exceção é detectada. As situações onde isso pode ocorrer são exatamente os mesmas descritas para a propriedade do Cenário V-D2. A diferença é que para violar a propriedade de progresso de tratador basta que apenas um caso de tratamento seja mal projetado (i.e., contenha uma falta de projeto), enquanto que para violar a propriedade de progresso de captura, descrita no Cenário V-D2, existe a necessidade de que todos os casos de tratamento sejam mal projetados. Desse modo, optou-se por utilizar o mesmo tipo de falta de projeto do Cenário V-D2. E. Sumário dos Resultados Cada cenário foi executado individualmente e foram consideradas 3 (três) tipos de permutações, denominadas de rodadas: (i) a injeção de falta apenas na exceção Fire ; (ii) a injeção de falta apenas na exceção NoFreeSpace ; e (iii) a injeção de falta em ambas exceções de forma simultânea. Na primeira rodada do Cenário V-D1, como esperado, a falta injetada foi detectada através da identificação de uma falta de projeto de exceção morta no projeto da exceção de incêndio. Além dessa, outras faltas de projeto foram detectadas: tratamento nulo e tratador morto. O fato dessas outras duas faltas serem detectadas no projeto da exceção de incêndio é compreensível, uma vez que só se pode selecionar um caso de tratamento para tratar uma exceção quando está é detectada. O mesmo resultado ocorreu na segunda rodada do Cenário V-D1, porém, com respeito a exceção de vaga indisponível. Por outro lado, na terceira rodada, nenhuma falta de projeto foi identificada. Nessa rodada, como as faltas foram inseridas em ambas exceções, nenhum estado excepcional foi gerado, consequente, o modelo de comportamento não pode ser derivado e a sua verificação não pode ser conduzida. Na primeira rodada do Cenário V-D2, a falta injetada foi detectada através da identificação das faltas de projeto de tratamento nulo e tratador morto. Nenhuma falta de exceção morta foi identificada, uma vez que exceções foram detectadas no modelo. Na segunda rodada do Cenário V-D2, o mesmo resultado foi encontrado para a exceção de vaga indisponível. Por fim, na terceira rodada, como esperado, um par de faltas de projeto de tratamento nulo e tratador morto foi identificado para cada exceção contextual. com respeito a o Cenário V-D3, na primeira, como esperado, a falta injetada foi detectada através da identificação da falta de projeto de tratador morto. Na segunda rodada do Cenário V-D3, o mesmo resultado foi encontrado para a exceção de vaga indisponível. Por fim, na terceira rodada, como esperado, uma falta de projeto de tratador morto foi identificada para cada exceção contextual. VI. TRABALHOS RELACIONADOS No escopo da revisão bibliográfica realizada, não foram encontrados trabalhos que abordam a mesma problemática endereçada neste artigo. Porém, os trabalhos [22][16][11][17] possuem uma relação próxima à solução proposta neste artigo. Particularmente, [16][11][17] estão relacionados a descoberta de faltas de projeto no mecanismo de adaptação de sistemas ubíquos sensíveis ao contexto. Essa problemática consiste na má especificação das regras de adaptação em tempo de projeto. Essas regras são compostas por uma condição de guarda (antecedente) e um conjunto de ações associadas (consequente). A condição de guarda descreve situações de contexto às quais o sistema deve reagir. Já as ações, caracterizam a reação do sistema ao contexto detectado. Dessa forma, a especificação erronea das condições de guarda podem levar o sistema a uma configuração imprópria e, posteriormente, a uma falha. Em [22] é proposto uma forma de especificar a semântica do comportamento adaptativo por meio de fórmulas lógicas temporais, entretanto, não provê suporte ferramental para a verificação de propriedades. Já os trabalhos [16][11][17], buscam representar o comportamento adaptativo por meio de algum formalismo baseado em estados e transições. De posse desse modelo, técnicas formais de análise (e.g., algoritmos simbólicos e verificadores de modelos) são empregadas com o intuito de identificar faltas de projeto. Em [16] o focado é

83 dado ao domínio de aplicações sensíveis ao contexto formadas por composições de Web Services. Nesse trabalho o objetivo é encontrar inconsistências na composição dos serviços e nas suas interações. Para isso, é proposto um mapeamento da especificação da aplicação baseada em BPEL para um modelo formal utilizado para fazer as análises e verificações, chamado de CA-STS (Context-Aware Symbolic Transition System). Por outro lado, [11] busca identificar problemas específicos de má especificação das regras de adaptação. Para isso, eles propõem um formalismo baseado em máquina de estados, chamado A- FSM (Adaptation Finite-State Machine). Esse formalismo é usado para modelar o comportamento adaptativo e servir como base para a verificação de propriedades e detecção de faltas de projeto. Em [17] é feita uma extensão de [11], onde é proposto um método para melhorar a efetividade da A-FSM por meio de técnicas de programação por restrições, mineração de dados e casamento de padrões. Entretanto, é importante mencionar que todos os trabalhos, exceto [22], são limitados com relação ao tipo de propriedades a serem verificadas. Por proporem seus próprios formalismos e implementarem ferramentas que analisam apenas um conjunto particular de propriedades, a sua extensão acaba sendo limitada. Diferente do método proposto, que permite que novas propriedades sejam incorporadas. VII. CONCLUSÕES E TRABALHOS FUTUROS Este trabalho apresentou um método para a verificação do projeto do TESC. As abstrações do método permitem que aspectos importantes do comportamento do TESC sejam modelados e mapeados para uma de Kripke, permitindo que seja analisado por um verificador de modelos. Adicionalmente, um conjunto de propriedades que capturam a semântica de determinados tipos de faltas de projeto foram formalmente especificadas e apresentas como forma de auxiliar os projetistas na identificação de faltas de projeto. Além disso, a ferramenta de suporte e os cenários de injeção de faltas, utilizados para avaliar o método, apresentam resultados que demonstram a viabilidade da proposta. Como trabalhos futuros, pretendese tratar questões relacionadas ao tratamento de exceções concorrentes no modelo com a definição de uma função de resolução que permita selecionar as medidas de tratamento mais adequadas face ao conjunto de exceções levantadas. Além disso, outro direcionamento para trabalhos futuros consiste na extensão do modelo para que seja possível representar a composição dos comportamentos adaptativo e excepcional, com o objetivo de analisar a influência de um sobre o outro. Por fim, outra linha a ser investigada é a criação de uma DSL para o projeto do TESC para que um experimento envolvendo usuários possa ser conduzido. REFERÊNCIAS [1] S. W. Loke, Building taskable spaces over ubiquitous services, IEEE Pervasive Computing, vol. 8, no. 4, pp , oct.-dec [2] A. K. Dey, Understanding and using context, Personal Ubiquitous Computing, vol. 5, no. 1, pp. 4 7, [3] D. Kulkarni and A. Tripathi, A framework for programming robust context-aware applications, IEEE Trans. Softw. Eng., vol. 36, no. 2, pp , [4] K. Damasceno, N. Cacho, A. Garcia, A. Romanovsky, and C. Lucena, Context-aware exception handling in mobile agent systems: The moca case, in Proceedings of the 2006 international workshop on Software Engineering for Large-Scale Multi-Agent Systems, ser. SELMAS 06. New York, NY, USA: ACM, 2006, pp [5] J. Mercadal, Q. Enard, C. Consel, and N. Loriant, A domain-specific approach to architecturing error handling in pervasive computing, in Proceedings of the ACM international conference on Object oriented programming systems languages and applications, ser. OOPSLA 10. New York, NY, USA: ACM, 2010, pp [6] D. M. Beder and R. B. de Araujo, Towards the definition of a context-aware exception handling mechanism, in Fifth Latin-American Symposium on Dependable Computing Workshops, 2011, pp [7] L. Rocha and R. Andrade, Towards a formal model to reason about context-aware exception handling, in 5th International Workshop on Exception Handling (WEH) at ICSE 2012, 2012, pp [8] E.-S. Cho and S. Helal, Toward efficient detection of semantic exceptions in context-aware systems, in 9th International Conference on Ubiquitous Intelligence Computing and 9th International Conference on Autonomic Trusted Computing (UIC/ATC), sept. 2012, pp [9] J. Whittle, P. Sawyer, N. Bencomo, B. H. C. Cheng, and J.-M. Bruel, Relax: Incorporating uncertainty into the specification of self-adaptive systems, in Proceedings of the th IEEE International Requirements Engineering Conference, RE, ser. RE 09. Washington, DC, USA: IEEE Computer Society, 2009, pp [10] D. Cassou, B. Bertran, N. Loriant, and C. Consel, A generative programming approach to developing pervasive computing systems, in Proceedings of the 8th International Conference on Generative Programming and Component Engineering, ser. GPCE 09. ACM, 2009, pp [11] M. Sama, S. Elbaum, F. Raimondi, D. Rosenblum, and Z. Wang, Context-aware adaptive applications: Fault patterns and their automated identification, IEEE Trans. Softw. Eng., vol. 36, no. 5, pp , [12] C. Bettini, O. Brdiczka, K. Henricksen, J. Indulska, D. Nicklas, A. Ranganathan, and D. Riboni, A survey of context modelling and reasoning techniques, Pervasive Mob. Comput., vol. 6, pp , April [13] A. Coronato and G. De Pietro, Formal specification and verification of ubiquitous and pervasive systems, ACM Transactions on Autonomous and Adaptive Systems, vol. 6, no. 1, pp. 9:1 9:6, Feb [14] F. Siewe, H. Zedan, and A. Cau, The calculus of context-aware ambients, J. Comput. Syst. Sci., vol. 77, pp , Jul [15] P. Zhang and S. Elbaum, Amplifying tests to validate exception handling code, in Proceedings of the 2012 International Conference on Software Engineering, ser. ICSE Piscataway, NJ, USA: IEEE Press, 2012, pp [16] J. Cubo, M. Sama, F. Raimondi, and D. Rosenblum, A model to design and verify context-aware adaptive service composition, in Proceedings of the 2009 IEEE International Conference on Services Computing, ser. SCC 09. Washington, DC, USA: IEEE, 2009, pp [17] Y. Liu, C. Xu, and S. C. Cheung, Afchecker: Effective model checking for context-aware adaptive applications, Journal of Systems and Software, vol. 86, no. 3, pp , [18] E. M. Clarke, Jr., O. Grumberg, and D. A. Peled, Model Checking. Cambridge, MA, USA: MIT Press, [19] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 1, pp , [20] P. Van Hentenryck and V. Saraswat, Strategic directions in constraint programming, ACM Comput. Surv., vol. 28, no. 4, pp , [21] J. Ezekiel and A. Lomuscio, Combining fault injection and model checking to verify fault tolerance in multi-agent systems, in Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems - Volume 1, Richland, SC, 2009, pp [22] J. Zhang and B. Cheng, Using temporal logic to specify adaptive program semantics, J. Syst. Software, vol. 79, no. 10, pp , 2006.

84 Prioritization of Code Anomalies based on Architecture Sensitiveness Roberta Arcoverde, Everton Guimarães, Isela Macía, Alessandro Garcia Informatics Department, PUC-Rio Rio de Janeiro, Brazil {rarcoverde, eguimaraes, ibertran, Yuanfang Cai Department of Computer Science Drexel University Philadelphia, USA Abstract Code anomalies are symptoms of software maintainability problems, particularly harmful when contributing to architectural degradation. Despite the existence of many automated techniques for code anomaly detection, identifying the code anomalies that are more likely to cause architecture problems remains a challenging task. Even when there is tool support for detecting code anomalies, developers often invest a considerable amount of time refactoring those that are not related to architectural problems. In this paper we present and evaluate four different heuristics for helping developers to prioritize code anomalies, based on their potential contribution to the software architecture degradation. Those heuristics exploit different characteristics of a software project, such as change-density and error-density, for automatically ranking code elements that should be refactored more promptly according to their potential architectural relevance. Our evaluation revealed that software maintainers could benefit from the recommended rankings for identifying which code anomalies are harming architecture the most, helping them to invest their refactoring efforts into solving architecturally relevant problems. Keywords Code anomalies, Architecture degradation and Refactoring; I. INTRODUCTION Code anomalies, commonly referred to as code smells [9], are symptoms in the source code that may indicate deeper maintainability problems. The presence of code anomalies often represents structural problems, which make code harder to read and maintain. Those anomalies can be even more harmful when they impact negatively on the software architecture design [4]. When that happens, we call those anomalies architecturally relevant, as they represent symptoms of architecture problems. Moreover, previous studies [14][35] have confirmed that the progressive manifestation of code anomalies is a key symptom of architecture degradation [14]. The term architecture degradation is used to refer to the continuous quality decay of architecture design when evolving software systems. Thus, as the software architecture degrades, the maintainability of software systems can be compromised irreversibly. As examples of architectural problems, we can mention Ambiguous Interface and Component Envy [11], as well as cyclic dependencies between software modules [27]. In order to prevent architecture degradation, software development teams should progressively improve the system maintainability by detecting and removing architecturally relevant code anomalies [13][36]. Such improvement is commonly achieved through refactoring [6][13] - a widely adopted practice [36] with well known benefits [29]. However, developers often focus on removing or prioritizing - a limited subset of anomalies that affect their projects [1][16]. Furthermore, most of the remaining anomalies are architecturally relevant [20]. Thus, when it is not possible to distinguish which code anomalies are architecturally relevant, developers can spend more time fixing problems that are not harmful to the architecture design. This problem occurs even in situations where refactoring need to be applied in order to improve the adherence of the source code to the intended architecture [1][19][20]. Several code analysis techniques have been proposed for automatically detecting code anomalies [18][25][28][32]. However, none of them help developers to prioritize anomalies with respect to their architectural relevance, as they present the following limitations: first, most of these techniques focus on the extraction and combination of static code measures. The analysis of the source code structure alone is often not enough to reveal whether an anomalous code element is related to the architecture decomposition [19][20]. Second, they do not provide means to support the prioritization or ranking of code anomalies. Finally, most of them disregard: (i) the exploitation of software project factors (i.e. frequency of changes and number of errors) that may be an indicator of the architectural relevance of a module, and (ii) the role that code elements play in the architectural design. In this context, this paper proposes four prioritization heuristics to help identifying and ranking architecturally relevant code anomalies. Moreover, we assessed the accuracy of the proposed heuristics when ranking code anomalies based on their architecture relevance. The assessment was carried out in the context of four target systems from heterogeneous domains, developed by different teams using different programming languages. Our results show that the proposed heuristics were able to accurately prioritize the most code relevant anomalies of the target systems mainly for scenarios where: (i) there were architecture problems involving groups of

85 classes that changed together; (ii) changes were not predominantly perfective; (iii) there were code elements infected by multiple anomalies; and (iv) the architecture roles are well-defined and have distinct architectural relevance. The remainder of this paper is organized as follows. Section II introduces the concepts involved in this work, as well as the related work. Section III introduces the study settings. Section IV describes the prioritization heuristics proposed in this paper. Section V presents the evaluation of the proposed heuristics, and Section VI the evaluation results. Finally, Section VII presents the threats to validity, while Section VIII discusses the final remarks and future work. II. BACKGROUND AND RELATED WORK This section introduces basic concepts related to software architecture degradation and code anomalies (Section II.A). It also discusses researches that investigate the interplay between code anomaly and architectural problems (Section II.B). Finally, the section introduces existing ranking systems for code anomalies (Section II.C). A. Basic Concepts One of the causes for architecture degradation [14] on software projects is the continuous occurrence of code anomalies. The term code anomaly or code smell [9] is used to define structures on the source code that usually indicate maintenance problems. As examples of code anomalies we can mention Long Methods and God Classes [9]. In this work, we consider a code anomaly as being architecturally relevant when it has a negative impact in the system architectural design. That is, the anomaly is considered relevant when it is harmful or related to problems in the architecture design. Therefore, the occurrence of an architecturally relevant code anomaly can be observed if the anomalous code structure is directly realizing an architectural element exhibiting an architecture problem [19-21]. Once a code anomaly is identified, the corresponding code may suffer some refactoring operations, so the code anomaly is correctly removed. When those code anomalies are not correctly detected, prioritized and removed in the early stage of software development, the ability of these systems to evolve can be compromised. In some cases, the architecture has to be completely restructured. For this reason, the effectiveness of automatically detected code anomalies using strategies has been studied under different perspectives [16][18][19][26][31]. However, most techniques and tools disregard software project factors that might indicate the relevance of an anomaly in terms of its architecture design, number of errors and frequency of changes. Moreover, those techniques do not help developers to distinguish which anomalous element are architecturally harmful without considering the architectural role that a given code element plays on the architectural design. B. Harmful Impact and Detection of Code Anomalies The negative impact of code anomalies on the system architecture has been investigated by many studies in the stateof-art. For instance, the study developed in [23] reported that the Mozilla s browser code was overly complex and tightly coupled therefore hindering its maintainability and ability to evolve. This problem was the main cause of its complete reengineering, and developers took about five years to rewrite over 7 thousand source files and 2 million lines of code [12]. Another study [7] showed how the architecture design of a large telecommunication system degraded in 7 years. Particularly, the relationship between the system modules increased over the time. This was the main cause why the system modules were not independent anymore, and as consequence, further changes were not possible. Finally, a study performed in [14] investigated the main causes for architecture degradation. As a result, the study indicated that refactoring specific code anomalies could help to avoid it. Another study [35] has identified that duplicated code was related to design violations. In this sense, several detection strategies have been proposed in order provide means for the automatic detection of code anomalies [18][25][28]. However, most of them is based on source code information and relies on a combination of static code metrics and thresholds into logical expressions. This is the main limitation of those detection strategies, since they disregard architecture information that could be exploited to reveal architecturally relevant code anomalies. In addition, current detection strategies only consider individual occurrences of code anomalies, instead of analyzing the relationships between them. Such limitations are the main reasons why the current detection strategies are not able to support the detection of code anomalies responsible for inserting architectural problems [19]. Finally, a recent study [19] investigated to what extent the architecture sensitive detection strategies can better identify code anomalies related to architectural problems [22]. C. Ranking Systems for Code Anomalies As previously mentioned, many tools and techniques provide support for automatically detecting code anomalies. However, the number of anomalies tends to increase as the system grows and, in some cases, the high number of anomalies can be unmanageable. Moreover, maintainers are expected to choose which code anomalies should be prioritized. Some of the reasons why this is necessary are (i) time constraints and (ii) attempts to find the correct solution when restricting a large system in order to perform refactoring operations to solve those code anomalies. The problem is that the existing detection strategies do not focus on ranking or prioritizing code anomalies. Nevertheless, there are two tools that provide ranking capabilities for different development platforms: Code Metrics and InFusion. The first tool is a.net based add-in for the Visual Studio development environment and it is able to calculate a limited set of metrics. Once the metrics are calculated, the tool assigns a maintainability index score to each of the analyzed code elements. This score is based on the combination of the metrics for each code element. The second tool is the InFusion, which can be used for analyzing Java, C and C++ systems. Moreover, it allows calculating more than 60 metrics. Besides the statistical analysis for calculating code metrics, it also provides numerical scores in order to detect the code

86 anomalies. Those scores provide means to measure the negative impact of code anomalies in the software system. When combining the scores, a deficit index is calculated for the entire system. The index takes into consideration size, encapsulation, complexity, coupling and cohesion metrics. However, the main concern of using these tools is that the techniques they implement have some limitations: (i) usually it only considers the source code structure as input for detecting code anomalies; (ii) the ranking system disregards the architecture role of the code elements; and (iii) the user cannot define or customize their own criteria for prioritizing code anomalies. In this sense, our study proposes prioritization heuristics for ranking and prioritizing code anomalies. Moreover, our heuristics are not only based on source code information for detecting code anomalies. It also considers information about architecture relevance of the detected code anomalies. For this, we analyze different properties of the source code they affect, such as information about changes on software modules, bugs observed during the system evolution and the responsibility of module in the system architecture. III. STUDY SETTINGS This section describes our study hypotheses and variables selection, as well as the target systems used to evaluate the accuracy of the proposed heuristics. The main goal of this study is to evaluate whether the proposed heuristics for prioritization of architecturally relevant code anomalies can help developers on the ranking and prioritization process. It is important to emphasize that the analysis of the proposed heuristics is carried out in terms of accuracy. Also, Table I defines our study using the GQM format [34]. TABLE I. STUDY DEFINITION USING GQM FORMAT GQM (Goal, Question, Metric) Analyze: The proposed set of prioritization heuristics. Understanding their accuracy for ranking For the purpose of: code anomalies based on their architecture relevance. With respect to: Rankings previously defined by developers or maintainers of each analyzed system. From the viewpoint of: Researchers, developers and architects Four software systems from different In the context of: domains with different architectural designs. Our study was basically performed in three phases: (i) as we have proposed prioritization heuristics for ranking code anomalies, in the first phase we performed the detection and classification of code anomalies according to their architecture relevance for each of the target systems. For such detection, we used a semi-automatic process based on strategies and thresholds, which has been broadly used on previous studies [2][11][16][20]; (ii) in the second phase, we applied the proposed heuristics and computed their scores for each detected code anomaly. The output of this phase is an ordered list with the high-priority anomalies; finally, (iii) in the third phase, we compared the heuristics results with rankings previously defined by developers or maintainers of each target system. The ranking list provided by developers represents the ground truth data in our analysis, and were produced manually. A. Hypotheses In this section, we describe the study hypothesis in order to test the accuracy of the proposed heuristics for ranking code anomalies based on their relevance. First, we have defined some thresholds of what we consider as an acceptable accuracy: (i) low accuracy, 0-40%; (ii) acceptable accuracy, 40-80%; and (iii) high accuracy, %. These thresholds are based on the ranges defined in [37], where the values are applied in statistical tests (e.g. Pearson s correlation). We adapted these values in order to better evaluate our heuristics, since we are only interested in values that indicate a high correlation. Moreover, we analyzed the three levels of accuracy in order to investigate to what extent the prioritization heuristics would be helpful. For instance, a heuristic with an accuracy level of 50% means the ranking produced by the heuristic should be able to identify at least half of the architecturally relevant code anomalies. In order to test the accuracy of the prioritization heuristics, we have defined 4 hypotheses (see Table II). TABLE II. STUDY HYPOTHESES Hypothesis Description H 1.0 The change-density heuristic cannot accurately identify H1 architecturally relevant code anomalies ranked as top ten. H 1.1 The change-density heuristic can accurately identify architecturally relevant code anomalies ranked as top ten. H 2.0 The error-density heuristic cannot accurately identify H2 architecturally relevant code anomalies ranked as top ten. H 2.1 The error-density heuristic can accurately identify architecturally relevant code anomalies ranked as top ten. H 3.0 The anomaly density heuristic cannot accurately identify H3 architecturally relevant code anomalies ranked as top ten H 3.1 The anomaly density heuristic can accurately identify architecturally relevant code anomalies ranked as top ten H4 H 4.0 The architecture role heuristic cannot accurately identify architecturally relevant code anomalies ranked as top ten. H 4.1 The architecture role heuristic can accurately identify architecturally relevant code anomalies ranked as top ten B. Target Systems In order to test the study hypotheses, we selected 4 target systems from different domains: (i) MIDAS [24], a lightweight middleware for distributed sensors application; (ii) Health Watcher [13], a web system used for registering complaints about health issues in public institutions; (iii) PDP, a web application for managing scenographic sets in television productions; and (iv) Mobile Media [6], a software product line that manages different types of media in mobile devices. All the selected target systems have been previously analyzed in other studies that address problems such as architectural degradation and refactoring [11][20]. The target systems were selected based on 4 criteria: (i) the availability of either architecture specification or original developers. The architectural information is essential to the application of the architecture role heuristic, which directly depends on architecture information to compute the ranks of code anomalies; (ii) availability of the source version control systems of the selected applications; the information for the version control system provides input for the change-density

87 heuristic; (iii) availability of an issue tracking system. Although this is not a mandatory criterion, it is highly recommended for providing input for the error-density heuristic; and (iv) the applications should present different design and architectural structures. This restriction allows us to better understand the impact of the proposed heuristics for a diverse set of code anomalies, emerging from different architectural designs. IV. PRIORITIZATION OF CODE ANOMALIES In this section, we describe 4 prioritization heuristics proposed in this work. These heuristics are intentionally simple, in order to be feasible on most software projects. Their main goal is to help developers on identifying and ranking architecturally relevant code anomalies. A. Change Density Heuristic This heuristic is based on the idea that anomalies infecting unstable code elements are more likely to be architecturally relevant. An unstable element can be defined as a code element that suffers from multiple changes during the system evolution [15]. In some cases, for instance, changes occur in cascade and affect the same code elements. Those cases are a sign that such changes are violating the "open-closed principle", which according to [27] is the main principle for the preservation of architecture throughout the system evolution. In this sense, the change-density heuristic calculates the ranking results based on the number of changes performed on the anomalous code element. The change-density heuristics is defined as follows: given a code element c, the heuristic will look for every revision in the software evolution path where the element c has been modified. That is, the number of different revisions represents the number of changes performed in the element. Thus, the higher the number of changes, the higher is the element priority. The only input required for this heuristic is the change sets that occurred during the system evolution. The change sets is composed by the list of existing revisions and the code elements that were modified on each revision. For this heuristic, we are only able to calculate the changes performed to an entire file. For this scoring mechanism, all code anomalies presented in the same file will receive the same score. We adopted this strategy in our heuristics because none of the studied code anomalies emerged as the best indicator of architectural problems across the systems [20]. However, it is possible to differentiate between two classes by ranking those that have changed most as high-priority. In order to calculate the score for each code anomaly, the heuristic assign the number of changes that were performed in the infected class. Once the number of changes was computed, we ordered the list of resources and their respective number of changes, thus producing our final ranking. This information was to extract the change log from the version control systems for each of the target applications. B. Error-Density Heuristic This heuristic is based on the idea that code elements that have a high number of errors observed during the system evolution might be considered high-priority. The error-density heuristic is defined as follows: given a resolved bug b, the heuristic will look for code elements c that was modified in order to solve b. Thus, the higher the number of errors solved as a consequence of changes applied to c, the higher is the position in the prioritization ranking. This heuristic requires two different inputs: (i) change log inspection our first analysis was based on change log inspection, looking for common terms like bug or fix. Once those terms are found on commit messages, we incremented the scores for the classes involved in a given change. This technique has been successfully applied in other relevant studies [17]; and (ii) bug detection tool as we could not rely on the change log inspection for all system, we have decided to use a bug detection tool, namely findbugs, for automatically detecting blocks of code that could be related to bugs. Once possible bugs are identified, we collect the code elements causing them and increment their scores. Basically, the heuristic works as follows: (i) firstly, the information of bugs that were fixed is retrieved from the revisions; (i) after that, the heuristic algorithm iterates over all classes changes on those revisions and the score is incremented for each anomaly that infect the classes. In summary, when a given class is related to several bug fixes, the code anomaly will have a high score. C. Anomaly Density Heuristic This heuristic is based on the idea that each code element can be affected by many anomalies. Moreover, a high number of anomalous elements concentrated in a single component indicate a deeper maintainability problem. In this sense, the classes internal to a component with a high number of anomalies should be prioritized. Furthermore, it is known that developers seem to care less about classes that present too many code anomalies [27], when they need to modify them. Thus, anomalous classes tend to remain anomalous or get worse as the systems evolve. Thus, prioritizing classes with many anomalies should avoid the propagation of problems. This heuristic might also be worthy when classes have become brittle and hard to maintain due to the number of anomalies infecting them. Computing the scores for this heuristic was rather straightforward. Basically, it calculates the number of anomalies found per code element. Thus, we consider that elements with a high number of anomalies are high-priority targets for refactoring. The anomaly density heuristic is defined as follows: given a code element c, the heuristic will look to the number of code anomalies that c contains. Thus, the higher the number of anomalies found in c, the higher would be the ranking in the prioritization heuristic result. This heuristic requires only one input: the set of detected code anomalies for each code element in the system. Moreover, the heuristic can be customized to compute only architecture relevant anomalies, instead of computing the set of all the anomalies infecting the system. In order to define whether an anomaly is relevant or not, our work relies on the detection mechanisms provided by SCOOP tool [21].

88 D. Architecture Role Heuristic Finally, this heuristic proposes a ranking mechanism based on the architectural role a given class plays in the system. The fact is that, when the architecture information is available, the architectural role influences the priority level. The architecture role heuristic is defined as follows: given a code element c, this heuristic will examine the architectural role r performed by c. The relevance of the architectural role in the system represents the rank of c. In other words, if r is defined as a relevant architecture role and it is performed by c, the code element c will be ranked as high priority. The architecture role heuristic depends on two kinds of information, regarding the system s design: (i) which roles each class plays in the architecture; and (ii) how relevant those roles are towards architecture maintainability. For this study setting, we first had to leverage architecture design information in order to map code elements to their architecture roles. Part of this information extraction had already been performed on our previous studies [19][20]. Then, we asked the original architects to assign different levels of importance to those roles, according to the architecture patterns implemented. Moreover, we defined score levels to each architecture role. For doing so, we considered the number of roles identified by the architects, and distributed them according to a fixed interval from 0 to 10. Code anomalies that infected elements playing critical architecture roles were assigned to the highest score. On the other hand, when the code anomaly affected elements related to less critical architecture roles, they would be assigned to lower scores, according to the number architecture roles provided by the original architects. V. EVALUATION This section describes the main steps for evaluating the proposed heuristics, as well as testing the study hypotheses. The evaluation is organized into three main activities: (i) detect of code anomalies; (ii) identify of the rankings representing the ground truth; and (iii) collect scores for each anomaly under the perspective of the prioritization heuristics. A. Detecting Code Anomalies The first step was the automatic identification of code anomalies for each of the 4 target systems by using well-known detection strategies and thresholds [16][31]. These detection strategies and thresholds used in our study have been used previously in other studies [6][19][20]. The metrics required by the detection strategies are mostly collected with current tools [30][33]. After that, the list of code anomalies is checked and refined by original developers and architects of each target system. Through this validation we can make sure that results produced by the detection tools do not include false positives [19]. We have also a defined ground truth ranking in order to compare the results of the analysis provided by the software architects and the resulting ranking provided by each of the proposed heuristics. The ground truth ranking is a list of anomalous elements in the source code ordered by their architecture relevance, defined by the original architects of each target application. Basically, the architects were asked to provide an ordered list of the top ten classes that, in their beliefs, represented the main sources of maintainability problems of those systems. Besides providing a list of the high priority code elements, the architects were also asked to provide information regarding the architectural design of each target system. That is, they should provide a list of architectural roles presented in each target system ordered by their relevance from the architecture perspective. B. Analysis Method After applying the heuristics, we compared the rankings produced by each of them with the ground truth ranking. We decided to analyze only the top ten code elements ranked, for three main reasons: (i) it would be unviable if we have asked developers to rank an extensive list of elements; (ii) we wanted to evaluate our prioritization heuristics mainly for their abilities to improve refactoring effectiveness. Thus, the top ten anomalous code elements represent a significant sample of elements that could possibly cause architecture problems; and (iii) we focused on analyzing the top 10 code elements for assessing whether they represent a useful subset of sources of architecturally relevant anomalies. In order to analyze the rankings provided by the heuristics, we have considered three measures: (i) Size of overlap measures the number of elements that appear both in the ground truth ranking and in the heuristic ranking. It is fairly simple to calculate and tells us whether the prioritization heuristics are accurately distinguishing the top k items from the others; (ii) Spearman s footrule [5] it is a well-known metric for permutations. It measures the distance between two ranked lists by computing the differences in the rankings of each item; and (iii) Fagin s extension to the Spearman s footrule [8] it is an extension to Spearman s footrule for top k lists. Fagin extended Spearman s footrule by assigning an arbitrary placement to elements that belong to one of the lists but not to the other. Such placement represents the position in the resulting ranking for all of the items that do not overlap when comparing both lists. It is important to notice the main differences between the three measures: the number of overlaps indicates how effectively our prioritization heuristics are capable of identifying a set of k relevant code elements, disregarding the differences between them. This measure becomes more important as the number of elements under analysis grows. Thus, the number of overlaps might give us a good hint on the heuristics capability for identifying good refactoring candidates, disregarding the differences between them. The purpose of the other two measures is to analyze the similarity between two rankings. Unlike the number of overlaps, they take into consideration the positions each item has in the compared rankings. It is important to mention the main differences between those two measures: when calculating Spearman s footrule, we are only considering the overlapping items. When the lists are disjoint, the original ranks are lost, and a new ranking is produced. On the other hand, Fagin s measure takes into consideration the positions of the overlapping elements in the original lists. Finally, we used the measures results to calculate the similarity accuracy as defined in our hypotheses.

89 VI. EVALUATING THE PROPOSED PRIORITIZATION HEURISTICS The evaluation of the proposed heuristics involved two separated activities: (i) quantitative analysis on the similarity results; and (ii) quantitative evaluation of the results regarding their relations to actual architecture problems. A. Change-Density Heuristic Evaluation. This heuristic was applied in 3 out of the 4 target applications selected in our study. Our analysis was based on different versions of Health Watcher (10 versions), Mobile Media (8 versions) and PDP (409 versions). Our goal was to check whether the prioritization heuristics performed well or not on systems with shorter and longer longevity. Additionally, it was not a requirement to only embrace projects with long histories, once we wanted also to evaluate whether the heuristics would be more effective in preliminary versions of a software system. Table III shows the evolution characteristics analyzed for each system. TABLE III. CHANGE CHARACTERISTICS FOR EACH TARGET APPLICATION Name CE N-Revisions M-Revisions AVG Health Watcher Mobile Media PDP As we can observe, Mobile Media and Health Watcher presented similar evolution behaviors. As the maximum number of revisions (M-Revisions) was limited to the total number of revisions for a system (AVG), neither Health Watcher nor Mobile Media could have 10 or more versions of a code element (CE). We can observe that Health Watcher had more revisions than Mobile Media. However, those changes were scattered between more files. Due to the reduced number of revisions available for both systems, we have established a criterion for selecting items when there were ties in the top 10 rankings. For instance, we can use alphabetical order when the elements in the ground truth are ranked equally harmful. TABLE IV. RESULTS FOR THE CHANGE-DENSITY HEURISTIC Name Overlap NSF NF Value Accuracy Value Accuracy Value Accuracy HW 8 57% % % MM 5 50% 1 0% % PDP 6 60% % % Table IV show the results observed when analyzing the change-density heuristic. As we can observe, the highest absolute overlap value was obtained for Health Watcher. It can be explained by the fact that the Health Watcher system has many files with the same number of changes. In this sense, when computing the scores we did not consider only the 10 most changed files, as that approach would discard files with as many changes as the ones selected. So, we decided to select 14 files, where the last 5 presented exactly the same number of changes. Moreover, the Health Watcher presented the highest number of code elements, having a total of 137 items (see Table III) that could appear on the ranking produced by applying the heuristic. Another interesting finding was observed in the Mobile Media system. Although the changedensity heuristic detected 5 overlaps, all of them were shifted exactly two positions, thus resulting in the 1 value for the NSF measure. On the other hand, when we considered the nonoverlaps, the position for one item matched. Moreover, the results show us that the NSF measure is not adequate when the number of overlaps is small. When we compare the results of Mobile Media and Health Watcher to those obtained by PDP, we observed a significant difference. All PDP measures performed above our acceptable similarity threshold, which means a similarity value higher than 45%. For this case, we observed that the similarity was related to a set of classes that were deeply coupled: an interface acting as Facade and three realizations of this interface, implementing a server module, a client module and a proxy. When performing changes on the interface, many other changes were triggered in those three classes. For this reason, they have suffered many other modifications during the system evolution. Moreover, the nature of changes that the target applications underwent is different. For instance, on Health Watcher most part of changes was perfective (changes made aiming to improve the overall structure of the application). On the other hand, on Mobile Media, most part of the changes was related to the addition of new functionalities, which was also the case for PDP. However, we observed that Mobile Media had also low accuracy rates. In summary, the results on applying the change-density heuristic showed us that it could be useful for detecting and prioritizing architecturally relevant anomalies in the following scenarios: (i) there are architecture problems involving groups of classes changing together; (ii) there are problems in the architecture related to Facade or communication classes; and (iii) changes were predominantly perfective. In this sense, from the results observed in the analysis, we can reject the null hypothesis H1. The fact was that the change-density heuristic was able to produce rankings for PDP with at least acceptable accuracy in all the analyzed measures. Correlation with Architectural Problems. Based on the results produced by the change-density heuristic, we also needed to evaluate whether there is a correlation between the rankings with architectural problems. In this sense, we performed the analysis by observing which ranked elements are related to actual architectural problems (see Table V). We can observe that elements containing architecturally relevant anomalies (Arch-Relevant) were likely to be change-prone. For PDP system, all of the top 10 most changed elements were related to architectural problems. Also, if we consider that PDP has 97 code elements, and 37 of them are related to architectural problems, the results give us a hint that changedensity is a good heuristic for detecting them. TABLE V. RESULTS FOR THE CHANGE-DENSITY HEURISTIC VS. ARCHITECTURAL PROBLEMS Name N-ranked CE Arch-Relevant % of Arch-Relevant HW % MM % PDP %

90 B. Error-Density Heuristic Evaluation. This heuristic is based on the assessment of bugs that are introduced by a code element. So, the higher the number of bugs observed in a code element, the higher is its priority. Thus, in order to correctly evaluate the results produced by the error-density heuristics, a reliable set of detected bugs should be available for each target system. This was the case for the PDP system, where the set of bugs was well documented. On the other hand, for Mobile Media and Health Watcher, where the documentation of bugs was not available, we relied on the analysis of bug detection tools. The results of applying the error-density heuristics are showed in Table VI. It is important to highlight that for Health Watcher there were 14 ranked items, due to ties between some of them. Nevertheless, Health Watcher presented the highest overlap measures. That happens because the detected bugs were related to the behavior observed in every class implementing the Command pattern. Furthermore, each of the classes implementing this pattern was listed a high-priority in the ground-truth ranking. TABLE VI. RESULTS FOR THE ERROR-DENSITY HEURISTIC Name Overlap NSF NF Value Accuracy Value Accuracy Value Accuracy HW 10 71% 0 100% % MM 3 30% 0 100% % PDP 5 30% % % Another interesting finding we observed was that the priority order for overlapping elements was exactly the same as the one pointed out in the ground truth. However, the 4 remaining non-overlapping elements were the same 4 elements in the ground truth ranking. The fact that top 4 elements are not listed in the ranking list produced by the heuristic resulted in a low accuracy for NF measure. For the Mobile Media, we have applied the same strategy, but all the measures also presented low accuracies. Due to the small number of overlaps, the results for NSF may not confidently represent the heuristics accuracy. Finally, for the PDP the results were evaluated in a different perspective once we considered the manually detected bugs. That is, the bugs were collected through its issue tracking system, instead of using automatic bug detection tools. However, even when we performed the analysis using a reliable set of bugs, the overall results presented low accuracy. That is, from the 5 nonoverlapping items, 4 of them were related to bugs on utility classes. Once those classes were neither related to any particular architectural role, nor implementing an architecture component, they were not considered architecturally relevant. Correlation with Architectural Problems. Based on the results produced by the error-density heuristic, we could investigate the correlation between the rankings with actual architectural problems. That is, we could analyze whether the error-density heuristics presented better results towards detecting architecturally relevant anomalies. Table VII presents the results from applying this heuristic. As we can see, at least 80% of the ranked elements were related to architecture problems for all the analyzed systems. Moreover, Health Watcher system reached the most significant results with 85% of the ranked elements related to architectural problems. When we take into consideration that the ranking for Health Watcher was composed of 14 code elements (instead of 10), this result is even more significant. As mentioned before, the rankings for Health Watcher and Mobile Media were built over automatically detected bugs. It means that even when formal bug reports are not available, the use of static analysis tool [3] for predicting possible bugs might be useful. TABLE VII. RESULTS FOR THE ERROR-DENSITY HEURISTIC VS. ACTUAL ARCHITETURAL PROBLEMS Name N-ranked CE Arch-Relevant % of Arch-Relevant HW % MM % PDP % On the other hand, for the PDP system where we considered actual bug reports, the results were also promising. From the top 10 ranked elements, 8 were related to architecture problems. When we consider that PDP system has 97 code elements, with 37 of them related to architecture problems, it means that the remaining 29 were distributed among the 87 bottom ranked elements. Moreover, if we extend the analysis over the top 20 elements, we observe a better correlation factor. That is, in this case the correlation showed us that around 85% of the top 20 most error-prone elements were related to architecture problems. C. Anomaly Density Heuristic Evaluation. The anomaly density heuristic was applied to the 4 target systems selected in our study. We have observed good results in terms of accuracy on ranking the architecturally relevant anomalies. As we can see in Table VIII, good results were obtained not only on ranking the top 10 anomalies, but also on defining its positions. We observed that only 2 of 8 measures had low accuracy when compared to the thresholds defined in our work. Furthermore, the number of overlaps achieved by this heuristic can be considered highly accurate in 3 of the 4 target systems. This indicates that code elements affected by multiple code anomalies are often perceived as high priority. It did not occur only in the case of Health Watcher, where we observed only 5 overlaps. When analyzing the number of anomalies for each element on the ranking representing ground truth, we could observe that many of them had exactly the same number of code anomalies, namely 8. Also, it is important to mention that for this heuristic, in contrast to the change-density and error-density heuristics, we only considered the top 10 elements for the Health Watcher system - once there were not ties to be taken into consideration. When analyzing the MIDAS system, we could not observe a significant number of overlaps, once 9 out of 10 elements appeared in both rankings. However, this fact was expected as the system is composed by only 21 code elements. Nevertheless, we observed that both NSF and NF presented a high accuracy, which means that the rankings were similarly ordered. Moreover, the NF measure presented a better result, which was influenced by the fact that the only mismatched element was ranked in the 10 th position. On the other hand, when analyzing the Mobile Media we observed discrepant

91 results regarding two ranking measures. We found 59% of accuracy for the NSF measure, and 30% for the NF measure. This difference is also related to the position the nonoverlapping elements in the ranking generated by the heuristic. Therefore, the ranks for those elements were assigned to k+1 in the developers list, which resulted in a huge distance from their original positions. It is also important to mention that those elements comprehended a data model class, a utility class and a base class for controllers. TABLE VIII. RESULTS FOR THE ANOMALY DENSITY HEURISTIC Overlap NSF NF Name Val Accuracy Value Accuracy Value Accuracy ue HW 5 50% % % MM 7 70% % % PDP 8 80% % % MIDAS 9 90% % % By analyzing the results for this heuristic, we observed that code elements infected by multiple code anomalies are often perceived as high priority. We also identified that many false positives could arise from utility classes, as those classes are often large and not cohesive. Finally, the results obtained in this analysis also helped us rejecting the null hypothesis H3 as the anomaly density heuristic was able to produce rankings with at least acceptable accuracy in all of the systems we analyzed for at least one measure. Furthermore, we obtained a high accuracy rate for the MIDAS system in 2 out of 3 measures, which means 90% for the overlaps and 80% for NF. Correlation with Architectural Problems. We also performed an analysis in order to evaluate whether the rankings produced by the anomaly density heuristic. However, when evaluating the results produced by this heuristic, we observed that they were not consistent if compared them with architecturally relevant anomalies. This is valid conclusion for all target systems. TABLE IX. RESULTS FOR THE ANOMALY DENSITY HEURISTIC VS. ACTUAL ARCHITETURAL PROBLEMS Name N-ranked CE Arch-Relevant % of Arch-Relevant HW % MM % PDP % MIDAS 10 6* 60%* For instance, Table IX shows that for the Health Watcher system only 5 out of the top 10 ranked elements were related to architectural problems. The 5 code elements related to architectural problems are exactly the same overlapping items between the compared ranks. It happens due the high number of anomalies, which are concentrated in a small numbers of elements that are not architecturally relevant. Moreover, all the 5 non-architecturally relevant elements were data access classes responsible for communicating to the database. For the MIDAS system, we observed that from the top 10 code elements with the higher number of anomalies, 6 were architecturally relevant. In addition, the MIDAS system has exactly 6 elements that contribute to the occurrence of architecture problems. So, we can say that the anomaly density heuristic correctly outlined all of them in the top 10 ranking. D. Architecture Role Heuristic Evaluation We analyzed 3 of the 4 systems in order to evaluate the architecture role heuristic. As we can observe (see Table X), PDP achieved the most consistent results regarding the three similarity measures. The heuristic achieved around 60% of accuracy when comparing the similarity between the rankings. Also, the PDP is the only system where it was possible to divide classes and interfaces in more than three levels when analyzing the architectural roles. For instance, Table XI illustrates the four different architectural roles defined on the PDP system. TABLE X. RESULTS FOR THE ARCHITECTURE ROLE HEURISTIC Name Overlap NSF NF Value Accuracy Value Accuracy Value Accuracy HW 4 40% % % MM 6 60% % % PDP 6 60% % % TABLE XI. ARCHITECTURE ROLES IN PDP Architecture Roles Score # of CE Utility and Internal Classes 1 23 Presentation and Data access classes 2 28 Domain Model, Business classes 4 24 Public Interfaces, Communication classes, Facades 8 6 Based on the classification provided in Table XI, we can draw the architecture role heuristic ranking for PDP. As we can see, the ranking contains all of the 6 code elements (# of CE) from the highest category and 4 elements from the domain model and business classes. We ordered the elements alphabetically for breaking ties. Therefore, although 23 classes obtained the same score, we are only comparing 4 of them. However, it is important to mention that some of the elements ranked by the original architects belonged to the group of discarded elements. Once we have chosen a different approach, such as considering all the ties as one item, we would turn our top ten ranking into a list of 30 items and have a 100% overlap rate. On the other hand, we decided to follow a different score approach for Mobile Media and Health Watcher, by consulting original architects for each of the target applications. The architects provided us the architecture roles and their relevance on the system architecture. Once we identified which classes were directly implementing which roles, we were able to produce the rankings for this heuristic. The worst results were observed in the Health Watcher system, where almost 20 elements were ties with the same scores. So, we first selected the top 10 elements, and broke the ties according to the alphabetic order. This led us to an unreal low number of overlaps, as some of the discarded items were present in the ground truth ranking. In fact, due to low number of overlaps, it would not be fair to evaluate the NSF measure as well. Thus, we performed a second analysis, considering the top 20 items instead of the top 10, for analyzing the whole set of elements that had the same score. In this second analysis, we observed the number of overlaps went up to 6, but the accuracy for the NSF measure decreased to 17% - which indicates a larger distance between the compared rankings. In addition, this also shows us that the 50% accuracy for NSF obtained in the first comparison round was

92 misleading, as expected, due the low number of overlaps. For the Mobile Media system, we observed high accuracy rates for both NSF and NF measures. Furthermore, we observed that several elements of the Mobile Media were documented as being of high priority on the implementation of architectural components. More specifically, there were 8 architecture components described in that document directly related to 9 out of the top 10 high priority classes. It is important to notice that the results for this heuristic are dependent on the quality of the architecture roles defined by the software architect. Moreover, we observed that PDP system achieved the best results, even with multiple architecture roles defined, as well as different levels of relevance. Finally, we conclude that the results of applying the architecture role heuristic helped to reject the null hypothesis H4. In other words, the heuristic was able to produce rankings with at least acceptable accuracy in all of the target applications. Correlation with Architectural Problems. Similarly to the other heuristics, we have also evaluated whether the rankings produced by the architecture role heuristic are related to actual architectural problems for each of the target applications (see Table XII). As we can observe, the results are discrepant between the Health Watcher and the other three systems. However the problem in this analysis is related to the analyzed data. We identified two different groups of architecture roles among the top 10 elements for Health Watcher, ranked as equally relevant. That is, 6 of the related elements were playing the role of repository interfaces. The 4 remaining elements were Facades [10], or elements responsible for communicating different architecture components. We then asked the original architects to elaborate on the relevance of those roles, as we suspected they were unequal. They decided to differentiate the relevance between them, and considered the repository role as less relevant. This refinement led to a completely different ranking, which went up from 4 to 7 elements related to architecture problems. TABLE XII. ARCHITECTURE ROLE HEURISTIC AND ACTUAL ARCHITECTURAL PROBLEMS Name # of ranked CE Arch-Relevant % of Arch-Relevant HW % MM % PDP % The results obtained for Health Watcher show us the importance of correctly identifying the architecture roles and their relevancies for improving the accuracy of this heuristic. When that information is accurate, the results for this heuristic are highly positive. Furthermore, the other proposed prioritization heuristics could benefit from information regarding architecture roles in order to minimize the number of false positives, like utility classes. This indicates the need to further analyze different combinations of prioritization heuristics. VII. THREATS TO VALIDITY This section describes some threats to validity observed in our study. The first threat is related to possible errors on the anomalies detection in each of the selected target systems. As the proposed heuristics consist of ranking previously ranked code anomalies, the method for detecting these anomalies must be trustworthy. Although there are several kinds of detection strategies in the state-of-art, many studies have proven that they are inefficient for detecting architecturally relevant code anomalies [19]. In order to reduce the risk of imprecision when detecting code anomalies: (i) the original developers and architects were involved in this process; and (ii) we used wellknown metrics and thresholds for constructing our detection strategies [16][31]. The second threat is related to how we identified errors in software systems in order to apply the errordensity heuristic. Firstly, we relied on commit messages for identifying classes related to bug fixes. So, it implies that some errors might be missed. In order to mitigate this threat, we also investigated issue-tracking systems. Basically, we looked for error reports and traces between these errors and the code changed to fix them. Furthermore, we investigated test reports in order to identify the causes for eventual broken tests. Finally, for some cases where the information is not available, we relied on the use of static analysis methods for identifying bugs [3]. The third threat is related to the identification of the architectural roles for each of the target systems. The architecture role heuristic is based on identifying the relevance of code elements regarding the system architectural design. Thus, in order to compute the scores for this heuristic, we needed to assess the roles that each code element plays in the system architecture. In this sense, we considered the identification of architectural roles as being a threat to construct validity because the information regarding the architectural roles was extracted differently depending on the target system. Furthermore, we understand that the absence of architecture documentation reflect a common situation that might be inevitable when analyzing real world systems. Finally, the fourth threat to validity is an external threat and it is related to the choice of the target systems. The problem here is that our results are limited to the scope of the 4 target systems. But in order to minimize this threat, we selected systems developed by different programmers, with different domains, programming languages, environment and architectural styles. In order to generalize our results, further empirical investigation is still required. In this sense, our study should be replicated with other applications, from different domains. VIII. FINAL REMARKS AND FUTURE WORK The presence of architecturally relevant code anomalies often leads to the decline of the software architecture quality. Furthermore, the removal of those critical anomalies is not properly prioritized, mainly due to the inability of current tools to identify and rank architecturally relevant code anomalies. Moreover, there is no sufficient empirical knowledge towards factors that could make it easier the prioritization process. In this sense, our work has shown that developers can be guided through the prioritization of code anomalies according to architectural relevance. The main contributions of this work are: (i) four prioritization heuristics based on the architecture relevance and (ii) the evaluation of the proposed heuristics on four different software systems.

93 In addition, during the evaluation of the proposed heuristics, we found that they were mostly useful in scenarios where: (i) there are architectural problems involving groups of classes that change together; (ii) there are architecture problems related to Facades or classes responsible for communicating different modules; (iii) changes are not predominantly perfective; (iv) there are architecture roles infected by multiple anomalies; and (v) the architecture roles are well defined in the software system and have distinct architecture relevance. Finally, in this work we evaluated the proposed heuristics individually. Thus, we have not evaluated how their combinations could benefit the prioritization results. In that sense, as a future work, we aim to investigate whether the combination of two or more heuristics would improve the efficiency of the ranking results. We also intend to apply different weights when combining the heuristics, enriching the possible results and looking for an optimal combination. REFERENCES [1] R. Arcoverde, A. Garcia and E. Figueiredo, Understanding the Longevity of Code Smells Preliminary Results of an Survey, in Proc. of 4 th Int l Workshop on Refactoring Tools, May 2011 [2] R. Arcoverde et al., Automatically Detecting Architecturally-Relevant Code Anomalies, 3 rd Int l Workshop on Recommendation Systems for Soft. Eng., June [3] N. Ayewah et al., Using Static Analysis to Find Bugs, IEEE Software, Vol. 25, Issue 5, pp , September [4] L. Bass, P. Clements and R. Kazman, Software Architecture in Practice, Second Edition, Addison-Wesley Professional, [5] P. Diaconis and R. Graham, Spearman s Footrule as a Measure of Disarray, in Journal of the Royal Statistic Society, Series B, Vol. 39, pp , [6] E. Figueiredo et al., Evolving Software Product Lines with Aspects: An Empirical Study on Design Stability, in Proc. of 30 th Int l Conf. on Software Engineering, New York, USA [7] S. Eick, T. Graves and A. Karr, Does Code Decay? Assessing the Evidence from Change Management Data, IEEE Transactions on Soft. Eng., Vol. 27, Issue 1, pp. 1-12, 2001 [8] R. Faing, R. Kumar and D. Sivakumar, Comparing Top K Lists, in Proc. of 14 th Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pp ,, USA [9] M. Fowler, Refactoring: Improving the Design of Existing Code, Addison-Wesley, 99. [10] E. Gamma et al., Design Patterns: Elements of Reusable Object- Oriented Software, Addison-Wesley, Boston, USA, [11] J. Garcia, D. Popescu, G. Edwards and N. Medvidovic, Idenfitying Architectural Bad Smells, in Proc. of CSMR, Washington, USA [12] M. Godfrey and E. Lee, Secrets from the Monster: Extracting Mozilla s Software Architecture, in Proc. of 2 nd Int l Symp. On Constructing Software Engineering Tools, [13] P. Greenwood et al., On the Impact of Aspectual Decompositions on Design Stability: An Empirical Study, in Proc. of 21 st Conf. of Object- Oriented Programming, Springer, pp , [14] L. Hochestein and M. Lindvall, Combating Architectural Degeneration: A Survey, Information of Software Technology, Vol. 47, July [15] D. Kelly, A Study of Design Characteristics in Evolving Software Using Stability as a Criterion, IEEE Transactions on Software Engineering, Vol. 32, Issue 5, pp , [16] F. Khom, M. Penta and Y. Guéhenéuc, An Exploratory Study of the Impact of Code Smells on Software Change-Proneness, in Proc. of 16 th Working Conf. on Reverse Eng., pp , 2009 [17] M. Kim, D. Cai and S. Kim, An Empirical Investigation into the Role of API-Level Refactorings during Software Evolution, in Proc. of 33 rd Int l Conf. on Software Engineering, USA [18] M. Lanza and R. Marinescu, Object-Oriented Metrics in Practice, Springer-Verlag, New York, USA 2006 [19] I. Macia et al., Are Automatically-Detected Code Anomalies Relevant to Architectural Modularity? An Exploratory Analysis of Evolving Systems, in Proc. of 11 th AOSD, pp , Germany, [20] I. Macia et al., On the Relevance of Code Anomalies for Identifying Architecture Degradation Symptoms, in Proc. of 16 th CSMR, Hungary, March [21] I. Macia et al., Supporting the Identification of Architecturally- Relevant Code Anomalies, in Proc. of 28 th IEEE Int l Conf on Soft. Maint., Italy, [22] I. Macia et al., Enhancing the Detection of Code Anomalies with Architecture-Sensitive Strategies, in Proc. of the 17 th CSMR, Italy, March [23] A. MacCormack, J. Rusnak and C. Baldwin, Exploring the Structure of Complex Software Design: An Empirical Study of Open Source and Proprietary Code, in Management Science, Vol. 52, Issue 7, pp , [24] S. Malek et al., Reconceptualizing a Family of Heterogeneous Embedded Systems via Explicit Architectural Support, in Proc. of the 29 th Int l Conf on Soft. Eng., IEEE Computing Society, USA [25] M. Mantyla and C. Lassensius, Subjective Evaluation of Software Evolvability using Code Smells: An Empirical Study, Vol. 11, pp , 2006 [26] R. Marinescu, Detection Strategies: Metrics-Based Rules for Detecting Design Flaws, in Proc. Int l Conf. on Soft. Maint., pp , [27] R. Martin, Agile,Software Software Development, Principles, Patterns ans Practices. Prentice Hall, [28] M. J. Munro, Product Metrics for Automatic Identification of Bad Smells Design problems in Java Source-Code, In Proc. of 11 th Int l Sympositum on Soft. Metrics, pp. 15, September [29] E. Murphy-hill, C. Parnin and A. Black, How We Refactor and How We Know it, in Proc. of 31 st Int l Conf. on Software Engineering, [30] NDepend. Available at [31] S. Olbrich, D. Cruzes and D. Sjoberg, Are Code Smells Harmful? A Study of God Class and Brain Class in the Evolution of Three Open Source Systems, in Proc. of 26 th Int l Conf. on Soft. Maint., [32] J. Ratzinger, M. Fischer and H. Gall, Improving Evolvability through Refactoring, in Proc. of 2 nd Int l Workshop on Mining Soft. Repositories, ACM Press, pp , New York, [33] Understand, Available at: [34] C. Wohlin, et al., Experimentation in Software Engineering An Introduction, Kluwer Academic Publisher, [35] S. Wong, Y. Cai and M. Dalton, Detecting Design Defects Caused by Design Rule Violations, in Proc. of 18 th ESEC/ Foundations on Software Engineering, [36] Z. Xing and E. Stroulia, Refactoring Practice: How it is and How it should be Supported: An Eclipse Study, in Proc. of 22 nd IEEE Int l Conf. on Software Maintenance, pp , [37] D. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, Chapman & All, 4 th Edition, 2007.

94 Are domain-specific detection strategies for code anomalies reusable? An industry multi-project study Reuso de Estratégias Sensíveis a Domínio para Detecção de Anomalias de Código: Um Estudo de Múltiplos Casos Alexandre Leite Silva, Alessandro Garcia, Elder José Reioli, Carlos José Pereira de Lucena Opus Group, Laboratório de Engenharia de Software Pontifícia Universidade Católica do Rio de Janeiro (PUC - Rio) Rio de Janeiro/RJ, Brasil {aleite, afgarcia, ecirilo, lucena}@inf.puc-rio.br Resumo Para promover a longevidade de sistemas de software, estratégias de detecção são reutilizadas para identificar anomalias relacionadas a problemas de manutenção, tais como classes grandes, métodos longos ou mudanças espalhadas. Uma estratégia de detecção é uma heurística composta por métricas de software e limiares, combinados por operadores lógicos, cujo objetivo é detectar um tipo de anomalia. Estratégias prédefinidas são usualmente aplicadas globalmente no programa na tentativa de revelar onde se encontram os problemas críticos de manutenção. A eficiência de uma estratégia de detecção está relacionada ao seu reuso, dado o conjunto de projetos de uma organização. Caso haja necessidade de definir limiares e métricas para cada projeto, o uso das estratégias consumirá muito tempo e será negligenciado. Estudos recentes sugerem que o reuso das estratégias convencionais de detecção não é usualmente possível se aplicadas de forma universal a programas de diferentes domínios. Dessa forma, conduzimos um estudo exploratório em vários projetos de um domínio comum para avaliar o reuso de estratégias de detecção. Também avaliamos o reuso de estratégias conhecidas, com calibragem inicial de limiares a partir do conhecimento e análise de especialistas do domínio. O estudo revelou que, mesmo que o reuso de estratégias aumente quando definidas e aplicadas para um domínio específico, em alguns casos o reuso é limitado pela variação das características dos elementos identificados por uma estratégia de detecção. No entanto, o estudo também revelou que o reuso pode ser significativamente melhorado quando as estratégias consideram peculiaridades dos interesses recorrentes no domínio ao invés de serem aplicadas no programa como um todo. Abstract To prevent the quality decay, detection strategies are reused to identify symptoms of maintainability problems in the entire program. A detection strategy is a heuristic composed by the following elements: software metrics, thresholds, and logical operators combining them. The adoption of detection strategies is largely dependent on their reuse across the portfolio of the organizations software projects. If developers need to define or tailor those strategy elements to each project, their use will become time-consuming and neglected. Nevertheless, there is no evidence about efficient reuse of detection strategies across multiple software projects. Therefore, we conduct an industry multi-project study to evaluate the reusability of detection strategies in a critical domain. We assessed the degree of accurate reuse of previously-proposed detection strategies based on the judgment of domain specialists. The study revealed that even though the reuse of strategies in a specific domain should be encouraged, their accuracy is still limited when holistically applied to all the modules of a program. However, the accuracy and reuse were both significantly improved when the metrics, thresholds and logical operators were tailored to each recurring concern of the domain. Palavras-chave anomalias;detecção;reuso;acurácia I. INTRODUÇÃO Na medida em que sistemas de software são alterados, mudanças não planejadas podem introduzir problemas estruturais no código fonte. Estes problemas representam sintomas de manutenibilidade pobre do programa e, portanto, podem dificultar as atividades subsequentes de manutenção e evolução do programa [1]. Tais problemas são chamados de anomalias de código ou popularmente de bad smells [1]. Segundo estudos empíricos, módulos de programas com anomalias recorrentes, tais como métodos longos [1] e mudanças espalhadas [1], estão usualmente relacionados com introdução de falhas [17][25][26] e sintomas de degeneração de projeto [17][20][27]. Quando tais anomalias não são identificadas e removidas, é frequente a ocorrência de degradação parcial ou total do sistema [21]. À medida que um sistema cresce, identificar anomalias de código manualmente fica ainda mais difícil ou impeditivo. A automação do processo de detecção de anomalias em programas é usualmente suportada através de métricas [2][11]. Cada métrica quantifica um atributo de elementos do código fonte, tais como acoplamento [23], coesão [24] e complexidade ciclomática [22]. A partir das métricas é possível identificar uma relação entre os valores de atributos e um sintoma de problema no código. Através dessa relação é possível definir uma estratégia de detecção para apoiar a descoberta de anomalias automaticamente [1][2]. Uma estratégia de detecção é uma condição composta por métricas e limiares, combinados através de operadores lógicos. Através desta condição é possível filtrar um conjunto específico de elementos do

95 programa. Este conjunto de elementos representa candidatos a anomalias de código nocivas a manutenibilidade do sistema [2]. Mesmo assim, nem todo sintoma representa necessariamente um problema relevante para o desenvolvedor do sistema [8]. Para facilitar a identificação das anomalias, algumas ferramentas foram propostas, a partir das estratégias de detecção conhecidas: [3], [4], [5], [6] e [7]. Mesmo com o apoio de ferramentas, detectar anomalias é difícil e custoso [8]. Além disso, a eficiência de uma estratégia de detecção está relacionada à facilidade do seu reuso dado o conjunto de projetos de uma organização. Em um extremo negativo, os desenvolvedores precisariam definir uma estratégia de detecção para cada tipo possível de anomalia, para cada projeto. Para isso, seria preciso rever as métricas e limiares apropriados, além das ocorrências identificadas pelas ferramentas que não representam necessariamente problemas no código. Essa tarefa, ao ser executada especificamente para cada projeto, vai custar muito tempo e fatalmente será negligenciada. Além disso, existem evidências empíricas de que o reuso das estratégias de detecção não é possível se aplicada em vários projetos de software de domínios totalmente diferentes. Para que fosse possível investigar o reuso de estratégias de detecção em vários projetos de software do mesmo domínio, este artigo apresenta um estudo de múltiplos casos da indústria. O estudo investigou o reuso de sete estratégias de detecção, relacionadas a três anomalias, em seis projetos de um domínio específico. O reuso das estratégias foi avaliado a partir do percentual de falsos positivos, classificados segundo a análise de três especialistas do domínio, a partir das ocorrências encontradas pelas estratégias de detecção de anomalias. A partir do grau de reuso das estratégias, foram investigadas as situações em que fosse possível aumentar o grau de reuso, tendo em vista os sistemas escolhidos para o estudo. Dessa forma, o estudo revelou que, mesmo que o reuso de estratégias de detecção em um domínio específico seja incentivado, em alguns casos o reuso é limitado devido à variação das características dos elementos identificados por uma estratégia de detecção. No entanto, a acurácia e o reuso foram ambos significativamente melhorados quando os limiares foram adaptados para certos interesses recorrentes no domínio. Nesse sentido, foi observado que os melhores resultados se deram nos interesses em que as características dos elementos variaram menos. Assim, o presente estudo inicia o estudo de estratégias de detecção tendo em vista conjuntos de elementos com responsabilidades bem definidas. O artigo está estruturado da seguinte maneira. Na seção II é apresentada a terminologia relacionada ao trabalho. Na seção III é apresentada a definição do estudo de caso. Na sessão IV são apresentados os resultados e as discussões e, na seção V, as conclusões. II. TERMINOLOGIA Esta seção apresenta conceitos associados a anomalias de código (seção II.A) e estratégias de detecção de anomalias (II.B). A. Anomalias de código Segundo Fowler, uma anomalia de código é um sintoma de manutenibilidade pobre do programa que pode dificultar futuras atividades de correção e evolução do código fonte [1]. Por exemplo, um sintoma que precisa ser evitado é a existência de classes que centralizam muito conhecimento sobre as funcionalidades do sistema. Este sintoma é muito conhecido como God Class e tem um grande potencial de impacto negativo no perfeito entendimento do sistema [2]. Outro sintoma que deve ser evitado é Long Method [1]: quanto maior um método é, mais difícil será entender o que ele se propõe a fazer. Espera-se então uma maior longevidade de programas com métodos curtos de código [1]. Estas anomalias estão relacionadas de uma forma ou de outra a fatos sobre um único elemento do código. Por outro lado, certas anomalias procuram correlacionar fatos sobre diversos elementos do código com possíveis problemas de manutenção, como é o caso da Shotgun Surgery. Esta anomalia identifica métodos que podem provocar muitas alterações em cascata, isto é, manutenções no código que levam a diversas mudanças pequenas em outras classes. Quando essas alterações estão espalhadas pelo código, elas são difíceis de encontrar assim como também é fácil neste caso para o desenvolvedor esquecer alguma mudança importante [1]. Para apoiar a descoberta de anomalias, Fowler propôs 22 metáforas de sintomas que indicam problemas no código, sendo que, cada metáfora está relacionada a uma anomalia de código [1][10]. B. Estratégias de detecção A detecção de anomalias oferece aos desenvolvedores a oportunidade de reestruturação do código para uma nova estrutura que facilite manutenções futuras. Um mecanismo bastante utilizado para detecção de anomalias é a descrição das mesmas através da composição de métricas associadas aos atributos dos elementos de código [2]. Uma composição de métricas descreve uma estratégia de detecção. A partir de uma estratégia de detecção é possível filtrar então um conjunto específico de elementos do programa. Este conjunto de elementos representa potenciais candidatos a anomalias de código [2][12][13]. A Fig. 1 descreve, de maneira sucinta, o processo de formação de uma estratégia de detecção, segundo [2], para reuso em diferentes sistemas. Primeiro, um conjunto de métricas relacionadas a sintomas que indicam um determinado problema é identificado (Fig. 1 a). Em um segundo passo, as métricas identificadas são associadas a limiares, para que seja possível filtrar os elementos de código. Uma métrica associada a um dado limiar elimina elementos para os quais os valores das métricas excedam os limiares (Fig. 1 b). Para a final formação de uma estratégia de detecção, as métricas e limiares são combinados entre si através de operadores lógicos (e.g., AND, OR) (Fig. 1 c e d).

96 M 1 M 2 M n (a) M 1, L 1 M 2, L 2 M n, L n Legenda: Mi: Resultado da métrica i L i: Limiar associado à métrica i Fig. 1. Processo de formação de uma estratégia de detecção[2] e seu uso em diversos sistemas Adaptado de [25]. Como se pode observar, uma estratégia de detecção codifica o conhecimento a respeito das características de uma determinada anomalia. Logo, escolher métricas e limiares apropriados é determinante para o sucesso da estratégia no apoio a descoberta de sintomas de problemas no código [2] [8][14]. Com isso, a principal intenção com esta abordagem é permitir que uma estratégia de detecção possa ser posteriormente aplicada em diversos sistemas (Fig. 1 e), isto é, espera-se que as características de uma anomalia se mantenham dentre diferentes sistemas. No entanto, observa-se que em determinados contextos em que as estratégias são aplicadas, algumas ocorrências não são necessariamente sintomas de problemas, isto é, elas indicam, na realidade, falsos positivos [8] [9]. III. Composição DEFINIÇÃO DO ESTUDO O estudo objetiva investigar a viabilidade de reuso de estratégias de detecção de anomalias em vários sistemas do mesmo domínio. Portanto, a seção III.A descreve o objetivo do estudo em mais detalhes. A seção III.B descreve o contexto em que o estudo foi conduzido. A seção III.C descreve o projeto do estudo. A. Objetivo do estudo De acordo com o formato proposto por Wohlin (1999), o objetivo deste trabalho pode ser caracterizado da seguinte forma: O objetivo é analisar a generalidade das estratégias de detecção de anomalias de código para o propósito de reuso das mesmas com respeito à diminuição da ocorrência de falsos positivos do ponto de vista de mantenedores de software no contexto de sistemas web de apoio à tomada de decisão. O contexto desse estudo é formado por seis sistemas web de apoio à tomada de decisão. Esse conjunto de sistemas opera em um domínio crítico, pois realiza a análise de indicadores para o mercado financeiro (seção III.B). Em uma primeira etapa, busca-se calibrar ou definir estratégias de detecção para sistemas desse domínio, a partir de características conhecidas e observadas pelos desenvolvedores de um subconjunto de sistemas neste domínio. Esta fase tem o objetivo de calibrar estratégias existentes ou definir novas estratégias para serem utilizadas em sistemas do domínio alvo. Portanto, o conhecimento dos especialistas do domínio sobre o código fonte foi utilizado primeiro para calibrar os limiares de métricas usadas em estratégias convencionais existentes (ex. ED (b) (c) (d) (e) ED: Estratégia de detecção Sistema A Sistema B Sistema C [2][13][16]). Tal conhecimento do especialista sobre o código foi também usado para definir novas estratégias com métricas não exploradas em tais estratégias convencionais. Em uma segunda etapa, avalia-se o reuso e a acurácia das estratégias em uma família de outros sistemas do mesmo domínio. Além do grau de reuso, a acurácia das estratégias é avaliada através da quantidade de falsos positivos encontrados. Falsos positivos são indicações errôneas de anomalias detectadas pela estratégia. Nossa pressuposição é que o reuso das estratégias aumenta na medida em que as mesmas são definidas em função de características de sistemas do mesmo domínio. Além disso, certos conjuntos recorrentes de classes, que implementam um mesmo interesse (responsabilidade) bem definido, em sistemas de um mesmo domínio, tendem a possuir características estruturais semelhantes. De fato, em sistemas web de apoio à tomada de decisão, foco do nosso estudo (seção III.B), alguns conjuntos de classes possuem responsabilidades semelhantes e bem definidas. Portanto, também estudamos se o reuso e a eficácia poderiam ser melhorados se estratégias fossem aplicadas a classes com uma responsabilidade recorrente do domínio. Por exemplo, um conjunto de classes desse domínio é formado por classes que recebem as requisições do usuário e iniciam a geração de indicadores financeiros. Essas classes recebem os parâmetros necessários, calculam uma grande quantidade de informações e geram os resultados para serem exibidos na interface. Além disso, essas classes desempenham o papel de intermediário entre a interface do usuário e as classes de negócio. Mesmo assim, é preciso evitar que fiquem muito grandes. Além disso, é preciso evitar que o acoplamento dessas classes fique muito disperso com relação a outras classes da aplicação. Outro conjunto desse domínio é formado por classes responsáveis pela persistência dos dados. Assim, essas classes são formadas por muitos métodos de atribuição e leitura de valores de atributos (getters e setters). As classes de persistência devem evitar métodos muito longos que possam incorporar também a lógica da aplicação de forma indesejável. Uma classe da camada de persistência com essas características pode indicar um sintoma de problemas para a compreensão dos métodos, bem como acoplamentos nocivos à manutenção do programa. Nesse sentido, esse trabalho visa responder às seguintes questões de pesquisa: (1) É possível reusar estratégias de detecção de anomalias de forma eficaz em um conjunto de sistemas de um mesmo domínio? A partir de estratégias calibradas ou definidas com o apoio do especialista do domínio, faz-se necessário avaliar o reuso das estratégias em outros sistemas do mesmo domínio. Entretanto, o reuso de cada estratégia só é eficaz se a mesma é aplicada em um novo programa do mesmo domínio com baixa incidência de falsos positivos. Em nosso estudo, consideramos que a estratégia foi eficaz se o uso desta não resulta em mais que 33% de falsos positivos. Mais a frente justificamos o uso deste procedimento.

97 (2) É possível diminuir a ocorrência de falsos positivos ao considerar as características de classes com responsabilidade bem definida do domínio? Como justificamos com os exemplos acima, observa-se que certos elementos do programa implementam um interesse recorrente de um domínio de aplicações; estes elementos podem apresentar características estruturais parecidas, que não são aplicáveis aos outros elementos do programa como um todo. Portanto, também verificamos se a associação de estratégias específicas para classes de um mesmo interesse seriam mais reutilizáveis do que as mesmas que são definidas para um programa como um todo. Para responder essas questões, foi conduzido um estudo com múltiplos casos de programas do mesmo domínio. Esse estudo foi realizado para avaliar o reuso de sete estratégias de detecção, relacionados a três tipos de anomalias recorrentes em um domínio específico. B. Contexto de aplicação do estudo O presente estudo foi conduzido em uma empresa de consultoria e desenvolvimento em sistemas de missão-crítica. A empresa é dirigida por doutores e mestres em informática, e foi fundada em Em 2010, a empresa absorveu um conjunto de sistemas web de apoio à tomada de decisão, originalmente desenvolvidos por outra empresa. Esse conjunto de sistemas opera em um domínio crítico, pois realiza a análise de indicadores para o mercado financeiro. O tempo de resposta e a precisão dos dados são importantes, pois a apresentação de uma análise errada pode gerar uma decisão errada e a consequente perda de valores financeiros. De forma a propiciar a confiabilidade deste sistema em longo prazo, o mesmo também deve permanecer manutenível. Caso contrário, as dificuldades de manutenção facilitarão a introdução de faltas nos programas ao longo do histórico do projeto. Além disso, a baixa manutenibilidade dificulta que a empresa se adapte a mudanças nas regras de negócio ou incorpore inovações, perdendo, assim, competitividade no mercado. A seguir, apresentamos várias características destes programas, algumas delas sinalizando a importância de manter a manutenibilidade dos mesmos através, por exemplo, de detecção de anomalias de código. Os seis sistemas escolhidos para o estudo estão divididos entre três equipes distintas. Segundo a Tabela I, cada equipe é responsável por dois sistemas e é representada por um líder. Cada líder participa do estudo como especialista do domínio (E1, E2 e E3). Além disso, oito programadores distintos compõem as três equipes que mantêm os seis sistemas. Os sistemas que fazem parte desse estudo possuem uma estrutura direcionada à operação de grande quantidade de dados. A partir desses dados é possível gerar indicadores para a tomada de decisão no mercado financeiro. Os dados estão relacionados, por exemplo, com informações históricas de ativos financeiros e informações relacionadas à configuração e armazenamento de estruturas utilizadas pelos usuários. A partir das estruturas utilizadas pelos usuários é possível controlar: carteiras de ativos financeiros, tipos de relatório, variáveis utilizadas nos cálculos de indicadores, modos de interpolação de dados, entre outras informações. TABELA I. COMPOSIÇÃO DAS EQUIPES QUE MANTÉM OS SISTEMAS USADOS NO ESTUDO Sistemas Especialistas Programadores A e B E1 P1, P2 e P3 C e D E2 P4 e P5 E e F E3 P6, P7 e P8 Nesses sistemas, como a interface do usuário é bastante rica, também existem muitas classes que compõem os elementos visuais. Esses elementos recebem as requisições do usuário e dão início à geração de informações e processamento de dados. Ao final das operações necessárias, os dados são mostrados na interface e o usuário pode analisá-los através de gráficos e relatórios em diferentes formatos. O tempo de resposta das solicitações é fundamental para a tomada de decisões. Desse modo, algumas operações realizadas por esses sistemas utilizam tecnologias assíncronas e client-side operações executadas diretamente no navegador do cliente como, por exemplo, javascript e JQuery. A manutenibilidade das classes destes programas também é importante para não acarretar potenciais efeitos colaterais ao desempenho. Ainda, existe um conjunto de classes que garante o controle de acesso às informações por meio de autenticação. A autenticação é necessária, pois existem restrições para os diferentes perfis de usuários. Além disso, um grande conjunto de classes é usado para refletir o modelo do banco de dados. Da mesma forma que em vários outros sistemas existentes, essas classes são necessárias para garantir a integridade das informações. Ainda, nesses sistemas, é importante garantir a frequente comunicação com serviços de terceiros. Esses serviços fornecem dados provenientes de algumas fontes de dados financeiros como, por exemplo, Bloomberg ( Outro ponto importante para a escolha destes sistemas é a recorrência de conjuntos de elementos com responsabilidades bem definidas. Dessa forma, é possível garantir a proximidade estrutural dos conjuntos de classes dos sistemas em estudo, o que é fundamental para avaliar o reuso das estratégias e responder nossas duas questões de pesquisa (seção III.A). Além disso, através da recorrência desses conjuntos de elementos é possível avaliar o percentual de falsos positivos das estratégias, considerando as características específicas dos conjuntos de elementos. C. Projeto do estudo Segundo [13], um bom índice de acurácia de uma estratégia de detecção deveria estar acima dos 60%. De qualquer forma, o índice usado nesse estudo foi um pouco mais rigoroso e está um pouco acima deste índice sugerido na literatura: 66%, isto é, dois terços de acertos nas detecções feitas por cada estratégia. A escolha do índice de acurácia de 66% também se deu pelo fato de que é possível garantir que, a cada três ocorrências identificadas pelas estratégias de detecção, apenas uma é classificada como um falso positivo. Se o desenvolvedor encontra um número de erros (falso positivos) maior que dois terços, este será desencorajado a reusar a estratégia em outro

98 programa. Sendo assim, para avaliar se as estratégias de detecção de anomalias escolhidas podem ser reusadas com, no máximo, 33% de ocorrências de falsos positivos, foram definidas três etapas. O objetivo da primeira etapa, chamada de etapa de ajustes, é definir estratégias de detecção de anomalias que tenham percentual de falsos positivos abaixo de 33% para duas aplicações do domínio em estudo. A segunda etapa, chamada de etapa de reuso, tem por objetivo avaliar se as estratégias definidas na etapa de ajustes podem ser reusadas em outros quatro sistemas do mesmo domínio, com o resultado de falsos positivos ainda abaixo de 33%. Finalmente, a última etapa é chamada de etapa de análise por interesses do domínio. Esta tem como objetivo verificar se o percentual de falsos positivos das estratégias pode ser melhorado tendo em vista a aplicação das estratégias apenas em classes de um mesmo interesse recorrente nos programas do mesmo domínio. Nesse estudo, o percentual de falsos positivos é definido através da avaliação do especialista do domínio. Essa avaliação é realizada durante uma sessão de investigação. Em cada sessão de investigação realizada, as estratégias de detecção de anomalias são aplicadas a um dos sistemas do domínio. Assim, a partir de cada ocorrência indicada pela ferramenta de detecção, o especialista faz uma avaliação qualitativa, para indicar se a ocorrência é um falso positivo ou se realmente é um sintoma de problema para o domínio das aplicações em estudo. Dessa forma, o percentual de falsos positivos de uma estratégia de detecção é definido pelo nº de ocorrências classificadas como falso positivo pelo especialista do domínio, em relação ao nº de ocorrências identificadas pela ferramenta de detecção. Etapa de Ajustes. Na etapa de ajustes, os especialistas do domínio apoiaram as atividades de: (i) definição do domínio em estudo, para que fosse possível caracterizar os sistemas para os quais faria sentido avaliar o reuso de estratégias; (ii) escolha dos sistemas que caracterizam o domínio, para que fosse possível considerar sistemas que representam o domínio em estudo; e (iii) identificação dos interesses (responsabilidades) recorrentes do domínio, bem como do conjunto de classes que contribuem para a implementação de cada interesse. Em seguida, nessa mesma etapa, as definições de anomalias que são recorrentes na literatura [19] foram apresentadas aos especialistas do domínio. Isso foi feito para que fosse possível avaliar as anomalias que seriam interessantes investigar no domínio alvo, do ponto de vista dos especialistas. A partir da escolha das anomalias, foram definidas as estratégias de detecção de anomalias que seriam utilizadas. Conforme mencionado anteriormente, foram utilizadas estratégias definidas a partir da sugestão dos especialistas do domínio, além de estratégias conhecidas da literatura [2][13][16]. Neste último caso, os especialistas sugeriram refinamentos de limiares de acordo com experiências e observações feitas na etapa de ajustes. Ainda nesta etapa de ajustes, foi escolhida uma ferramenta de detecção de anomalias de código em que fosse possível avaliar as ocorrências identificadas pelas estratégias, tendo em vista o mapeamento das classes que implementavam cada interesse do domínio (conforme discutido acima). Também na etapa de ajustes, foram realizadas duas sessões de investigação, com a participação do especialista do domínio, para os dois sistemas escolhidos para essa etapa. A partir da classificação do especialista, foi possível definir o percentual de falsos positivos para cada uma das estratégias escolhidas, para cada um dos dois sistemas. Finalizando a etapa de ajustes, verificamos as estratégias que resultaram em no máximo 33% de falsos positivos (na média). Dessa forma, as estratégias que não excederam esse limiar para os dois sistemas foram aplicadas na etapa seguinte, chamada etapa de reuso. Etapa de Reuso. Como mencionamos, o objetivo da etapa de reuso é avaliar se as estratégias definidas na etapa de ajustes podem ser reusadas em outros quatro sistemas do mesmo domínio, com o resultado de falsos positivos ainda abaixo de 33%. Nessa etapa, o reuso das estratégias é definido através dos seguintes critérios: Reuso total: a estratégia foi aplicada nos sistemas do domínio e resultou diretamente em no máximo 33% de falsos positivos, em todos os sistemas; Reuso parcial: a estratégia foi aplicada nos sistemas do domínio, porém o percentual de falsos positivos excedeu 33% em um ou dois sistemas; isto é, as estratégias foram reusadas de forma eficaz em, pelo menos, a metade dos programas; Nenhum reuso: nesse caso, a estratégia foi aplicada nos sistemas do domínio e o percentual de falsos positivos excedeu 33% em mais de dois sistemas; isto é, as estratégias foram reusadas de forma eficaz em menos da metade dos programas. Da mesma forma, na etapa de reuso, o percentual de falsos positivos para todas as estratégias de detecção é determinado pela avaliação qualitativa do especialista do domínio. Dessa forma, o percentual de falsos positivos para cada estratégia de detecção é definido pelo nº de ocorrências classificadas como falso positivo pelo especialista do domínio, em relação ao nº total de ocorrências identificadas pela ferramenta de detecção. Assim, foram realizadas quatro sessões de investigação, sendo uma para cada um dos quatro sistemas escolhidos para essa etapa. A partir dos resultados das quatro sessões de investigação, procurou-se identificar quais estratégias tiveram um reuso total. Dessa forma, essa etapa procura indícios das estratégias que tiveram bons resultados, considerando o domínio das aplicações em estudo. Depois, em um segundo momento, foram investigados os casos em que o percentual de falsos positivos esteve acima de 33%. Nesses casos, procuramos entender quais fatores influenciaram a alta ocorrência de falsos positivos. Nesse sentido, foi realizada uma investigação nos valores das métricas dos elementos identificados, para observar quais fatores desmotivaram o reuso da estratégia para os seis sistemas do domínio em estudo. Etapa de Interesses do Domínio. Para finalizar o estudo, a etapa chamada de etapa de interesses tem como objetivo verificar se o percentual de falsos positivos das estratégias diminui ao considerar a aplicação das estratégias apenas em elementos de cada interesse recorrente nos sistemas do mesmo domínio. Através da última etapa, investigamos se seria

99 possível diminuir o percentual de falsos positivos ao aplicar as estratégias de detecção a um conjunto de elementos com responsabilidades bem definidas. Anomalias Investigadas. Nesse estudo as anomalias investigadas forma definidas juntamente com o especialista do domínio. Dessa forma, é possível investigar anomalias que são interessantes do ponto de vista de quem acompanha o dia a dia do desenvolvimento dos sistemas do domínio. Nesse estudo foram investigadas: uma anomalia em nível de classe, uma anomalia em nível de método e uma anomalia relacionada a mudanças. São elas, nessa ordem: 1) God Class (seção II.A): com o passar do tempo, é mais cômodo colocar apenas um método em uma classe que já existe, a criar uma classe nova. Dessa forma, é preciso evitar classes que concentram muito conhecimento, isto é, classes com várias responsabilidades distintas, chamadas de God Classes; 2) Long Method (seção II.A): da mesma forma que a anomalia anterior, existem métodos que acabam concentrando muita lógica do domínio. Assim, é importante identificar métodos que concentram muito conhecimento e dificultam a compreensão e manutenção do programa. Ocorrências de anomalias como estas são chamadas de Long Methods; 3) Shotgun Surgery (seção II.A): para prevenir que uma mudança em um método possa gerar várias pequenas mudanças em outros elementos do código, é preciso evitar que um método possua relação com vários outros métodos dispersos na aplicação. Caso esse relacionamento disperso ocorra, podemos ter ocorrências de anomalias chamadas de Shotgun Surgeries; Estratégias de Detecção Escolhidas. A partir das anomalias escolhidas, as estratégias de detecção definidas para o estudo foram concebidas em conjunto com o especialista. A partir da discussão com o especialista, foram escolhidas e calibradas estratégias de detecção conhecidas da literatura (Seção III.A). Foram formadas também novas estratégias de detecção, tendo como orientação o processo de formação de estratégias de detecção, proposto por [2] (seção II.B). Para definir as estratégias em conjunto com os especialistas, foi necessário decidir quais métricas identificam os sintomas que devem ser evitados, tendo em vista as características do domínio em estudo. Dessa forma, para definir as estratégias que avaliam God Class, os especialistas sugeriram uma métrica relacionada ao tamanho e uma métrica relacionada ao acoplamento. Além disso, os especialistas sugeriram que fosse possível variar a métrica de tamanho para avaliar qual estratégia poderia apresentar melhores resultados, tendo em vista os sistemas do domínio em estudo. Depois, para identificar Long Method, os especialistas do domínio sugeriram que fossem usadas uma métrica de tamanho e uma métrica de complexidade. Por último, para identificar Shotgun Surgery, os especialistas sugeriram uma métrica de complexidade e uma métrica de acoplamento. Depois de definir estratégias de detecção em conjunto com os especialistas do domínio, foram escolhidas três estratégias de detecção, a partir da literatura. Dessa forma, cada uma das estratégias da literatura está relacionada a uma das anomalias escolhidas para o estudo. Além disso, os limiares usados para todas as estratégias escolhidas para o estudo foram definidos segundo as opiniões dos três especialistas. Assim, as estratégias escolhidas na fase de ajustes, para detecção de anomalias definidas pelos especialistas, para o domínio em estudo, são apresentadas nas Tabelas II e III. Ferramenta de Detecção Escolhida. Entre as ferramentas disponíveis para a detecção de anomalias de código, diversas são baseadas nas estratégias de detecção propostas em [2]. Mesmo assim, para que fosse possível realizar um estudo de estratégias através do mapeamento de interesses, foi necessário escolher uma ferramenta que possibilitasse essa análise. Assim, a ferramenta escolhida foi SCOOP (Smells Co-Occurrences Pattern Analyzer) [15]. Além disso, SCOOP já foi usada com sucesso em estudos empíricos anteriores, tais como aqueles reportados em [16][17]. TABELA II. ESTRATÉGIAS DE DETECÇÃO SUGERIDAS INTEIRAMENTE PELOS ESPECIALISTAS Anomalia Estratégia God Class EspLoc (LOC > 150) and (CBO > 6) God Class EspNom (NOM > 15) and (CBO > 6) Long Method Esp (LOC > 50) and (CC > 5) Shotgun Surgery Esp (CC > 7) and (AM a > 7) a. Accessed Methods (AM) representa a quantidade de métodos externos utilizados por um método[2]. TABELA III. ESTRATÉGIAS DE DETECÇÃO SUGERIDAS NA LITERATURA COM LIMIARES AJUSTADOS PELOS ESPECIALISTAS Anomalia Estratégia God Class Lit (ATFD b > 5) and (WMC c > 46) and (TCC d < 33) Long Method Lit (LOC > 50) and (CC > 6) and (MaxNesting > 5) and (NOAV e > 3) Shotgun Surgery Lit (FanOut > 16) b. Access to Foreign Data (ATFD) representa o nº de atributos de classes externas, que são acessados diretamente ou através de métodos de acesso [12]. c. Weighted Method Count (WMC) representa a soma da complexidade ciclomática de todos os métodos de uma classe [22][23]. d. Tight Class Cohesion (TCC) representa o nº relativo de pares de métodos de uma classe que acessam em comum pelo menos um atributo da classe avaliada[24]. e. Number of Accessed Variables (NOAV) representa o nº total de variáveis acessadas diretamente pelo método avaliado [2]. TABELA IV. RELAÇÃO DOS INTERESSES MAPEADOS DOS SISTEMAS USADOS NO ESTUDO Sistema Interesses mapeados a b c d e f g h i j A x x x x x x x x x x B x x x x x x x C x x x x x x x D x x x x x x x E x x x x x x x F x x x x x x x TABELA V. DESCRIÇÃO DO TAMANHO DOS SISTEMAS USADOS NO ESTUDO Sistema NLOC Nº de classes A B C D E F

100 Interesses Mapeados nos Sistemas em Estudo. Para avaliar se é possível diminuir a ocorrência de falsos positivos, ao considerar as características de conjuntos de elementos com responsabilidades bem definidas, foi necessário realizar o mapeamento dos interesses em classes dos seis sistemas, através do acompanhamento dos especialistas do domínio. Durante o mapeamento dos interesses, primeiro foram observados os interesses mais gerais, como, por exemplo, interface, persistência e recursos auxiliares. Em seguida, foi realizado o mapeamento dos interesses relacionados especificamente ao domínio das aplicações. Através do acompanhamento dos especialistas do domínio pode-se garantir a identificação dos elementos de código para cada um dos interesses mapeados para o domínio. Segundo os especialistas do domínio, existia, de fato, um conjunto razoável de interesses recorrentes do domínio. A partir da Tabela IV é possível observar o grau de recorrência dos interesses mapeados nos sistemas escolhidos para o estudo. Os interesses escolhidos são representados por letras minúsculas na tabela, mas nomeados em próximas subseções do artigo. A Tabela V descreve o tamanho dos sistemas escolhidos para o estudo, em número de linhas de código (NLOC) e número de classes. Mesmo que os sistemas variem em tamanho, segundo os especialistas do domínio, a proximidade estrutural dos sistemas é observada nos conjuntos de classes mapeadas para interesses recorrentes nos sistemas. IV. RESULTADOS E DISCUSSÕES Essa seção apresenta os resultados do estudo. A seção IV.A apresenta os resultados da etapa de ajustes. A seção IV.B apresenta os resultados sobre o reuso das estratégias e a seção IV.C apresenta os resultados sobre o percentual de falsos positivos ao considerar o mapeamento de interesses. Nas tabelas a seguir, a coluna NO/FP indica os valores de: nº de ocorrências de anomalias encontradas pelas estratégias / nº de falsos positivos identificados pelo especialista do domínio. A coluna %FP indica o percentual de falsos positivos identificados pelo especialista do domínio, em relação ao total de ocorrências de anomalias encontradas pelas estratégias. Destacamos em negrito os percentuais de falsos positivos acima de 33% para facilitar a identificação dos casos em que o resultado da estratégia não foi eficaz. A. Resultado da fase de ajustes Como mencionado, nesta fase avaliamos o percentual de falsos positivos de cada estratégia, de acordo com o julgamento do especialista. A Tabela VI apresenta o número e o percentual de falsos positivos indicados pelo especialista para cada estratégia, quando aplicadas aos sistemas A e B. A escolha desses sistemas para a fase de ajustes se deve especificamente à disponibilidade imediata do especialista E1 (Tabela I). Através da Tabela VI é possível perceber que apenas uma das estratégias (God Class Lit) excedeu o percentual de falsos positivos (33%), na média dos dois sistemas, proposto para o estudo. Isso significa que, embora as métricas utilizadas sejam recorrentes da literatura, os limiares propostos pelos especialistas não foram muito bons. TABELA VI. RESULTADO DA FASE DE AJUSTES. Estratégia Sistema A Sistema B NO/FP %FP NO/FP %FP God Class EspLoc 27/6 22% 17/7 41% God Class EspNom 15/3 20% 5/1 20% God Class Lit 4/3 75% 2/0 0% Long Method Esp 30/0 0% 19/3 16% Long Method Lit 1/0 0% 0/0 0% Shotgun Surgery Lit 61/12 20% 25/6 24% Shotgun Surgery Esp 21/0 0% 7/1 14% TABELA VII. ESTRATÉGIAS DE DETECÇÃO DA FASE DE REUSO Anomalia Estratégia God Class EspLoc (LOC > 150) and (CBO > 6) God Class EspNom (NOM > 15) and (CBO > 6) God Class Lit (ATFD > 6) and (WMC > 46) and (TCC < 11) Long Method Esp (LOC > 50) and (CC > 5) Long Method Lit (LOC > 50) and (CC > 6) and (MaxNesting > 5) and (NOAV > 3) Shotgun Surgery Lit (FanOut > 16) Shotgun Surgery Esp (CC > 7) and (AM > 7) Como um exercício, porém, investigamos se seria possível reduzir o percentual de falsos positivos para a estratégia God Class Lit apenas alterando os limiares dos seus componentes, preferencialmente sem criar novos falsos negativos. Neste caso, observamos que sim, realmente foi possível reduzir para 0% o percentual de falsos positivos da anomalia God Class Lit. Na fase de reuso, como veremos a seguir, as estratégias foram então reaplicadas a outros quatro sistemas do mesmo domínio. Objetivamos na fase de reuso observar se é possível reusar as estratégias diretamente (sem modificação nos limiares) em outros sistemas do mesmo domínio. B. Resultado da etapa de reuso Na segunda fase, as estratégias apresentadas anteriormente (Tabela VII) foram aplicadas aos sistemas C, D, E e F. As estratégias God Class EspLoc e God Class EspNom, quando aplicadas ao sistema D, resultaram em um percentual de falsos positivos de 80%. A estratégia Shotgun Surgery Lit, quando aplicada ao sistema C, resultou em 76% de falsos positivos. Mesmo assim, nenhuma das estratégias definidas para a segunda fase resultou em mais do que 30% de falsos positivos, quando aplicadas aos sistemas A e E. A partir da Tabela VIII, é importante observar então que God Class Lit e Long Method Lit mantiveram os resultados abaixo de 33% para todos os sistemas avaliados. As estratégias que não sofreram qualquer adaptação, por outro lado, variaram um pouco em termos do percentual de falsos positivos. De forma geral, é possível perceber que houve um reuso satisfatório (83%) tanto das estratégias definidas em conjunto com os especialistas (God Class EspLoc e EspNom, Long Method Esp e Shotgun Surgery Esp) quanto das estratégias com limiares definidos na literatura (God Class Lit, Long Method Lit e Shotgun Surgery Lit). Pode-se concluir pelos resultados desta fase de análise do reuso que existe certa tendência de comportamento padrão entre sistemas de um mesmo domínio, apesar de uns poucos casos peculiares que encorajaram e desencorajaram futuras adaptações nos limiares.

101 TABELA VIII. OCORRÊNCIAS DE FALSOS POSITIVOS E ANOMALIAS NA SEGUNDA FASE. Sistemas Estratégia A B C D E F NO/FP %FP NO/FP %FP NO/FP %FP NO/FP %FP NO/FP %FP NO/FP %FP God Class EspLoc 27/6 22% 17/7 41% 17/7 41% 10/8 80% 30/5 17% 24/7 29% God Class EspNom 15/3 20% 5/1 20% 6/2 33% 5/4 80% 10/3 30% 8/4 50% God Class Lit 4/3 0% 2/0 0% 0/0 0% 0/0 0% 3/0 0% 2/0 0% Long Method Esp 30/0 0% 19/3 16% 6/1 17% 5/2 40% 40/2 5% 26/3 12% Long Method Lit 1/0 0% 0/0 0% 0/0 0% 0/0 0% 4/0 0% 0/0 0% Shotgun Surgery Lit 61/12 20% 25/6 24% 17/13 76% 13/1 8% 48/1 2% 44/2 5% Shotgun Surgery Esp 21/0 0% 7/2 28% 0/0 0% 1/0 0% 12/0 0% 9/0 0% Aplicando novas adaptações nos limiares, observamos que certas características comuns entre os sistemas certamente podem influenciar positivamente no grau de reuso das estratégias. Por exemplo, considerando a estratégia God Class EspNom, quando aplicada o sistemas F, gerou um número de falsos positivos onde 75% dos casos o valor do componente CBO é igual a 10. Neste caso, alterando o valor do componente CBO para 10, o número de falsos positivos cai para 20%, no caso do sistema F, e aumenta para 27% no caso do sistema A. No entanto para o sistema B, o número de falsos positivos cai para 0%, para o sistema E 12% e sistema D 50%. Mesmo com uma piora no caso do sistema C para 40%, um pequeno ajuste nos limiares mostrou um melhor equilíbrio dentro dos sistemas para um mesmo domínio. Outro exemplo pode ser mais criterioso. Assim, considerando a estratégia God Class EspNom, quando aplicada ao sistema D, gerou um número de falsos positivos onde 100% dos casos o valor do componente CBO é menor que 18. Neste caso, alterando o valor do componente CBO para 18, o número de falsos positivos cai para 0%, nos sistemas B, D, E e F. Mesmo assim, o percentual de falsos positivos se mantém no sistema C e aumenta para 25% no sistema A. Dessa forma, com um ajuste mais criterioso é possível diminuir para 0% o percentual de falsos positivos em quatro dos seis sistemas. Em um segundo caso, analisando o resultado da aplicação da estratégia God Class EspLoc nos sistemas C e D, constatamos um número de falsos positivos relativamente maior do que para os demais sistemas (E e F), onde ela apresentou valores de falsos positivos menor que 33%. C. Resultados da etapa de interesses Ao avaliar a segunda questão de pesquisa, investigou-se a possibilidade de diminuir a ocorrência de falsos positivos das estratégias de detecção. Observamos se tal diminuição pode ocorrer caso fossem definidas estratégias para as classes de cada interesse do domínio. Com este propósito, nós aplicamos cada uma das estratégias de detecção, apresentadas anteriormente, em classes de cada interesse. As mesmas métricas e limiares foram mantidos. Desta forma, conseguimos observar se: (i) haveria potencial benefício em utilizar estratégias de detecção para cada interesse do domínio este caso foi observado quando as estratégias tiveram um número maior do que 33% falsos positivos, e (ii) foi suficiente o uso de estratégias no programa como um todo este caso foi observado quando as estratégias tiveram um número menor do que 33% falsos positivos. As tabelas a seguir apresentam o número de ocorrências (NO) e percentagem de falsos positivos (FP) para cinco estratégias da fase anterior. Duas delas tiveram 0% de falsos positivos na ampla maioria dos casos. A partir dos resultados apresentados nas Tabelas IX a XIII, percebe-se que não haveria necessidade de especialização das estratégias para cada interesse: (i) tanto para os casos de interesses Autenticação/Segurança e Auxiliar, que são mais gerais (isto é, podem ocorrer frequentemente em aplicações de outros domínios), (ii) como para os interesses Ações, Engine e Serviços, que são características mais específicas deste domínio. Nesse sentido, ajustar os limiares para as estratégias considerando o mapeamento de interesses não seria benéfico para reduzir significativamente o percentual de falsos positivos nos casos acima. Por outro lado, note que o contrário pode ser dito para o caso dos interesses Persistência, Interface, Indicadores e Tarefas. Em todos estes casos de interesses, notase nas tabelas que os números de falsos positivos, independentemente da anomalia analisada, estão bem acima do limiar de 33% em vários casos. TABELA IX. OCORRÊNCIAS DA ESTRATÉGIA SHOTGUN SURGERY LIT VISANDO O MAPEAMENTO DE INTERESSES Interesse NO/FP % FP Ações 83/15 18% Autenticação/segurança 1/0 0% Auxiliar 81/12 15% Engine 7/1 14% Exceção 1/0 0% Indicadores 1/1 100% Interface 5/2 40% Persistência 16/5 31% Serviços 11/2 18% Tarefas 2/1 50% TABELA X. OCORRÊNCIAS DA ESTRATÉGIA GOD CLASS ESPLOC VISANDO O MAPEAMENTO DE INTERESSES Interesse NO/FP % FP Ações 30/7 23% Autenticação/segurança 2/0 0% Auxiliar 52/12 23% Engine 1/0 0% Interface 10/6 60% Persistência 18/11 61% Serviços 10/3 30% Tarefas 2/1 50%

102 TABELA XI. OCORRÊNCIAS DA ESTRATÉGIA LONG METHOD ESP VISANDO O MAPEAMENTO DE INTERESSES Interesse NO/FP % FP Ações 46/3 7% Autenticação/segurança 2/0 0% Auxiliar 52/5 10% Engine 4/0 0% Persistência 13/5 38% Serviços 7/0 0% Tarefas 2/0 0% TABELA XII. OCORRÊNCIAS DA ESTRATÉGIA GOD CLASS ESPNOM VISANDO O MAPEAMENTO DE INTERESSES Interesse NO/FP % FP Ações 6/2 33% Auxiliar 19/3 16% Indicadores 2/1 50% Interface 10/6 60% Persistência 7/6 86% Servicos 4/0 0% Tarefas 1/0 0% TABELA XIII. OCORRÊNCIAS DA ESTRATÉGIA SHOTGUN SURGERY ESP VISANDO O MAPEAMENTO DE INTERESSES Interesse NO/FP % FP Ações 24/0 0% Auxiliar 16/2 13% Engine 3/0 0% Persistência 2/0 0% Serviços 5/0 0% D. Trabalhos relacionados Em 2011 [19], Zhang, Hall e Baddoo, realizaram uma revisão sistemática da literatura, para descrever o estado da arte sobre anomalias de código e refatoração. Esse trabalho foi baseado em artigos de conferências e revistas, entre 2000 e Junho de Segundo os autores, poucos trabalhos que relatam estudos empíricos sobre a detecção de anomalias. A grande maioria dos trabalhos tem o objetivo de mostrar novas ferramentas e métodos para apoiar a detecção de anomalias. Em 2010 [18], Guo, Seaman, Zazworka e Shull, propuseram a análise de características do domínio, para a adaptação das estratégias de detecção de anomalias. Esse trabalho foi realizado em um ambiente real de manutenção de sistemas. Além disso, a adaptação dos limiares das estratégias foi apoiada pela análise de especialistas do domínio. Mesmo assim, esse trabalho não avalia o reuso das estratégias de detecção para outras aplicações do mesmo domínio. Em 2012 [28], Ferreira, Bigonha, Bigonha, Mendes e Almeida, identificaram limiares para métricas de software orientado a objetos. Esse trabalho foi realizado em 40 sistemas Java, baixados a partir do SourceForge ( Nesse trabalho foram identificados limiares para seis métricas, para onze domínios de aplicações. A partir desse trabalho é necessário investigar o reuso desses limiares em projetos da indústria. Em 2012 [29], Fontana, Braione e Zanoni, revisaram o cenário atual das ferramentas de detecção automática de anomalias. Para isso, realizaram a comparação de quatro ferramentas de detecção, em seis versões de projetos de software de tamanho médio. Segundo os autores, é interessante refinar o uso das estratégias, considerando informações do domínio dos sistemas analisados. Ainda, existe um esforço manual para avaliar as anomalias que são caracterizadas como falsos positivos. Nesse sentido, percebe-se o esforço investido na adaptação das estratégias de detecção. Dessa forma, torna-se motivador investigar estratégias de detecção que possam ser reusadas com sucesso. E. Ameaças à validade Ameaças à Validade de Construto. Durante o experimento, os três especialistas do domínio participaram da definição das características do domínio em estudo, da escolha dos seis sistemas, do mapeamento de interesses de cada sistema, da escolha das anomalias, da definição das estratégias e os limiares e a classificação das ocorrências de anomalias. Ao avaliar um domínio específico, é necessária a participação de alguém que vive o desenvolvimento neste domínio no seu dia a dia. Além disso, os especialistas possuem conhecimento sobre boas práticas e experiências profissionais prévias no domínio escolhido de mais de dois anos. Validade de Conclusão e Validade Externa. Para a conclusão do estudo, o percentual de falsos positivos das estratégias é avaliado a partir da relação entre a quantidade de falsos positivos classificados pelos especialistas e a quantidade de ocorrências identificadas pela ferramenta. O limiar que define o reuso das estratégias é de 33% de falsos positivos. Dessa forma é possível garantir que a estratégia é capaz de identificar apenas um falso positivo, a cada três ocorrências. Para amenizar as ameaças à validade externa, é importante ratificar que os seis sistemas em estudo foram escolhidos a partir da especificação do domínio em estudo. Ainda, escolha dos sistemas teve o apoio de especialistas que possuem mais de dois anos de experiência no domínio. V. CONCLUSÕES Para que fosse possível investigar o reuso de estratégias de detecção em vários projetos de software do mesmo domínio, foi conduzido um estudo de múltiplos casos da indústria. O estudo investigou o reuso de sete estratégias de detecção, relacionadas a três anomalias, em seis projetos de um domínio específico. Segundo o nosso estudo, em alguns casos, o reuso das estratégias de detecção pode ser melhorado, se aplicadas a programas do mesmo domínio, sem gerar um efeito colateral. Mesmo assim, em outros casos, para realizar uma melhoria no reuso das estratégias, é possível que sejam criados falsos negativos. No total, dos sete casos que excederam o limiar de 33%, em quatro casos existe pelo menos um cenário onde duas classes com estruturas similares foram classificadas uma como anomalia e outra como falso positivo. Isso mostrou que em certos casos é impossível definir um limiar que elimine boa parte dos falsos positivos sem gerar falsos negativos. Como uma consequência direta, pode-se afirmar que existe um limite no grau de reuso das estratégias, isto é, uma nova adaptação na tentativa de diminuir o percentual de falsos positivos pode aumentar o número de falsos negativos.

103 Além disso, percebe-se que tanto em interesses como Autenticação/Segurança e Auxiliar, que são mais gerais, quanto os interesses Ações, Engine e Serviços, que são características mais específicas deste domínio, não existe a necessidade de especialização das estratégias. Nesse sentido, ajustar os limiares para as estratégias considerando o mapeamento de interesses não seria benéfico para reduzir significativamente o percentual de falsos positivos nos casos acima. Por outro lado, o contrário pode ser dito para o caso dos interesses Persistência, Interface, Indicadores e Tarefas. Ainda, a partir dos resultados, percebeu-se que duas estratégias de detecção de anomalias escolhidas a partir da literatura, resultaram em 0% de falsos positivos em todos os casos em que encontraram ocorrências. Mesmo assim, essas estratégias não detectaram algumas ocorrências identificadas pelas estratégias mais simples, para a mesma anomalia. Essas ocorrências das anomalias mais simples já haviam sido classificadas pelo especialista do domínio e não eram falsos positivos. Essa evidência motiva trabalhos futuros sobre a variedade da complexidade das estratégias de detecção de anomalias. Ainda, como trabalho futuro, o presente trabalho pode ser estendido a outros cenários (porém não limitados), como: (i) a investigação de estratégias de detecção com reuso em outros domínios e (ii) a investigação de outras estratégias de detecção neste e em outros domínios. REFERÊNCIAS [1] M. Fowler: Refactoring: Improving the Design of Existing Code. New Jersey: Addison Wesley, p. [2] R. Marinescu, M. Lanza: Object-Oriented Metrics in Practice. Springer, p. [3] N. Tsantalis, T. Chaikalis, A. Chatzigeorgiou: JDeodorant: Identification and removal of typechecking bad smells. In Proceedings of CSMR 2008, pp [4] PMD. Disponível em [5] iplasma. Disponível em [6] InFusion. Disponível em [7] E. Murphy-Hill, A. Black: An interactive ambient visualization for code smells, Proceedings of SOFTVIS '10, USA, October [8] F. Fontana, E. Mariani, A. Morniroli, R. Sormani, A. Tonello: An Experience Report on Using Code Smells Detection Tools. IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops (ICSTW), [9] F. Fontana, V. Ferme, S. Spinelli: Investigating the impact of code smells debt on quality code evaluation. Third International Workshop on Managing Technical Debt, [10] E. Emden, L. Moonen: Java quality assurance by detecting code smells. In in Proceedings of the 9th Working Conference on Reverse Engineering, [11] N. Fenton, S. Pfleeger: Software metrics: a rigorous and practical approach. PWS Publishing Co., [12] R. Marinescu: Measurement and Quality in Object-Oriented Design. Proceedings of the 21st IEEE International Conference on Software Maintenance, [13] R. Marinescu: Detection strategies: Metrics-based rules for detecting design flaws. Proceedings of the 20th IEEE International Conference on Software Maintenance, [14] N. Moha, Y. Gu eh eneuc, A. Meur, L. Duchien, A. Tiberghien: From a domain analysis to the specification and detection of code and design smells, Formal Aspects of Computing, [15] Scoop. Disponível em: [16] I. Macia, J. Garcia, D. Popescu, A. Garcia, N. Medvicovic, A. Staa: "Are Automatically-Detected Code Anomalies Relevant to Architectural Modularity? An Exploratory Analysis of Evolving Systems". In Proceedings of the 11th International Conference on Aspect-Oriented Software Development (AOSD'12), Postdam, Germany, March [17] I. Macia, A. Garcia, A. Staa: An Exploratory Study of Code Smells in Evolving Aspect-Oriented Systems. Proceedings of the 10th International Conference on Aspect-Oriented Software Development, [18] Y. Guo, C. Seaman, N. Zazworka, F. Shull: Domain-specific tailoring of code smells: an empirical study. Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2, [19] M. Zhang, T. Hall, e N. Baddoo: Code bad smells: a review of current knowledge. Journal of Software Maintenance and Evolution: research and practice, 23(3), , [20] I. Macia, R. Arcoverde A. Garcia, C. Chavez e A. von Staa: On the Relevance of Code Anomalies for Identifying Architecture Degradation Symptoms. In Software Maintenance and Reengineering (CSMR), th European Conference on (pp ). IEEE. [21] L. Hochstein e M. Lindvall: "Combating architectural degeneration: a survey." Information and Software Technology (2005): [22] T. J. McCabe: A Complexity Measure. IEEE Transactions on Software Engineering, 2(4): , [23] S. R. Chidamber e C. F. Kemerer: A metrics suite for object oriented design. Software Engineering, IEEE Transactions on, v. 20, n. 6, p , [24] J. Bieman e B. Kang: Cohesion and reuse in an object-oriented system. In Proceedings ACM Symposium on Software Reusability, [25] S. Olbrich, D. S. Cruzes, V. Basili, e N. Zazworka: The evolution and impact of code smells: A case study of two open source systems. In Proceedings of the rd International Symposium on Empirical Software Engineering and Measurement (pp ). IEEE Computer Society, [26] F. Khomh, M. Di Penta, e Y. G. Guéhéneuc: An exploratory study of the impact of code smells on software change-proneness. In Reverse Engineering, WCRE'09. 16th Working Conference on (pp ). IEEE [27] A. Lozano, M. Wermelinger, e B. Nuseibeh: "Assessing the impact of bad smells using historical information." In Ninth international workshop on Principles of software evolution: in conjunction with the 6th ESEC/FSE joint meeting (pp ). ACM, [28] K. A. Ferreira, M. A. Bigonha, R. S. Bigonha, L. F. Mendes e H. C. Almeida: Identifying thresholds for object-oriented software metrics. Journal of Systems and Software, 85(2), , [29] F. Fontana, P. Braione, M. Zanoni: Automatic detection of bad smells in code: An experimental assessment.. Publicação Electrônica em JOT: Journal of Object Technology, v. 11, n. 2, ago. 2012

104 F3T: From Features to Frameworks Tool Matheus Viana, Rosângela Penteado, Antônio do Prado Department of Computing Federal University of São Carlos São Carlos, SP, Brazil {matheus viana, rosangela, Rafael Durelli Institute of Mathematical and Computer Sciences University of São Paulo São Carlos, SP, Brazil Abstract Frameworks are used to enhance the quality of applications and the productivity of development process since applications can be designed and implemented by reusing framework classes. However, frameworks are hard to develop, learn and reuse, due to their adaptive nature. In this paper we present the From Features to Framework Tool (F3T), which supports the development of frameworks in two steps: Domain Modeling, in which the features of the framework domain are modeled; and Framework Construction, in which the source-code and the Domain-Specific Modeling Language (DSML) of the framework are generated from the features. In addition, the F3T also supports the use of the framework DSML to model applications and generate their source-code. The F3T has been evaluated in a experiment that is also presented in this paper. I. INTRODUCTION Frameworks are reusable software composed of abstract classes implementing the basic functionality of a domain. When an application is developed through framework reuse, the functionality provided by the framework classes is complemented with the application requirements. As the application is not developed from scratch, the time spent in its development is reduced and its quality is improved [1] [3]. Frameworks are often used in the implementation of common application requirements, such as persistence [4] and user interfaces [5]. Moreover, a framework is used as a core asset when many closely related applications are developed in a Software Product Line (SPL) [6], [7]. Common features of the SPL domain are implemented in the framework and applications implement these features reusing framework classes. However, frameworks are hard to develop, learn and reuse. Their classes must be abstract enough to be reused by applications that are unknown beforehand. Framework developers must define the domain of applications for which the framework is able to be instantiated, how the framework is reused by these applications and how it accesses application-specific classes, among other things [7], [8]. Frameworks have a steep learning curve, since application developers must understand their complex design. Some framework rules may not be apparent in its interface [9]. A framework may contain so many classes and operations that even developers who are conversant with it may make mistakes while they are reusing this framework to develop an application. In a previous paper we presented an approach for building Domain-Specific Modeling Languages (DSML) to support framework reuse [10]. A DSML can be built by identifying framework features and the information required to instantiate them. Thus, application models created with a DSML can be used to generate application source-code. Experiments have shown that DSMLs protect developers from framework complexities, reduce the occurrence of mistakes made by developers when they are instantiating frameworks to develop applications and reduce the time spent in this instantiation. In another paper we presented the From Features to Framework (F3) approach, which aims to reduce framework development complexities [11]. In this approach the domain of a framework is defined in a F3 model, which is a extended version of the feature model. Then a set of patterns, named F3 patterns, guides the developer to design and implement a white box framework according to its domain. One of the advantages of this approach is that, besides showing how developers can proceed, the F3 patterns systematizes the process of framework development. This systematization allowed that the development of frameworks could be automatized by a tool. Therefore, in this paper we present the From Features to Framework Tool (F3T), which is a plug-in for the Eclipse IDE that supports the use of the F3 approach to develop and reuse frameworks. This tool provides an editor for developers to create a F3 model of a domain. Then, the source-code and the DSML of a framework can be generated from the domain defined in this model. The source-code of the framework is generated as a Java project, while the DSML is generated as a set of Eclipse IDE plug-ins. After being installed, a DSML can be used to model applications. Then, the F3T can be used again to generate the application source-code from models created with the framework DSML. This application reuses the framework previously generated. We also have carried out an experiment in order to evaluate whether the F3T facilitates framework development or not. The experiment analyzed the time spent in framework development and the number of problems found the source-code of the outcome frameworks. The remainder of this paper is organized as follows: background concepts are discussed in Section II; the F3 approach is commented in Section III; the F3T is presented in Section IV; an experiment that has evaluated the F3T is presented in Section V; related works are discussed in Section VI; and conclusions and future works are presented in Section VII. II. BACKGROUND The basic concepts applied in the F3T and its approach are presented in this section. All these concepts have reuse as their basic principle. Reuse is a practice that aims: to reduce time spent in a development process, because the software

105 is not developed from scratch; and to increase the quality of the software, since the reusable practices, models or code were previously tested and granted as successful [12]. Reuse can occur in different levels: executing simple copy/paste commands; referencing operations, classes, modules and other blocks in programming languages; or applying more sophisticated concepts, such as patterns, frameworks, generators and domain engineering [13]. Patterns are successful solutions that can be reapplied to different contexts [3]. They provide reuse of experience helping developers to solve common problems [14]. The documentation of a pattern mainly contains its name, the context it can be applied, the problem it is intended to solve, the solution it proposes, illustrative class models and examples of use. There are patterns for several purposes, such as design, analysis, architectural, implementation, process and organizational patterns [15]. Frameworks act like skeletons that can be instantiated to implement applications [3]. Their classes embody an abstract design to provide solutions for domains of applications [9]. Applications are connected to a framework by reusing its classes. Unlike library classes, whose execution flux is controlled by applications, frameworks control the execution flux accessing the application-specific code [15]. The fixed parts of the frameworks, known as frozen spots, implement common functionality of the domain that is reused by all applications. The variable parts, known as hot spots, can change according to the specifications of the desired application [9]. According to the way they are reused, frameworks can be classified as: white box, which are reused by class specialization; black box, which work like a set of components; and gray box, which are reused by the two previous ways [2]. Generators are tools that transform an artifact into another [16], [17]. There are many types of generators. The most common are Model-to-Model (M2M), Model-to-Text (M2T) and programming language translators [18]. Such as frameworks, generators are related to domains. However, some generators are configurable, being able to change their domain [19]. In this case, templates are used to define the artifacts that can be generated. A domain of software consists of a set of applications that share common features. A feature is a distinguishing characteristic that aggregates value to applications [20] [22]. For example, Rental Transaction, Destination Party and Resource could be features of the domain of rental applications. Different domain engineering approaches can be found in the literature [20], [22] [24]. Although there are differences between them, their basic idea is to model the features of a domain and develop the components that implement these features and are reused in application engineering. The features of a domain are defined in a feature model, in which they are arranged in a tree-view notation. They can be mandatory or optional, have variations and require or exclude other features. The feature that most represents the purpose of the domain is put in the root and a top-down approach is applied to add the other features. For example, the main purpose of the domain of rental applications is to perform rentals, so Rental is supposed to be the root feature. The other features are arranged following it. Domains can also be modeled with metamodel languages, which are used to create Domain-Specific Modeling Languages (DSML). Metamodels, such as defined in the MetaObject Facility (MOF) [25], are similar to class models, which makes them more appropriate to developers accustomed to the UML. While in feature models, only features and their constraints are defined, metaclasses in the metamodels can contain attributes and operations. On the other hand, feature models can define dependencies between features, while metamodels depend on declarative languages to do it [18]. A generator can be used along with a DSML to transform models created with this DSML into code. When these models represent applications, the generators are called application generators. III. THE F3 APPROACH The F3 is a Domain Engineering approach that aims to develop frameworks for domains of applications. It has two steps: 1) Domain Modeling, in which framework domain is determined; and 2) Framework Construction, in which the framework is designed and implemented according to the features of its domain. In Domain Modeling step the domain is defined in a feature model. However, an extended version of feature model is used in the F3 approach, because feature models are too abstract to contain information enough for framework development and metamodels depend on other languages to define dependencies and constraints. These extended version, called F3 model, incorporates characteristics of both feature models and metamodels. As in conventional feature models, the features in the F3 models can also be arranged in a tree-view, in which the root feature is decomposed in other features. However, the features in the F3 models do not necessarily form a tree, since a feature can have a relationship targeting a sibling or even itself, as in metamodels. The elements and relationships in F3 models are: Feature: graphically represented by a rounded square, it must have a name and it can contain any number of attributes and operations; Decomposition: relationship that indicates that a feature is composed of another feature. This relationship specifies a minimum and a maximum multiplicity. The minimum multiplicity indicates whether the target feature is optional (0) or mandatory (1). The maximum multiplicity indicates how many instances of the target feature can be associated to each instance of the source feature. The valid values to the maximum multiplicity are: 1 (simple), for a single feature instance; * (multiple), for a list of a single feature instance; and ** (variant), for any number of feature instances. Generalization: relationship that indicates that a feature is a variation generalized by another feature. Dependency: relationship that defines a condition for a feature to be instantiated. There are two types of dependency: requires, when the A feature requires the B feature, an application that contains the A feature also has to include the B feature; and excludes, when the A feature excludes the B feature, no application can include both features.

106 Framework Construction step has as output a white box framework for the domain defined in the previous step. The F3 approach defines a set of patterns to assist developers to design and implement frameworks from F3 models. The patterns treat problems that go from the creation of classes for the features to the definition of the framework interface. Some of the F3 patterns are presented in Table I. Pattern Domain Feature Mandatory Decomposition Optional Decomposition Simple Decomposition Multiple Decomposition Variant Decomposition Variant Feature Modular Hierarchy Requiring Dependency Excluding Dependency TABLE I: Some of the F3 patterns. Purpose Indicates structures that should be created for a feature. Indicates code units that should be created when there is a mandatory decomposition linking two features. Indicates code units that should be created when there is an optional decomposition linking two features. Indicates code units that should be created when there is a simple decomposition linking two features. Indicates code units that should be created when there is a multiple decomposition linking two features. Indicates code units that should be created when there is a variant decomposition linking two features. Defines a class hierarchy for features with variants. Defines a class hierarchy for features with common attributes and operations. Indicates code units that should be created when a feature requires another one. Indicates code units that should be created when a feature excludes another one. In addition to indicate the code units that should be created to implement the framework functionality, the F3 patterns also determine how the framework can be reused by the applications. For example, some patterns suggest to include abstract operations in the classes of the framework that allows it to access application-specific information. In addition, the F3 patterns make the development of frameworks systematic, allowing it to be automatized. Thus, the F3T tool was created to automatize the use of the F3 approach, enhancing the processes of framework development. IV. THE F3T The F3T assists developers to apply the F3 approach in the development of white box frameworks and to reuse these frameworks through their DSMLs. The F3T is a plug-in for the Eclipse IDE. So developers can make use of the F3T resources, Fig. 1: Modules of the F3T. such as domain modeling, framework construction, application modeling through framework DSML and application construction, as well the other resources provided by the IDE. The F3T is composed of three modules, as seen in Figure 1: 1) Domain Module (DM); 2) Framework Module (FM); and 3) Application Module (AM). A. Domain Module The DM provides a F3 model editor for developers to define domain features. This module has been developed with the support of the Eclipse Modeling Framework (EMF) and the Graphical Modeling Framework (GMF) [18]. The EMF was used to create a metamodel, in which the elements, relationships and rules of the F3 models were defined as described in the Section III. The metamodel of F3 models is shown in Figure 2. From this metamodel, the EMF generated the source-code of the Model and the Controller layers of the F3 model editor. The GMF has been used to define the graphical notation of the F3 models. This graphical notation also can be seen as the View layer of the F3 model editor. With the GMF, the graphical figures and the menu bar of the editor were defined and linked to the elements and relationships defined in the metamodel of the F3 models. Then, the GMF generates the source-code of the graphical notation. The F3 model editor is shown in Figure 3 with an example of F3 model for the domain of trade and rental transactions. Fig. 2: Metamodel containing elements, relationships and rules of F3 models.

107 Fig. 3: F3 model for the domain of trade and rental transactions. B. Framework Module The FM is a M2T generator that transforms F3 models into framework source-code and DSML. Despite their graphical notation, F3 models actually are XML files. It makes them more accessible to other tools, such as a generator. The FM was developed with the support of the Java Emitter Templates (JET) in the Eclipse IDE [26]. The JET plug-in contains a framework that is a generic generator and a compiler that translate templates into Java files. These templates are XML files, in which tags are instructions to generate an output based on information in the input and text is a fixed content inserted in the output independently of input. The Java files originated from the JET templates reuse the JET framework to compose a domain-specific generator. Thus, the FM depend on the JET plug-in to work. The templates of the FM are organized in two groups: one related to framework source-code; and another related to framework DSML. Both groups are invoked from the main template of the DM generator. Part of the JET template which generates Java classes in the framework source-code from the features found in the F3 models can be seen as follows: public <c:if test="($feature/@abstract)">abstract </c:if> class <c:get select="$feature/@name"/> extends <c:choose select="$feature/@variation"> <c:when test=" true ">DVariation</c:when> <c:otherwise> <c:choose> <c:when test="$feature/dsuperfeature"> <c:get select="$feature/dsuperfeature/@name"/> </c:when> <c:otherwise>dobject</c:otherwise> </c:choose> </c:otherwise> </c:choose> {... } The framework source-code that is generated by the FM is organized in a Java project identified by the domain name and the suffix.framework. The framework source-code is generated according to the patterns defined by the F3 approach. For example, the FM generates a class for each feature found in a F3 model. These classes contain the attributes and operations defined in its original feature. All generated classes also, directly or indirectly, extend the DObject class, which implements non-functional requirements, such as persistence and logging. Generalization relationships result in inheritances and decomposition relationships result in associations between the involving classes. Additional operations are included in the framework classes to treat feature variations and constraints of the domains defined in the F3 models. For example, according to the Variant Decomposition F3 pattern, the getresourcetypeclasses operation was included in the code of the Resource class so that the framework can recognize which classes implement the ResourceType feature in the applications. Part of the code of the Resource class is presented as folows: */ public abstract class Resource extends DObject { */ private int id; */ private Sting name; */ private List<ResourceType> types; */ public abstract Class<?>[] getresourcetypeclasses(); The framework DSML is generated as a EMF/GMF project identified only by the domain name. The FM generates the EMF/GMF models of the DSML, as seen in Figure 4.a, which was generated from the F3 model shown in Figure 3. Then, source-code of the DSML must be generated by using the generator provided by the EMF/GMF in three steps: 1) using the EMF generator from the genmodel file (Figure 4.a); 2) using the GMF generator from the gmfmap file (Figure 4.b); and 3) using the GMF generator from the gmfgen file (Figure 4.c). After this, the DSML will be composed of 5 plug-in projects in the Eclipse IDE. The projects that contain the source-code and the DSML plug-ins of the framework for the trade and rental transactions domain are shown in Figure 4.d. Fig. 4: Generation of the DSML plugins.

108 Fig. 5: Application model created with the framework DSML. C. Application Module The AM has been also developed with the support of JET. It generates application source-code from an application model based on a framework DSML. The templates of the AM generate classes that extend framework classes and override operations that configure framework hot spots. After the DSML plug-ins are installed in the Eclipse IDE, the AM recognizes the model files created from the DSML. An application model created with the DSML of the framework for the domain of trade and rental transactions is shown in Figure 5. Application source-code is generated in the source folder of the project where the application model is. The AM generates a class for each feature instantiated in the application model. Since the framework is white box, the application classes extend the framework classes indicated by the stereotypes in the model. It is expected that many class attributes requested by the application requirements have been defined in the domain. Thus, these attributes are in the framework source-code and they must not be defined in the application classes again. Part of the code of the Product class is presented as follows: public class Product extends Resource { */ private float value; */ public Class<?>[] getresourcetypeclasses() { return new Class<?>[] { Category.class, Manufacturer.class }; }; V. EVALUATION In this section we present an experiment, in which we evaluated the use of the F3T to develop frameworks, since the use of DSMLs to support framework reuse has been evaluated in a previous paper [10]. The experiment was conducted following all steps described by Wohlin et al. (2000) and it can be summarized as: (i) analyse the F3T, described in Section IV; (ii) for the purpose of evaluation; (iii) with respect to time spent and number of problems; (iv) from the point of view of the developer; and (v) in the context of MSc and PhD Computer Science students. A. Planning The experiment was planned to answer two research questions: RQ 1 : Does the FT3 reduce the effort to develop a framework? ; and RQ 2 : Does the F3T result in a outcome framework with a fewer number of problems?. All subjects had to develop two frameworks, both applying the F3 approach, but one manually and the other with the support of the F3T. The context of our study corresponds to multi-test within object study [27], hence the experiment consisted of experimental tests executed by a group of subjects to study a single tool. In order to answer the first question, we measured the time spent to develop each framework. Then, to answer the second question, we analyzed the frameworks developed by the subjects, then we identified and classified the problems found in the source-code. The planning phase was divided into seven parts, which are described in the next subsections: 1. Context Selection 26 MSc and PhD students of Computer Science participated in the experiment, which has been made in an off-line situation. All participants had prior experience in software development, Java programming, patterns and framework reuse. 2. Formulation of Hypotheses The experiment questions have been formalized as follows: RQ 1, Null hypothesis, H 0 : Considering the F3 approach, there is no significant difference, in terms of time, between developing frameworks with the support of F3T and doing it manually. Thus, the F3T does not reduce the time spent to develop frameworks. This hypothesis can be formalized as: H 0 : µ F3T = µ manual RQ 1, Alternative hypothesis, H 1 : Considering the F3 approach, there is a significant difference, in terms of time, between developing frameworks with the support of F3T and doing it manually. Thus, the F3T reduces the time spent to develop frameworks. This hypothesis can be formalized as: H 1 : µ F3T µ manual RQ 2, Null hypothesis, H 0 : Considering the F3 approach, there is no significant difference, in terms of problems found in the outcome frameworks, between developing frameworks using the F3T and doing it manually. Thus, the F3T does not reduce the mistakes made by subjects while they are developing frameworks. This hypothesis can be formalized as: H 0 : µ F3T = µ manual RQ 2, Alternative hypothesis, H 1 : Considering the F3 approach, there is a significant difference, in terms of problems found in the outcome frameworks, between developing frameworks using the F3T and doing it manually. Thus, the F3T reduces the mistakes made by subjects while they are developing frameworks. This hypothesis can be formalized as: H 1 : µ F3T µ manual 3. Variables Selection The dependent variables of this experiment were time spent to develop a framework and number of problems found in the outcome frameworks. The independent variables were as follows:

109 Application: Each subject had to develop two frameworks: one (Fw1) for the domain of trade and rental transactions and the other (Fw2) for the domain of automatic vehicles. Both Fw1 and Fw2 had 10 features. Development Environment: Eclipse 4.2.1, Astah Community 6.4, F3T. Technologies: Java version Selection of Subjects The subjects were selected through a non probabilist approach by convenience, i.e., the probability of all population elements belong to the same sample is unknown. 5. Experiment Design The subjects were carved up in two blocks of 13 subjects: Block 1, development of Fw1 manually and development of Fw2 with the support of the F3T; Block 2, development of Fw2 manually and development of Fw1 with the support of the F3T. We have chosen use block to reduce the effect of the experience of the students, that was measured through a form in which the students answered about their level of experience in software development. This form was given to the subjects one week before the pilot experiment herein described. The goal of this pilot experiment was to ensure that the experiment environment and materials were adequate and the tasks could be properly executed. 6. Design Types The design type of this experiment was one factor with two treatments paired [27]. The factor in this experiment is the manner how the F3 approach was used to develop a framework and the treatments are the support of the F3T against the manual development. 7. Instrumentation All necessary materials to assist the subjects during the execution of this experiment were previously devised. These materials consisted of forms for collecting experiment data, for instance, time spent to develop the frameworks and a list of the problems were found in the outcome frameworks developed by each subject. In the end of the experiment, all subjects received a questionnaire to report about the F3 approach and the F3T. B. Operation The operation phase was divided into two parts, as described in the next subsections: 1. Preparation Firstly, the subjects received a characterization form, containing questions regarding their knowledge about Java programming, Eclipse IDE, patterns and frameworks. Then, the subjects were introduced to the F3 approach and the F3T. 2. Execution Initially, the subjects signed a consent form and then answered a characterization form. After this, they watched a presentation about frameworks, which included the description of some known examples and their hot spots. The subjects were also trained on how to develop frameworks using the F3 approach with or without the support of the F3T. Following the training, the pilot experiment was executed. The subjects were split into two groups considering the results of the characterization forms. Subjects were not told about the nature of the experiment, but were verbally instructed on the F3 approach and its tool. The pilot experiment was intended to simulate the real experiments, except that the applications were different, but equivalent. Beforehand, all subjects were given ample time to read about approach and to ask questions on the experimental process. This could affect the experiment validity, then, the data from this activity was only used to balance the groups. When the subjects understood what their had to do, they received the description of the domains and started timing the development of the frameworks. Each subject had to develop the frameworks applying the F3 approach, i.e., creating its F3 model from a document which describes its domain features and then applying the F3 patterns to implement it. C. Analysis of Data This section presents the experimental findings. The analysis is divided into two subsections: (1) Descriptive Statistics and (2) Hypotheses Testing. 1. Descriptive Statistics The time spent by each subject to develop a framework and the number of problems found in the outcome frameworks are shown in Table II. From this table, it can be seen that the subjects spent more time to develop the frameworks when they were doing it manually, approximately 72.5% against 27.5%. This result was expected, since the F3T generates framework source-code from F3 models. However, it is worth highlighting that most of the time spent in the manual framework development was due to framework implementation and the effort to fix the problems found in the frameworks, while most of the time spent in the framework development supported by the F3T was due to domain modeling. The dispersion of time spent by the subjects are also represented graphically in a boxplot on left side of Figure 6. In Table II it is also possible to visualize four types of problems that we analyzed in the outcome frameworks: (i) incoherence, (ii) structure, (iii) bad smells, (iv) interface. The problem of incoherence indicates that, during the experiment, the subjects did not model the domain of the framework as expected. Consequently, the subjects did not develop the frameworks with the correct domain features and constraints (mandatory, optional, and alternative features). As the capacity to model the framework domains depend more on the subject skills than on tool support, incoherence problems could be found in equivalent proportions, approximately 50%, when the framework was developed either manually or with the support of the F3T. The problem of structure indicates that the subjects did not implement the frameworks properly during the experiment. For example, they implemented classes with no constructor

110 TABLE II: Development timings and number of problems Fig. 6: Dispersion of the total time and number of problems. and incorrect relationships or when they forgot to declare the classes as abstract. This kind of problem occurred when the subjects did not properly follow the instructions provided by the F3 patterns. In Table II it can be seen that the F3T helped the subjects to develop frameworks with less structure problems, i.e., 10% in opposition to 90%. The problem of bad smells indicates design weaknesses that do not affect functionality, but make the frameworks harder to maintain. In the experiment, this kind of problem occurred when the subjects forgot to apply some F3 patterns related to the organization of the framework classes, such as the Modular Hierarchy F3 pattern. By observing Table II we can remark that the F3T made a design with higher quality than the manual approach, i.e, 0% against 100%, because the F3T automatically identified which patterns should be applied from the F3 models. The problem of interface indicates absence of getter/setter operations and the lack of operations that allows the applications to reuse the framework and so on. Usually, this kind of problem is a consequence of problems of structure, hence the number of problems of these two types are quite similar. As it can be observed in Table II that the F3T helped the subjects to design a better framework interface than when they developed the framework manually, i.e., 8.6% against 91.4%. In the last two columns of Table II it can be seen that the F3T reduced the total number of problems found in the frameworks developed by the subjects. It is also graphically represented in the boxplot on right side of Figure Testing the Hypotheses The objective of this section is to verify, based on the data obtained in the experiment, whether it is possible to reject the null hypotheses in favor of the alternative hypotheses. Since some statistical tests are applicable only if the population follows a normal distribution, we applied the Shapiro-Wilk test and created a Q-Q chart to verify whether or not the experiment data departs from linearity before choosing a proper statistical test. The tests has been carried out as follows: 1) Time: We have applied the Shapiro-Wilk test on the experiment data that represents the time spent by each subject to develop a framework manually or using the F3T, as shown in Table II. Considering an α = 0.05, the p-values are and and Ws are and , respectively, for each approach. The test results confirmed that the experiment data related to the time spent in framework development is normally distributed, as it can be seen in the Q-Q charts (a) and (b) in Figure 7. Thus, we decided to apply the Paired T-Test to these data. Assuming a Paired T-Test, we can reject H 0 if t 0 > t α/2,n 1.

111 these data are S/R of t problemsmanual t problemsf3t = { +3.5, +7.5, +7.5, +16.5, -3.5, +23, +3.5, +3.5, +10.5, +10.5, +3.5, +18.5, +10.5, +14, +24, +18.5, +3.5, +21, +21, +14, +21, +10.5, +14, +16.5}, S/R stand for signed rank. As result we got a p-value = with a significance level of 1%. Based on these data, we conclude there is considerable difference between the means of the two treatments. We were able to reject H0 at 1% significance level. The p- value is very close to zero, which further emphasizes that the F3T reduces the number of problems found in the outcome frameworks. Fig. 7: Normality tests. In this case, t α, f is the upper α percentage point of the t-distribution with f degrees of freedom. Therefore, based on the samples, n = 26 and d = {46,42,52,49,41,49,55,50,53,42,42,52,48,43,45, 42,47,48,44,49,51,48,52,51,48,45}, S d = 9.95 and t 0 = The average values of each data set are µ manual = and µ F3T = So, d = = 47.46, which implies that S d = and t 0 = The number of degrees of freedom is f = n 1 = 26 1 = 25. We take α = Thus, according to StatSoft 1, it can be seen that t 0.025,9 = Since t 0 > t 0.025,9 it is possible to reject the null hypothesis with a two sided test at the level. Therefore, statistically, we can assume that, when the F3 approach is applied, the time needed to develop a framework using F3T is less than doing it manually. 2) Problems: Similarly, we have applied the Shapiro- Wilk test on the experiment data shown in the last two columns of Table II, which represent the total number of problems found in the outcome frameworks that were developed whether manually or using the F3T. Considering an α = 0.05, the p-values are and , and Ws are and , respectively, for each approach. As it can be seen in the Q-Q charts (c) and (d) in Figure 7, the test results confirmed that date related to manual development is normally distributed, but the data related to the F3T can not be considered as normally distributed. Therefore we applied a non-parametric test, the Wilcoxon signed-rank test in these data. The signed rank of 1 D. Opinion of the Subjects We analyzed the opinion of the subjects in order to evaluate the impact of using the approaches considered in the experiment. After the experiment operation, all subjects received a questionnaire, in which they could report their perception about applying the F3 approach manually or with the support of the F3T. The answers in the questionnaire has been analyzed in order to identify the difficulties in the use of the F3 approach and its tool. As it can be seen in Figure 8, when asked if they encountered difficulties in the development of the frameworks by applying the F3 approach manually, approximately 52% of the subjects reported having significant difficulty, 29% mentioned partial difficulty and 19% had no difficulty. In contrast, when asked the same question with respect to the use of the F3T, 73% subjects reported having no difficulty, 16% subjects mentioned partial difficulty and only 11% of all subjects had significant difficulty. Fig. 8: Level of difficulty of the subjects. The reduction of the difficulty to develop the frameworks, shown in Figure 8, reveals that the F3T assisted the subjects in this task. The subjects also answered in the questionnaire about the difficulties they found during framework development. The most common difficulties pointed out by the subjects when they developed the frameworks manually were: 1) too much effort spent on coding; 2) mistakes they made due to lack of attention; 3) lack of experience for developing frameworks; and 4) time spent identifying the F3 patterns in the F3 models. In contrast, the most common difficulties faced by the subjects when they used the F3T were: 1) lack of practice with the tool; and 2) some actions in the tool interface, for instance, opening the F3 model editor, take many steps to be executed. The subjects said that the F3 patterns helped them to identify which

112 structures were necessary to implement the frameworks in the manual development. They also said the F3T automatized the tasks of identifying which F3 patterns should be used and of implementing the framework source-code. Then, they could keep their focus on domain modeling. E. Threats to Validity Internal Validity: Experience level of the subjects: the subjects had different levels of knowledge and it could affect the collected data. To mitigate this threat, we divided the subjects in two balanced blocks considering their level knowledge and rebalanced the groups considering the preliminary results. Moreover, all subjects had prior experience in application development reusing frameworks, but not for developing frameworks. Thus, the subjects were trained in common framework implementation techniques and how to use the F3 approach and the F3T. Productivity under evaluation: there was a possibility that this might influence the experiment results because subjects often tend to think they are being evaluated by experiment results. In order to mitigate this, we explained to the subjects that no one was being evaluated and their participation was considered anonymous. Facilities used during the study: different computers and installations could affect the recorded timings. Thus, the subjects used the same hardware configuration and operating system. Validity by Construction: Hypothesis expectations: the subjects already knew the researchers and knew that the F3T was supposed to ease framework development, which reflects one of our hypothesis. These issues could affect the collected data and cause the experiment to be less impartial. In order to keep impartiality, we enforced that the participants had to keep a steady pace during the whole study. External Validity: Interaction between configuration and treatment: it is possible that the exercises performed in the experiment are not accurate for every framework development for real world applications. Only two frameworks were developed and they had the same complexity. To mitigate this threat, the exercises were designed considering framework domains based on the real world. Conclusion Validity: Measure reliability: it refers to metrics used to measuring the development effort. To mitigate this threat, we used only the time spent which was captured in forms fulfilled by the subjects; Low statistic power: the ability of a statistic test in reveal reliable data. To mitigate this threat, we applied two tests: T-Tests to statistically analyze the time spent to develop the frameworks and Wilcoxon signed-rank test to statistically analyze the number of problems found in the outcome frameworks. VI. RELATED WORKS In this section some works related to the F3T and the F3 approach are presented. Amatriain and Arumi [28] proposed a method for the development of a framework and its DSL through iterative and incremental activities. In this method, the framework has its domain defined from a set of applications and it is implemented by applying a series of refactorings in the source-code of these applications. The advantage of this method is a small initial investment and the reuse of the applications. Although it is not mandatory, the F3 approach can also be applied in iterative and incremental activities, starting from a small domain and then adding features. Applications can also be used to facilitate the identification of the features of the framework domain. However, the advantage of the F3 approach is the fact that the design and the implementation of the frameworks are supported by the F3 patterns and it is automatized by the F3T. Oliveira et al. [29] presented the ReuseTool, which assists framework reuse by manipulating UML diagrams. The Reuse- Tool is based in the Reuse Description Language (RDL), a language created by these authors to facilitate the description of framework instantiation processes. Framework hot spots can be registered in the ReuseTool with the use of the RDL. In order to instantiate the framework, application models can be created based on the framework description. Application source-code is generated from these models. Thus, the RDL works as a meta language that registers framework hot spots and the ReuseTool provides a more friendly interface for developers to develop applications reusing the frameworks. In comparison, the F3T supports framework development through domain modeling and application development through framework DSML. Pure::variants [30] is a tool that supports the development of applications by modeling domain features (Feature Diagram) and the components that implement these features (Family Diagram). Then the applications are developed by selecting a set of features of the domain. Pure::variants generates only application source-code, maintaining all domain artifacts in model-level. Besides, this tool has private license and its free version (Community) has limitations in its functionality. In comparison, the F3T is free, uses only one type of domain model (F3 model) and generates frameworks as domain artifacts. Moreover, the frameworks developed with he support of the F3T can be reused in the development of applications with or without the support of the F3T. VII. CONCLUSIONS The F3T support framework development and reuse through code generating from models. This tool provides an F3 model editor for developers to define the features of the framework domain. Then, framework source-code and DSML can be generated from the F3 models. Framework DSML can be installed in the F3T to allow developers to model and to generate the source-code of applications that reuses

113 the framework. The F3T is a free software available at: matheus viana. The F3T was created to semi-automatize the applying of the F3 approach. In this approach, domain features are defined in F3 models in order to separate the elements of the framework from the complexities to develop them. F3 models incorporate elements and relationships from feature models and properties and operations from metamodels. Framework source-code is generated based on patterns that are solutions to design and implement domain features defined in F3 models. A DSML is generated along with the sourcecode and includes all features of the framework domain and in the models created with it developers can insert application specifications to configure framework hot spots. Thus, the F3T supports both Domain Engineering and Application Engineering, improving their productivity and the quality of the outcome frameworks and applications. The F3T can be used to help the construction of software product lines, providing an environment to model domains and create frameworks to be used as core assets for application development. The experiment presented in this paper has shown that, besides the gain of efficiency, the F3T reduces the complexities surrounding framework development, because, by using this tool, developers are more concerned about defining framework features in a graphical model. All code units that compose these features, provide flexibility to the framework and allows it to be instantiated in several applications are properly generated by the F3T. The current version of the F3T generates only the model layer of the frameworks and applications. In future works we intend to include the generation of a complete multi-portable Model-View-Controller architecture. ACKNOWLEDGMENT The authors would like to thank CAPES and FAPESP for sponsoring our research. REFERENCES [1] V. Stanojevic, S. Vlajic, M. Milic, and M. Ognjanovic. Guidelines for Framework Development Process. In 7th Central and Eastern European Software Engineering Conference, pages 1 9, Nov [2] M. Abi-Antoun. Making Frameworks Work: a Project Retrospective. In ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications, [3] R. E. Johnson. Frameworks = (Components + Patterns). Communications of ACM, 40(10):39 42, Oct [4] JBoss Community. Hibernate. Jan [5] Spring Source Community. Spring Framework. Jan [6] S. D. Kim, S. H. Chang, and C. W. Chang. A Systematic Method to Instantiate Core Assets in Product Line Engineering. In 11th Asia- Pacific Conference on Software Engineering, pages 92 98, Nov [7] David M. Weiss and Chi Tau Robert Lai. Software Product Line Engineering: A Family-Based Software Development Process. Addison- Wesley, [8] D. Parsons, A. Rashid, A. Speck, and A. Telea. A Framework for Object Oriented Frameworks Design. In Technology of Object-Oriented Languages and Systems, pages , Jul [9] S. Srinivasan. Design Patterns in Object-Oriented Frameworks. ACM Computer, 32(2):24 32, Feb [10] M. Viana, R. Penteado, and A. do Prado. Generating Applications: Framework Reuse Supported by Domain-Specific Modeling Languages. In 14th International Conference on Enterprise Information Systems, Jun [11] M. Viana, R. Durelli, R. Penteado, and A. do Prado. F3: From Features to Frameworks. In 15th International Conference on Enterprise Information Systems, Jul [12] Sajjan G. Shiva and Lubna Abou Shala. Software Reuse: Research and Practice. In Fourth International Conference on Information Technology, pages , Apr [13] W. Frakes and K. Kang. Software Reuse Research: Status and Future. IEEE Transactions on Software Engineering, 31(7): , Jul [14] M. Fowler. Patterns. IEEE Software, 20(2):56 57, [15] R. S. Pressman. Software Engineering: A Practitioner s Approach. McGraw-Hill Science, 7th edition, [16] A. Sarasa-Cabezuelo, B. Temprado-Battad, D. Rodrguez-Cerezo, and J. L. Sierra. Building XML-Driven Application Generators with Compiler Construction. Computer Science and Information Systems, 9(2): , [17] S. Lolong and A.I. Kistijantoro. Domain Specific Language (DSL) Development for Desktop-Based Database Application Generator. In International Conference on Electrical Engineering and Informatics (ICEEI), pages 1 6, Jul [18] R. C. Gronback. Eclipse Modeling Project: A Domain-Specific Language (DSL) Toolkit. Addison-Wesley, [19] I. Liem and Y. Nugroho. An Application Generator Framelet. In 9th International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 08), pages , Aug [20] J. M. Jezequel. Model-Driven Engineering for Software Product Lines. ISRN Software Engineering, 2012, [21] K. Lee, K. C. Kang, and J. Lee. Concepts and Guidelines of Feature Modeling for Product Line Software Engineering. In 7th International Conference on Software Reuse: Methods, Techniques and Tools, pages 62 77, London, UK, Springer-Verlag. [22] K. C. Kang, S. G. Cohen, J. A. Hess, W. E. Novak, and A. S. Peterson. Feature-Oriented Domain Analysis (FODA): Feasibility Study. Technical report, Carnegie-Mellon University Software Engineering Institute, Nov [23] H. Gomaa. Designing Software Product Lines with UML: From Use Cases to Pattern-Based Software Architectures. Addison-Wesley, [24] J. Bayer, O. Flege, P. Knauber, R. Laqua, D. Muthig, K. Schmid, T. Widen, and J. DeBaud. PuLSE: a Methodology to Develop Software Product Lines. In Symposium on Software Reusability, pages ACM, [25] OMG. OMG s MetaObject Facility. Jan [26] The Eclipse Foundation. Eclipse Modeling Project, Jan [27] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén. Experimentation in Software Engineering: an Introduction. Kluwer Academic Publishers, Norwell, MA, USA, [28] X. Amatriain and P. Arumi. Frameworks Generate Domain-Specific Languages: A Case Study in the Multimedia Domain. IEEE Transactions on Software Engineering, 37(4): , Jul-Aug [29] T. C. Oliveira, P. Alencar, and D. Cowan. Design Patterns in Object- Oriented Frameworks. ReuseTool: An Extensible Tool Support for Object-Oriented Framework Reuse, 84(12): , Dec [30] Pure Systems. Pure::Variants. variants.49.0.html, Fev 2013.

114 A Metric of Software Size as a Tool for IT Governance Marcus Vinícius Borela de Castro, Carlos Alberto Mamede Hernandes Tribunal de Contas da União (TCU) Brasília, Brazil {borela, carlosmh}@tcu.gov.br Abstract This paper proposes a new metric for software functional size, which is derived from Function Point Analysis (FPA), but overcomes some of its known deficiencies. The statistical results show that the new metric, Functional Elements (EF), and its submetric, Functional Elements of Transaction (EFt), have higher correlation with the effort in software development than FPA in the context of the analyzed data. The paper illustrates the application of the new metric as a tool to improve IT governance specifically in assessment, monitoring, and giving directions to the software development area. Index Terms Function Points, IT governance, IT performance, Software engineering, Software metrics. Pressures of Business Direct Plans and Policies IT Corporate Governance Assess Proposals Monitor Performance Accordance Needs of Business O I. INTRODUCTION RGANIZATIONS need to leverage their technology to create new opportunities and produce change in their capabilities [1, p. 473]. According to ITGI [2, p. 7], information technology (IT) has become an integral part of business for many companies with key role in supporting and promoting their growth. In this context, IT governance fulfills an important role of directing and boosting IT in order to achieve its goals aligned with the company s strategy. In order for IT governance to foster the success of IT and of the organization, ISO [3, p. 7] proposes three main activities: to assess the current and future use of IT; to direct the preparation and implementation of plans and policies to ensure that IT achieves organizational goals; to monitor performance and compliance with those policies (Fig. 1). A metric of software size can compose several indicators to help reveal the real situation of the systems development area for the senior management of an organization, directly or through IT governance structures (e.g., IT steering committee). Measures such as the production of software in a period (e.g., measure of software size per month) and the productivity of an area (e.g., measure of software size per effort) are examples of indicators that can support the three activities of governance proposed by ISO For the formation of these indicators, one can use Function Point Analysis (FPA) to get function points (FP) as a metric of software size. Created by Albrecht [4], FPA has become an This work has been supported by the Brazilian Court of Audit (TCU). Business Process negócio IT Projects IT Operati Fig. 1. Cycle Assess-Direct-Monitor of IT Governance Source:: ISO [3, p. 7] international standard for measuring the functional size of a software with the ISO [5] designation. Its rules are maintained and enhanced by a nonprofit international group of users called International Function Point Users Group (IFPUG), responsible for publishing the Counting Practices Manual (CPM), now in version [6]. Because it has a direct correlation with the effort expended in software development [7]-[8], FPA has been used as a tool for information technology management, not only in Brazil but worldwide. As identified in the Quality Research in Brazilian Software Industry report, 2009 [9, p. 93], FPA is the most widely used metric to evaluate the size of software among software companies in Brazil, used by 34.5% of the companies. According to a survey carried out by Dekkers and Bundschuh [10, p. 393], 80% of projects registered on the International Software Benchmarking Standards Group (ISBSG), release 10, which applied metric used the FPA. The FPA metric is considered a highly effective instrument to measure contracts [11, p. 191]. However, it has the limitation of not treating non-functional requirements, such as quality criteria and response-time constraints. Brazilian federal government institutions also use FPA for procurement of development and maintenance of systems. The Brazilian Federal Court of Audit (TCU) points out FPA as an example

115 of metric to be used in contracts. 2 The metrics roadmap of SISP [12], a federal manual for software procurement, recommends its application to federal agencies. Despite the extensive use of the FPA metric, a large number of criticism about its validity and applicability, described in Section II-B, put in doubt the correctness of its use in contracts and the reliability of its application as a tool for IT management and IT governance. So the question arises for the research: is it possible to propose a metric for software development, with the acceptance and practicality of FPA, that is, based on its concepts already widely known, without some of the flaws identified in order to maximize its use as a tool for IT governance, focusing on systems development and maintenance? The specific objectives of this paper are: 1) to present an overview of software metrics and FPA; 2) to present the criticisms to the FPA technique that motivated the proposal of a new metric; 3) to derive a new metric based on FPA; 4)to evaluate the new metric against FPA in the correlation with effort; 5) to illustrate the use of the proposed metric in IT governance in the context of systems development and maintenance. Following, each objective is covered in a specific section. A. Software Metrics II. DEVELOPMENT 1) Conceptualization, categorization, and application Dekkers and Bundschuh [10, p ] describe various interpretations for metric, measure, and indicator found in the literature. Concerning this study, no distinction is made among these three terms. We used Fenton and Pfleeger s definition [13, p. 5] for measure: a number or symbol that characterizes an attribute of a real world entity, object or event, from formally defined rules. Kitchenham et al. [14] present a framework for software metrics with concepts related to the formal model in which a metric is based, for example, the type of scale used. According to Fenton and Pfleeger [13, p. 74], software metrics can be applied to three types of entities: processes, products, and resources. The authors also differentiate direct metrics when only one attribute of an entity is used, from indirect metrics, the other way around [13, p. 39]. Indirect metrics are derived by rules based on other metrics. The speed of delivery of a team (entity type: resource) is an example of indirect metric because it is calculated from the ratio of two measures: size of developed software (product) development and elapsed time (process). The elapsed time is an example of direct metric. Moser [15, p. 32] differentiates size metrics from quality metrics: size metrics distinguish between the smallest and the largest whereas quality metrics distinguish between good and bad. Table I consolidates the mentioned categories of software metrics. 2 There are several rulings on the subject: 1.782/2007, 1.910/2007, 2.024/2007, 1.125/2009, 1.784/2009, 2.348/2009, 1.274/2010, 1.647/2010, all of the Plenary of the TCU. Moser [15, p.31] notes that, given the relationship between a product and the process that produced it, a product measure can be assigned to a process, and vice versa. For example, the percentage of effort in testing, which is a development process attribute, can be associated with the generated product as an indicator of its quality. And the number of errors in production in the first three months, a system attribute (product), can be associated to the development process as an indicative of its quality. Fenton and Pfleeger [13, p. 12] set three goals for software metric: to understand, to control, and to improve the targeted entity. They call our attention to the fact that the definition of the metrics to be used depends on the maturity level of the process being measured: the more mature, more visible, and therefore more measurable [13, p. 83]. Chikofsky and Rubin [16, p. 76] highlight that an initial measurement program for a development and maintenance area should cover five key dimensions that address core attributes for planning, controlling, and improvement of products and processes: size, effort, time, quality, and rework. The authors remind us that what matters are not the metric itself, but the decisions that will be taken from them, refuting the possibility of measuring without foreseeing the goal [16, p. 75]. According to Beyers [17, p. 337], the use of estimates in metric (e.g., size, time, cost, effort, quality, and allocation of people) can help in decision making related to software development and to the planning of software projects. 2) FPA overview According to the categorization of in previous section, FPA is an indirect measure of product size. It measures the functional size of an application (system) as a gauge of the functionality requested and delivered to the user of the software. 3 This is a metric understood by users, regardless of the technology used. According to Gencel and Demirors [18, p. 4], all functional metrics ISO standards estimate software size based on the functionality delivered to users, 4 differing in the considered objects and how they are measured. TABLE I EXAMPLES OF CATEGORIES OF SOFTWARE METRICS Criterion Category Source Entity Of process [13, p. 74] Of product Of resource Number of attributes Direct [13, p. 39\ involved Indirect Target of differentiation Size Quality [15, p. 32] 3 The overview presented results from the experience of the author Castro with FPA. In 1993, he coordinated the implementation of FPA in the area of systems development at the Brazilian Superior Labor Court (TST). At TCU, he works with metric, albeit sporadically, without exclusive dedication. 4 Besides FPA, there are four other functional metrics that are ISO standards, as they meet the requirements defined in the six standards of ISO 14143: MKII FPA, COSMIC-FFP, FISMA, and NESMA. Non-functional attributes of a development process (e.g., development team experience, chosen methodology) are not in the scope of functional metrics. Functional requirements are only one dimension of several impacting the effort. All of them have to be taken into account in estimates. Estimates and non-functional requirements evaluations are not the goal of this paper.

116 Functionalities can be of two types: transactions, that implement data exchanges with users and other systems, and data files, which indicate the structure of stored data. There are three types of transactions: external inquiry (EQ), external outputs (EO), and external inputs (EI), as the primary intent of the transaction is, respectively, a simple query, a more elaborate query (e.g., with calculated totals) or data update. There are two types of logical data files: internal logical files (ILF) and external interface files (EIF), as their data are, respectively, updated or just referenced (accessed) in the context of the application. Fig. 2 illustrates graphically these five function types. To facilitate understanding, we can consider an example of EI as an employee inclusion form which includes information in the employees data file (ILF) and validates the tax code (CPF) informed by the user accessing the external file taxpayers (EIF), external to the application. Also in the application we could have, hypothetically, an employee report, a simple query containing the names of the employees of a given organizational unit (EQ) and a more complex report with the number of employees per unit (EO). In the FPA calculating rule, each function is evaluated for its complexity and takes one of three classifications: low, medium or high complexity. Each level of complexity is associated with a size in function points. Table II illustrates the derivation rule for external inquiries, according to the number of files accessed (File Type Referenced - FTR) and the number of fields that cross the boundary of the application (Data Element Type - DET). As for EQ, each type of functionality (EO, EI, ILF, and EIF) has its specific rules for derivation of complexity and size, similar to Table II. Table III summarizes the categories of attributes used for calculating function points according to each type of functionality. The software size is the sum of the sizes of its functionalities. This paper is not an in-depth presentation of concepts associated with FPA. Details can be obtained in the Counting Practices Manual, version [6]. Taxpayer EIIF Employee ILF Employee Inclusion EI Employee Report EQ Totals per Unit EO Application Boundary Fig. 2. Visualization of the five types of functions in FPA User or External System TABLE II DERIVATION RULE FOR COMPLEXITY AND SIZE IN FUNCTION POINTS OF AN EXTERNAL INQUIRY DET (field) or more FTR (file) 1 low (3) low (3) medium (4) 2-3 low (3) medium (4) high (6) 4 or more medium (4) high (6) high (6) B. Criticisms to the FPA technique that motivated the proposal of a new metric Despite the extensive use of the metric FPA, mentioned in Section I, there are a lot of criticism about its validity and applicability that call into question the correctness of its use in contracts and the reliability of its application as a tool for IT management and governance ( [19], [13], [20], [21], [14], [22]; [23], [24], [25]). Several metrics have been proposed taking FPA as a basis for their derivation, either to adapt it to particular models, or to improve it, fixing some known bugs. To illustrate, there is Antoniol et al. [26] work proposing a metric for objectoriented model and Kralj et al. [22] work proposing a change in FPA to measure more accurately high complexity functions (item 4 below). The objective of the metric proposed in this paper is not to solve all faults of FPA, but to help to reduce the following problems related to its definition: 1) low representation: the metric restricts the size of a function to only three possible values, according to its complexity (low, medium, or high). But there is no limit on the number of possible combinations of functional elements considered in calculating the complexity of a function in FPA; 2) functions with different functional complexities have the same size: as a consequence of the low representation. Pfleeger et al. [23, p. 36] say that if H is a measure of size, and if A is greater than B, then H A should be greater than H B. Otherwise, the metric would be invalid, failing to capture in the mathematical world the behavior we perceive in the empirical world. Xia et al. [25, p. 3] show examples of functions with different complexities that were improperly assigned the same value in function points because they fall into the same complexity classification, thus exposing the problem of ambiguous classification; 3) abrupt transition between functional element ranges: Xia et al. [25, p. 4] introduced this problem. They present two logical files, B and C, with apparent similar complexities, differing only in the number of fields: B has 20 fields and C has 19 fields. The two files are classified as low (7 fp, function points) and medium complexity (10 fp), respectively. The difference lies in the transition of the two ranges in the complexity derivation table: up to 19 fields, it is considered low complexity; from 20 fields, it is considered medium complexity. The addition of only one field leading to an increase in 3 pf is inconsistent, since varying from 1 to 19 fields does not involve any change in the function point size. A similar result occurs in other ranges of transitions; 4) limited sizing of high (and low) complexity functions: FPA sets an upper (and a lower) limit for the size of a function

117 TABLE III CATEGORIES OF FUNCTIONAL ATTRIBUTES FOR EACH TYPE OF FUNCTIONALITY Function Functional Attributes Transactions: EQ, EO, EI referenced files (FTR) and fields (DET) Logical files: ILF, EIF logical registers (Record Element Type - RET) and fields (DET) in 6, 7, 10 or 15, according to its type. Kralj et al. [22, p. 83] describe high complexity functions with improper sizes in FPA. They propose a change in the calculation of FPA to support larger sizes for greater complexity; 5) undue operation on ordinal scale: as previously seen, FPA involves classifying the complexity of functions in low, medium or high complexity, as a ordinal scale. These labels in the calculated process are substituted by numbers. An internal logical file, for example, receives 7, 10 or 15 function points, as its complexity is low, medium or high, respectively. Kitchenham [20, p. 29] criticizes the inadequacy of adding up values of ordinal scale in FPA. He argues that it makes no sense to add the complex label with the simple label, even if using 7 as a synonym for simple and 15 as a synonym for complex; 6) inability to measure changes in parts of the function: this characteristic, for example, does not allow to measure function points of part of a functionality that needs to be changed in one maintenance operation. Thus, a function addressed in several iterations in an agile method or other iterative process is always measured with full size, even if the change is considered small in each of them. For example, consider three maintenance requests at different moments for a report already with the maximum size of 7 fp, which initially showed 50 distinct fields. Suppose each request adds a single field. The three requests would be dimensioned with 7 fp each, the same size of the request that created the report, and would total 21 fp. Aware of this limitation, PFA [6, vol. 4, p. 94] points to the Netherlands Software Metrics Association (NESMA) metric as an alternative for measuring maintenance requests. NESMA presents an approach to solve this problem. According to the Function Point Analysis for Software Enhancement [27], NESMA measures a maintenance request as the multiplication of the original size of a function by a factor of impact of the change. The impact factor is the ratio of the number of attributes (e.g., fields and files) included, changed or deleted by the original number of attributes of the function. The adjustment factor assumes multiple values of 25%, varying up to a maximum of 150%. Given the deficiencies reported, the correlation between the size in function points of software and the effort required for the development tends not to be appropriate, since FPA has these deficiencies in the representation of the real functional size of software. If there are inaccuracies in the measuring of the size of what must be done, it is impossible to expect a proper definition of the effort and therefore accuracy in defining the cost of development and maintenance. The mentioned problems motivated the development of this work, in order to propose a quantitative metric, with infinite values, called Functional Elements (EF). C. Derivation process of the new metric The proposed metric, Functional Elements, adopts the same concepts of FPA but changes the mechanism to derive the size of functions. The use of concepts widely known to metric specialists will enable acceptance and adoption of the new metric among these professionals. The reasoning process for deriving the new metric, as described in the following sections, implements linear regression similar to that seen in Fig. 3. The objective is to derive a formula for calculating the number of EF for each type of function (Table VII in Section II-C-4) from the number of functional attributes considered in the derivation of its complexity, as indicated in Table III in Section II-A-2. In this paper, these attributes correspond to the concept of functional elements, which is the name of the metric proposed. The marked points in Fig. 3 indicate the size in fp (Z axis) of an external inquiry derived from the number of files (X axis) and the number of fields (Y axis), which are the attributes used in the derivation of its complexity (see Table II in Section II-A-2). The grid is the result of a linear regression of these points, and represents the new value of the metric. 1) Step 1 - definition of the constants If the values associated with the two categories of functional attributes are zero, the EF metric assumes the value of a constant. Attributes can be assigned value zero, for example, in the case of maintenance limited to the algorithm of a function not involving changes in the number of fields and files involved. The values assigned to these constants come from the NESMA functional metric mentioned in Section 2-B. This metric was chosen because it is an ISO standard and supports the maintenance case with zero-value attributes. For each type of functionality, the proposed metric uses the smallest possible value by applying NESMA, that is, 25% of the number of fp of a low complexity function of each type: EIF (25% of 5); ILF (25% of 7); EQ (25% of 3); EI (25% of 3), and EO - 1 (25% of 4). Fig. 3. Derivation of number of fp of an external inquiry from the attributes used in the calculation

118 2) Step 2 - treatment of ranges with unlimited number of elements In FPA, each type of function has its own table to derive the complexity of a function. Table II in Section II-A-2 presents the values of the ranges of functional attributes for the derivation of the complexity of external inquiries. The third and last range of values of each functional element of all types of functions is unlimited. We see 20 or more TD in the first cell of the fourth column of the same table, and 4 or more ALR in the last cell of the first column. The number of elements in the greater range, that is, the highest value among the first two ranges, was chosen for setting a upper limit for the third range. In the case of ranges for external inquiries, the number of fields was limited to 33, having 14 elements (20 to 33) as the second range (6 to 19), the largest one. The number of referenced files was limited to 5, using the same reasoning. The limitation of the ranges is a mathematical artifice to prevent imposing an upper limit for the new metric (4 th criticism in Section II-B). 3) Step 3 - generation of points for regression The objective of this phase was to generate, for each type of function, a set of data records with three values: the values of the functional attributes and the derived fp, already decreased from the value of the constant in step 1. Table IV illustrates some points generated for the external inquiry. An application developed in MS Access generated a dataset with all possible points for the five types of functions, based on the tables of complexity with bounded ranges developed in the previous section. Table V shows all considered combinations of ranges for EQ. 4) Step 4 - linear regression The several points obtained by the procedure described in the previous section were imported into MS Excel for linear regression using the ordinary least squares method (OLS). The regression between the size fp, which is the dependent variable, and the functional attributes, which are the dependent variables, held constant with value zero, since these constants were already defined in step 1 and decreased from the expected value in step 3. The statistical results of the regression are shown in Table VI for each type of function. Table VII shows the derived formula for each type of function with coefficient values rounded to two decimal place values. Each formula calculates the number of functional elements, which is the proposed metric, based on the functional attributes impacting the calculation and the constants indicated in step 1. The acronym EFt and EFd represent the functional elements associated with transactions (EQ, EI, and EO) and data (ILF and EIF), respectively. The functional elements metric, EF, is the sum of the functional elements transaction, EFT, with the functional TABLE IV PARTIAL EXTRACT OF THE DATASET FOR EXTERNAL INQUIRY FTR DET PF (decreased of constant of step 1) (...) (...) TABLE V COMBINATIONS OF RANGES FOR CALCULATING FP OF EQ Function type Initial FTR Final FTR Initial DET Final DET Original FP PF decreased of constant EQ EQ EQ EQ EQ EQ EQ EQ EQ TABLE VI STATISTICAL REGRESSION - COMPARING RESULTS PER TYPES OF FUNCTIONS ILF EIF EO EI EQ R Records Coefficient p- value (FTR or 3.00E E E E E-60 RET) Coefficient p- value (DET) 2.28E E E E E-45 TABLE VII CALCULATION FORMULAS OF FUNCTIONAL ELEMENTS BY TYPE OF FUNCTION 5 Function type Formula ILF EFd = RET DET EIF EFd = RET DET EO EFt = FTR DET EI EFt = FTR DET EQ EFt = FTR DET elements of data, EFd, as explained in the formulas of Table VII. So the proposed metric is: EF = EFt + EFd. The EFt submetric considers logical files (ILF and EIF) as they are referenced in the context of transactions. Files are not counted in separate as in the EFd submetric. Similar to two other ISO standard metrics of functional size [10, p. 388], MKII FPA [28] and COSMIC-FFP [29], EFt does not take into account logical files. EFt is indicated for the cases where the effort of dealing with data structures (EFd) is not subject to evaluation or procurement. In the next section, the EF and EFt metrics were tested, counting and not counting logical files, respectively. Results show stronger correlation with effort for EFt. Although not evaluated, the EFd submetric has its role as it reflects the structural complexity of the data of an application. D. Evaluation of the new metric The new EF metric and its submetric EFt were evaluated for their correlation with effort in comparison to the FPA metric. 6 The goal was not to evaluate the quality of these correlations, but to compare their ability to explain the effort. We obtained a spreadsheet from a federal government agency with records of Service Orders (OS) contracted with private companies for coding and testing activities. An OS 5 The size of a request for deleting a function is equal to the constant value, since no specific attributes are impacted by this operation. 6 Kemerer [8, p. 421] justified linear regression as a means of measuring this correlation.

119 contained one or more requests for maintenance or development of functions of one system, such as: create a report, change a transaction. The spreadsheet showed for each OS the real allocated effort and, for each request, the size of the function handled. The only fictitious data were the system IDs, functionality IDs and OS IDs, as they were not relevant to the scope of this paper. Each system was implemented in a single platform: Java, DotNet or Natural. The spreadsheet showed the time spent in hours and the number of people allocated for each OS. The OS effort, in man-hours, was derived from the product of time by team size. Table VIII presents the structure of the received data. Data from 183 Service Orders were obtained. However, 12 were discarded for having dubious information, for example, undefined values for function type, number of fields, and operation type. The remaining 171 service orders were related to 14 systems and involved 505 requests that dealt with 358 different functions. To achieve higher quality in the correlation with effort, we decided to consider only the four systems associated with at least fifteen OS, namely, systems H, B, C, and D. Table IX indicates the number of OS and requests for each system selected. The data were imported into MS Excel to perform the linear regression using the ordinary least squares method after calculating the size in EF and EFt metrics for each request in an MS-Access application developed by the authors. 7 The regression considered the effort as the independent variable and the size calculated in the PF, EF, and EFT metrics as the dependent ones. As there is no effort if there is no size, the regression considered the constant with value zero, that is, the straight line crosses the origin of the axes. Independent regressions were performed for each system, since the variability of the factors that influence the effort is low within a single system, because the programming language is the same and the technical staff is generally also the same. 8 Fig. 4 illustrates the dispersion of points (OS) on the correlation between size and effort in EFt (man-hour) and the line derived by linear regression in the context of system H. The coefficient of determination R 2 was used to represent the degree of correlation between effort and size calculated for each of the evaluated metrics. According to Sartoris [30, p. 244], R 2 indicates, in a linear regression, the percentage of the variation of a dependent variable Y that is explained by the variation of a second independent variable X. Table IX shows the results of the linear regressions performed. From the results presented on Table IX, comparing the correlation of the metrics with effort, we observed that: 1) correlations of the new metrics (EF, EFt) were considered significant at a confidence level of 95% for all 7 A logistic nonlinear regression with constant was also performed using Gretl, a free open source tool ( However, the R 2 factor proved that this alternative was worse than the linear regression for all metrics. 8 The factors that influence the effort and the degree of this correlation are discussed in several articles. We suggest the articles available in the BestWeb database ( created as a result of the research of Jorgensen and Shepperd [31]. TABLE VIII STRUCTURE OF THE RECEIVED DATA TO EVALUATE THE METRIC Abbreviation Description Domain OS Identification Number of a service order up to 10 numbers Function Identification Number of a function up to 10 numbers Type Type (categorization) of a functionality according to FPA ALI, AIE, EE, SE or CE Operation Operation performed, which may be I or A inclusion (I) of a new feature or change (A) of a function (maintenance) Final FTR RET Value at the conclusion of the request implementation: if the function is a transaction, indicates the number of referenced logical files (FTR); if it is a logical file, indicates the number of logical records (RET) up to 3 numbers Operation FTR RET Original FTR RET Final DET Operation DET Original TD FP %Impact PM Number of FTR or RET that were included, changed or deleted in the scope of a maintenance of a functionality (only in change operation) Number of FTR or RET originally found in the functionality (only in change operation) Number of DET at the conclusion of the request implementation Number of DET included, changed or deleted in the scope of a functionality maintenance (only in change operation) Number of DET originally found in a functionality (only in change operation) Number of function points of the functionality at the conclusion of the request Percentage of the original function impacted by the maintenance, as measured by NESMA [27] Number of maintenance points of the functionality handled, as measured by NESMA [27] up to 3 numbers up to 3 numbers up to 3 numbers up to 3 numbers up to 3 numbers up to 2 numbers 25, 50, 75, 100, 125, 150 up to 4 numbers System Identification of a system one char Hours Hours dedicated by the team to implement the OS up to 5 numbers Team Number of team members responsible for the implementation of the OS up to 2 numbers systems (p-value less than 0.05). 9 However, the correlation of FPA was not significant for system B (p-value > 0.05); 2) correlations of the new metrics were higher in both systems with the highest number of OS (H and B). A better result in larger samples is an advantage, because the larger the sample size, the greater the reliability of the results, since the p-value has reached the lowest values for these systems; 3) although no metric got a high coefficient of determination (R 2 > 0.8), the new metrics achieved medium correlation (0.8 > R 2 > 0.5) in the four systems evaluated, whereas FPA obtained weak correlation (0.2 > R 2 ) in system B. We considered the confidence level of 91.2% in this correlation (p-value 0.88); 4) the correlation of the new metrics was superior in three out of the four systems (H, B, and D). (A correlation C1 is classified as higher than a correlation C2 if C1 is significant and C2 is not significant or if both correlations are significant and C1 has a higher R 2 than C2.) 9 To be considered a statistically significant correlation at a confidence level of X%, the p-value must be less than 1 - X [30, p.11]. For a 95% confidence level, the p-value must be less than 0.05.

120 OS Effort (mh) OS size (EFt) Fig.4. Dispersion of points (OS) of H system: effort (man-hour) x size (Functional Element of Transaction) TABLE IX RESULTS OF LINEAR REGRESSIONS - EFFORT VERSUS METRICS OF SIZE System H B C D Quantity of OS Quantity of Requests FP R % 11.2% 67.7% 51.8% p-value (test-f) 4.6E E E E-03 R 2 65,1% 60.3% 53.0% 54.7% EF p-value (test-f) 1.5E E E E-03 Proportion to FP s R 2 +10% +438% -22% +5% R % 60.3% 53.0% 54.7% EFt p-value (test-f) 8.5E E E E-03 Proportion to FP s R 2 +11% +438% -22% +5% Given the observations listed above,, we conclude for the analyzed data that the metrics proposed, EF and EFt, have better correlation with effort in comparison to FPA. A higher correlation of the EFt metric in comparison to the EF was perceived for system H. Only system H allowed a differentiation of the result for the two metrics by presenting requests for changing logical files in its service orders. Therefore, we see that the EFt submetric tends to yield better correlations if compared to the EF. This result reinforces the hypothesis that the EFd submetric, which composes the EF metric, does not impact the effort, at least not for coding and testing, which are tasks addressed in the evaluated service orders. Table X contains the explanation of how the proposed metrics, EF and EFt, address the criticisms presented in Section II-B. E. Illustration of the use of the new metrics in IT governance Kaplan and Norton [31, p. 71] claim that what you measure is what you get. According to COBIT 5 [34, p. 13], governance aims to create value by obtaining the benefits through optimized risks and costs. In relation to IT governance, the metrics proposed in this paper not only help to assess the capacity of IT but also enable the optimization of its processes to achieve the results. Metrics support the communication between the different actors of IT governance (see Fig. 5) by enabling the translation of objectives and results in numbers. The quality of a process can be increased by stipulating objectives and by measuring results through metrics [15, p. 19]. So, the production capacity of the process of information systems development can be enhanced to achieve the strategic objectives with the appropriate use of metrics and estimates. TABLE X JUSTIFICATIONS OF HOW THE NEW METRICS ADDRESS THE CRITIQUES PRESENTED IN SECTION II-B Critique Low representation Functions with different complexities have the same size Abrupt transition between functional element ranges Limited sizing of high (and low) complexity functions Undue operation on ordinal scale Inability to measure changes in parts of the function Solution Each possible combination of the functional attributes considered in deriving the complexity in FPA is associated with a distinct value. Functionalities with different complexities, as determined by the number of functional attributes, assume a different size. By applying the formulas of calculation described in Section II-C-4, the variation in size is uniform for each variation of the number of functional attributes, according to its coefficients. There is no limit on the size assigned to a function by applying the calculation formulas described in Section II-C-4. The metrics do not have a ordinal scale with finite values, but rather a quantitative scale with infinite discrete values, which provide greater reliability in operations with values. Enables the measurement of changes in part of a functionality considering in the calculation only the functional attributes impacted by the amendment. Software metrics contribute to the three IT governance activities proposed by ISO 38500, mentioned in Section I: to assess, to direct and to monitor. These activities correspond, respectively, to the goals of software metrics mentioned in Section II-A-1: to understand, to improve, and to control the targeted entity of a measurement. Regarding the directions of IT area, Weill and Ross [36, p. 188] state that the creation of metrics for the formalization of strategic choices is one of four management principles that summarize how IT governance helps companies achieve their strategic objectives. Metrics must capture the progress toward strategic goals and thus indicate if IT governance is working or not [36, p. 188]. Kaplan and Norton [37, pp ] claim that strategies need to be translated into a set of goals and metrics in order to have everyone s commitment. They claim that the Balanced Scorecard (BSC) is a tool which provides knowledge of longterm strategies at all levels of the organization and also promotes the alignment of department and individual goals with those strategies. According to ITGI [2, p. 29], BSC, besides being a holistic view of business operations, also contributes to connect long-term strategic objectives with short-term actions. To adapt the concepts of the BSC for the IT function, the perspectives of a BSC were re-established [38, p. 3]. Table XI presents the perspectives of a BSC-IT and their base questions. Owners and Stake holders Dele gate Account table Gover ning Body Set di rection Monitor Ma na ge ment Fig.5. Roles, activities and relationships of IT governance. Source: ISACA [35, p. 24] In struct Report Ope ra tions

121 TABLE XI PERSPECTIVES OF A BSC-IT Perspective Base question BSC corporative perspective Contribution to How do business executives see Financial the business the IT area? Customer How do customers see the IT Customer orientation area? Operational How effective and efficient are Internal Processes excellence the IT processes? Future How IT is prepared for future Learning orientation needs? Source: inspired in ITGI [2, p. 31] According to ITGI [2, p. 30], BSC-IT effectively helps the governing body to achieve alignment between IT and the business. This is one of the best practices for measuring performance [2, p. 46]. BSC-IT is a tool that organizes information for the governance committee, creates consensus among the stakeholders about the strategic objectives of IT, demonstrates the effectiveness and the value added by IT and communicates information about capacity, performance and risks [2, p. 30]. Van Grembergen [39, p.2] states that the relationship between IT and the business can be more explicitly expressed through a cascade of scorecards. Van Grembergen [39, p.2] divides BSC-IT into two: BSC-IT-Development and BSC-IT- Operations. Rohm and Malinoski [40], members of the Balanced Scorecard Institute, present a process with nine steps to build and implement strategies based on scorecard. Bostelman and Becker [41] present a method to derive objectives and metrics from the combination of BSC and the Goal Question Metric (GQM) technique proposed by Basili and Weiss [42]. The association with GQM method is consistent to what ISACA [43, p. 74] says: good strategies start with the right questions. The metric proposed in this paper can compose several indicators that can be used in BSC- IT - Development. Regarding the activities of IT monitoring and assessment [3, p. 7], metrics enable the monitoring of the improvement rate of organizations toward a mature and improved process [1, p. 473]. Performance measurement, which is object of monitoring and assessment, is one of the five focus areas of IT governance, and it is classified as a driver to achieve the results [2, p. 19]. To complement the illustration of the applicability of the new metric for IT governance, Table XII shows some indicators based on EF. 10 The same indicator can be used on different perspectives of a BSC-IT-Development, depending on the targeted entity and the objective of the measurement, such as the following examples. The productivity of a resource (e.g., staff, technology) may be associated with the Future Orientation perspective, as it seeks to answer whether IT is prepared for future needs. The same indicator, if associated with an internal process, encoding, for example, reflects a vision of its production capacity, in the Operational Excellence perspective. In the Customer Orientation perspective, production can be divided by client, showing the proportion of IT production to each business area. The evaluation of the variation in IT production in contrast to the production of business would be an example of using the indicator in the Contribution to the Business perspective. The choice of indicators aimed to encompass the five fundamental dimensions mentioned in Section II-A-1: size, effort, time, quality, and rework. A sixth dimension was added: the expected benefit. According to Rubin [44, p. 1], every investment in IT, from a simple training to the creation of a corporate system, should be aligned to a priority of the business whose success must be measured in terms of a specific value. Investigating the concepts and processes associated with the determination of the value of a function (or a system or the IT area) is not part of the scope of this work. This is a complex and still immature subject. The dimension of each indicator is shown in the third column of Table XII. Some measurements were normalized by being divided by the number of functional elements of the product or process, tactics used to allow comparison across projects and systems of different sizes. The ability to standardize comparisons, as in a BSC, is one of the key features of software metrics [45, p. 493]. It is similar to normalize construction metrics based on square meter, a common practice [46, p. 161]. As Dennis argues [47, p. 302], one should not make decisions based on a single indicator, but from a vision formed by several complementary indicators. As IT has assumed greater prominence as a facilitator to the achievement of business strategy, the use of dashboards to monitor its performance, under appropriate criteria, has become popular among company managers [43, p. 74]. Abreu and Fernandes [48, p. 167] propose some topics that may compose such strategic and tactical control panels of IT. Fig. 6 illustrates the behavior of the indicators shown in Table XII with annual verification for hypothetical systems A, TABLE XII DESCRIPTION OF ILLUSTRATIVE INDICATORS Metric Unit Dimension Description of the calculation for a Functional size Production in the period Production on rework Productivity system EF Size sum of the functional size of the functionalities that compose the system at the end of the period EF Effort sum of the functional size of requests for inclusion, deletion, and change implemented in the period EF Rework sum of the functional size of requests for deletion and change implemented in the period Functional Elements / Man hour Error density Failures / Functional Element Delivery speed Density of the expected benefit Effort Quality Functional Time Elements / Hour $ / EF Expected benefit sum of the functional size of requests implemented in the period / sum of the efforts of all persons allocated to the system activities in the period number of failures resulting from the use of the system in a period / size of the system at the end of the period sum of the size of the features implemented in the period / elapsed time benefit expected by the system in the period / system size 10 The illustration is not restricted to EF, as the indicators could use others software size metrics.

122 B, C, and D. 11 The vertical solid line indicates how the indicator to the system was in the previous period, allowing a view of the proportion of the increasing or decreasing of the values over the period. In the productivity column (column 4), a short line at its base indicates, for example, a pattern value obtained by benchmark. The vertical dashed line metric associated with the production in the period (2) indicates the target set in the period for each system: system A reached it, system D exceeded it, and systems B and C failed. In one illustrative and superficial analysis of the indicators for system C, one can associate the cause of not achieving the production goal during that period (2) with the decrease of the delivery speed (6) and the increase of the production on rework (3), resulted, most likely, from the growth in the error density (5). The reduction on the delivery speed (6) which can be associated with decreased productivity (4) led to a low growth of the functional size of the system (1) during that period. These negative results led to a decrease in the density of the expected benefit (7). Fig. 6 represents an option of visualization of the governance indicators shown in Table XII: a multi-metrics chart of multi-instances of a targeted entity or a targeted attribute. The vertical column width is variable depending on the values of the indicators (horizontal axis) associated with the different instances of entities or attributes of interest (vertical axis). The same vertical space is allocated for each entity instance. The width of the colored area, which is traced from the left to the right, indicates graphically the value of the indicator for the instance. In the hands of the governance committee, correct indicators can help senior management, directly or through any governance structure, to identify how IT management is behaving and to identify problems and the appropriate course of action when necessary. D C B A 1 Functio nal Size 2 Produc tion in the period 3 Produc tion on rework 4 Producti vity 5 Error density 6 Deli very speed Fig. 6. Annual indicators of systems A, B, C and D 7 Density of the expected benefit 11 The fictitious values associated with the indicators were adjusted so that all vertical columns had the same maximum width. The adjustment was done by correlating the maximum value for the indicator with the width defined for the column. The other values were derived by a simple rule of three. III. FINAL CONSIDERATIONS The five specific objectives proposed for this work in Section I were achieved, albeit with limitations and with possibilities for improvement that are translated into proposals for future work. The main result was the proposition of a new metric EF and its submetric EFt. The new metrics, free of some deficiencies of the FPA technique taken as a basis for their derivation, reached a higher correlation with effort than the FPA metric, in the context of the analyzed data. The paper also illustrated the connection found between metrics and IT governance activities, either in assessment and monitoring, through use in dashboards, or in giving direction, through use in BSC-IT. There are possibilities for future work in relation to each of the five specific objectives. Regarding the conceptualization and the categorization of software metrics, a comprehensive literature research is necessary to the construction of a wider and updated categorization of software metrics. Regarding the presentation of the criticisms to FPA, only the criticisms addressed by the new proposed metrics were presented. Research in the theme, as a bibliographic research to catalog the criticisms, would serve to encourage other propositions of software metrics. Regarding the process of creating the new metric, it could be improved or it could be applied to other metrics of any area of knowledge based on ordinal values derived from tables of complexity as FPA (e.g., metric proposed by KARNER [49]: Use Case Points). Future works may also propose and evaluate changes in the rules and in the scope of the new metrics. Regarding the evaluation of the new metric, the limitation in using data from only one organization could be overcome in new works. Practical applications of the metric could also be illustrated. New works could compare the results of EF with the EFt submetric as well as compare both with other software metrics. Different statistical models could be used to evaluate its correlation with effort even in specific contexts (e.g., development, maintenance, development platforms). We expect to achieve a higher correlation of the new metric with effort in agile methods regarding to the APF, considering its capacity of partial functionality sizing. (6 th criticism in Section II-B.) Regarding to the connection with IT governance, a work about the use of metrics in all IT governance activities is promising. The proposed graph for visualization of multiple indicators of multiple instances through columns with varying widths along their length can also be standardized and improved in future work. 12 A suggestion for future work is noteworthy: the definition 12 In (accessed on 04 November 2012) is located a graph that functionally reminds the proposed one: heatmap plotting. However it is different in the format and in the possibilities of evolution. As we did not find any similar graph, we presume to be a new format for viewing the behavior of Multiple Indicators about Multiple Instances through Columns with Varying Widths along their Extension (MIMICoVaWE). An example of evolution would be a variation in the color tone of a cell according to a specific criterion (eg in relation to achieving of a specified goal).

123 of an indicator that shows the level of maturity of a company regarding to the use of metrics in IT governance. Among other aspects, it could consider: the breadth of the entities evaluated (e.g., systems, projects, processes, teams), the dimensions treated (e.g., size, rework, quality, effectiveness) and the effective use of the indicators (e.g., monitoring, assessment). Finally, we expect that the new metric EF and its submetric EFt help increase the contribution of IT to the business, in an objective, reliable, and visible way. REFERENCES [1] H. A. Rubin, Software process maturity: measuring its impact on productivity and quality, in Proc. of the 15th int. conf. on Softw. Eng., IEEE Computer Society Press, pp , [2] ITGI - IT Governance Institute. Board briefing on IT Governance, 2nd ed, Isaca, [3] ISO/IEC, 38500: Corporate governance of information technology, [4] A. J. Albrecht, "Measuring application development productivity" in Guide/Share Application Develop. Symp. Proc., pp [5] ISO/IEC, 20926: Software measurement - IFPUG functional size measurement method, [6] IFPUG - International Function Point Users Group, Counting Practices Manual, Version 4.3.1, IFPUG, [7] A. Albrecht and J. Gaffney Jr., Software function, source lines of code, and development effort prediction: A software science validation, IEEE Trans. Software Eng.,vol. 9, pp , [8] C. F. Kemerer, An empirical validation of software cost estimation models, Communications of the ACM, vol. 30, no. 5, pp , [9] Brazil. MCT - Ministério da Ciência e Tecnologia. Quality Research in Brazilian Software Industry; Pesquisa de Qualidade no Setor de Software Brasileiro 2009, Brasília. [Online]. 204p. Available: [10] M. Bundschuh and C. Dekkers, The IT measurement compendium: estimating and benchmarking success with functional size measurement, Springer, [11] C. E. Vazquez, G. S. Simões and R. M. Albert, Function Point Analysis: Measurement, Estimates and Project Management Software; Análise de Pontos de Função: Medição, Estimativas e Gerenciamento de Projetos de Software. Editora Érica, São Paulo, [12] Brazil. SISP - Sistema de Administração dos Recursos de Tecnologia da Informação. (2012). Metrics Roadmap of SISP - Version 2.0; Roteiro de Métricas de Software do SISP Versão 2.0, Brasília: Ministério do Planejamento, Orçamento e Gestão.Secretaria de Logística e Tecnologia da Informação. [Online]. Available: gcie/download/file/roteiro_de_metricas_de_software_do_sisp_- _v2.0.pdf [13] N. E. Fenton and S. L. Pfleeger, Software metrics: a rigorous and practical approach, PWS Publishing Co, [14] B. Kitchenham, S. L. Pfleeger and N. Fenton, Towards a framework for software measurement validation, IEEE Trans. Softw. Eng., vol. 21, no. 12, pp , [15] S. MOSER, Measurement and estimation of software and software processes, Ph.D. dissertation, University of Berne, Switzerland, [16] E. Chikofsky and H. A. Rubin, Using metrics to justify investment in IT, IT professional, vol. 1, no. 2, pp [17] C. P. Beyers, "Estimating software development projects." in IT measuremen,. Addison-Wesley Longman Publishing Co., Inc., pp , [18] C. Gencel and O. Demirors, Functional size measurement revisited, ACM Transactions on Software Engineering and Methodology (TOSEM) vol. 17, no. 3, p. 15, [19] A. Abran and P. N. Robillard, "Function Points: A Study of Their Measurement Processes and Scale Transformations", Journal of Systems and Software, vol. 25, pp [20] B. Kitchenham, The problem with function points, IEEE Software, vol. 14, no. 2, pp , [21] B. Kitchenham, K. Känsälä, Inter-item correlations among function points, in Proc.15th Int. Conf. on Softw. Eng.,, IEEE Computer Society Press, pp , [22] T. Kralj, I. Rozman, M. Heričko and A.Živkovič, Improved standard FPA method resolving problems with upper boundaries in the rating complexity process, Journal of Systems and Software, vol. 77, no. 2, pp , [23] S. L. Pfleeger, R. Jeffery, B. Curtis and B. Kitchenham, Status report on software measurement, IEEE Software, vol. 14, no. 2, pp , [24] O. Turetken, O. Demirors, C. Gencel, O. O. Top, and B. Ozkan, The Effect of Entity Generalization on Software Functional Sizing: A Case Study, in Product-Focused Software Process Improvement, Springer Berlin Heidelberg, pp , [25] W. Xia, D. Ho, L. F. Capretz, and F. Ahmed, Updating weight values for function point counting, International Journal of Hybrid Intelligent Systems, vol. 6, no. 1, pp. 1-14, [26] G. Antoniol, R. Fiutem and C. Lokan, "Object-Oriented Function Points: An Empirical Validation," Empirical Software Engineering, vol. 8, no. 3, pp , 2003 [27] NESMA - Netherlands Software Metrics Association, Function Point Analysis For Software Enhancement, [Online], Available: re_enhancement_(v2.2.1).pdf [28] ISO/IEC, 20968: MkII Function Point Analysis - Counting Practices Manual, [29] ISO/IEC, 19761: COSMIC: a functional size measurement method, [30] A. Sartoris, Estatística e introdução à econometria; Introduction to Statistics and Econometrics, Saraiva S/A Livreiros Editores, [31] M. Jorgensen and M. Shepperd, A systematic review of software development cost estimation studies, IEEE Trans. Softw. Eng., vol. 33, no. 1, pp , [32] M. L. Orlov, Multiple Linear Regression Analysis Using Microsoft Excel, Chemistry Department, Oregon State University, [33] R. S. Kaplan and D. P. Norton, The balanced scorecard - measures that drive performance, Harvard business review, vol. 70, no. 1, pp , [34] Isaca, COBIT 5: Enabling Processes, Isaca, [35] Isaca, COBIT 5: A Business Framework for the Governance and Management of IT, Isaca, [36] P. Weill and J. W. Ross, IT governance: How top performers manage IT decision rights for superior results, Harvard Business Press, [37] R. S. Kaplan and D. P. Norton, Using the balanced scorecard as a strategic management system, Harvard business review, vol.74, no. 1, pp , [38] W. Van Grembergen and R. Van Bruggen, Measuring and improving corporate information technology through the balanced scorecard, The Electronic Journal of Information Systems Evaluation, vol. 1, no [39] W. Van Grembergen, The balanced scorecard and IT governance, Information Systems Control Journal, Vol 2, pp.40-43, [40] H. Rohm and M. Malinoski, Strategy-Based Balanced Scorecards for Technology, Balanced Scorecard Institute, [41] S. A. Becker and M. L. Bostelman, "Aligning strategic and project measurement systems," IEEE Software, vol. 16, no. 3, pp , May/Jun [42] V. R. Basili, and D. M. Weiss, "A Methodology for Collecting Valid Software Engineering Data," IEEE Trans. Softw. Eng., vol. SE-10, no. 6, pp , Nov [43] Isaca,. CGEIT Review Manual 2010, ISACA. [44] H. A. Rubin, How to Measure IT Value, CIO insight, [45] B. Hufschmidt, Software balanced scorecards: the icing on the cake, in IT measurement, Addison-Wesley Longman Publishing Co., Inc., pp [46] C. A. Dekkers, How and when can functional size fit with a measurement program?," in IT measurement, Addison-Wesley Longman Publishing Co., Inc., pp , [47] S. P. Dennis, Avoiding obstacles and common pitfalls in the building of an effective metrics program, in IT measurement, Addison-Wesley Longman Publishing Co., Inc., pp , [48] A. A. Fernandes and V. F. Abreu, Deploying IT governance: from strategy to process and services management; Implantando a governança de TI: da estratégia à gestão de processos e serviços, Brasport, [49] G. Karner. Metrics for Objectory, Diploma thesis, University of Link ping, Sweden, No. LiTH-IDA-Ex9344:21, December 1993.

124 An Approach to Business Processes Decomposition for Cloud Deployment Uma Abordagem para Decomposição de Processos de Negócio para Execução em Nuvens Computacionais Lucas Venezian Povoa, Wanderley Lopes de Souza, Antonio Francisco do Prado Departamento de Computação (DC) Universidade Federal de São Carlos (UFSCar) São Carlos, São Paulo - Brazil {lucas.povoa, desouza, prado}@dc.ufscar.br Resumo Devido a requisitos de segurança, certos dados ou atividades de um processo de negócio devem ser mantidos nas premissas do usuário, enquanto outros podem ser alocados numa nuvem computacional. Este artigo apresenta uma abordagem genérica para a decomposição de processos de negócio que considera a alocação de atividades e dados. Foram desenvolvidas transformações para decompor processos representados em WS-BPEL em subprocessos a serem implantados nas premissas do usuário e numa nuvem computacional. Essa abordagem foi demonstrada com um estudo de caso no domínio da Saúde. Palavras-chave Gerenciamento de Processos de Negócio; Computação em Nuvem; Decomposição de Processos; WS- BPEL; Modelo Baseado em Grafos. Abstract Due to safety requirements, certain data or activities of a business process should be kept within the user premises, while others can be allocated to a cloud environment. This paper presents a generic approach to business processes decomposition taking into account the allocation of activities and data. We designed transformations to decompose business processes represented in WS-BPEL into sub-processes to be deployed on the user premise and in the cloud. We demonstrate our approach with a case study from the healthcare domain. Keywords Business Process Management; Cloud Computing; Process Decomposition; WS-BPEL; Graph-based model. I. INTRODUÇÃO Atualmente várias organizações dispõem de grandes sistemas computacionais a fim de atenderem à crescente demanda por processamento e armazenamento de um volume cada vez maior de dados. Enquanto na indústria grandes companhias constroem centros de dados em larga escala, para fornecerem serviços Web rápidos e confiáveis, na academia muitos projetos de pesquisa envolvem conjuntos de dados em larga escala e alto poder de processamento, geralmente providos por supercomputadores. Dessa demanda por enormes centros de dados emergiu o conceito de Computação em Nuvem [1], onde tecnologias de Luís Ferreira Pires, Evert F. Duipmans Faculty of Electrical Engineering, Mathematics and Computing Science (EEMCS) University of Twente (UT) Enschede, Overijssel - The Netherlands l.ferreirapires@utwente.nl, e.f.duipmans@student.utwente.nl informação e comunicação são oferecidas como serviços via Internet. Google App Engine, Amazon Elastic Compute Cloud (EC2), Manjrasoft Aneka e Microsoft Azure são alguns exemplos de nuvens computacionais [2]. O cerne da Computação em Nuvem é oferecer recursos computacionais, de forma que seus usuários paguem somente pelo seu uso e tendo a percepção de que estes são ilimitados. O National Institute of Standards and Technology (NIST) identifica três modelos de serviço [3]: (a) Softwareas-a-Service (SaaS), um software hospedado num servidor é oferecido e usuários acessam-no via alguma interface através de uma rede local ou Internet (e.g., Facebook, Gmail); (b) Platform-as-a-Service (PaaS), uma plataforma é oferecida, usuários implantam suas aplicações na mesma e esta oferece recursos como servidor Web e bases de dados (e.g., Windows Azure, Google AppEngine); (c) Infrastructure-asa-Service (IaaS), uma máquina virtual com certa capacidade de armazenamento é oferecida e usuários alugam esses recursos (e.g., Amazon EC2, GoGrid). Embora muito promissora, a Computação em Nuvem enfrenta obstáculos que devem ser transpostos para que não impeçam o seu rápido crescimento. Segurança dos dados é uma grande preocupação dos usuários, quando estes armazenam informações confidenciais nos servidores das nuvens computacionais. Isto porque geralmente esses servidores são operados por fornecedores comerciais, nos quais os usuários não depositam total confiança [4]. Em alguns domínios de aplicação, a confidencialidade não é só uma questão de segurança ou privacidade, mas também uma questão jurídica. A Saúde é um desses domínios, já que a divulgação de informações devem satisfazer requisitos legais, tais como os presentes no Health Insurance Portability and Accountability Act (HIPAA) [5]. Business Process Management (BPM) tem sido bastante empregado por diversas empresas, nesta última década, para gerenciar e aperfeiçoar seus processos de negócio [6]. Um processo de negócio consiste de atividades exercidas por humanos ou sistemas e um Business Process Management System (BPMS) dispõe de um executor (engine), no qual

125 instâncias de um processo de negócio são coordenadas e monitoradas. A compra de um BPMS pode ser um alto investimento para uma empresa, já que software e hardware precisam ser adquiridos e profissionais qualificados contratados. Escalabilidade também pode ser um problema, já que um executor é somente capaz de coordenar um número limitado de instâncias de um processo simultaneamente, sendo necessária a compra de servidores adicionais para lidar com situações de pico de carga. BPMSs baseados em nuvens computacionais e oferecidos como SaaS via Internet podem ser uma solução para o problema de escalabilidade. Entretanto, o medo de perder ou expor dados confidenciais é um dos maiores obstáculos para a implantação de BPMSs em nuvens computacionais, além do que há atividades num processo de negócio que podem não se beneficiar dessas nuvens. Por exemplo, uma atividade que não exige intensa computação pode tornar-se mais onerosa se colocada numa nuvem, já que os dados a serem processados por essa atividade devem ser enviados à nuvem, o que pode levar mais tempo para a sua execução e custar mais caro, uma vez que transferência de dados é um dos fatores de faturamento das nuvens computacionais [7]. Outros modelos de seviço em nuvens computacionais, além dos identificados pelo NIST, são encontrados na literatura. Por exemplo, no modelo Process-as-a-Service um processo de negócio é executado parcial ou totalmente numa nuvem computacional [8]. Devido a requisitos de segurança, nesse modelo certos dados ou atividades devem ser mantidos nas premissas do usuário enquanto outros podem ser alocados numa nuvem, o que requer uma decomposição desse processo. Neste sentido, este artigo apresenta uma abordagem genérica para a decomposição de processos de negócio, oferecendo uma solução técnica para esse problema. A sequência do mesmo está assim organizada: a Seção II discorre sobre BPM; a Seção III apresenta a abordagem proposta; a Seção IV descreve um estudo de caso acompanhado de análises de desempenho e custo; a Seção V trata de trabalhos correlatos; e a Seção VI expõe as considerações finais apontando para trabalhos futuros. II. BUSINESS PROCESS MANAGEMENT BPM parte do princípio que cada produto oferecido por uma empresa é o resultado de um determinado número de atividades desempenhadas por humanos, sistemas ou ambos, e as metas do BPM são identificar, modelar, monitorar, aperfeiçoar e revisar processos de negócio dessa empresa. Identificando essas atividades via workflows, a empresa tem uma visão de seus processos, e monitorando e revisando os mesmos esta pode detectar problemas e realizar melhorias. O ciclo de vida de um processo de negócio possui as fases: Projeto, os processos de negócio são identificados e capturados em modelos geralmente gráficos, possibilitando aos stakeholders entendê-los e refinálos com certa facilidade. As atividades de um processo são identificadas supervisionando o processo existente e considerando a estrutura da empresa e os seus recursos técnicos, sendo que Business Process Model and Notation (BPMN)[9] é a linguagem mais usada nessa fase. Uma vez capturados nos modelos, os processos podem ser simulados e validados, fornecendo aos stakeholders uma visão da sua correção e adequação; Implementação, um modelo de processo de negócio é implementado manual, semi-automática ou automaticamente. Quando automação não é requerida ou possível, listas de trabalho são criadas com tarefas bem definidas, as quais são atribuídas a funcionários da empresa. O problema é que não há um sistema central para o monitoramento das instâncias do processo, devendo este ser realizado por cada funcionário envolvido. Com a participação de sistemas de informação, um BPMS pode usar o modelo desse processo e criar instâncias do mesmo, sendo capaz de monitorar cada uma destas e fornecer uma visão das atividades realizadas, do tempo consumido e da sua conclusão ou falha; Promulgação, o processo de negócio é executado e para cada iniciação uma instância do mesmo é criada. Tais instâncias são gerenciadas por um BPMS, que as acompanha via um monitor, fornecendo um quadro das que estão em execução e das que terminaram, e detectando eventuais problemas que podem ocorrer com essas instâncias; e Avaliação, a informação monitorada e coletada pelo BPMS é usada para revisar o processo de negócio, sendo que as conclusões obtidas nessa fase serão as entradas da próxima interação no ciclo de vida. A. WS-BPEL BPMSs precisam de linguagens executáveis, sobretudo nas três últimas fases, e uma vez que as usadas na fase de projeto são geralmente muito abstratas, linguagens tais como Web Services Business Process Execution Language (WS- BPEL) [10] tornam-se necessárias. Concebida pela Organization for the Advancement of Structured Information Standards (OASIS) para a descrição de processos de negócio e de seus protocolos, WS-BPEL foi definida a partir dos padrões Web WSDL 1.1, XML Schema, XPath 1.0, XSLT 1.0 e Infoset. As suas principais construções serão ilustradas com o exemplo do Picture Archiving and Communication System (PACS) [11], um sistema de arquivamento e comunicação para diagnóstico por imagem, cujo workflow é apresentado na Fig. 1. Fig. 1. Workflow do PACS descrito como um processo monolítico.

126 Um processo descrito em WS-BPEL é um container, onde são declaradas as atividades a serem executadas, dados, tipos de manipuladores (handlers) e as relações com parceiros externos. PACS pode ter sua descrição em WS- BPEL iniciada por <process name="pacsbusinessprocess" targetnamespace=" xmlns=" WS-BPEL permite agregar Web Services, definir a lógica de cada interação de serviço e orquestrar essas interações. Uma interação envolve dois lados (o processo e um parceiro), é descrita via o partnerlink, que é um canal de comunicação caracterizado por um partnerlinktype, myrole e partnerrole, sendo que essas informações identificam a funcionalidade a ser provida por ambos os lados. Em PACS pode ser definido um canal de comunicação entre esse processo e um cliente como <partnerlinks> <partnerlink name="client" partnerlinktype="tns:pacsbusinessprocess" myrole="pacsbusinessprocessprovider" partnerrole="pacsbusinessprocessrequester" /> </partnerlinks> Para a troca de mensagens emprega-se receive, reply e invoke. As duas primeiras permitem a um processo acessar componentes externos através de um protocolo de comunicação (e.g., SOAP), sendo que receive permite ao processo captar requisições desses componentes. Em PACS, a requisição de um radiologista pela persistência e detecção automática de nódulos de uma tomografia de pulmão, pode ser captada por <receive name= ImagePersistenceAndAnalysisReq partnerlink="processprovider" operation="initiate" variable="input" createinstance="yes"/> Para que um processo possa emitir uma resposta a um solicitante é necessário um reply relacionado a um receive. Um possível reply para o receive acima é <reply name= ImagePersistenceAndAnalysisResponse partnerlink="processprovider" operation="initiate" variable="output"/> Um processo requisita uma operação oferecida por um Web Service através de invoke. A operação de persistência de imagem médica pode ser requisitada por <invoke name= ImagePersistence partnerlink="imagpl" operation="persistimage" inputvariable="imagevar" outputvariable= imageresp /> Em geral um processo de negócio contém desvios condicionados a critérios. Em PACS, imageresp determina a invocação da função de detecção automática de nódulo ou o disparo de uma exceção. Esse desvio pode ser descrito como <if> <condition>imageresp</condition> <invoke name= AutomaticAnalysis /> <else> <throw faultname= PersistenceException /> </else> </if> Atividades executadas iterativamente devem ser declaradas via while, onde é realizada uma avaliação para uma posterior execução, ou via repeat until, onde a avaliação sucede a execução. Em PACS, a persistência de várias imagens pode ser descrita como <while> <condition> currentimagenumber <= numberofimages </condition> <invoke name= persistimage /> <assign> <copy> <from>$currentimagenumber + 1</from> <to>$currentimagenumber</to> </copy> </assign> </while> Atividades executadas paralelamente devem ser declaradas via flow. Em PACS, as operações de persistência de uma imagem e de análise desta podem ser declaradas para execução em paralelo como <flow name= parallelrequest > <invoke name= MedicalImagePersistence /> <invoke name= AutomaticAnalysis /> </flow> B. BPM em Nuvens Computacionais O modelo Process enactment, Activity execution and Data (PAD) é apresentado em [7], onde são investigadas possíveis distribuições de um BPM entre as premissas e uma nuvem, considerando a partição de atividades e dados, mas não considerando a partição do executor de processo. Em [12] o PAD é estendido, possibilitando também a partição do executor, conforme ilustrado na Fig. 2. É comum um processo de negócio conter atividades a serem executadas em sequência. Em PACS, a solicitação de persistência de imagem médica, a execução dessa tarefa e a emissão da resposta ao solicitante, podem ser descritas como <sequence name= ImagePersistenceSequence > <receive name= ImagePersistenceRequest /> <invoke name= ImagePersistence /> <reply name= ImagePersistenceResponse /> </sequence> Fig. 2. Possibilidades de partição e distribuição de BPM.

127 Processos de negócio definem fluxos de controle, que regulam as atividades e a sequência das mesmas, e fluxos de dados, que determinam como estes são transferidos de uma atividade a outra. Um executor tem que lidar com ambos os tipos e, se dados sensíveis estiverem presentes, os fluxos de dados devem ser protegidos. Na dissertação de mestrado [13] é proposto um framework para a decomposição de um processo em dois processos colaborativos, com base numa lista de distribuição de atividades e dados, onde restrições relativas aos dados podem ser definidas, para assegurar que dados sensíveis permaneçam nas premissas. A Fig. 3 ilustra essa decomposição. Fig. 3. Exemplo de decomposição. III. ABORDAGEM PROPOSTA O framework apresentado em [13], cujas fases são ilustradas na Fig. 4, contém uma Representação Intermediária (RI) baseada em grafos, na qual conceitos de processos de negócio são capturados. A decomposição de um processo passa pela RI, sendo que a adoção de uma linguagem de processos de negócio requer transformações da linguagem para a RI (lifting) e vice-versa (grounding). Em [13] foi adotada a linguagem abstrata Amber [14], efetuada uma análise para definir as regras de decomposição suportadas pelo framework, concebidos algoritmos para a sua implementação, os quais realizam transformações em grafos, e concebido um algoritmo para verificar se restrições relativas aos dados são violadas pela decomposição. base na avaliação de uma condição; mixagem simples, que une múltiplos ramos alternativos para um único desses ser executado; e ciclos arbitrários, que modela comportamento recursivo. Essa RI suporta também: dependência de dados, que representa explicitamente as dependências de dados entre os nós, que é necessária pois o processo original é decomposto em processos colaborativos e dados sensíveis podem estar presentes; e comunicação, que permite descrever como um processo invoca outro. A RI emprega um modelo baseado em grafos para representar processos, onde um nó representa uma atividade ou um elemento de controle e uma aresta representa uma relação entre dois nós. Esses nós e arestas foram especializados, definindo-se uma representação gráfica para cada especialização: Atividade, cada nó tem em geral uma aresta de controle de entrada e uma de saída; Comportamento paralelo, ilustrado na Fig. 5 (a), é modelado com nós flow e eflow. O primeiro divide um ramo de execução em vários ramos paralelos e possui no mínimo duas arestas de controle de saída. O segundo junta vários ramos paralelos num único ramo e possui duas ou mais arestas de controle de entrada e no máximo uma de saída; Comportamento condicional, ilustrado na Fig. 5 (b), é modelado com nós if e eif. O primeiro possui duas arestas de controle de saída, uma rotulada true a outra false, e após a avaliação da condição somente uma destas é tomada. O segundo junta ramos condicionais, convertendo-os num único ramo de saída; Comportamento repetitivo, ilustrado nas Fig. 5 (c) e (d), é modelado com um único nó loop e, após a avaliação da condição, o ramo do comportamento repetitivo é tomado ou abandonado. Esse nó pode estar antes ou depois do comportamento, sendo que no primeiro caso resulta em zero ou mais execuções e no segundo em pelo menos uma execução; Fig. 4. Etapas envolvidas no framework. A. Representação Intermediária Para definir os requisitos da RI, foram adotados os seguintes padrões de workflow [15]: sequência, que modela fluxos de controle e expressa a sequência de execução de atividades num processo; divisão paralela, que divide um processo em dois ou mais ramos para execução simultânea; sincronização, que junta múltiplos ramos num único ramo de execução; escolha condicional, que executa um ramo com Fig. 5. Construções para comportamentos paralelo (a), condicional (b) e repetitivo com loop antes (c) e loop depois (d). Comunicações síncrona e assíncrona são ilustradas na Fig. 6 (a) e Fig. 6 (b) respectivamente. Por exemplo, a síncrona é modelada com os nós ireq,

128 ires, rec e rep, através dos quais dois subprocessos, partes do processo global, se comunicam; Fig. 6. Comunicações síncrona (a) e assíncrona (b). Arestas de controle, representadas por setas sólidas, modelam o fluxo de controle. Uma aresta de controle é disparada pelo seu nó de origem, tão logo a ação associada ao mesmo termina, e o nó terminal dessa aresta aguarda pelo seu disparo para iniciar a sua ação associada. Caso o nó de origem seja if, essa aresta é rotulada true ou false, e caso a condição avaliada corresponda a esse rótulo, esta é disparada pelo nó; Arestas de dados possibilitam investigar os efeitos na troca de dados causados pelas mudanças das atividades de um processo a outro, permitindo verificar se alguma restrição aos dados foi violada durante a partição do processo original. Uma aresta de dados é representada por uma seta tracejada. Uma aresta de dados do nó de origem ao nó terminal implica que os dados definidos no primeiro são usados pelo segundo. Cada aresta possui um rótulo, que define o nome dos dados compartilhados; Arestas de comunicação permitem enviar controle e dados a diferentes processos e são rotuladas com nomes de itens de dados enviados via as mesmas. Formalmente, um grafo na RI é uma tupla (A, C, S, ctype, stype, E, L, nlabel, elabel), onde: A é um conjunto de nós de atividade; C é um conjunto de nós de comunicação; S é um conjunto de nós estruturais {flow, eflow, if, eif, loop}; Os conjuntos par a par A, C e S são disjuntos; ctype : C {InvokeRequest, InvokeResponse, Receive, Reply} atribui um tipo comunicador a um nó de comunicação; stype : S {Flow, EndFlow, If, EndIf, Loop} atribui um tipo nó controle a um nó de controle; E = E ctrl E data E com é o conjunto de arestas no grafo, sendo que uma aresta é definida como (n 1, etype, n 2 ), onde etype {Control, Data, Communication} é o tipo da aresta e n 1, n 2 A C S; E ctrl é o conjunto de arestas de fluxo de controle, onde e=(n 1, Control, n 2 ) com n 1, n 2 A C S; E data é o conjunto de arestas de dados, onde e = (n 1, Data, n 2 ) com n 1, n 2 A C S; E com é o conjunto de arestas de comunicação, onde e = (n 1, Communication, n 2 ) com n 1, n 2 C; L é um conjunto de rótulos textuais que podem ser atribuídos aos nós e arestas; nlabel : N L, onde N = A C S atribui um rótulo textual a um nó; elabel : E L atribui um rótulo textual a uma aresta; Os conjuntos N e E são disjuntos. B. Decomposição Em [13], para cada construção da RI foram identificadas decomposições para processos situados nas premissas, que possuem atividades a serem alocadas na nuvem, e vice-versa. A Fig. 7 ilustra um conjunto de atividades sequenciais, marcado para a nuvem, sendo alocado num único processo e substituído, no processo original, por nós de invocação síncrona. Fig. 7. Conjunto de atividades sequenciais movido como um bloco. Embora semanticamente diferentes, as construções paralelas e condicionais são generalizadas como compostas, pois possuem a mesma estrutura sintática, podendo ser decompostas de várias formas. Neste trabalho os nós de início e fim devem ter a mesma alocação e as atividades de um ramo, com a mesma alocação desses nós, permanecem com os mesmos. Se uma determinada construção é toda marcada para a nuvem, a decomposição é semelhante a das atividades sequenciais. Na Fig. 8, os nós de início e fim são marcados para a nuvem, e um ramo permanece nas premissas, sendo que a atividade desse ramo é colocada num novo processo nas premissas, o qual é invocado pelo processo na nuvem. Fig. 8. Um ramo da construção composta permanece nas premissas.

129 Na Fig. 9, os nós de início e fim são marcados para a nuvem e os ramos permanecem nas premissas, sendo criado para cada ramo um novo processo nas premissas. Fig. 11. Ramos iterativos. Fig. 9. Os ramos da construção composta permanecem nas premissas. Na Fig. 10, os nós de início e fim permanecem nas premissas e os ramos são marcados para a nuvem, sendo criado para cada ramo um processo na nuvem. Fig. 10. Os nós início e fim permanecem nas premissas. Laços usam o loop, e se um laço é todo marcado para a nuvem, a decomposição é semelhante a das atividades sequenciais. Quando loop e comportamento são marcados com alocações distintas, este último é tratado como um processo separado. A Fig. 11 ilustra um laço onde o nó loop é marcado para a nuvem e a atividade iterativa fica nas premissas. Em função da complexidade da decomposição, os algoritmos para a sua implementação foram concebidos em quatro etapas consecutivas: identificação, partição, criação de nós de comunicação e criação de coreografia. Tais algoritmos, apresentados em [13], foram omitidos aqui devido às limitações de espaço. Como já mencionado, a abordagem de decomposição aqui descrita emprega uma lista de distribuição de atividades e dados, que determina o que deve ser alocado nas premissas e numa nuvem computacional. Embora a definição dessa lista esteja fora do escopo deste trabalho, parte-se do princípio que esta é elaborada manual ou automaticamente de acordo com os seguintes critérios: Atividades sigilosas ou que contenham dados sigilosos devem ser alocadas nas premissas; Atividades com baixo custo computacional e volume de dados devem ser alocadas nas premissas; e Atividades com alto custo computacional, com uma alta relação entre tempo de processamento e tempo de transferência de dados e que não se enquadrem no primeiro critério, devem ser alocadas na nuvem. C. Lifting e Grounding Devido à base XML de WS-BPEL, o lifting e o grounding convertem estruturas de árvores em grafos e viceversa, sendo que lifting possui um algoritmo para cada tipo de construção WS-BPEL e grounding um algoritmo para cada tipo de elemento da RI. Dessa forma, os principais mapeamentos foram: assign e throw para nós Atividade; flow para Comportamento paralelo; if para Comportamento condicional, onde construções com mais de uma condição são mapeadas para Comportamentos condicionais aninhados junto ao ramo false; while e repeatuntil para Comportamento repetitivo; receive e reply para Comunicação com nós rec e res; sequence para um conjunto de nós que representam construções aninhadas interconectados por arestas de controle; invoke assíncrono para o nó ireq e síncrono para os nós ireq e ires. Os algoritmos para o lifting e o grouding foram implementados em Java 7 usando a API para XML, baseada nas especificações do W3C, e o framework para testes JUnit. Por exemplo, as estruturas de árvore e grafo para o if, apresentadas na Fig. 12, tiveram seus lifting e grounding implementados a partir dos Algoritmos 1 e 2 respectivamente, cujas entrada e saída estão ilustradas na Fig. 12. Os algoritmos para o lifting e o grounding de outras estruturas da RI e de WS-BPEL foram omitidos devido a limitações de espaço.

130 enviado a FalseGenerator. Caso contrário, se todas as condições foram assumidas false e havendo atividade para execução, um nó else é adicionado à árvore. Algoritmo 2 Grounding para o grafo if function IfGenerator(g) t IfTree() t.children t.children {CondGenerator(g.cond)} t.children t.children {Generator(g.true)} t.children t.children {FalseGenerator(g.false)} end function Fig. 12. Estruturas de árvore e grafo para a construção if. IfParser caminha nos nós aninhados da árvore verificando a condição e construindo o ramo true do grafo if com as atividades relacionadas, sendo que as construções restantes são enviadas a FalseParser para que o ramo false seja construído. Caso a árvore tenha mais de uma condição, o ramo false conterá um grafo if para a segunda condição, esse grafo terá um ramo false que conterá outro grafo if para a terceira condição, e assim sucessivamente. Algoritmo 1 Lifting para a árvore da construção if function IfParser(t) cond {} if t of type IfTree then cond IfGraph() for all c t.children do if c type of Condition then cond.cond CondParser(c) else if c type of ElseTree c type of ElseIfTree then cond.false FalseParser(t.children) return cond else if c type of Tree then cond.true Parser(c) end if t.children t.children {c} end for end if return cond end function function FalseParser(s) if s = {} then return s end if falsebranch Graph() if s.first of type ElseIfTree then cond IfBranch() cond.true ElseParse(s.first) cond.false FalseParse(s-{s.first}) falsebranch.nodes {cond} else if s.first of type ElseTree then falsebranch.nodes {ElseParse(s.first)} else return FalseParse(s-{s.first}) end if return falsebranch end function IfGenarator caminha no ramo true do grafo verificando e adicionando à árvore if a condição junto com as atividades relacionadas, sendo que o ramo false é enviado à FalseGenerator que verifica se há um nó if aninhado. Caso exista uma construção elseif, com a condição e as atividades relacionadas, esta é adicionada à árvore e seu ramo false é function FalseGenerator(f) r {} while f {} do if # of f.nodes = 1 ^ f of type ElseIfTree then t ElseIfTree() t.children CondGenerator(f.cond) Generator(f.true) r r t else r r ElseGenerator(f) end if f f.false end while return r end function IV. ESTUDO DE CASO O estudo de caso para validar a decomposição foi baseado no PACS, um processo na Saúde que tem por objetivo persistir diagnósticos e tomografias mamárias e aplicar uma função para a detecção de possíveis nódulos nas mesmas. O PACS aceita um conjunto de imagens e seus respectivos pré-diagnósticos e identificadores, efetua a persistência de cada imagem e diagnóstico, executa a função para detecção automática de nódulos sobre as tomografias mamárias e emite um vetor contendo os identificadores das imagens com nódulos em potencial. No workflow do processo monolítico do PACS, ilustrado na Fig. 1, as construções marcadas para alocação na nuvem estão com um fundo destacado. A Fig. 13(a) ilustra a RI do PACS monolítico após o lifting, enquanto a Fig. 13(b) ilustra a RI após a execução da decomposição. A Fig. 14 ilustra o PACS decomposto após o grounding com a adição de dois observadores: um externo, cuja visão é a mesma do observador do PACS monolítico, ou seja, só enxerga as interações entre Cliente e PACS; um interno que, além dessas interações, enxerga também as interações entre os processos nas premissas e na nuvem. A Fig. 15 ilustra, via diagramas UML de comunicação, exemplos de traços obtidos executando o processo monolítico (a) e o processo decomposto (b), sendo que as interações destacadas neste último são visíveis somente ao observador interno. Se ocultadas tais interações, ambos os traços passam a ser equivalentes em observação para o observador externo.

131 A. Análise de Desempenho A fim de comparar o desempenho entre os processos monolítico e decomposto, estes foram implementados empregando-se as seguintes ferramentas: sistema operacional Debian 6; servidor de aplicação Apache Tomcat 6; Java 6; mecanismo de processos BPEL Apache ODE; e o framework para disponibilizar os Web Services Apache AXIS 2. O processo monolítico e a parte nas premissas do processo decomposto foram executados sobre uma infraestrutura com 1GB de RAM, 20 GB de disco e 1 núcleo virtual com 2,6 GHz. A parte na nuvem do processo decomposto foi executada sobre um modelo IaaS, em uma nuvem privada gerenciada pelo software OpenStack, com as diferentes configurações descritas na Tabela I. TABELA I CONFIGURAÇÕES DAS INSTÂNCIAS NA NUVEM. (a) (b) Fig. 13. RIs dos processos monolítico (a) e decomposto (b). Código Memória HD Núcleos Frequência conf#1 2 GB 20 GB GHz conf#2 2 GB 20 GB GHz conf#3 4 GB 20 GB GHz conf#4 4 GB 20 GB GHz conf#5 6 GB 20 GB GHz conf#6 6 GB 20 GB GHz As execuções dos processos empregaram uma carga de trabalho composta por duas tuplas na forma <id, diagnostic, image>, onde id é um identificador de 4 bytes, diagnostic é um texto de 40 bytes e image é uma tomografia mamária de 11,1 MB. Foram coletadas 100 amostras dos tempos de resposta dos processos monolítico e decomposto para cada configuração i. De acordo com [16], o percentual P i de desempenho ganho do processo decomposto em relação ao monolítico para a i ésima configuração pode ser definido como (a) (b) Fig. 14. PACS decomposto com observadores externo e interno. 1: request() :WebServiceCliente :ProcessoMonolítico 1.1: response() 1: request() 1.1: cloudrequest() :WebServiceCliente :ProcessoPremissa :ProcessoNuvem 1.3: response() 1.2: cloudresponse() onde: P i = 1 T decomposto i T monol ítico T decomposto i é o tempo de resposta médio do processo decomposto na configuração i; e T monol ítico é o tempo de resposta médio do processo monolítico. O tempo de comunicação adicional foi desconsiderado, pois essa medida é relativa a cada recurso disponível e ao tamanho da carga de trabalho. A Fig. 16 ilustra o percentual de ganho de desempenho do processo decomposto em relação ao monolítico para cada uma das configurações, sendo que o percentual mínimo é superior a 10%. Para verificar a hipótese de que as médias dos tempos de resposta do processo decomposto foram significativamente menores que a do processo monolítico, foi empregada a estatística do teste t [17] a um nível de significância de 5%. Os testes resultaram em valores da probabilidade p-value na ordem de 2, , confirmando essa hipótese. Fig. 15. Diagramas UML dos processos monolítico (a) e decomposto (b).

132 Fig. 16. Percentual de desempenho ganho do processo decomposto. B. Custos Relativos à Nuvem Para determinar os custos adicionais agregados a esses ganhos de desempenho, foi criado um modelo de regressão linear [18] com os dados obtidos via 45 observações de preços de três grandes provedores de IaaS, o qual emprega as seguintes variáveis independentes: quantidade de RAM em MB; quantidade de disco em GB; o número de núcleos virtuais; e a frequência de cada um desses núcleos. Dessa forma, o valor estimado y do preço em dólar/hora do recurso alocado na nuvem é definido como onde: y = +βx = 2, é o intercepto do modelo; β = [ , , , ] é o vetor de coeficientes de regressão; e X = [memory_in_gb, number_of_virtual_cores, ghz_by_core, hard_disk_in_gb] é o vetor de variáveis independentes. Esse modelo possui o coeficiente de determinação R 2 de 89,62% e erro aleatório médio de US$ 0,0827, o qual foi determinado com a técnica de validação cruzada leave-oneout [19]. A Fig. 17 ilustra a aderência dos valores estimados, via a Equação (2), aos valores observados. Já a Fig. 18 ilustra a relação entre o custo adicional de cada configuração, definida via a Equação 2, e a porcentagem de desempenho ganho através da mesma. Fig. 17. Aderência dos valores estimados aos valores observados. Fig. 18. Percentual de desempenho ganho e custo/hora do recurso na nuvem. Observa-se na Fig. 18 que o maior ganho de desempenho é obtido com a conf#6, a qual proporciona uma redução maior que 12% no tempo de resposta do processo de negócio, sendo acompanhada de um custo adicional de aproximadamente US$ 0,20/hora do recurso alocado na nuvem. V. TRABALHOS CORRELATOS Em [20] novas orquestrações são criadas para cada serviço usado por um processo de negócio, resultando numa comunicação direta entre os mesmos ao invés destes terem uma coordenação única. O processo WS-BPEL é convertido para um grafo de fluxo de controle, que gera um Program Dependency Graph (PDG), a partir do qual são realizadas as transformações, e os novos grafos gerados são reconvertidos para WS-BPEL. Como no algoritmo cada serviço no processo corresponde a um nó fixo para o qual uma partição é gerada, este trabalho não é adequado para a abordagem aqui proposta, pois esta visa a criação de processos nos quais múltiplos serviços possam ser usados. Os resultados descritos em [21] focam na descentralização de orquestrações de processos WS-BPEL, usando Dead Path Elimination (DPE) para garantir a conclusão da execução de processos descentralizados, mas DPE também torna a abordagem muito dependente da linguagem empregada na especificação do processo de negócio. A RI aqui apresentada é independente dessa linguagem e, consequentemente, também a decomposição, bastando o desenvolvimento das transformações de lifting e grounding apropriadas. Em [22] é reportado que a maioria das pesquisas, em descentralização de orquestrações, foca em demasia em linguagens de processos de negócio específicas. Não focar tanto nessas linguagens foi um dos principais desafios da pesquisa aqui apresentada, sendo que outro desafio foi não se preocupar somente com problemas de desempenho, mas também com medidas de segurança reguladas por governos ou organizações. Consequentemente, a decisão de executar uma atividade nas premissas ou na nuvem, neste trabalho, é já tomada na fase de projeto do ciclo de vida do BPM. VI. CONSIDERAÇÕES FINAIS E TRABALHOS FUTUROS Este trabalho é uma continuação do apresentado na dissertação de mestrado [13] e focou nas regras de

133 decomposição de processos de negócio, sendo que as seguintes contribuições adicionais merecem destaque: Para demonstrar a generalidade da abordagem proposta, ao invés da linguagem Amber usada em [13], foi utilizada WS-BPEL para a especificação de processos de negócio; Para que essa abordagem pudesse ser empregada, transformações de lifting e grounding tiveram que ser desenvolvidas para WS-BPEL; O fato de WS-BPEL ser executável, possibilitou a implementação dos processos criados e a comparação de seus comportamentos ao comportamento do processo original, validando assim a abordagem proposta; e Essas implementações possibilitaram também a realização de uma analise comparativa de desempenho entre os processos original e decomposto e uma avaliação dos custos inerentes à alocação de parte do processo decomposto na nuvem. Os resultados obtidos com esse trabalho indicam que a abordagem proposta é genérica, viável e eficaz tando do ponto de vista de desempenho quanto financeiro. Atualmente, a RI está sendo estendida para suportar mais padrões de workflow e para modelar comportamentos de exceção de WS-BPEL. Num futuro próximo, esta pesquisa continuará nas seguintes direções: complementar as regras de decomposição para suportar construções compostas, nas quais os nós de início e fim tenham diferentes localizações, e para possibilitar a extensão do número de localizações, já que múltiplas nuvens podem ser usadas e/ou múltiplos locais nas premissas podem existir nas organizações; e desenvolver um framework de cálculo, que leve em consideração os custos reais do processo original e dos processos criados, visando recomendar quais atividades e dados devem ser alocados em quais localizações. AGRADECIMENTOS Os autores agradecem ao suporte do CNPq através do INCT-MACC. REFERÊNCIAS [1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica and M. Zaharia, "Above the Clouds: A Berkeley View of Cloud Computing," EECS Department, University of California, Berkeley, [2] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg and I. Brandic, "Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility," Future Generation Computer Systems, vol. 25, no. 6, pp , June [3] P. Mell and T. Grance, "The NIST Definition of Cloud Computing," National Institute of Standards and Technology, vol. 53, no. 6, pp. 1-50, [4] S. Yu, C. Wang, K. Ren and W. Lou, "Achieving secure, scalable, and fine-grained data access control in cloud computing," in Proceedings of the 29th conference on Information communications, Piscataway, NJ: IEEE Press, 2010, pp [5] D. L. Banks, "The Health Insurance Portability and Accountability Act: Does It Live Up to the Promise?," Journal of Medical Systems, vol. 30, no. 1, pp , February [6] R. K. L. Ko, "A computer scientist's introductory guide to business process management (BPM)," Crossroads, vol. 15, no. 4, pp , June [7] Y.-B. Han, J.-Y. Sun, G.-L. Wang and H.-F. Li, "A Cloud-Based BPM Architecture with User-End Distribution of Non-Compute- Intensive Activities and Sensitive Data," Journal of Computer Science and Technology, vol. 25, no. 6, pp , [8] D. S. Linthicum, Cloud Computing and SOA Convergence in Your Enterprise: A Step-by-Step Guide, Boston, MA, USA: Pearson Education Inc., [9] OMG, "Business Process Model and Notation (BPMN) Version 2.0," January [Online]. Available: [Accessed 17 março 2013]. [10] A. Alves, A. Arkin, S. Askary, C. Barreto, B. Bloch, F. Curbera, M. Ford, Y. Goland, A. Guízar, N. Kartha, C. K. Liu, R. Khalaf, D. König, M. Marin, V. Mehta, S. Thatte, D. van der Rijn, P. Yendluri and A. Yiu, "Web Services Business Process Execution Language Version 2.0," OASIS Standard, 11 April [Online]. Available: [Accessed 1 Março 2013]. [11] P. M. d. Azevedo-Marques and S. C. Salomão, "PACS: Sistemas de Arquivamento e Distribuição de Imagens," Revista Brasileira de Física Médica, vol. 3, no. 1, pp , [12] E. Duipmans, L. F. Pires and L. da Silva Santos, "Towards a BPM Cloud Architecture with Data and Activity Distribution," Enterprise Distributed Object Computing Conference Workshops (EDOCW), 2012 IEEE 16th International, pp , [13] E. F. Duipmans, Business Process Management in the Cloud with Data and Activity Distribution, master's thesis, Enschede, The Netherlands: Faculty of EEMCS, University of Twente, [14] H. Eertink, W. Janssen, P. O. Luttighuis, W. Teeuw and C. Vissers, "A business process design language," World Congress on Formal Methods, vol. I, pp , [15] W. v. d. Aalst, A. t. Hofstede, B. Kiepuszewski and A. Barros., "Workflow Patterns," Distributed and Parallel Databases, vol. 3, no. 14, pp. 5-51, [16] R. Jain, The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling, Wiley, 1991, pp [17] R Core Team, "R: A Language and Environment for Statistical Computing," [Online]. Available: [Accessed 5 Abril 2013]. [18] J. D.Kloke and J. W.McKean, "Rfit: Rank-based Estimation for Linear Models," The R Journal, vol. 4, no. 2, pp , [19] R. Kohavi, "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection," in Proceedings of the 14th international joint conference on Artificial intelligence, vol. 2, San Francisco, CA: Morgan Kaufmann Publishers Inc., 1995, pp [20] M. G. Nanda, S. Chandra and V. Sarkar, "Decentralizing execution of composite web services," SIGNPLAN Notices, vol. 39, no. 10, pp , October [21] O. Kopp, R. Khalaf and F. Leymann, "Deriving Explicit Data Links in WS-BPEL Processes," Services Computing, SCC '08, vol. 2, pp , July [22] W. Fdhila, U. Yildiz and C. Godart, "A Flexible Approach for Automatic Process Decentralization Using Dependency Tables," Web Services, ICWS 2009, pp , 2009.

134 On the Influence of Model Structure and Test Case Profile on the Prioritization of Test Cases in the Context of Model-based Testing João Felipe S. Ouriques, Emanuela G. Cartaxo, Patrícia D. L. Machado Software Practices Laboratory/UFCG, Campina Grande, PB, Brazil {jfelipe, Abstract Test case prioritization techniques aim at defining an ordering of test cases that favor the achievement of a goal during test execution, such as revealing faults as earlier as possible. A number of techniques have already been proposed and investigated in the literature and experimental results have discussed whether a technique is more successful than others. However, in the context of model-based testing, only a few attempts have been made towards either proposing or experimenting test case prioritization techniques. Moreover, a number of factors that may influence on the results obtained still need to be investigated before more general conclusions can be reached. In this paper, we present empirical studies that focus on observing the effects of two factors: the structure of the model and the profile of the test case that fails. Results show that the profile of the test case that fails may have a definite influence on the performance of the techniques investigated. Keywords Experimental Software Engineering, Software Testing, Model-Based Testing, Test Case Prioritization. I. INTRODUCTION The artifacts produced and the modifications applied during software development and evolution are validated by the execution of test cases. Often, the produced test suites are also subject to extensions and modifications, making management a difficult task. Moreover, their use can become increasingly less effective due to the difficulty to abstract and obtain information from test execution. For instance, if test cases that fail are either run too late or are difficult to locate due to the size and complexity of the suite. To cope with this problem, a number of techniques have been presented in the literature. These techniques can be classified as: test case selection, test suite reduction and test case prioritization. The general test case selection problem is concerned with selecting a subset of the test cases according to a specific (stop) criterion, whereas test suite reduction techniques focus on selecting a subset of the test cases, but the selected subset must provide the same coverage as the original suite [1]. While the goal of selection and reduction is to produce a more cost-effective test suite, studies presented in the literature have shown that the techniques may not work effectively, since some test cases are discarded and consequently, some failures may not be revealed [2]. On the other hand, test case prioritization techniques have been investigated in order to address the problem of defining an execution order of the test cases according to a given testing goal, particularly detecting faults as early as possible [3]. These techniques can be applied either in a general development context or in a more specific context, such as regression testing, depending on the information that is considered by the techniques [4]. Moreover, both code-based and specificationbased test suites can be handled, although most techniques presented in the literature have been defined and evaluated for code-based suites in the context of regression testing [5] [6]. Model-based Testing (MBT) is an approach to automate the design and generation of black-box test cases from specification models, together with all oracle information needed [7]. MBT can be applied to any model with different purposes, from which specification-based test cases are derived, and also at different testing levels. As usually, automatic generation produces a big number of test cases that may also have a considerable degree of redundancy [8] [9]. Techniques for ordering the test cases may be required to support test case selection, for instance, to address constrained costs of running and analysing the complete test suite and also to improve the rate of fault detection. However, to the best of our knowledge, there are only few attempts presented in the literature to define test case prioritization techniques based on model information [10] [11]. Generally, empirical studies are preliminary, making it difficult to assess current limitations and applicability of the techniques in the MBT context. To provide useful information that may influence on the development of prioritization techniques, empirical studies must focus on controlling and/or observing factors that may determine the success of a given technique. Given the goals of prioritization in the context of MBT, a number of factors can be determinant such as the size and the coverage of the suite, the structure of the model (that may determine the size and structure of test cases), the amount and distribution of failures and the degree of redundancy of test cases. In this paper, we investigate mainly the influence of two factors: the structure of the model and the profile of the test cases that fail. For this, we conduct 3 empirical studies, where real application models, as well as automatically generated ones, are considered. The focus is on general prioritization techniques that can be applied to MBT test suites. The purpose of the first study was to acquire preliminary observations by considering real application models. From this study, we concluded that a number of different factors may influence on the performance of the techniques. Therefore, the purpose of the second and third studies, the main contribution of this paper, was to investigate on specific factors by controlling them through the use of generated models. Results from

135 these studies show that, despite the fact that the structure of the models may present or not certain constructions (for instance the presence of loops 1 ), it is not possible to differentiate the performance of the techniques when focusing on the presence of the construction investigated. On the other hand, depending on the profile of the test case that fails (longest, shortest, essential, and so on), one technique may perform better than the other. In the studies presented in this paper, we focus on system level models, that can be represented as activity diagrams and/or as labelled transition systems with inputs and outputs as transitions. Models are generated according to the strategy presented by Oliveira Neto et al. [12]. Test cases are sequences of transitions extracted from a model by a depth-search algorithm as presented by Cartaxo et al. [9] and Sapna and Mohanty [11]. Prioritization techniques receive as input a test suite and produces as output an ordering for the test cases. The paper is structured as follows. Section II presents fundamental concepts along with a quick definition of the prioritization techniques considered in this paper. Section III discusses related works. Section IV presents a preliminary study where techniques are investigated in the context of two real applications, varying the ammount of faults. Sections V and VI presents the main empirical studies conducted: the former reports a study with automatically generated models where the presence of certain structural constructions is controlled, whereas the latter depicts a study with automatically generated models that are investigated for different profiles of the test case that fails. Section VII presents concluding remarks about the results obtained and pointers for further research. Details on the input models and data collected in the studies can be found at the project site 2. Empirical studies have been defined according to the general framework proposed by Wohlin [13] and the R tool 3 has been used to support data analysis. II. BACKGROUND This section presents the test case prioritization concept (subsection II-A) and the techniques considered in this paper (subsection II-B). A. Test Case Prioritization Test case prioritization is a technique that orders test cases in an attempt to maximize an objective function. This problem was defined by Elbaum et al. as follows [14]: Given: T S, a test suite; P T S, a set of permutations of T S; and, f, a function that maps P T S to real numbers (f : P T S R). Problem: Find a T S P T S T S (T S P T S) (T S T S ) f(t S ) f(t S ) The objective function is defined according to the goal of the test case prioritization. For instance, the manager may need to quickly increase the rate of fault detection or the coverage 1 A number of loops distributed in a model may lead to huge test suites with a certain degree of redundancy between the test cases even if they are traversed only once for each test case of the source code/model. Then, a set of permutations P T S is obtained and the T S that has the highest value of f is chosen. Note that the key point for the test case prioritization is the goal, and the success of the prioritization is measured by this goal. However, it is necessary to have some data (according to the defined goal) to calculate the function for each permutation. Then, for each test case, a priority is assigned and test cases with the highest priority are scheduled to execute first. When the goal is to increase fault detection, the Average Percentage of Fault Detection (APFD) metric has been largely used in the literature. The highest the APFD value is, the faster and the better the fault detection rates are [14]. Test case prioritization can be applied in code-based and specification-based contexts, but it has been more applied in the code-based context and it is often related to regression testing. This way, Rothermel et al. [4] has proposed the following classification: General test case prioritization - test case prioritization is applied any time in the software development process, even in the initial testing activities; Regression testing prioritization - test case prioritization is performed after a set of changes was made. Therefore, test case prioritization can use information gathered in previous runs of existing test cases to help prioritize the test cases for subsequent runs. B. Techniques This subsection presents general test case prioritization techniques that will be approached in this paper. Optimal. This technique is largely used in experiments as upper bound on the effectiveness of the other techniques. This technique presents the best result that can be obtained. To obtain the best result, it is necessary to have, for example, the faults (if the goal is to increase fault detection) as input that are not available in practice (so the technique is not feasible in practice). For this, we can only use applications with known faults. This let us determine the ordering of test cases that maximizes a test suite s rate of fault detection. Random. This technique is largely used in experiments as lower bound on the effectiveness of the other techniques [6], based on a random choice strategy. Adaptive Random Testing (ART). This technique distributes the selected test case as spaced out as possible based on a distance function [15]. To apply this technique, two sets of test cases are required: the executed set (the set of distinct test cases that have been executed but without revealing any failure) and the candidate set (the set of test cases that are randomly selected without replacement). Initially, the executed set is empty and the first test case is randomly chosen from the input domain. The executed set is then updated with the selected element from the candidate set. From the candidate set, an element that is farthest away from all executed test cases, is selected as the next one. There are several ways to implement the concept of farthest away. In this paper, we will consider: Jaccard distance: The use of this function in the prioritization context was proposed by Jiang et al. [6].

136 It calculates the distance between two sets and it is defined as 1 minus the size of the intersection divided by the size of the union of the sample sets. In our context, we consider a test case as an ordered set of edges (that represent transitions). Considering p and c as test cases and B(p) and B(c) as a set of branches covered by the test cases p and c, respectively, the distance between them can be defined as follows: J(p, c) = 1 B(p) B(c) B(p) B(c) Manhattan distance: This distance, proposed by Zhou [16], is calculated by using two arrays. Each array has its size equal to the number of branches in the model. Since this function is used to evaluate the distance between two test cases, each test case is associated with one array. For each position of the array, it is assigned 1 if the test case has the branch or 0, otherwise. Fixed Weights. This technique was proposed by Sapna and Mohanty [11] and it is a prioritization technique based on UML activity diagrams. The structures of the activity diagram are used to prioritize the test cases. First of all, the activity diagram is converted into a tree structure. Then, weights are assigned according to the structure of the activity diagram (3 for forkjoin nodes, 2 for branch-merge nodes, 1 for action/activity nodes). Lately, the weight for each path is calculated (sum of the weights assigned to nodes and edges) and the test cases are prioritized according to the weight sums obtained. STOOP. This technique was proposed by Kundu et al. [17]. The inputs are sequence diagrams. These diagrams are converted into a graph representation called as sequence graph (SG). After this, the SGs are merged. From the merged sequence graph, the test cases are generated. Lastly, the set of test case is prioritized. The test cases are sorted in descending order taking into account the average weighted path length (AWPL) metric. This is defined as follows: m i=1 AW P L(p k ) = ew eight(e i) m where p k = e 1 ; e 2 ; e m is a test case and ew eight is the amount of test cases that contains the edge e i. III. RELATED WORK Several test case prioritization techniques have been proposed and investigated in the literature. Most of them focus on code-based test suites and the regression testing context [18], [19]. The experimental studies presented, have discussed whether a technique is more effective than others, comparing them mainly by the APFD metric. And, so far, there is no experiment that presented general results. This evidences the need for further investigation and empirical studies that can contribute to advances in the state-of-the-art. Regarding code-based prioritization, Zhou et al. [20] compared fault-detection capabilities of the Jaccard-distance-based ART and Manhattan-distance-based ART. Branch coverage information was used for test case prioritization and the results showed that Manhattan is more effective than Jaccard distance in the context considered [20]. Also, Jeffrey and Gupta [21] proposed an algorithm that prioritizes test cases based on coverage of statements in relevant slices and discuss insights from an experimental study that considers also total coverage. Moreover, Do et al. [22] presented a series of controlled experiments evaluating the effects of time constraints and faultiness levels on the costs and benefits of test case prioritization techniques. The results showed that time constraints can significantly influence both the cost and effectiveness. Moreover, when there are time constraints, the effects of increased faultiness are stronger. Furthermore, Elbaum et al. [5] compared the performance of 5 prioritization techniques in terms of effectiveness, and showed how the results of this comparison can be used to select a technique (regression testing) [18]. They applied the prioritization techniques to 8 programs. Characteristics of each program (such as: number of versions, KLOC, number and size of the test suites, and average number of faults) were taken into account. By considering the use of models in the regression testing context, Korel et al. [10], [19], [23] presented two model-based test prioritization methods: selective test prioritization and model dependence-based test prioritization. Both techniques focus on modifications made to the system and models. The inputs are the original EFSM system model and the modified EFSM. Models are run to perform the prioritization. On the other hand, our focus is on general prioritization techniques were modifications are not considered. Generally, in the MBT context, we can find proposals to apply general test case prioritization from UML diagrams, such as: i) the technique proposed by Kundu [17] et al. where sequence diagrams are used as input; and ii) the technique proposed by Sapna and Mohanty [11] where activity diagrams are used as input. Both techniques are investigated in this paper. In summary, the original contribution of this paper is to present empirical studies in the context of MBT that consider different techniques and factors that may influence on their performance such as the structure of the model and the profile of the test case that fails. IV. FIRST EMPIRICAL STUDY The main goal of this study is to analyze general prioritization techniques for the purpose of comparing their performances, observing the impact of the number of test cases that fail, with respect to their ability to reveal failures earlier, from the point of view of the tester and in the context of MBT. We worked with the following research hypothesis: The general test case prioritization techniques present different abilities of revealing failures, considering different amount of failing test cases in the test suite. In the next sections, we present the study design and analysis on data collected. A. Planning We conducted this experiment in a research laboratory a controlled environment. This characteristic leads to an offline study. Moreover, all the techniques involved in the study only require the test cases to prioritize and the mapping from the branches of the system model to the test cases that satisfy each branch. Thus, no human intervention is required, eliminating the expertise influence.

137 As objects, from which system models were derived, we considered real systems. Despite the fact that the applications are real ones, they do not compose a representative sample of the whole set of applications and, thereby, this experiment deal with a specific context. In order to analyze the performance of the techniques, observing the influence of the number of test cases that fail, we defined the following variables: Independent variables and factors General prioritization techniques: Techniques defined in Section II. We will consider the following short-names for the sake of simplicity: optimal, random, ARTjac (Adaptive Random Testing with Jaccard distance), ARTman (Adaptive Random Testing with Manhattan distance), fixedweights, stoop; Number of test cases that fail: low (lower than 5% of the total), medium (between 5% and 15% of the total), high (higher than 15% of the total); Dependent variable Average Percentage of Fault Detection - APFD In this study, we used two system models from two realworld applications: i) Labelled Transition System-Based Tool LTS-BT [24] a MBT activities support tool developed in the context of our research group and ii) PDF Split and Merge - PDFsam 4 a tool for PDF files manipulation. They were modelled by UML Activity Diagram, using the provided use cases documents and the applications themselves. From this diagram a graph model was obtained for each application, from which test cases were generated by using a depth search-based algorithm proposed by Sapna and Mohanty [11] where each loop is considered two times at most. Table I shows some structural properties from the models and the test cases that were generated from them to be used as input to the techniques. It is important to remark that test cases for all techniques were obtained from the same model using a single algorithm. Also, even though the STOOP technique has been generally proposed to be applied from sequence diagrams, the technique itself works on an internal model that combines the diagrams. Therefore, it is reasonable to apply STOOP in the context of this experiment. TABLE I. STRUCTURAL PROPERTIES OF THE MODELS IN THE EXPERIMENT. Property LTS-BT PDFSam Branching Nodes Loops 0 5 Join Nodes 7 6 Test Cases Shortest Test Case Longest Test Case Defects 4 5 TC reveal failures The number of test cases that fail variable was defined considering real and known defects in the models and allocated as shown in Table II. 4 Project s site: TABLE II. DEFINITION OF THE TEST CASES THAT FAIL VARIABLE Level Failures in LTS-BT Failures in PDFSam low 2 test cases 3,77% 4 test cases 4,59% medium 4 test cases 7,54% 7 test cases 8,04% high 8 test cases 15,09% 16 test cases 18,39% The relationship between a defect (associated with a specific edge in the model) and a failure (a test case that fails) is that when a test case exercises the edge, it reveals the failure. For each different level, we considered a different set of defects of each model, and in the high level, two defects originate the failures. Moreover, these test cases do not reveal the two defects at the same time for the two models. By using the defined variables and detailing the informal hypothesis, we postulated eight pairs of statistical hypotheses (null and alternative): three pairs evaluating the techniques at each level of number of test cases that fail (e.g. H 0 : AP F D (low,i) = AP F D (low,j) and H 1 : AP F D (low,i) AP F D (low,j), for techniques i and j, with i j) and five pairs evaluating the levels for each technique (e.g. H 0 : AP F D (random,k) = AP F D (random,l) and H 1 : AP F D (random,k) AP F D (random,l), for levels k and l, with k l), excluding the optimal technique. For the lack of space, the hypotheses pairs are not written here. Based on the elements already detailed, the experimental design for this study is One-factor-at-a-time [25]. The data analysis for the hypotheses pairs is based on 2-Way ANOVA [26] [27], after check the assumptions of normality of residuals and equality of variances. Whether any assumption is not satisfied, a non parametric analysis needed to be proceeded. We calculated the number of replications based on a pilot sample, using the following formula proposed by Jain [27]. We obtained 815 as result, for a precision (r) of 2% of the sample mean and significance (α) of 5%. The following steps were executed to perform the experiment: 1) Instantiate lists for data collection for each replication needed; 2) Instantiate the failure models to be considered; 3) Generate test cases; 4) Map branches to test cases; 5) Execute each technique for each object considering the replications needed; 6) Collect data and compute dependent variable; 7) Record and analyse results. All techniques were automatically executed. B. Data Analysis When analysing data collected, we must verify the ANOVA assumptions. Figure 1 assures that the residuals are not normally distributed, because the black solid line should be near of the straight continuous line of the normal distribution. Thus, we proceeded a nonparametric analysis. A confidence interval analysis, as seen in Table III of the 95% confidence intervals of the pseudomedians 5 of APFD values collected might give a first insight about some null hypotheses rejection. The set of hypothesis defined for this experiment compare the techniques under two points of view: i) whole set of 5 The pseudomedian is a nonparametric estimator for the median of a population [28].

138 QQ Plot for the Residuals Quantiles of Normal Distribution Sample Quantiles Fig. 1. QQ-Plot of the residuals and the normal distribution TABLE III. CONFIDENCE INTERVAL OF THE PSEUDOMEDIANS Low Medium High optimal random ARTJac ARTMan fixedweights stoop techniques at each single level, and ii) each technique isolated in the different levels. For the first set of hypothesis, when considering the levels of number of test cases that fail separately (set of two columns for each level), some confidence intervals do not overlap, therefore the null hypotheses of equality must be rejected. However, in the three levels, there is an overlap between random and ARTMan and the p-values of Mann-Whitney tests between the two techniques are , and for low, medium and high, respectively. These p-values are greater than the significance of 5%, thus the performance of these techniques are statistically similar at this significance. For the second set of hypothesis, by analyzing each technique separately (lines of Table III), all the null hypothesis of equality must be rejected, once every technique present no overlap between the confidence intervals at each level. This means that the performance of the techniques can vary when more or less test cases fail. As general observations, ARTJac presented the best performance for the three levels. Moreover, the techniques presented slightly variations when considering the three levels (by increasing or decreasing), except from fixedweights and stoop that increase more than other techniques. These techniques that are mostly based on structural elements of the test cases, may be more affected by the number of test cases that fail than the random based ones. Furthermore, by increasing the level of the number of test cases that fail, different evolution patterns in the techniques performance arise, e.g. stoop increases its performance with the growth of the level, while fixedweights decreases its performance when the level goes from low to medium and increase when the level goes from medium to high. This different patterns compose an evidence of influence of another factors over the researched techniques that motivated the execution of the experiments presented in Sections V and VI. C. Threats to validity As a controlled experiment with statistical analysis, measures were rigorously taken to address conclusion validity regarding data treatment and assumptions, number of replications and tests needed. For the internal validity of this experiment, it is often difficult to represent a defect at a high abstract level since a code defect may refer to detailed contents. Therefore, an abstract defect may correspond to one or more defects at code level and so on. To mitigate this threat, we considered test cases that fail as measure instead of counting defects (even though we had data on the real defects). This decision suits our experiment perfectly, since the APFD metric focus on failure rather than defects. The construct validity regarding the set of techniques and evaluation metric chosen to compose the study, was supported by a systematic review [29] that revealed suitable techniques and evaluation metrics, representing properly the research context. The low number of system models used in this experiment threatened its external validity, since two models do not represent the whole universe of applications. However, as preliminary study, we aimed at a specific context observation only. V. SECOND EMPIRICAL STUDY Motivated by the study reported in the Section IV, this section contains a report of an empirical study that aims to analyzing general prioritization techniques for the purpose of observing the model structure influence over the studied techniques, with respect to their ability to reveal failures earlier, from the point of view of the tester and in the context of Model-Based Testing. Complementing the definition, we postulated the following research hypothesis: The general test case prioritization techniques present different abilities to reveal failures, considering models with different structures. A. Planning We also conducted this experiment in a research environment and the techniques involved in the study need the same artifacts from the first experiment the test suite generated through a MBT test case generation algorithm. Thus, the execution of the techniques does not need human intervention, what eliminates the factor experience level from the experiment. The models that originate the test suites processed in the experiment were generated randomly using a parametrized graph generator (Section V-B). Thus, the models do not represent real application models. For this study, we defined the following variables: Independent variables General prioritization techniques (factor): ARTJac, stoop;

139 Number of branch constructions to be generated in the input models (factor): 10, 30, 80; Number of join constructions to be generate in the input models (factor): 10, 20, 50; Number of loop constructions to be generate in the input models (factor): 1, 3, 9; Maximum depth of the generated models (fixed value equals to 25); Rate of test cases that fail (fixed value equals to 10%); Dependent variable Average Percentage of Fault Detection - APFD. For the sake of simplicity of the experimental design required when considering all techniques and variables, in this study, we decided to focus only on two techniques among the ones considered in Section IV ARTJac and stoop particularly the ones with best and worst performance, respectively. They can be seen as representatives of the random-based and structural based techniques considered respectively. Moreover, we defined the values for the variables that shape the models based on the structural properties from the models considered in the motivational experiment reported in the Section IV. In this experiment, we do not desire the effect of the failures location over the techniques, thus we selected failures randomly. To mitigate the effect of the number of test cases that fail, we assign a constant rate of 10% of the test cases to reveal failure. In order to evaluate the model structure, we defined three different experimental designs and according to Wu and Hamada [25], each one is a one-factor-at-a-time. The designs are described in the next subsections. 1) Branches Evaluation: In order to evaluate the impact of the number of branches in the capacity of revealing failures, we defined three levels for this factor and fixed the number of joins and branches in zero. For each considered level of number of branches with the another parameters fixed, 31 models were generated by the parameterized generator. For each model, the techniques were executed with 31 different random failure attributions and we gathered the APFD value of each execution. We postulated five pairs of statistical hypotheses: three analyzing each level of the branches with the null hypothesis of equality between the techniques and the alternative indicating they have a different performance (e.g. H 0 : AP F D (ART Jac,10 branch) = AP F D (Stoop,10 branch) and H 1 : AP F D (ART Jac,10 branch) AP F D (Stoop,10 branch) ) and two related to each technique isolately, comparing the performance in the three levels with the null hypotheses of equality and alternative indicating some difference (e.g. H 0 : AP F D (ART Jac,10 branch) = AP F D (ART Jac,30 branch) = AP F D (ART Jac,80 branch) and H 1 : AP F D (ART Jac,10 branch) AP F D (ART Jac,30 branch) AP F D (ART Jac,80 branch) ). 2) Joins Evaluation: In the number of joins evaluation, we proposed a similar design, but varying just the number of joins and fixing the another variables. We fixed the number of branches in 50, loops in zero and all the details that were exposed in the branch evaluation are applied for this design. The reason for allowing 50 branches is that branches may be part of a join. Therefore, we cannot consider 0 branches. The corresponding set of hypotheses follows the same structure of the branch evaluation, but considering the number of joins. 3) Loops Evaluation: In the number of loops evaluation, once again, we proposed a similar design, but varying only the number of loops and fixing the number of branches in 30 and the joins in 15 (again, this structures are commonly parte of a loop, so it is not reasonable to consider 0 branches and 0 joins). We structured a similar set of hypotheses as in the branch evaluation, but considering the three levels of the number of loops variable. The following steps were executed to perform the experiment: 1) Generate test models as described in Section V-B; 2) Instantiate lists for data collection for each replication needed; 3) Instantiate the failure models to be considered; 4) Generate test cases; 5) Map branches to test cases; 6) Execute each technique for each object considering the replications needed; 7) Collect data and compute dependent variable; 8) Record and analyse results. All techniques were automatically executed and test cases were generated by using the same algorithm as in Section IV. B. Model Generation The considered objects for this study are the randomly generated models. The generator receives five parameters: 1) Number of branch constructions; 2) Number of join constructions; 3) Number of loop constructions; 4) The maximum depth of the graphs; 5) The number of graphs to generate. The graph is created by executing operations to include the constructions in sequences of transitions (edges). The first step is to create an initial sequence using the forth parameter, e.g. let a maximum depth be equal to five, so a sequence with five edges is created, as in the Figure 2. Fig. 2. Initial configuration of a graph with maximum depth equals to 5. Over this initial configuration, the generator executes the operations. To increase the probability of generating structurally different graphs, the generator executes operations randomly, but respecting the number passed as parameter. The generator perform the operations of adding branching, joining, and looping in the following way: Branching: from a non-leaf random node x, create two more new nodes y and z and create two new edges (x, y) and (x, z) (Figure 3a); Joining: from two non-leaf different random nodes x and y, create a new node z and create two new edges (x, z) and (y, z) (Figure 3b);

140 Looping: from two non-leaf different random nodes x and y, with depth(x) > depth(y), create a new edge (x, y) (Figure 3c). Following the analysis, we performed three tests, as summarized in the Table V. We chose the test according to the normality from the samples: for normal samples, we performed the T-test and for non-normal samples, the Mann-Whitney test. TABLE V. P-VALUES FOR THE SAMPLES EQUALITY TESTS WITH 5% OF SIGNIFICANCE FROM THE FIRST EXPERIMENTAL DESIGN SAMPLES. (a) Branching the node 4 to nodes 7 and 8. (b) Joining the nodes 2 and 5 to node 7. (c) Looping the node 4 to 2. Fig. 3. Examples of operations performed by the parametrized graph generator. The generator execute the same process as many times as the number of graphs to generate parameter indicates. C. Data Analysis As we divided the whole experiment into three experimental designs, the data analysis will respect the division. Basically, we followed the same chain of tests for the three designs. Firstly, we tested the normality assumptions over the samples using the Anderson-Darling test and the equality of variances through F-test. Depending on the result of these tests, we chose the next one, that evaluate the equality of the samples, Mann-Whitney or T-Test. After evaluate the levels separately, we tested the techniques separately through the three levels using ANOVA or Kruskal-Wallis test. We considered for each test the significance level of 5%. The objective in this work is to expose influences of the studied structural aspects of the models on the performance of the techniques, thus if the p-value analysis in a hypothesis testing suggests that the null hypothesis of equality may not be rejected, this is an evidence that the variable considered alone does not affect the performance of the techniques. On the other hand, if the null hypothesis must be rejected, it represents a evidence of some influence. 1) Branches Analysis: The first activity for the analysis is the normality test and Table IV summarizes this step. The two samples from the low level had the null hypotheses of normality rejected. TABLE IV. P-VALUES FOR THE ANDERSON-DARLING NORMALITY TESTS WITH 5% OF SIGNIFICANCE FROM THE FIRST EXPERIMENTAL DESIGN SAMPLES. NORMAL SAMPLES ARE IN BOLD FACE. 10 Branches 30 Branches 80 Branches ART Jaccard Stoop Branches 30 Branches 80 Branches All the p-values on the Table V are greater than the defined significance of 5%, so the null hypothesis of equality of the techniques cannot be rejected, at the defined significance level, in other words, the two techniques presented similar performance at each level separately. The next step of the analysis is to evaluate each technique separately through the levels and we proceeded a nonparametric test of Kruskal-Wallis to test their correspondent hypothesis. The tests calculated for ARTJac and stoop p-value equals to and respectively. Comparing the p- values against the significance level of 5%, we cannot reject the null hypothesis of equality between the levels for each technique, so the performance is similar, at this significance level. 2) Joins Analysis: Following the same approach from the first experimental design, we can see on Table VI the p-values of the normality tests. The bold face p-values indicate the samples normally distributed, at the considered significance. TABLE VI. P-VALUES FOR THE ANDERSON-DARLING NORMALITY TESTS WITH 5% OF SIGNIFICANCE FROM THE SECOND EXPERIMENTAL DESIGN SAMPLES. NORMAL SAMPLES ARE IN BOLD FACE. 10 Joins 20 Joins 50 Joins ART Jaccard Stoop Based on these normality tests, we tested the equality of the performance of the techniques at each level and, according to the Table VII, the techniques performs statistically in a similar way at all levels. TABLE VII. P-VALUES FOR THE SAMPLES EQUALITY TESTS WITH 5% OF SIGNIFICANCE FROM THE SECOND EXPERIMENTAL DESIGN SAMPLES. 10 Joins 20 Joins 50 Joins The next step is to assess each technique separately. We executed a Kruskal-Wallis test comparing the three samples for ARTJac and stoop and the p-value was and , respectively. Comparing with the significance level considered of 5%, both null hypothesis of equality was not rejected, what means the techniques behave similarly through the levels. 3) Loops Analysis: Following the same line of argumentation, the first step is to evaluate the normality of the measured data and the Table VIII summarizes these tests. According to the results of the normality tests, we tested the equality of the techniques at each level of this experimental design. As we can see on Table IX, the null hypotheses for 1 Loop, 3 Loops and 9 loops cannot be rejected because they have p-value greater than 5%, thus the techniques present similar behaviour for all levels of the factor.

141 TABLE VIII. P-VALUES FOR THE ANDERSON-DARLING NORMALITY TESTS WITH 5% OF SIGNIFICANCE FROM THE THIRD EXPERIMENTAL DESIGN SAMPLES. NORMAL SAMPLES ARE IN BOLD FACE. 1 Loop 3 Loops 9 Loops ART Jaccard Stoop TABLE IX. P-VALUES FOR THE SAMPLES EQUALITY TESTS WITH 5% OF SIGNIFICANCE FROM THE THIRD EXPERIMENTAL DESIGN SAMPLES. 1 Loop 3 Loops 9 Loops Analyzing the two techniques separately through the levels, we performed the non-parameric Kruskal-Wallis test and the p-values were and for ARTJac and stoop, respectively. These p-values, compared with the significance level of 5%, indicate that the null hypotheses of the considered pairs cannot be rejected, in another words, the techniques perform statistically similar through the different levels of the number of looping operations. D. Threats to Validity About the validity of the experiment, we can point some threats. To the internal validity, we defined different designs to evaluate separately the factors, therefore, it is not possible analyze the interaction between the number of joins and branches, for example. We did it because some of the combinations between the three variables might be unfeasible, e.g. a model with many joins and without any branch. Moreover, we did not calculate the number of replications in order to achieve a defined precision because the execution would be infeasible (conclusion validity). The executed configuration took several days because some test suites were huge. To deal with this limitation, we limited the generation to 31 graphs for each experimental design and 31 failure attributions for each graph, keeping the balancing principle [13] and samples with size greater than, or equal to, 31 are wide enough to test for normality with confidence [26], [27]. Furthermore, the application models were generated randomly to deal with the problem of lack of application models, but, at the same time, this reduces the capability of represent the reality, threatening the external validity. To deal with this, we used structural properties, e.g. depth and number of branches, from existent models. VI. THIRD EMPIRICAL STUDY This section contains a report of an experiment that aims to analyze general prioritization techniques for the purpose of observing the failure profile influence over the studied techniques, with respect to their ability to reveal failures earlier, from the point of view of the tester and in the context of Model-Based Testing. Complementing the definition, we postulated the following research hypothesis: The general test case prioritization techniques present different abilities to reveal failures, considering that the test cases that fail have different profiles. We are considering profiles as the characteristics of the test cases that reveal failures. A. Planning We performed the current experiment in the same environment of the previous ones and the application models used in this experiment are the same used in Section V. Since we do not aim to observing variations of model structure, we considered the 31 models that were generated with 30 branches, 15 joins, 1 loop and maximum depth 25. For this experiment, we defined these variables: Independent variables General prioritization techniques (factor): ARTJac, stoop; Failure profiles, i.e., characteristics of the test cases that fail (factor); Long test cases with many steps (longtc); Short test cases with few steps (shorttc); Test cases that contains many branches (manybr); Test cases that contains few branches (fewbr); Test cases that contains many joins (manyjoin); Test cases that contains few joins (fewjoin); Essential test cases (ESSENTIAL) (the ones that uniquely covers a given edge in the model); Number of test cases that fail: fixed value equals to 1; Dependent variable Average Percentage of Fault Detection - APFD. A special step is the failure assignment, according to the profile. As the first step, the algorithm sorts the test cases according to the profile. For instance, for the longtc profile, the test cases are sorted decreasingly by the length or number of steps. If there are more than one with the biggest length (same profile), one of them is chosen randomly. For example, if the maximum size of the test cases is 15, the algorithm selects randomly one of the test cases with size equals to 15. Considering the factors, this experiment is a one-factor-ata-time, and we might proceed analysis between the techniques at each failure profile and between the levels at each technique. In the execution of the experiment, each one of the 31 models were executed with 31 different and random failure assigned to each profile, with just one failure at once (a total of 961 executions for each technique). This number of replications keeps the design balanced and gives confidence for testing normality [27]. Based on these variables and in the design, we defined the correspondent pairs of statistical hypotheses: i) to analyse each profile with the null hypothesis of equality between the techniques and the alternative indicating they have a different performance (e.g. H 0 : AP F D (ART Jac,longT C) = AP F D (stoop,longt C) and H 1 : AP F D (ART Jac,longT C) AP F D (stoop,longt C) ), and also ii) to analyse each technique with the null hypothesis of equality between the profiles( f 1, f 2 {longtc, shorttc, manybr, fewbr, manyjoin, fewjoin, ESSENTIAL}, f 1 f 2 H 0 : AP F D (ART Jac,f1) = AP F D (ART Jac,f2), and H 1 : AP F D (ART Jac,f1) AP F D (ART Jac,f2) ). If the tests reject null hypotheses, this

142 fact will be considered as an evidence of the influence of the failure profile over the techniques. Experiment execution followed the same steps defined in Section V. However, as mentioned before, each technique was run by considering one failure profile at a time. B. Data Analysis The boxplots from Figures 4 and 5 summarizes the trends of the data collected. The notches in the boxplots are a graphical representation of the confidence interval calculated by the R software. When these notches overlap, it suggests the better and deeper investigation of the statistical similarity of the samples. techniques, because frequently a test case among the longest ones are among the ones with the biggest number of branches. The same happens with the profiles ShortT C and F ewbr, by the same reasoning. There is a relationship between the profiles F ewjoin and ESSENT IAL as we can see in Figures 4 and 5. The essential test cases are the ones that contains some requirement uniquely, in this case a branch, only covered by itself, and by this definition, the test cases among the ones with least joins frequently are essentials. In summary, rejection of null hypothesis are a strong evidence of the influence of the failure profiles over the performance of the general prioritization techniques. Furthermore, data suggests that ARTJac may not have a good performance when the test case that fails is either long or with many branches. In this case, stoop has a slightly better performance. In the other cases, ARTJac has a better performance, similarly to results obtained in the first experiment with real applications (Section IV). Fig. 4. Boxplot with the samples from ARTJac. C. Threats to Validity Regarding conclusion validity, we did not calculate the number of replications needed To deal with this threat of precision, we limited the random failure attributions at each profile for each graph in 31, keeping the balancing principle [13] and samples with size greater than, or equal to, 31 are wide enough to test for normality with confidence [26], [27]. Construct validity is threatened by the definition of the failure profiles. We chose the profiles based on data and observations from previous studies, not necessarily the specific results. Thus, we defined them according to our experience and there might be other profiles not investigated yet. This threat is reduced by the experiment s objective, that is to expose the influence of different profiles on the prioritization techniques performance, and not to show all the possible profiles. Fig. 5. Boxplot with the samples from stoop. For testing the performance between the two techniques at every failure profile from a visual analysis of the boxplots of the samples, seen in the Figures 4 and 5, we can see that there are no overlaps between the techniques in any profile (the notches in the box plot do not overlap), in another words, at 5% of significance, ARTJac and stoop perform statistically different in every researched profile. Comparing each technique separately through the failure profiles, both of them present differences between the profiles, enough condition to also reject the null hypothesis of equality. By observing the profiles longt C and manybr, in Figures 4 and 5, they incur in similar performances for the two VII. CONCLUDING REMARKS This paper presents and discusses the results obtained from empirical studies on the use of test case prioritization techniques in the context of MBT. It is widely accepted that a number of factors may influence on the performance of the techniques, particularly due to the fact that the techniques can be based on different aspects and strategies, including or not random choice. In this sense, the main contribution of this paper is to investigate on the influence of two factors: the structure of the model and the profile of the test case that fails. The intuition behind this choice is that the structure of the model may determine the size of the generated test suites and the redundancy degree among their test cases. Therefore, this factor may affect all of the techniques involved in the experiment due to either the use of distance functions or the fact that the techniques consider certain structures explicitly. On the other hand, depending on the selection strategy, the techniques may favor the selection of given profiles of test cases despite others. Therefore, whether the test cases that fail have a certain structural property may also determine the success of a technique. To the best of our knowledge, there are no similar studies presented in the literature.

143 In summary, in the first study, performed with real applications in a specific context, different growth patterns of APFD for the techniques can be considered as evidence of influence of more factors in the performance of the general prioritization techniques other than the number of test cases that fail. This result motivated the execution of the other studies. On one hand, the second study, aimed at investigating the influence of the number of occurrences of branches, joins and loops over the performance of the techniques, showed that there is no statistical difference on the performance of the techniques studied with significance of 5%. On the other hand, in the third study, based on the profile of the test case that fail, the fact that all of the null hypotheses were rejected may indicate a high influence of the failure profile on the performance of the general prioritization techniques. Moreover, from the perspective of the techniques, this study exposed weaknesses associated with these profiles. For instance, ARTJac presented low performance when long test cases (and/or with many branches) reveal failures and high when short test cases (and/or with few branches) reveal failures. On the other hand, stoop showed low performance with almost all profiles. From these results, testers may opt to use one technique or the other based on failure prediction and the profile of the test cases. As future work, we will perform a more complex factorial experiment, calculating the interaction between the factors analyzed separately in the experiments reported in this paper. Moreover, we plan an extension of the third experiment to consider other techniques and also investigate other profiles of test cases that may be of interest. From the analysis of the results obtained, new (possibly hybrid) technique may emerge. ACKNOWLEDGMENT This work was supported by CNPq grants / and / Also, this work was partially supported by the National Institute of Science and Technology for Software Engineering 6, funded by CNPq/Brasil, grant / First author was also supported by CNPq. REFERENCES [1] M. J. Harrold, R. Gupta, and M. L. Soffa, A methodology for controlling the size of a test suite, ACM Trans. Softw. Eng. Methodol., vol. 2, no. 3, pp , Jul [2] D. Jeffrey and R. Gupta, Improving fault detection capability by selectively retaining test cases during test suite reduction, Software Engineering, IEEE Transactions on, vol. 33, no. 2, pp , [3] G. Rothermel, R. Untch, C. Chu, and M. Harrold, Test case prioritization: an empirical study, in Software Maintenance, (ICSM 99) Proceedings. IEEE International Conference on, 1999, pp [4] G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold, Prioritizing test cases for regression testing, IEEE Transactions on Software Engineering, vol. 27, pp , [5] S. G. Elbaum, A. G. Malishevsky, and G. Rothermel, Test case prioritization: A family of empirical studies, IEEE Transactions in Software Engineering, February [6] B. Jiang, Z. Zhang, W. K. Chan, and T. H. Tse, Adaptive random test case prioritization, in ASE, 2009, pp [7] M. Utting and B. Legeard, Practical Model-Based Testing: A Tools Approach, 1st ed. Morgan Kauffman, [8] E. G. Cartaxo, P. D. L. Machado, and F. G. O. Neto, Seleção automática de casos de teste baseada em funções de similaridade, in XXIII Simpósio Brasileiro de Engenharia de Software, 2008, pp [9] E. G. Cartaxo, P. D. L. Machado, and F. G. Oliveira, On the use of a similarity function for test case selection in the context of model-based testing, Software Testing, Verification and Reliability, vol. 21, no. 2, pp , [10] B. Korel, G. Koutsogiannakis, and L. Tahat, Application of system models in regression test suite prioritization, in IEEE International Conference on Software Maintenance, 2008, pp [11] S. P. G. and H. Mohanty, Prioritization of scenarios based on uml activity diagrams, in CICSyN, 2009, pp [12] F. G. O. Neto, R. Feldt, R. Torkar, and P. D. L. Machado, Searching for models to test software technology, 2013, proc. of First International Workshop on Combining Modelling and Search-Based Software Engineering, CMSBSE/ICSE [13] C. Wohlin, P. Runeson, M. Host, M. C. Ohlsson, B. Regnell, and A. Wesslen, Experimentation in software engineering: an introduction. Norwell, MA, USA: Kluwer Academic Publishers, [14] S. Elbaum, A. G. Malishevsky, and G. Rothermel, Prioritizing test cases for regression testing, in In Proc. of the Int. Symposium on Software Testing and Analysis. ACM Press, 2000, pp [15] T. Y. Chen, H. Leung, and I. K. Mak, Adaptive random testing, in Advances in Computer Science - ASIAN 2004, ser. Lecture Notes in Computer Science, vol. 3321/2005. Springer, 2004, pp [16] Z. Q. Zhou, Using coverage information to guide test case selection in adaptive random testing, in IEEE 34th Annual COMPSACW, 2010, july 2010, pp [17] D. Kundu, M. Sarma, D. Samanta, and R. Mall, System testing for object-oriented systems with test case prioritization, Softw. Test. Verif. Reliab., vol. 19, no. 4, pp , Dec [18] S. Elbaum, G. Rothermel, S. K, and A. G. Malishevsky, Selecting a cost-effective test case prioritization technique, Software Quality Journal, vol. 12, p. 2004, [19] B. Korel, L. Tahat, and M. Harman, Test prioritization using system models, in Software Maintenance, ICSM 05. Proceedings of the 21st IEEE International Conference on, 2005, pp [20] Z. Q. Zhou, A. Sinaga, and W. Susilo, On the fault-detection capabilities of adaptive random test case prioritization: Case studies with large test suites, in HICSS, 2012, pp [21] D. Jeffrey, Test case prioritization using relevant slices, in In the Intl. Computer Software and Applications Conf, 2006, pp [22] H. Do, S. Mirarab, L. Tahvildari, and G. Rothermel, The effects of time constraints on test case prioritization: A series of controlled experiments, IEEE Transactions on Software Engineering, vol. 36, no. 5, pp , [23] B. Korel, G. Koutsogiannakis, and L. H. Tahat, Model-based test prioritization heuristic methods and their evaluation, in Proceedings of the 3rd international workshop on Advances in model-based testing, ser. A-MOST 07. New York, NY, USA: ACM, 2007, pp [Online]. Available: [24] E. G. Cartaxo, W. L. Andrade, F. G. O. Neto, and P. D. L. Machado, LTS-BT: a tool to generate and select functional test cases for embedded systems, in Proc. of the 2008 ACM Symposium on Applied Computing, vol. 2. ACM, 2008, pp [25] C. F. J. Wu and M. S. Hamada, Experiments: Planning, Analysis, and Optimization, 2nd ed. John Wiley and Sons, [26] D. C. Montgomery and G. C. Runger, Applied Statistics and Probability for Engineers. John Wiley and Sons, [27] R. K. Jain, The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, [28] E. Lehmann, Nonparametrics, ser. Holden-Day series in probability and statistics, H. D Abrera, Ed. San Francisco u.a.: Holden-Day u.a., [29] J. F. S. Ouriques, Análise comparativa entre técnicas de priorização geral de casos de teste no contexto do teste baseado em especificação, Master s thesis, UFCG, Janeiro

144 The Impact of Scrum on Customer Satisfaction: An Empirical Study Bruno Cartaxo 1, Allan Araújo 1,2, Antonio Sá Barreto 1 and Sérgio Soares 1 Informatics Center - CIn / Federal University of Pernambuco - UFPE 1 Recife Center for Advanced Studies and Systems - C.E.S.A.R 2 Recife, Pernambuco - Brazil {arsa,bfsc,acsbn,scbs}@cin.ufpe.br Abstract In the beginning of the last decade, agile methodologies emerged as a response to software development processes that were based on rigid approaches. In fact, the flexible characteristics of agile methods are expected to be suitable to the lessdefined and uncertain nature of software development. However, many studies in this area lack empirical evaluation in order to provide more confident evidences about which contexts the claims are true. This paper reports an empirical study performed to analyze the impact of Scrum adoption on customer satisfaction as an external success perspective for software development projects in a software intensive organization. The study uses data from real-life projects executed in a major software intensive organization located in a nation wide software ecosystem. The empirical method applied was a cross-sectional survey using a sample of 19 real-life software development projects involving 156 developers. The survey aimed to determine whether there is any impact on customer satisfaction caused by the Scrum adoption. However, considering that sample, our results indicate that it was not possible to establish any evidence that using Scrum may help to achieve customer satisfaction and, consequently, increase the success rates in software projects, in contrary to general claims made by Scrum advocates. I. INTRODUCTION Since the term software engineering emerged in 1968 [1] it has motivated a tremendous amount of discussions, works, and research on processes, methods, techniques, and tools for supporting high-quality software development in a wide and industrial scale. Initially, industrial work based on manufacturing introduced several contributions to the software engineering body of knowledge. Many software processes has been supported by industrial work concepts such as functional decomposition and localized labor [2]. During the last decades, techniques and tools has been created as an analogy to the production lines. The first generation of software processes family was based on the waterfall life cycle assuming that the software development life cycle was a linear and sequential similar to a production line [3]. Then, in the early 90 s, other initiatives were responsible for creating iterative and incremental processes such as the Unified Process [4]. Despite these efforts and investments, software projects success rate has presented a dramatic situation in which less than 40% of projects achieve success (Figure 1). Obviously, these numbers may not be compared to other profitable industries [5]. Fig Chaos Report - Extracted from [5] Some specialists argue that software development is different from the traditional industrial work in respect to its nature. Software engineering may be described as knowledge work which is focused on information and collaboration rather than manufacturing placing value on the ownership of knowledge and the ability to use that knowledge to create or improve goods and services [2]. There are several differences between these two kinds of work. While the work is visible and stable in industrial work; it is invisible and changing in knowledge work. Considering that knowledge work (including software development) is more uncertain and less defined than the industrial work that is based on predictability, the application of industrial work techniques on knowledge work may lead to projects with increased failure rates. Since 2001, agile methods have emerged as a response for overcoming the difficulties related to the software development. Some preliminary results shown that agile methodologies may increase success rates as shown in Figure 2 [5]: Although some results may indicate that agile methodologies help to achieve success in software development, many of these researches fail to present evidence through empirical evaluation. Only through these evaluation it is possible to establish whether and in which context the proposed method or technique is efficient, effective, and can be applied [6] [7] [8]. In particular, for the agile context, a minor part of studies contains an empirical evaluation as shown in Figure 3 [9].

145 results obtained from this research. Fig. 2. Waterfall vs. Agile - Extracted from [5] Fig. 3. Agile empirical evaluation rate - Extracted from [9] Thus, the scope for this work was defined intending to provide a comparison between agile methods and traditional software development approaches. First, it necessary to point out that there are several agile methodologies such as Scrum, Extreme Programming (XP), Feature-Driven Development, Dynamic Systems Development Method (DSDM), Lean Software Development that are intended to support knowledge work (less defined and more uncertain) [2]. In parallel, it also exist many traditional approaches that are intended to support industrial work (more defined and less uncertain). These methods and processes are usually based on the remarkable frameworks such as PMBoK (Project Management Body of Knowledge) [10] and Unified Process [4]. These methods may include several perspectives such software engineering, project management, design and so on. For an objective analysis, it was chosen the project management perspective. On one hand, for agile methods, it was selected Scrum (project management based); on the other hand, it was chosen any traditional approach that include a perspective for the project management. In this context, a survey was executed in the C.E.S.A.R (Recife Center for Advanced Studies and Systems) using a random sample containing 19 different projects adopting Scrum or any other traditional approach for managing the initiative involving 156 developers. The main expected contributions by this study are listed below: Increase the body of knowledge about Scrum and agile methods using a systematic approach through evidences within an industrial environment. In particular it is intended to reduce the lack of empirical evaluation in software development discussions. Help the organization to understand how to increase internal success rates by analyzing and discussing the Hence, this paper is organized as following. Sections 2 and 3 present the definition for this study, including the conceptual model and the research method used for the survey, respectively. Section 4 is aimed to find out the results obtained from the survey execution. Limitations of this study as well as possible future studies are discussed at Section 5. Section 6 introduces some related studies and, finally, Section 7 presents the conclusion. Additionally, we present the applied questionnaire as well as the used likert anchoring scheme at the appendix. II. CONCEPTUAL MODEL OF CUSTOMER SATISFACTION The research model presented by this study verifies the impact of an independent variable (software development approach) on the project s success indexes considering the customer point of view. This independent variable may be assigned with two different values: Scrum and not Scrum (traditional approaches for software project management). In particular, it is necessary to recognize that customers probably have different definitions for success within a software project. In order to establish an external perspective, the model assumes seven critical factors for customer satisfaction (dependent variables), and consequently, for project success: time, goals, quality, communication and transparency, agility, innovation and benchmark. The next subsections provide more details for each one. A. Time In general, time to market is a critical variable within a software project. Thus, we define a project as successful if agreed and negotiated deadlines are met. Since Scrum is based on small iterations, it is expected anticipated delivery of valuable software [11] and also short time-to-market. Hence, we argue that a software project in which Scrum is adopted is able to provide higher customer satisfaction rates regarding to the time constraints by meeting the agreed and negotiated deadlines. Hypotheses 1: Scrum-based projects provide increased customer satisfaction from the time perspective. B. Goals Software projects are launched for strategic purposes, such as costs reduction, legal compliance, market-share increase, etc. Thus, we define a project as successful if the goals that motivated the endeavor are met. Since Scrum considers a deeper and frequent stakeholder participation and collaboration, it is expected a continuous goals adjustment [11]. Hence, we argue that a software project in which Scrum is adopted is able to provide higher customer satisfaction rates by addressing the customer needs regarding to the defined goals within a project. Hypotheses 2: Scrum-based projects provide increased customer satisfaction from the goals perspective. C. Quality By definition, quality is the degree to which a set of inherent characteristics fulfill requirements [10]. Product and process quality depend on the software project criticality demanded by the customers. Thus, we define a project as a

146 successful if the required quality standards for that specific situation are met. So, regular inspections (one of the Scrum pillars) are one of most effective quality tools within a software development project [2]. Hence, we argue that a software project in which Scrum is adopted is able to provide higher customer satisfaction rates by addressing the customer needs regarding to the defined quality standards within a project. Hypotheses 3: Scrum-based projects provide increased customer satisfaction from the quality perspective. D. Communication and Transparency Software projects are expected to create intangible products under a dynamic and uncertain environment. Therefore, frequent and continuous communication is required in order to provide confidence to the stakeholders regarding to the work progress. One of the Scrum pillars is transparency [11]. Thus, we define a project as successful if the customers feel themselves confident as a result of the communication and transparency. Hence, we argue that a software project in which Scrum is adopted is able to provide higher customer satisfaction rates by addressing the customer needs regarding to the expected level of communication and transparency within a project. Hypotheses 4: Scrum-based projects provide increased customer satisfaction from the communication and transparency perspective. E. Agility Some projects occurring in a fast-moving or timeconstrained environments, call for an agile approach [2]. The main characteristics of an agile software project are the early and continuous delivery of valuable software and ability to provide fast response to changes. Thus, we define a project as successful if the agility expected by the customers is met. Hence, we argue that a software project in which Scrum is adopted is able to provide higher customer satisfaction rates by addressing the agility demanded by the customer. Hypotheses 5: Scrum-based projects provide increased customer satisfaction from the agility perspective. F. Innovation Software projects are expected to deliver new softwarebased products and services for users/customers existing and emerging needs. Therefore, the innovation comes through new ways of work, study, entertainment, healthcare, etc. supported by software. Since Scrum also supports the principle of early and continuous delivery of valuable software it is expected that Scrum software development might help to create innovative products and services for the customer business. Thus, we define a project as successful if the innovation expected by the customer is met. Hence, we argue that a software project in which Scrum is adopted is able to provide higher customer satisfaction rates by addressing the customer expectation through innovative products and services generated by the project. Hypotheses 6: Scrum-based projects provide increased customer satisfaction from the innovation perspective. G. Benchmark Usually, software projects are launched as a procurement initiative in which an organization (buyer) hires a development organization (seller) to create a product or service that may be developed by several companies. It is natural that seller organizations do comparison between their suppliers. In this sense, we consider benchmark as a comparison between organizations that develop software. Thus, we define a project as successful if customers may recommend a development organization when comparing its project results to other organizations project results. Hence, we argue that a software project in which Scrum is adopted is able to provide higher customer satisfaction rates by comparing a project executed by a specific organization with other ones. Hypotheses 7: Scrumbased projects provide increased customer satisfaction from the benchmark perspective. III. RESEARCH METHOD In order to define a methodology to guide this study, we have chosen an approach based on surveys; and selected five of six recommended steps by Kitchenham [12], as below: Setting the objectives: This study investigates the relationship between the Scrum adoption (as a software development approach) and the customer satisfaction; Survey design: Cross-sectional, since the survey instrument was applied only once at a fixed point in time. It is not intended to promote a forward-looking to provide information about changes in the specific population through time; Developing the survey instrument: It was based on a questionnaire designed to identify the customer satisfaction within a particular project which determines its success degree from the external point of view; Obtaining valid data: The questionnaire was sent through for each customer business representatives (e.g. sponsor, product or project managers); Analyzing the data: Finally, the data analysis was executed using techniques from descriptive and inferential statistics. The following subsections present discussions related to the population, sample, variables, data collection procedure, and data analysis techniques used for this study. A. Population The population for this study is targeted on software intensive organizations, including companies of different sizes, developing several software-based solutions for a wide variety of markets. B. Sample It was selected a random sample of projects executed by C.E.S.A.R - Recife Center for Advanced Studies and Systems 1 which belongs to the target population. C.E.S.A.R is an innovation institute which has more than 500 employees working 1

147 TABLE I. Null Hypotheses (NH) (NH1) Ts = Tns (NH2) Gs = Gns (NH3) Qs = Qns (NH4) CTs = CTns (NH5) As = Ans (NH6) Is = Ins (NH7) Bs = Bns STUDY HYPOTESES Alternative Hypotheses (AH) (AH1) Ts Tns (AH2) Gs Gns (AH3) Qs Qns (AH4) CTs CTns (AH5) As Ans (AH6) Is Ins (AH7) Bs Bns Fig. 4. Contextual variables the customer satisfaction the Likert scale was used assuming values from 1 (poor) to 5 (excellent) values. Contextual Variables: Project type, team size, and project nature were identified as variables that may potentially influence the results. Project type and nature categorization was previously defined. The team size was the number of people involved during the development, including engineers, designers and manager. on projects from different business domains (e.g. finance, thirdsector, manufacture, service, energy, government, telecommunication, etc.), creating solutions for several platforms (mobile, embedded, web, etc.). The number of projects may vary from 70 to 100 in a year. Initially, the sample contained 27 projects, but it was reduced to 19 projects because incomplete questionnaires responses were eliminated from the sample. Even though, it represents an effective response rate of 70.3%, which is above the minimum norm of 40% suggested by [13] for academic studies. Furthermore, it was collected additional information related to each project, including project type, team size as below (Figure 4): Project type: 5 private and 14 public/brazilian tax incentives law. Team size: From 4 to 21. Project nature: Consulting: 4; Information Systems: 3; Telecommunications: 4; Maintenance: 1; Research & Development (R&D): 6; Embedded Systems: 4. Notice that one project may have different natures. Due to this reason, the number may be slightly different from the sample size. C. Variables This study contains several variables as following: Independent Variable: The software process is the independent variable and may assume two different values: Scrum (agile method) and Non-Scrum (any traditional approach). Dependent Variables: The success of a software project is the result of customer satisfaction from an external point of view considering several aspects: time, goals, quality, communication and transparency, agility, innovation and benchmark. In order to measure D. Data Collection Procedure First, the questionnaires were sent to customer business representatives through in a Microsoft Excel spreadsheet format. Each document contained the project categorization regarding to the contextual variables (project type, nature, and team size) and to the independent variable (Scrum/Non- Scrum). Thus, the customer business representatives were responsible for answering the questionnaire and then sending it back to the C.E.S.A.R project management office (PMO). E. Data Analysis Techniques The data analysis considered two different techniques. First, it was executed an exploratory data analysis (descriptive statistics) using tools such as barplots and boxplots in order to identify the preliminary insights about the data characteristics regarding to measures such as mean, position and variation. Then, hypotheses tests (inferential statistics) were conducted to provide more robust information for the data analysis process as shown in Table I. After the exploratory data analysis, it was not found apparent relevant difference in the obtained results. Thus, the alternative hypotheses were modified to verify the inequality, instead of the superiority. IV. RESULTS A. Descriptive Statistics - Exploratory Data Analysis Initially, the final sample - the one containing 19 projects - was divided into two groups (Scrum and Non-Scrum). Then, some exploratory data analysis techniques (descriptive statistics) were applied in order to find out central tendency, position and dispersion related to the data set. On one hand, barplots (Figure 5) helped to identify the means (central tendency) for each variable representing different aspects of customer satisfaction. On the other hand, boxplots (Figure 6) helped to reveal the data dispersion and position [14]. According to the barplots in Figure 5, we can notice that the projects using Scrum presented better results considering the

148 Fig. 5. Dependent variables means was a lot of data dispersion from grade one to five; and three was the mode. Communication and Transparency (CT): For the Scrum group, there was a variation (data dispersion) from grade two to five without a predominance of any value. For the Non-Scrum group, the grades were more concentrated from grade four to five and the mode was five. Agility (A): Both boxplots (Scrum and Non-Scrum groups) for the agility variable were extremely similar presenting a variation from grade three to five and the mode was the grade four. Innovation (I): For the Scrum group the variation was from grade four to five with an outlier (the grade three). For the Non-Scrum group the grades presented a dispersion from grade two to five. Benchmark (B): For both groups, the variation was the same: from grade three to five without any additional information. Finally, it is not possible to determine a relevant difference between the results from the groups considering the seven dependent variables as aspects of customer satisfaction. Therefore, there is no evidence about an advantage for the projects in which Scrum was applied. Fig. 6. Dependent variables boxplots following aspects: time, communication and transparency and agility. The projects that did not use Scrum presented better results for quality, goals, innovation and benchmark aspects. Despite these results, it not possible to assume that any group (Scrum and Non-Scrum) has an absolute advantage. According to the boxplot in Figure 6, it is possible to make some comments about each aspect of customer satisfaction considering the grades obtained from the sample observations: Time (T): For the Scrum projects groups, the grades presented a dispersion from two to five; and the second and third quartiles are coincident, showing that many grades four were given by the customers. For the Non-Scrum projects, the grades presented a more concentrated behavior with a dispersion from three to five; and a the first and second quartiles are coincident. Goals (G): For both groups, it was possible to identify a more concentrated data dispersion: from four to five in the Scrum projects; and three to four in the Non- Scrum projects. Besides, there are many occurrences of grades four in both groups. In particular, for the Non-Scrum group, it may be seen an outlier (the grade five). Quality (Q): For the Scrum group, the variation (dispersion) was from three to four and the mode (most frequent value) was four with two outliers (the grades two and five). For the Non-Scrum group, there B. Inferential Statistics - Hypotheses Tests Since the exploratory data analysis (descriptive statistics) was not able to provide any conclusion within this study, it was decided to go ahead through another method. Hypotheses tests (inferential statistics) was then used intending to establish a systematic basis for a decision about the data set behavior. First, the same previous segmentation was handled separating the sample into two groups: Scrum (seven elements) and Non-Scrum (12 elements) projects. Thus, we assumed both as independent samples containing ordinal data. In this case, it is recommended using nonparametric test for ordinal variables. In particular, it was chosen the Mann-Whitney s U test [15]. When performing nonparametric (or distribution free), there is no need to perform any kind of normality test (goodness of fit). The choice of U Mann-Whitney test did not bring harm to problem analysis, as in situations where the data are normal, the loss of efficiency compared to using the Student s t test is only 5%; in other situations where the data distribution has a heavier tail than normal, the U test will be more efficient [14]. Thus, hypotheses tests were performed (using the U test) through R language 2 to determine equality or inequality considering the samples means for each group (Scrum and Non-Scrum) from the perspective of each aspect (dependent variable). According to the previous hypothesis definitions, the equality was supposed to be accepted if the null hypothesis could not be rejected. Instead (in case of null hypothesis rejection) 2

Exibir mais