Monday, October 27, 2014

Using Regular Expressions for Program Transformation


I have been investigating the problem of practical code transformation to deal with small changes that have to be made across large collection of programs in different programming languages. So the problem is not to transform a program entirely but doing small transformation to a large set of programs that too when we have C, Perl, Java etc source programs in that set. This has practical application in software development, code refactoring and testing. A lot of cost savings in software testing is possible if test suits are well maintained and are automagically transformed using some method. I used a method and published findings on ip.com which is rather a full system explanation of what I described here in 2011 There are some interesting obstacles to transforming programs. I will try to explain as well as understand some of them in this post.

When we to want transform a set of programs in different languages what we are dealing with is a bunch of Context Free Grammars (CFG) implemented using syntax symbols that are a result of 'wild' imagination of authors (wild characters can not be domesticated that is!). The transformation therefore has to deal with two things: 1) CFG 2) syntax symbols. It is not obvious, but I have worked out after a some effort and intuition that best chance to modify these programs is only through full knowledge of each of these CFGs and their syntax. However, we can make some complex changes of practical importance without their knowledge using Regular Expressions (RE).

Regular expressions are nothing but representation of Finite Automata (FA). From the knowledge of Formal Systems we know that not all CFGs have a corresponding REs. Therefore even partial CFG parsing using REs may be impossible depending on the portion of CFG we are looking to transform. In other words, regex alone can not make some modifications no matter how small the change is. For a programer this would mean, there will always be mismatches for some REs when used across multiple programs (same language). The problem of coming up with a RE to always match patterns across programming languages or in single programming language defined by CFG are not solvable.

It is also tempting to think that, using some specially designed syntax symbols, modification of a program using RE may be possible. Unfortunately as long as what we use is RE we can not modify a language defined by CFG no matter what syntax symbols we use. The problem lies in the grammar, not syntax symbols.

As stated earlier, we can still make many meaningful and useful changes to programs across languages. We used two heuristic approaches successfully (very!):

1) Identify if what needs modified (portion) across the programs is an FA. If it is, write a regular expression to modify it. This will not fully succeed because syntax symbols of various languages may make it impossible to identify an FA. Thus this as well will work partially.

2) Identify a subset of CFG that can be matched easily using a set of REs and custom parsing. This can be used to modify the subset of CFG across programs and languages again partially.

If we learn to make better use of successful transformation we can achieve a lot of practical advantages and save labour. For idealists, a plugin can be implemented for all languages in question to program the compiler front end to do the transformation for us. I have written a small interpreter that can accept instructions on what to change in a program. I have called it the Test Mutation Language.

Ragu Kattinakere