Use Regular Expression To Match C Comments
We want to match multi-lined comments in C programming language with regular expressions. And this is the code we need to deal with:
/**a*/b/*c**/
We may want to use regular Expression like /\*.*\*/
. (Use .*
to match the whole comments)
But it will not work, we will get this result:
/**a*/b/*c**/
instead of the expected result:
/**a*/
Because regular expressions are greedy, they always trying to match more.
Modern regular expression engines have added extentions to traditional regular expressions. For example, many engines support non-greedy quantifiers.
Regular expression in JavaScript
support non-greedy quantifiers like *?
, +?
, {n,m}?
, etc.
With those non-greedy quantifiers, matching C comments is easy:
/\/\*.*?\*\//.exec("/**a*/b/*c**/")[0];
We will get the expected result:
'/**a*/'
Besides non-greedy quantifiers, advanced regular expressions support negative group, we have another solution with it:
/\/\*([^*]|\*(?!\/))*\*\//.exec("/**a*/b/*c**/")[0];
We will also get the expected result here:
'/**a*/'
But those solutions are not good enough since many important regular expression engines do NOT support those advanced functionalities.
For example, flex
and leex
only support the basic regular expressions.
When we use these tools, all we have is the basic regular expressions.
We can read
documents of flex
anddocuments of leex
if we are not sure about anything.
So how do we solve this problem?
The Good Solution
The final solution can be explained by this pseudo regular expression:
{START}({NOT_WORRYING}*{WORRYING}+{NOT_WORRYING_NOR_FINAL})*{NOT_WORRYING}*{WORRYING}+{FINAL}
With this solution, the regular expression can be written as:
/\*([^*]*\*+[^*/])*[^*]*\*+/
The details of this solution can be found in this site.
To make it clearer, we can split the expression into pieces:
{START} /\*
({NOT_WORRYING}*{WORRYING}+{NOT_WORRYING_NOR_FINAL})* ([^*]*\*+[^*/])*
{NOT_WORRYING}* [^*]*
{WORRYING}+{FINAL} \*+/
Now it's much easier to understand this solution.
Complex Example
Let's write a C language file who contains some random comments:
/*** this is c comment ** /
**/
int blah(struct myobj **p) {
return (*p)->f(p);
}
/* /* /*
* other c comment
*/
Let's match the comments in JavaScript code:
/// The contents of the C file have been read into variable `c`.
/\/\*([^*]*\*+[^*/])*[^*]*\*+\//s.exec(c)[0];
The result of the JavaScript expression:
'/*** this is c comment ** /\n **/'
As we can see, we have matched the complete comment.
Using sed
We can also use that solution with sed
command to do something.
For example, if we want to replace all comments with a empty line, we can just:
sed -z -E "s#/\*([^*]*\*+[^*/])*[^*]*\*+/##g" a.c
The result:
int blah(struct myobj **p) {
return (*p)->f(p);
}