protozoa ([info]protozoa) wrote in [info]advancedcpp,
@ 2003-09-12 18:20:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
A Challenge
Who can make the shortest C++ program to syntax color and escape C++ for the web using only standard C++? I think it's a difficult problem because of the scanning. C++ needs regular expressions.

Updated: I added what I meant to in the original challenge; only standard C++ and syntax coloring.


#!/usr/bin/perl

print "<pre><code>\n";
while(<>)
{
    # C style comments
    if(/\/\*/)
    {
        s|(/\*.*$)|<font color="#008000">$1|g;
        print $_;
        while(<>)
        {
            if(/\*\//)
            {
                break;
            }
            print $_;
        }
        s|(^.*?\*\/)|$1</font>|g;
    }
    

    # The & character
    s/&/&amp;/g;

    # The < and > characters
    s/</&lt;/g;
    s/>/&gt;/g;

    # Double quoted strings
    s|(".*?")|<font color="#800000">$1</font>|g;

    # The # preprocessor statements
    s|(^\s*#\w+\b)|<font color="#000080">$1</font>|g;
    
    # C++ comments
    s|(//.*?$)|<font color="#008000">$1</font>|g;
    
    # Keywords
    s/( const| public| template| class| typename)/<font color="#000080">$1<\/font>/g;
    s/( if| switch| break| case| default| else)/<font color="#000080">$1<\/font>/g;
    s/( int| double| for| while| return| using)/<font color="#000080">$1<\/font>/g;
    s/( namespace| void| bool| operator| struct)/<font color="#000080">$1<\/font>/g;
    
    print $_;
}
print "</code></pre>\n";



(Post a new comment)


[info]maharg
2003-09-12 07:00 pm UTC (link)
Well, you could always use boost regular expressions. Not quite as thoroughly integrated into the language as perl RE, but that's not the sort of thing C++ is designed for.

You could probably do it in about twice as many lines or less, but not by the same method. You would, rather, tokenize it and then work with the tokens instead. Maybe I'll try my hand at it a bit later.

(Reply to this)(Thread)


[info]protozoa
2003-09-13 12:48 am UTC (link)
I meant to put a condition on the challenge but I forgot, you have to use the standard library. Boost would make it easier.

(Reply to this)(Parent)(Thread)


[info]maharg
2003-09-13 01:24 am UTC (link)
I actually felt this went without saying, but only because this is a for fun sort of challenge.

I wonder at the reasoning of restricting it in this way, however. The fact is, C++ (not to mention C in the first place) was designed to be built on with libraries. Most languages would have a set of generic containers built into the language in order to avoid the troubles that come with generalized generics, but C++ makes it possible to reimplement nearly all of it's standardized types in situ.

I should also point out that a regex library is in the Library Technical Report released this year by the standards committee, which does mean it stands a pretty good chance of becoming part of the standard library.

(Reply to this)(Parent)(Thread)


[info]protozoa
2003-09-13 01:33 am UTC (link)
I'm very happy that regex's will make it into the next C++. I hope threads make it into it too.

(Reply to this)(Parent)


[info]maharg
2003-09-13 01:25 am UTC (link)
This also applies to what you said in the original post. C++ doesn't need regular expressions, it already has them via boost. That's the way it works with this language.

(Reply to this)(Parent)


[info]kenshinhimura
2003-09-12 07:13 pm UTC (link)
I created this code for the first post I had (to make sure that the cpp code actually compiles and exactly identical with what I'll be posting).

This code handles tab, less, and greater characters. (I think I forgot to implement the & stuff)


    ifstream fileReader( argv[1], ios_base::binary );

    ostringstream reader;
    ostreambuf_iterator<char> readerIter( reader );

    istreambuf_iterator<char> cursor( fileReader );
    istreambuf_iterator<char> end;

    reader << "<pre><font face=\"courier new\">";
    for( ; cursor != end; ++cursor ) {
        switch( *cursor ) {
        case '<':
            reader << "<";
            break;
        case '>':
            reader << ">";
            break;
        case '\t':
            reader << "    ";
            break;
        default:
            readerIter = *cursor;
        }
    }

    reader << "</font></pre>" << endl;


To implement the keyword stuff you'll probably need to copy the contents of the ostringstream to a string then do something like:
while( pos = sMyStringBuffer.find( "while" ) ) {
    sMyStringBuffer.replace(pos,sizeof("while"),
        std::string( "<font color=blue>while<font>" );


A more elegant solution is to extract that code into it's own function.

I haven't gotten around to implement the comments and preprocessor coloring though, any ideas?

(Reply to this)(Thread)


[info]protozoa
2003-09-13 01:07 am UTC (link)
I've written a character by character scanner in C++ and it wasn't a couple of lines. I think a good solution will have to be around 100 lines of code. That's still pretty short.

(Reply to this)(Parent)(Thread)


[info]maharg
2003-09-13 01:17 am UTC (link)
My first pass attempt was about 150 lines, without having been refactored (which it needs), with comments, and with each keyword on a seperate line. It's also more robust than the perl example presented.

I'll paste it tomorrow after I refactor it.

(Reply to this)(Parent)


[info]kenshinhimura
2003-09-13 07:08 am UTC (link)
Part 1:

#include <iostream>
#include <fstream>
#include <sstream>
#include <assert.h>
#include <stdlib.h>
using namespace std;

void htmulateStream( istream& inStream, ostream& htmulated )
{
    htmulated << "<pre><font face= courier>";

    istreambuf_iterator<char> cursor( inStream );
    istreambuf_iterator<char> end;
    for( ; cursor != end; ++cursor ) {
        switch( *cursor ) {
        case '<':
            htmulated << "&lt;";
            break;
        case '>':
            htmulated << "&gt;";
            break;
        case '\t':
            htmulated << "    ";
            break;
        case '&':
            htmulated << "&amp;";
            break;
        default:
            htmulated << *cursor;
        }
    }

    htmulated << "</font></pre>" << endl;
}

bool isCppOperator( char c )
{
    switch( c ) {
    case '*': case '/': case '+': case '-':
    case '&': case '|': case ':': case '.':
    case '(': case ')': case '<': case '>': case '{': case '}':
        return true;
    default:
        return false;
    }
}

bool isPossibleKeywordSeparator( char cPre, char cPost )
{
    return isspace( cPre ) || isCppOperator( cPre ) || cPost == '\'' &&
           isspace( cPost ) || isCppOperator( cPost ) || cPost == '\'';
}

(Reply to this)(Thread)


[info]kenshinhimura
2003-09-13 07:15 am UTC (link)
part 2:
string colorizeKeywords( const string& sBuffer )
{
    // incomplete keywords just plug in more keywords here
    static const char * szKeyWords[] = {
        "int", "bool", "string", "double", "void", "char",
        "static", "short", "long", "virtual", "const", "unsigned",
        "if", "for", "while", "do", "switch", "case", "default",
        "class", "struct", "union", "namespace",
        "private", "protected", "public",
        "friend", "throws", "throw",
        "template", "typename", "typedef",
        "using",
        "return", "break", "continue"
    };

    string sColorized = sBuffer;

    int keyWordListSize = sizeof szKeyWords / sizeof szKeyWords[0];
    for( int i = 0; i < keyWordListSize; ++i ) {
        string sReplacement = "<font color = blue>";
        sReplacement += szKeyWords[i];
        sReplacement += "</font>";
        string::size_type pos = string::npos;
        while( (pos = sColorized.find( szKeyWords[i], pos+1 )) != string::npos )
            if( isPossibleKeywordSeparator( sColorized[pos-1] ,
                                            sColorized[pos+strlen(szKeyWords[i])] ) )
            {
                cout << "Found keyword " << szKeyWords[i]
                     << " at pos = " << pos << " replacing..." << endl;
                sColorized.replace( pos, strlen( szKeyWords[i] ),
                                    sReplacement );
                pos += sReplacement.size();
            }
    }

    return sColorized;
}

(Reply to this)(Parent)(Thread)


[info]kenshinhimura
2003-09-13 07:18 am UTC (link)
Part 3:
string colorizeSequence( const string& sFirst, const string& sLast,
                         const string& sBuffer, const string& sColor )
{
    string sColorized = sBuffer;
    string::size_type pos = string::npos;

    while( (pos=sColorized.find( sFirst, pos+1 )) != string::npos ) {
        string::size_type lastPos = sColorized.find( sLast, pos+1 );
        if( lastPos != string::npos ) {
            string sReplacement = "<font color = ";
            sReplacement += sColor;
            sReplacement += ">";
            string::size_type length = lastPos+sLast.size() - pos;
            sReplacement += sColorized.substr( pos, length );
            sReplacement += "</font>";

            cout << "found sequence ( " << sColorized.substr( pos, length )
                 << " ) at pos = " << pos
                 << " replacing..." << endl;

            sColorized.replace( pos, length, sReplacement );
            pos+=sReplacement.size()-1;
        }
    }

    return sColorized;
}

int main( int argc, char* argv[] )
{
    if( argc != 2 ) {
        cerr << "Usage: CppToHtm <cppfile>" << endl;
        return 1;
    }

    ifstream fileReader( argv[1], ios_base::binary );
    ostringstream reader;

    htmulateStream( fileReader, reader );

    string sColorized = colorizeKeywords( reader.str() );
    sColorized = colorizeSequence( "//", "\n", sColorized, "green" );
    sColorized = colorizeSequence( "/*", "*/", sColorized, "green" );
    sColorized = colorizeSequence( "\"", "\"", sColorized, "000080" );
    sColorized = colorizeSequence( "'", "'", sColorized, "000080" );
    sColorized = colorizeSequence( "#", "\n", sColorized, "blue" );

    string outputFilename = argv[1];
    outputFilename += ".htm";
    ofstream fileWriter( outputFilename.c_str(), ios_base::binary );
    fileWriter << sColorized;
}


Statistics:
LOC = 146 (including blank lines)
Used STL = yes
User Input = only when invoking the program
Used IOStreams = yes
Used External Libraries = no
Used Platform Specific API = no
Cross Platform = too lazy to try

Bugs:
1. The coloring for sequences is not perfect because I'm too lazy to implement stack parsing.
2. Escape sequence will cause coloring problems especially for single and double quotes.

I'm tired from working out so I'm just going to sleep after this... Or maybe watch adult swim whicher goes first.

(Reply to this)(Parent)(Thread)


[info]kenshinhimura
2003-09-13 07:37 am UTC (link)
additional comment:

I hate std::string's find mechanism. As much as I've pretty much adapted to it, I still get bugs everytime I use the damn thing.

It's just not intuitive.

(Reply to this)(Parent)(Thread)


[info]maharg
2003-09-13 10:48 pm UTC (link)
You know, the reason std::string is so much different from the other containers is because it was based on an entirely different library than the rest.

Basically, the standard library as it exists now came originally from 3 places:
- iostream, which predates the standardization effort and can almost be thought of as a sort of coevolution by the various compiler vendors.
- the string class, the origins of which are not exactly clear to me. It may have simply been a proposal that had no prior art, but I can't say for sure. In it's original incarnation it did not conform to the iterator-based container model it does now, all accesses were done with indexes.
- the STL, by Stepanov and Lee, which had most of the other containers and what is now .

This is why these three components tend to communicate with each other poorly or not at all (no overloads for std::string in any of the iostream functions -- including the ones for opening files, the multiple personalities of the string class, the fact that the locale library is not used at all by the string class, which uses the, imo unfortunate, char_traits mechanism exclusively).

(Reply to this)(Parent)(Thread)


[info]kenshinhimura
2003-09-14 05:41 am UTC (link)
I did not know that. Here I assumed that the class design sprang forth from the stl classes, but was convoluted by "design by committee" crap.

Hmm... char_traits is also used in the standard facets (part of C++'s i18n mechanism). I have been reading Standard C++ IOStreams and Locales recently, and I saw some heavy use of std::string there. Maybe they just hammered it into the library and tried to make it fit?

Interesting facts though.

Oh hey if anyone is going to use the code I posted above here's a bug fix to a function:

bool isPossibleKeywordSeparator( char cPre, char cPost )
{
    return ( isspace( cPre ) || isCppOperator( cPre ) ) &&
           ( isspace( cPost ) || isCppOperator( cPost ) );
}


Bug Note: certain non-keyword substring were being "colorized" because of operator precedence problems. Oh and I removed a useless condition that I put in because... I forgot why actually.

eg.

ifstream         //had problems
ostringstream    //had no problems
stringstream     //had problems

(Reply to this)(Parent)(Thread)


[info]maharg
2003-09-14 02:51 pm UTC (link)
The locale library was added explicitly as part of the standard to make iostreams more flexible, they had no prior art, so they take better advantage of the string classes in a lot of places (and not at all in others, see codecvt for example).

The iostreams themselves don't provide anything in that way really, with all the std::string overloads being in the header (std::getline(istream&, string&), operator<<(istream&, string&, etc). Like I said, not even the fstream() functions for opening a file take a string, you have to use .c_str() to convert.

Locales were a good idea, but I personally think they were poorly implemented. No matter what, they are a garaunteed performance bottleneck with multiple virtual calls taking place for any given operator<</>>. Doesn't seem like a big deal until you need it in a server for a network protocol that is line/string based.

Char_traits on the other hand was a complete flub. It's facilities are really quite useless and don't, in the end, allow for any useful expansion of functionality of the std::string class or even the facets (for example, there is no way to implement a useful UTF-8 std::string because char_traits doesn't do enough to allow for shifted multibyte characters).

I'm very fond of the STL component of the standard library, but the iostream, locale, and string aspects of it are underdesigned, most likely in order to have some level of conformance with pre-standard (non-templated) iostreams and string classes.

(Reply to this)(Parent)


[info]maharg
2003-09-14 05:22 pm UTC (link)
Here's mine. Quite a bit longer than the other two (230 lines), but also more correct as it's character based rather than line&regex based.

(Reply to this)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…