100,000 strong - CodeFactory server

12 Mar 2013

The Inception

January 2012 was an idyllic time for us. Three of us had just teamed up to build something cool. There was no planning for the future, no sort of agreement - we were just three geeks sitting in the dorm room who wanted to build a product. We started working on MyCareerStack where there was supposed to be interview questions, tutorials etc. Soon, we realized that there was code editor needed, but that was easy. The harder part was that people don’t only want to write code online, they want to run them. Now that I had never done before!

What resulted then was a hacky few hundred lines of code in C, which could compile and run code in just C, C++ and Java. It might have been even the worst piece of code I had ever written, but it worked and it was sweet!

  /* create FIFO to be used here & later for IPC. */
    char fifo_1[NAME_MAX];
    createFilePath(fifo_1, id, dir, FIFO_FILE_1);

    umask(0);      /* reset file creation mask. */
    
    int readfd_1, writefd_1, readfd_2;
 
    if (mkfifo(fifo_1, S_IRWXU | S_IRWXG | S_IRWXO) < 0)
        err(1, NULL);

    /* Fork a child to exec the process: {id}_a.out */
    pid_t pid;
    int status;

    signal(SIGALRM, handle_tle);

    /* change current working directory to mycareerstack */
    chdir(RUNTIME_DIR);


Wait, you haven't seen the ugly part yet! This will make you cringe, it made me too. See what the hell I am doing with re-tries, but it was required!

    int tries = 0;

    /* try the exec 100 times. */
    while(tries < 100) {
        /* run the program. */
        if(execl(executable, executable, (char*) 0) < 0) {
            //
        }
        ++tries;
    }

    printf("Couldn't run the program!");
    _exit(0);


There were some serious reasons for which this hack was done. We were hosted on a shared webfaction server, with no access to root. This meant I couldn't drop the user process privilege to any user with limited access permissions. I needed to create another user then with limited access which could be controlled by the original user. And again, I didn't know of any way to know if there is already a process running with privileges dropped to the limited user. And that explains those number of retries.

By now, this hacky code-checker had executed over 10,000 code. This was a big achievement in itself.


Apache Thrift

Around August, I came to realize that this was going to be one of the core part of our future product - HackerEarth. I researched a little about existing framework which can allow cross-language services development and can help in easily building a distributed system. Apache Thrift came out to be an obvious choice. This was the first commit that I did, had stubs and other files - but it laid out the foundation for one of the most robust engine.

    class CodeCheckerHandler : virtual public CodeCheckerIf {
     public:
      CodeCheckerHandler() {
        // Your initialization goes here
      }

      void ping() {
        // Your implementation goes here
        printf("cpp-server got pinged...\n");
      }

      void run_code(RunResult& _return, const CodeInfo& code) {
        // Your implementation goes here
        printf("run_code\n");
      }

    };

    int main(int argc, char **argv) {
      int port = 9090;
      shared_ptr<CodeCheckerHandler> handler(new CodeCheckerHandler());
      shared_ptr<TProcessor> processor(new CodeCheckerProcessor(handler));
      shared_ptr<TServerTransport> serverTransport(new TServerSocket(port));
      shared_ptr<TTransportFactory> transportFactory(new TBufferedTransportFactory());
      shared_ptr<TProtocolFactory> protocolFactory(new TBinaryProtocolFactory());

      TSimpleServer server(processor, serverTransport, transportFactory, protocolFactory);
      server.serve();
      return 0;
    }


What resulted next was the implementation of whole client-server architecture using Thrift which came to be know as the CodeFactory server. I rewrote the whole evaluation part in C, added many languages to it like Perl, PHP, Python, Ruby, Haskell, etc. which came to be known as CodeFactory engine. But this was not the end.


The MVP - CodeTable

To test the CodeFactory server, it was pointless to wait for the HackerEarth to be launched. So on a weekend, I built the CodeTable and put the server out to be tested by real humans in this real world. I even posted on HackerNews and it brought so much traffic that we weren't ready for it. And boy you can't comprehend the bugs that will knock your doors. This was a very good example of MVP which allowed us to fix the bugs in no time.

There was a serious issue still left to be resolved. The current architecture allowed the python client to send request to the CodeFactory server, but it also left the python client hanging for response. When there were large number of submission, it meant more and more python client hanging for response which dramatically increased the memory consumption of the machine. After a while, the machine could even stop to respond. But we had not hit that phase yet, and there was not much point in building the complex system if it was not going to be used.


The Asynchronous System

But since December 2012, traffic had strarted increasing significantly. We could see now more and more failure rate in the submissions. It was now absolutely mandatory to resolve it. This resulted in the implementation of message queue and redefining the whole architecture.

    def _call(self, message):
        self.channel.basic_publish(exchange='',
                                   routing_key=self.routing_key,
                                   properties=pika.BasicProperties(
                                       # Make message persistent
                                       delivery_mode = 2,
                                       content_type = 'application/json',
                                       ),
                                   body=message)

        if DEBUG:
            print '[%s] Sent %r' % (self.routing_key, message,)
        self.connection.close()


The new architecture is completely asynchronous - which means that all the testcases for a submission are evaluated asynchronously, and all of them are evaluated even if you exceed the time limit or memory limit for a few ones. This was also a feature request as it helped the users to know more detail about their code execution. But hold on, pause a moment, and think about asynchronous system. It brings with it totally new set of challenges. In layman terms, you need to define and detect when your operation gets over. And I needed to detect it as soon as possible and send the submission result to the user.


The Realtime Server

I could have done it in using one of the easiest but nasty ways - polling. But then these decisions affect the product in subtle ways. I decided to implement a generic push server instead which came to be known as realtime product. nowjs appeared to be a good choice. I wrote the first few hundred lines of node.js server in it which could maintain the client information and could also receive the notifications from message queue itself and deliver to the client's browser.

    // Receive messages
    q.subscribe(function (message, headers, deliveryInfo) {
        message = JSON.stringify(message);
        var data = JSON.parse(message);
        if (DEBUG) {
            console.log('Received message for ' + data.name);
        } 
        sendMessage(data.name, data.message, data.html_id);
    });

It's this little piece of code coupled with other utility code thrown in which updates your browser with the result of each testcase in realtime. It made the evaluation process more engaging for the users, and reduced the first response time by orders of magnitude. Realtime server is still in a very nascent stage, obviously with some bugs, and it even stops working sometime. But I am sure it will result as the backbone of all our live push systems and many more in the upcoming platform. We are working hard to make your experience as seamless and simpler as possible.

Just a few days ago, the CodeFactory server processed 100,000th run request, and an almost equal number of compile request. The stats are growing exponentially and they are pretty interesting with submissions in all the languages supported on the platform. This is really exciting. We also released our API last month. It's still in alpha stage and we are working on making it robust and more useful. Feel free to use the API to build your own codechef/spoj/topcoder :)

P.S. I am the co-founder of HackerEarth. Reach out to me at vivek@hackerearth.com for any suggestion, bugs or even if you just want to chat! Follow me @vivekprakash

Posted by Vivek Prakash


blog comments powered by Disqus