"The Great" World: C vs. Java in number crunching

Introduction

Java being the most popular programming language of application programming is usually thought to be a good (usually the best) choice for any task. Six or seven months ago one of my instructor Dr. Hoermann, who is a real expert in random number generation techniques, ask me to reimplement one of his earlier C projects in Java by using the merits of object oriented design. I won’t dwell into the details of the project but shortly it was an object oriented implementation of a set of universal random number generators.

Initially, I did not suspect about the performance of Java because I previously use some data mining tools, simulators implemented by Java especially by some universities. Such that Weka – one of the most popular Java data mining applications—implemented by New Zealand University . So I start to design a model. You know those things. Interfaces, classes, so and so for…

Finally, design and implementation were completed and it seems that our new generator was solely doing pretty well. I use the generators to estimate some stock options and to do some Monte Carlo simulations. Everything was fine but one thing. What was the speed of previously implemented project in C by Hoermann? As I check the result, it was disgusting enough to make me irritate Java. I don’t tell you to irritate Java but just see below to not to see any problem as a nail.

Since the generator I implemented requires too much theory of statistics and probability theory, I won’t deal with it. But rather I will be implementing a pretty simple PI estimator using Monte Carlo technique in both Java and C.

Some rough idea on Monte Carlo

O! many a shaft at random sent

Finds mark at the archer little meant!

And many a word, at random spoken.

May soothe or wound a heart that’s broken!

Sir Walter Scott, the Lord of the Isles

A Monte Carlo method is an algorithm that solves a problem through the use of statistical sampling. Just as I said I won’t bother doing some probabilistic proof (it’s the job of http://www.systemsimulation.blogspot.com/). But to make you believe, think that in a raining day you and your son are walking through the street with your own umbrellas. And since he is just a kid, his umbrella is smaller than yours in terms of it is radius (think that we are looking from the top). So how can you compare the number of droplets hitting to your umbrella and his umbrella.

Believe or not this will be in proportion with the square of your umbrellas’ radius. Are the umbrellas magic? What if some guy squirt water on you by using garden trunk on you while your umbrellas are in your hand? This time the ratio of droplets will not be in proportion with r square. Why? Umbrellas, water, kid, parent are all the same. So what is the problem? The key word in here is uniformity. Think some you will realize what the problem isJ

So this much mix-up in your mind about Monte Carlo is enough. Let me finish this section by showing you how well a Monte Carlo estimator performs. Blue dots are our estimates for different numbers of replications(go on reading for the meaning of replication) in the chart and the dashed line is the exact value of PI (It is a bad saying that PI’s exact value. Why?).

Performance

Monte Carlo simulation we use for estimating PI requires many uniform number tuples (Imagine that we will control the rain. But before start we calculate all the droplets’ hit coordinates(X,Y) on the earth such that they will spread uniformly on the ground). Moreover method requires some multiplications and comparisons simply. Finally it requires building up a confidence interval in order to see how satisfactory its result is. It’s the statistical way of saying “I am sure that with 95% probability real PI value is some where between A and B”. The length of the confidence interval is determined fundamentally by the number of replications (This is the successive numbers of days we use our power to rain exactly 100000 droplets on the earth). The more replications we did the less distant A and B are (The more day we rain, the more we learn how to rain). Let me show the shrinking of CI (confidence interval), as the number of replications increase:

If you look close enough, you will see that as we multiply the number of replications by 4, the CI shrinks only by half.

All those in mind, let me show you the performance of Java and C with respect to number of replications (rainy days):

From now on I will not talk any more but one thing. Notice that the response time graphics of Java and C are both linear. This is good that our algorithm is scaleable. But isn’t it nice that C’s response time slope is one eighth of the Java. J You may ask or better may search for the reasons of this result: Intermediate machine code, virtual machine, overhead of class hierarchy, garbage collector…

That’s all for today... See you in the future!

Source Codes

Java

import java.util.Random;

import java.math.*;

public class MonteCarlo {

static final int iter = 100000;

static final int max_rep = 5000;

/*I aware of the fact that such public variable declarations are bad for OO design.But for the sake of performance…*/

public static int in_region[];

public static Random r;

public static void EstimatePI(int rep) {

double x, y;

int i, j;

double mean = 0, var = 0, hl;

for (j = 0; j < rep; j++) {

in_region[j]=0;

for (i = 0; i < iter; i++) {

x = 2 * r.nextDouble() - 1;

y = 2 * r.nextDouble() - 1;

if (x * x + y * y <= 1) {

in_region[j]++;

}

mean += in_region[j] * 1.0 / rep;

}

if (rep > 1) {

for (i = 0; i < rep; i++) {

var += (in_region[i] - mean) * (in_region[i] - mean) / (rep - 1);

}

mean = 4 * mean / iter;

hl = 7.84 * Math.sqrt(var/rep) / iter;

System.out.println("PI estimation is " + mean + " with half-length " + hl);

}

public static void main(String[] args) {

MonteCarlo mc = new MonteCarlo();

mc.r=new Random();

mc.in_region = new int[mc.max_rep];

long start;

for (int r = 1; r <= mc.max_rep; r *= 4) {

start = System.currentTimeMillis();

MonteCarlo.EstimatePI(r);

System.out.println("Response time is " + (System.currentTimeMillis() - start)/1000.0 + " sec.");

System.out.println();

}

C

#include <stdio.h>

#include <stdlib.h>

#include <time.h>

#include <math.h>

#define ITER 100000

#define MAX_REP 5000

int in_region[MAX_REP];

void EstimatePI(int rep){

int i,j;

double mean=0,var=0,hl,x,y;

srand(time(NULL));

for (j = 0; j < rep; j++) {

in_region[j]=0;

for (i = 0; i < ITER; i++) {

x = 2.0 * rand()/RAND_MAX - 1;

y = 2.0 * rand()/RAND_MAX - 1;

if (x * x + y * y <= 1) {

in_region[j]++;

}

mean += in_region[j] * 1.0 / rep;

}

if (rep > 1) {

for (i = 0; i < rep; i++) {

var += (in_region[i] - mean) * (in_region[i] - mean) / (rep - 1);

}

mean = 4 * mean / ITER;

hl = 7.84 * sqrt(var/rep) / ITER;

printf("PI estimation is %7.5lf with half-length %7.5lf\n",mean,hl);

}

int main(int argc, char* argv[]){

int r;

unsigned int start;

srand(time(NULL));

for (r = 1; r <= MAX_REP; r *= 4) {

start = time(NULL);

EstimatePI(r);

printf("Response time is %d sec.\n\n",time(NULL) - start);

}

return 0;

}

2 Comments:

Tonguç said...: akademik bir çalışma olmuş bu :) Ama C nin Java 'ya bu anlamda üstünlüğünü hemen herkes kabül ediyor fakat geliştirme kolaylığı, jvm ile ortam bağımzıszlığı ve nesneye yönelim özellikleri nedeniyle kebapçı geliştiriciler java yı tercih ediyorlar. öncelik performans olduğunda c ve c++ alternatifsiz, bu yüzden ki compute intensive işler söz konusun olduğunda plsql native compile ediyoruz ve tcell provisioning api lerini atos c++ ile yazıyor gibi.
yakışır bir başlangıç, devamını bekliyorum özellikle performans çalışmaların benim ilgimi çok çekiyor :); 3:46 AM
memilthegreat said...: Best answers to those has been given by Mr. Stroustrup:

Much of the relative simplicity of Java is - like for most new languages - partly an illusion and partly a function of its incompleteness. As time passes, Java will grow significantly in size and complexity. It will double or triple in size and grow implementation-dependent extensions or libraries. That is the way every commercially successful language has developed. Just look at any language you consider successful on a large scale. I know of no exceptions, and there are good reasons for this phenomenon. [I wrote this before 2000; now see a preview of Java 1.5.]

Java isn't platform independent; it is a platform. Like Windows, it is a proprietary commercial platform. That is, you can write programs for Windows/Intel or Java/JVM, and in each case you are writing code for a platform owned by a single corporation and tweaked for the commercial benefit of that corporation. It has been pointed out that you can write programs in any language for the JVM and associated operating systems facilities. However, the JVM, etc., are heavily biased in favor of Java. It is nowhere near being a general reasonably language-neutral VM/OS.

For more please refer to http://www.research.att.com/~bs/bs_faq.html; 4:17 PM

<< Home

"The Great" World

About Me

Sunday, June 11, 2006

C vs. Java in number crunching