# Which CPU is faster, 3.2 GHz Nehalem W5580 or 2.6 GHz Istanbul?

Yes this is a loaded question. The context I use is a very simple double precision floating point loop, with the interior re-written to use SSE2. The idea is, if we run the identical program on the same machine, running one core, doing little else but double precision FP operations (in this case, computing the Riemann Zeta Function), with very little to no memory traffic … which CPU core will win on this very simple sprint?

The larger picture context of this is a set of reports we were hired to generate. Unfortunately, its looking like the business side of things (for the reports) is falling through so there is little I can do relative to this. We can’t/won’t work for free (we have enough people trying to get us to do free consulting for them as it is).

The code is our rzf-sse2.c code. Running it basically computes π

Here is the Nehalem 3.2 GHz machine (running at full clock speed, I turned off the speed/power governor, and determined that it was off from /proc/cpuinfo). CPU and motherboard loaned to us by our friends at Intel for other purposes, but its there, so we ran it.

```[landman@cx1-2 rzftest]\$ ./rzf-sse2.exe -l 1000000000 -n 2
D: checking arguments: N_args=5
D: arg[0] = ./rzf-sse2.exe
D: arg[1] = -l
D: infinity found to be = 0
D: should be 1000000000
D: arg[2] = 1000000000
D: arg[3] = -n
D: N found to be = 2
D: should be 2
D: arg[4] = 2
D: running on machine = cx1-2
D: start_index = 1000000000
D: end_index   = 2
D: unroll      = 2
D: inf-1       = 999999999
zeta(2)  = 1.644934065848227
pi = 3.141592652634864
error in pi = 0.000000000954929
relative error in pi = 0.000000000303963
Milestone 0 to 1: time = 0.000s
Milestone 1 to 2: time = 3.440s
```

The core loop is running on this machine in 3.44 seconds.

and

```[landman@cx1-2 rzftest]\$ grep MHz /proc/cpuinfo  | uniq
cpu MHz		: 3200.133
```

What about the istanbul?

```landman@istanbul4p2us:~/rzftest> ./rzf-sse2.exe -l 1000000000 -n 2
D: checking arguments: N_args=5
D: arg[0] = ./rzf-sse2.exe
D: arg[1] = -l
D: infinity found to be = 0
D: should be 1000000000
D: arg[2] = 1000000000
D: arg[3] = -n
D: N found to be = 2
D: should be 2
D: arg[4] = 2
D: running on machine = istanbul4p2us
D: start_index = 1000000000
D: end_index   = 2
D: unroll      = 2
D: inf-1       = 999999999
zeta(2)  = 1.644934065848227
pi = 3.141592652634864
error in pi = 0.000000000954929
relative error in pi = 0.000000000303963
Milestone 0 to 1: time = 0.000s
Milestone 1 to 2: time = 3.273s
```

The core loop is running on this machine in 3.27 seconds.

and

```landman@istanbul4p2us:~/rzftest> grep MHz /proc/cpuinfo  | uniq
cpu MHz		: 2600.000
```

This code was compiled with GCC 4.3.2. Our makefile is

```CC	= gcc

# use -g to turn on debugging
DEBUG	=

CFLAGS	= -O3  \${DEBUG}   -I. -Bstatic
FFLAGS	=  \${DEBUG}
LFLAGS  = -O3  \${DEBUG}  -Bstatic

NAME	= Makefile.\${PROGRAM}-\${CC}

all:	\${PROGRAM}-\${CC}.exe \${PROGRAM}-\${CC}.s

\${PROGRAM}-\${CC}.exe:	\${PROGRAM}.o
\$(CC) \${LFLAGS} -o \${PROGRAM}.exe \${PROGRAM}.o -lm

\${PROGRAM}.o: \${PROGRAM}.c
\$(CC) \${CFLAGS} -c \${PROGRAM}.c

\${PROGRAM}-\${CC}.s: \${PROGRAM}.c
\$(CC) -dA -dp -S \${CFLAGS} -c \${PROGRAM}.c -o \${PROGRAM}-\${CC}.s

clean:
rm -f \${PROGRAM}-\${CC}.exe \${PROGRAM}.o \${PROGRAM}-\${CC}.s

rebuild:
\${MAKE} -f \${NAME} clean
\${MAKE} -f \${NAME} all

.c.o:
\$(CC) -c \$(CFLAGS) \$< .f.o:
\$(F77) -c \$(FFLAGS) \$<
```

The Istanbul is, from what I can see, a formidable computational competitor to Nehalem. Dismissing it out of hand (as I have seen many do based upon Shanghai, Barcelona, and other performance metrics) would not be in anyone's interests as users and consumers of high performance computing gear.

Its unfortunate that it does not look like the reports will be able to be generated though, as we will have to give up access to the Istanbul soon. I would have liked to have seen what it can do.

Viewed 13248 times by 2591 viewers

## 2 thoughts on “Which CPU is faster, 3.2 GHz Nehalem W5580 or 2.6 GHz Istanbul?”

1. I’m sad you won’t be able to expand the results. This is a good data point, but unfortunately it doesn’t really establish much in terms of either processor’s appropriateness for a given task, right? Most applications need access to memory, and eventually to disk. This is reminiscent of the Advanced Clustering Technologies benchmarks pitting AMDs against Intels for different tasks, with the winner varying depending upon what part of the system mattered (Nehalem faster for memory bound computations, etc.).

2. @John

We have some data, but it doesn’t look like we are going to be able to do anything with it. I agree, it is interesting. I agree, I’d like to see it expanded.

The Advanced Clustering benchmarks weren’t (for the most part) application level benchmarks (stream?). The LAMMPS benchmark was real, but not conclusive, and was used to draw general conclusions where it probably should not have been. There are many different operations in MD codes, and some will do better on some processors than others. You want to (generally) test multiple related code paths.

I do agree that all applications need memory and disk. My arguments won out on the former, but lost out on the latter. Hence the test cases were already restricted.