big memory machines

Haven’t finished debugging this unit yet. Thought you might like to see top info. These are physical CPUs BTW, not SMT.

top - 09:21:29 up 3 min,  2 users,  load average: 0.22, 0.21, 0.09
Tasks: 219 total,   1 running, 218 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.7%us,  0.3%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us, 13.2%sy,  0.0%ni, 86.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu16 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu17 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu18 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu19 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu20 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu21 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu22 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu23 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu24 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu25 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu26 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu27 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu28 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu29 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu30 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu31 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  529366036k total,  9816336k used, 519549700k free,        0k buffers
Swap:        0k total,        0k used,        0k free,    70116k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 2126 root      39  19     0    0    0 S   13  0.0   0:08.42 kipmi0             
 1882 root      20   0 15780  736  520 S    0  0.0   0:00.17 irqbalance         
 2388 root      20   0 79340 3760 2944 S    0  0.0   0:00.05 sshd               
 2467 root      20   0 19476 1520 1068 R    0  0.0   0:00.05 top                
    1 root      20   0 24008 2196 1340 S    0  0.0   0:07.03 init               
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd           
    3 root      20   0     0    0    0 S    0  0.0   0:00.05 ksoftirqd/0        
    4 root      20   0     0    0    0 S    0  0.0   0:00.00 kworker/0:0        
    5 root      20   0     0    0    0 S    0  0.0   0:00.14 kworker/u:0        

ahhhh

It really has 1TB, probably need some boot options or some other bits to get it to see all the ram.

Viewed 35473 times by 5079 viewers

11 thoughts on “big memory machines

  1. @kirjoittaessani

    Sadly, something like 1/2 the memory isn’t showing up. I’ll have to run into the lab today and test the RAM. I am guessing a mixture of bad dimms and memory cards. Ugh.

  2. As I said in an earlier comment in case you missed it is there are some pretty serious, in my opinion, issues with anyone reading /proc on kernels from 2.6.32 forward and I wrote it up here – http://collectl.sourceforge.net/SlowProc.html

    If this includes your system perhaps you can try out my ‘strace -c’ test and confirm you’re seeing this issue too.

    -mark

  3. @Mark

    Good catch there … I am wondering if this is what I’ve been running into with Collectl on our 2.6.32 kernels.

    Ok … this smells like a /proc – NUMA problem. That the CPUs handling the /proc interface could be different, so its possible that reads are causing all sorts of joyous access issues.

  4. re newer kernels – I believe it still is a problem. Nevertheless it would be good to test yourself if you have access to a many-core box.

    joe – it would be very interesting to see if this is what you’re bumping into. Can you try some of the tests I outlined on that web page?

    I too thought it was a numa issue but I think it’s more of an issue handling all the locking on the different memory sections one needs to traverse with a lot of cores. While it turns out you can’t have a lot of cores without a lot of sockets and hence NUMA, it’s not really the numa code that is doing this. At least that’s my understanding.

    -mark

  5. @marc – not sure you know, but the Red Hat bugzilla was locked down a few months ago to prevent access by non-subscribed people to bugs, apparently for “security reasons” (my understanding based on what RH told me happened to a bug of ours). So the BZ you link to from your collectl page is not viewable by anyone else I’m afraid.

    Is there a discussion on LKML about this kernel regression?

Comments are closed.