|
[Sponsors] |
February 3, 2020, 05:41 |
Ubuntu instabilities and hardware settings
|
#1 |
New Member
Andrew
Join Date: Jun 2015
Posts: 20
Rep Power: 11 |
Hello,
From my previous post that I posted awhile back, I went with a workstation with i9-10940x. I have been using Code_Saturne for my simulations and I compiled it same way I have previously compiled it on ubuntu. I installed Ubuntu 18.04 on a partition drive on the hdd drive (the system is dual boat with Windows on a separate ssd). I tried running a simulation that I have run successfully previously on several other Ubuntu desktops and clusters and has been very, very stable to run. Upon running it on my new machine the simulations runs for a few hundred iterations and crashes. Running again several times, the simulation crashes very randomly at different iterations and locations in the code. Recompiling the code in debug mode, the call stack for the errors originate with openmpi. Up to this point I have been using the gcc compiler and openmpi from ubuntu's server; thinking there might be something wrong with compiler and openmpi in regards to Code_Saturne, I installed several gcc/openmpi that I knew Code_Saturne should work with. But I still received crashes. I posted my problem on Code_Saturne's forum ( https://www.code-saturne.org/forum/v...php?f=3&t=2592 ) and their suggestion was that I might be accidentally mixing compilers/mpi libraries when building Code_Saturne and its prerequisite libraries which for my latest tests was the case. However, my early tests only used ubuntu's gcc/openmpi so there was no other libraries to be mixed with and I also fixed the issue of compilers being mixed up in my latest tests and still received mpi errors. Also, on my earlier tests the mesh quality changed from simulation to simulation, I could never determine if the code was reading the mesh wrong or if the mesh was changing (the mesh file showed no modifications), but in later tests with different test cases I didn't have this problem. In addition to these problems, Ubuntu has given random system errors and some applications will suddenly stop working (e.g. libreoffice, gedit, firefox, teamviewer). The screen will also freeze from time to time requiring a manual restart. I reinstalled Ubuntu more than once and problems still persist, I don't know if ubuntu settings and files will transfer over to the next Ubuntu install if I install on the same hard drive partition. I checked the hard drive and there are no errors were reported. Yesterday I tried running my code with mpich and it ran for 10ish iterations when it crashed. Running it a second time, the accompanying source files refused to compile and checking ubuntu's log file, the system was in read-only state! Restarting the computer I was met with a fatal error message for Ubuntu and it was stuck on a command prompt window. The window installation still works fine. Before I try to install Ubuntu again, I have several questions. Does ubuntu re-use files from previous installations? Is there any changes I should make in the bios for Ubuntu and mpi? I have the ASUS PRIME X299-DELUXE II-A motherboard and I have only turned off hyperthreading so far. Are there any additional hardware or system checks I can make besides checking the hard drive? Best regards, Andrew |
|
February 3, 2020, 08:15 |
|
#2 |
Senior Member
Join Date: May 2012
Posts: 551
Rep Power: 16 |
If you reinstall using the "erase" option then you should not have any previous files.
In regards to your problems, they may also be hardware related. I suggest that you run some diagnostics such as memtest to catch any possible problems. also check the temperature of the CPU during some simulation in Ubuntu you can do the following Code:
sudo apt update; sudo apt install lm-sensors -y sudo sensors-detect sudo watch sensors |
|
February 3, 2020, 10:55 |
|
#3 |
Super Moderator
Alex
Join Date: Jun 2012
Location: Germany
Posts: 3,427
Rep Power: 49 |
I am not too familiar with diagnosing these kinds of errors on Linux. I try to avoid getting them
The CPU is pretty new, it could be possible that Ubuntu 18.04 does not yet fully support it. You might want to post this in a forum specific to your distribution, they should be able to guide you through some diagnosing. First thing I would do is flash the latest bios version. This can resolve a lot of issues with bleeding edge hardware like your CPU. Next is making sure there are no hardware issues. My first guess would be memory. Which memory do you use exactly, is it populated as the manual suggests, and how fast do you run it? In case you run it above Intels official spec of DDR4-2933, drop it down to DDR4-2933 and see if the errors persist. Memtest and the likes are unfortunately not suitable for checking memory stability, as they do not really stress the memory subsystem. Better run whatever simulation was causing the errors in the first place. |
|
February 3, 2020, 11:41 |
|
#4 |
Senior Member
Join Date: May 2012
Posts: 551
Rep Power: 16 |
In terms of kernel, Ubuntu 18.04 is on 5.3.x if you have hardware enablement (HWE) active. This is the default, so if your system is updated then you should be on 5.3.x, which most likely is sufficient for 10900 series.
Updating the Bios is a good advice. Other than that I would still put my money on some memory problem. If you have damaged memory then memtest is ok. It may not catch all errors, but if you get lots of errors with memtest then you will have a good clue. |
|
February 5, 2020, 09:54 |
|
#5 |
New Member
Andrew
Join Date: Jun 2015
Posts: 20
Rep Power: 11 |
Thank you very much for your replies. There is indeed a bios update for my motherboard that was released on Jan 22nd with a description that it improves performance and stability. I will try to update the bios when I have the opportunity. I also tried running memory tests but none of the tests have returned any problems. I have been running the RAM at 3200 MHz, the design spec for the RAM. I will try lowering the RAM frequency to 2933 MHz to see if that helps.
Also, on Windows I installed virtualbox with ubuntu 18.04. I compiled all of my libraries and tried running my simulation on it, but I still received the same mpi errors after a random number of iterations. Monitoring the temperature during the simulation, the temperatures stayed below 70 deg C under load. The problem also occurs faster if I use more cores when experimenting between 4 to 10 cores. Are there also any debug modes or software for running applications with mpi that will give more information on the crash? I experimented with activating debug options for openmpi and Code_Saturne, but no additional information is displayed. Best regards, Andrew |
|
May 7, 2020, 00:19 |
|
#6 |
New Member
Andrew
Join Date: Jun 2015
Posts: 20
Rep Power: 11 |
An update and solution to my problem in the event someone else experiences the same problem: After spending a lot of time troubleshooting, I finally tested my RAM sticks manually because many of the errors I was receiving were due to memory allocation problems even though memory tests detected no problems. I individually tested each stick of RAM and 2 out 4 of the RAMS sticks caused errors when running my code. My code ran perfect with the two good sticks of RAM and constantly crashed when using the two bad sticks of RAM. I finally got replacements for the two bad sticks and my code runs perfect with now 4 good RAM sticks. Before I returned the bad RAM sticks, I ran memory tests on just the bad RAM sticks and no errors were detected! I guess memory tests aren't completely comprehensive and it's best to manually test the RAM.
|
|
|
|