OK, so when one of the machines restarts, apparently the default behavior is to install any pending kernel updates, whether we asked it to or not. That in turn seemingly causes CUDA stuff not to be found (not sure exactly where it goes wrong, but clearly some kernel-version-specific directory moves or gets renamed or something). UPDATE: It looks like some Nvidia stuff is not found within /lib/modules/[kernel version]/kernel/drivers/video. Although I (Matt) am not sure if that's the only issue, so still might be easiest to just reinstall stuff rather than try to figure out how to relocate it post-kernel-update.
Another update -- To get the current kernel version, use the uname -a command (this will help for looking to see if you are missing the drivers in the path above). Also, we *might* be able to disable automatic kernel updates? Matt has some notes on this in nvALT... will maybe post those here if it looks like they work.
We're working on suppressing the auto-update behavior and/or figuring out an easier fix for relocating the missing files, but in the meantime, the workaround is just to reinstall CUDA and CUDNN. In theory all the systems should have most of their other setup done so this shouldn't take too long and will not require a restart. Here are the basic steps (excerpted from the main DeepLearningSetup page).
Running the CUDA installer
(You should probably already have this downloaded if you're on this page. And don't delete it after installing -- you may need the installer again.)
In most cases, you can just run the cuda_8.0.44_linux.run (or whatever version) installer file and accept most of the defaults. However, note the following exception:
GTX 1080Ti GPU (current as of June 2017 None of the CUDA installers currently have the right drivers for this card. So when you install CUDA, do not let it install the GPU driver! Instead, do everything else normally but don't install any driver. Then download the current driver for a 1080Ti card from the Nvidia website. (Lrrr is using NVIDIA-Linux-x86_64-381.22.run as of June 2017.) Install the driver -- if it says there is already a driver installed (e.g., maybe from a past failed CUDA installation attempt or something), and asks to overwrite the old driver, allow it to overwrite! Otherwise, CUDA installation and everything following it should be the same as written below.
Detour over; back to CUDA installation. When asked if you want to install samples, say yes and put them in /opt/cuda_samples.
CUDA should now be installed. Next up is CUDNN -- need to download that from Nvidia developer program or just get from Agnew/Calculon/etc. We are currently using version 5.1 on all machines (even Lrrr, with the weird driver) as of June 2017.
Unzip/untar/whatever the CUDNN files e.g. cudnn-8.0-linux-x64-v5.1.tar. Should yield a cuda directory with lib64 and include subdirectories. Copy the files in each of those to the corresponding /usr/local/cuda subdirectories (will require sudo), e.g. sudo cp lib64/* /usr/local/cuda/lib64/ (assuming you are in the cuda directory already).
Also, it seems you need to enter sudo ldconfig /usr/local/cuda/lib64 at some point after installing all this stuff -- we think this has to do with making the system aware of the shared libraries? Seems like we need to enter it periodically but it's not clear when -- maybe after each restart???