Troubleshooting a malfunctioning computer is a tricky task. A computer is made up of many components, and determining exactly which one is faulty is like searching for a needle in a haystack. Luckily, there are techniques that reduce the amount of hay you have to go through. This is the story of how I determined a faulty motherboard was the cause of my aunt’s computer problems.
When troubleshooting, two of the most important things to do is to maintain a logical attitude and to document. Without a logical attitude, you will become frustrated and overly repeat steps, thinking that if you try “just one more time” you will solve the problem. Logic dictates that if you have eliminated the impossible, whatever remains, however improbable, must be the truth. Therefore, trying the same thing more than a few times is useless. Without documentation, you will surely try a fruitless idea that you had already tried, and go around in circles. You will also be unable to reverse a mistake if you don’t know how you made that mistake in the first place. Thus, remember to document your results and maintain a logical attitude.
The first step in any form of troubleshooting is determining the symptoms, then formulating a plan of attack. In this case, the computer was rebooting itself randomly, and the slot loading DVD drive was not accepting any discs. I decided on the following plan of attack: fix the DVD drive first, then work on the random reboots.
If an optical drive is not accepting media, then most likely it is lacking power. Unless the drive had been dropped or physically damaged, the likelihood of the moving parts not working is very low. Therefore, either the power connector is not connected properly, or a power surge has damaged the electronics. In this case, I determined the cause to be a loose power connector. I re-connected the molex plug, making sure to insert it into the socket snugly, and re-powered the system. The DVD drive was now working again. Now, I could concentrate on the most difficult part: the random reboots.
When my aunt first told me of the random reboots, I had assumed a virus or a worm (such as the Sasser worm) was the culprit. That is, I assumed that it was a software, rather than a hardware, problem. After fixing the DVD drive, I rebooted the computer, only to find that Windows would not load. The computer kept rebooting while attempting to load Windows. Since at this time I believed I was dealing with a software problem, my initial diagnosis was that Windows was trying to load a faulty driver and rebooting when the driver could not be loaded. Since Windows has a feature where if it encounters a problem it reboots, my first task was to disable this feature. How would I do that if I couldn’t even boot into Windows? The answer: Safe Mode.
Pressing F8 right after POST enables a menu of options, with one of them being Safe Mode. When I tried this, the computer still rebooted before loading the GUI. Confused, I tried rebooting a few times until Safe Mode finally loaded, and checked the Automatically Restart feature. It was not selected, and therefore not the cause of the reboots. So, it wasn’t a faulty driver. Was it something more fundamental? Perhaps a Windows service or a critical file needed by the OS? To address these issues, I used the Recovery Console to disable unnecessary services. Disabling a few at a time, I was unable to stop the reboots. As a final resort, I tried re-installing Windows, indicating to Setup that I wanted to attempt a repair of the existing installation. The text setup phase completed successfully, but upon the required reboot to boot into the graphical phase, the computer rebooted again. I was now pretty convinced that a hardware issue was behind the random reboots.
To verify that a hardware problem was the source, I tried to load Knoppix, a version of Linux that boots and runs off of a single CD. Since Linux is different from Windows, if Windows is having a software problem, most likely the same problem would not occur with Linux, because it is an entirely different operating system. Although Knoppix didn’t reboot, it did freeze before it could load completely. Now, I was 100% sure that hardware was the problem.
Since I was onsite, I did not have other pieces of hardware to test with. The only thing I could do was attempt to run Memtest86, a free memory testing tool that stresses and tests a computer’s memory. I inserted the CD and reboot, and Memtest86 loaded and ran its tests. After about 45 minutes, it had completed one pass without any problems. So, it seemed like memory was not at fault. I could no longer do anything without other hardware, so I disconnected everything and brought the computer back to my lab.
Once back at my lab, I connected a spare power supply to the computer to determine if the power supply was faulty. The computer rebooted at the same point in the boot process. I repeated swapping components for the video card, sound card, network card, drive cables, and DVD drive, to no avail. I swapped the memory around, and ran a few more passes of Memtest86. Still nothing. At this point, I had determined that the DVD drive, power supply, memory, network card, video card, sound card, and drive cables were not at fault. The only components left were the hard drive, CPU, and motherboard.
To test the hard drive, I used a spare hard drive and tried to install Windows on it. Again, the installation went past the text phase, but failed before the graphical phase. I repeated this a few times, and a couple of times the graphical phase loaded, then froze. So, it wasn’t the hard drive. Only CPU and motherboard left.
To make sure the CPU wasn’t setup improperly, I removed the heatsink to ascertain the model and speed, so that I could reconcile them with the model and speed the BIOS reported. The BIOS was correct, but to try a few more possible solutions, I cleared the BIOS, reset the BIOS to failsafe defaults, and re-flashed the BIOS. Rebooting after each of those steps, I encountered the same problem. This was not unexpected, however, because the likelihood of a CPU failing is pretty low. The only thing left now was the motherboard.
To make sure there wasn’t a short-circuit anywhere, I cleaned out all the dust in the computer, and removed the motherboard from the chassis. Then, I tested the problem using original components as well as my own components (repeating several of the above steps), and it was still there. I even tried Knoppix again, and this time it completely loaded, only to reboot while the computer idled at the desktop. It was now logical to conclude that the motherboard was faulty.
Because the reboots happened intermittently, the exact cause of the problem is difficult to ascertain. Reboots occurred during Windows startup, while Windows was running (as my aunt reported), and when setup began its graphical phase. With Linux, the computer just froze during startup, and rebooted randomly when the OS was running. Since I don’t have a spare motherboard and CPU to test with, I cannot determine with confidence that the motherboard is the problem. On the other hand, I have tested every component other than the motherboard, and if you remember that one of the two most important things to do while troubleshooting is to maintain a logical attitude, then you will realize that we have eliminated the impossible, and only the motherboard remains. I replaced every other part of the computer with known working parts, and the problem still occurred. This implies that the parts replaced were not causing the random reboots.
As I mentioned at the beginning of this story, troubleshooting is a tricky task. Troubleshooting is like a mystery that you must solve, with a trail of clues for you to follow along the way. Thus, we use logic and document our results as we proceed. We swap out parts, trying to determine if one is faulty. We run tests that stress certain components to determine if those components have failed. In the end, once we have eliminated all other possible solutions, once we have burned away all the hay, we arrive at one conclusion: our needle.