IWETHEY v. 0.3.0 | TODO
1,095 registered users | 0 active users | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New Hardware diagnosis
imric wrote:

This whole thing started when the HDD where Debian used to live fried itself badly enough to keep the rest of the system from booting.

Depending on what you mean by "fried", this might point to the underlying cause.

Back a number of years ago, we were having a several-day heat wave in San Francisco, reaching about 40 Celsius (maybe 103 F). That day, I was in my apartment, stripped to my waist, and made the judgement error of trying to reload my 486 tower with OS/2 3.0 from CD media, with the case open (and thus no proper forced-air flow). Furthermore, I'd made the error of mounting one of the system's two fast SCSI drives mounted with only about 5 mm of air space between the drive's electronics and a blocking steel bracket. As a result, the drive's electronics partially cooked, and some data writes thereafter were not reliably correct. Everything seemed OK at the software level: CRC values got calculated and confirmed, but in some cases the CRCs were generated and stored (accurately) on the basis of already incorrect datastreams. I didn't figure out what had happened for a few days, and could confirm it only indirectly after swapping out the drive.

Anyhow, my point is that if there has been a past heat episode in the machine we're talking about, you need to consider the possibility that more damage occurred at that time, not just the drive that got "fried", then. There may have been heat stress to other components, or electrical damage. (People are sometimes surprised when they ask me "May I test this board I think is bad on your motherboard?" and I reply "No!" Somehow "Oops, I'm sorry" doesn't suffice when someone damages my motherboard.)

If it eventually comes down to "Some hardware is acting up, I'm not sure which, and I've ruled out software/firmware configuration", then I guess you'll be spending a fun time honing your hardware diagnostic skills: You can try removing non-essential hardware, to see if they problem goes away. You can test suspect equipment a piece at a time in a different system that's known-good. You can swap in temporary replacements for suspect components in the suspect machine. There are no doubt other basic approaches that aren't readily coming to mind, but those are some of the classics.

Rick Moen
rick@linuxmafia.com


If you lived here, you'd be $HOME already.
Expand Edited by rickmoen Dec. 2, 2002, 10:17:19 PM EST
New Yup! Fun...
The genesis of this machine. Why else build a machine out of trash? *grin*

Imric's Tips for Living
  • Paranoia Is a Survival Trait
  • Pessimists are never disappointed - but sometimes, if they are very lucky, they can be pleasantly surprised...
  • Even though everyone is out to get you, it doesn't matter unless you let them win.
New Smoke Emitting FPUs
I had an easy diagnosis once.

I learned UNIX from a guru, and he put me, his one and only student, in charge of the engineering development machine for a Star Wars project.

Well, the Main Programmer was a PhD in math - so he hated physicists. He'd always look at me in the lab as if to say - "Physicists! Bah!"

But root was the brains of the operation. Root was so called because he was ugly as sin but had a beautiful lady on his arm all the time, and drove a purple Cadillac. I drove a red Mustang 5-liter convertible. He was good at ballroom dancing. I wasn't. (But I did OK too :)

Anyway for the development system root put me in charge of - one day the long sought-after external FPU for the 68030 (68882?) arrived, and root bid me install it in my own system! Well, Mr. math PhD would not hear of it! He insisted on installing it himself! Well, he installed it sideways (it was square) and that was not good. The very expensive FPU was destroyed, but the rest of the system lived on and accepted a new FPU a week later, installed by me.

I never talked much to Mr. Math PhD after that. He said my C programming looked like FORTRAN - and he was right. He wouldn't admit it, but he was impressed with my VAX FORTRAN parser written in XEDIT on an IBM 4341 running VM/CMS.

Pretty soon Star Wars ended, and Business by Numbers began, and then it was now.
-drl
New Re: Smoke Emitting FPUs
Good story there. Thanks.

I did a bonehead manoeuver, once, at the time I bought my K6/233 motherboard. Not as bad as your math Ph.D colleague's, and more cheaply recoverable, but funny.

This was back when you paid only a small premium for building up from individual components of your choosing: The price gap has widened since then. I was trying to put together a system from quality, maximum-bang-for-buck components, and so got an FIC PA-2007 mini-AT-format mainboard, K6/233, a separate huge CPU heat sink with ball-bearing fan, big tower case, PC Power & Cooling TurboCool PSU, an Adaptec AHA-2940UW, Number Nine Imagine 128 video, and 64 MB Corsair SDRAM (augmented later on, but only 64 MB initially) rated for CAS2 operation at 100MHz on the local memory bus. Carried forward from older boxen were an old Keytronics motherboard, Logitech Trackman Marble, a pair of 10k RPM IBM wide-SCSI hard drives, an old Toshiba ATAPI CD-ROM, a Toshiba floppy, and a pair of antique 3Com 3C509 ethernet cards.

I banged the machine together, flipped it on, loaded Debian, and started trying to use it. SIG11 errors started cropping up. Hmm, that's odd. Power it down, check for sources of heat problems, spaced out the cards and hard drives, made sure all fans were working, made sure there was no air blockage. Took a coping saw and made a cutout on the drive bracket, and mounted another muffin fan pointing upwards at the hard drives through it, just for an additional safety factor. Made sure nothing was jamming the CPU or power supply fans. Put the beast back together, powered up: No problems. Good. Half hour later, a consistent string of SIG11s, especially if I started doing strenuous things like kernel compiles. Damn.

Open 'er up again. Recheck for heat concentrations: None. Everything seems to be running cool. OK, am I seeing a sensitive CPU, dodgy RAM, or what? Run memtest86, no problems. Hmm. I'm being tempted, at this point, to try exchanging the CPU. But I want to try one thing out: I check out where the heat sink was mounted onto the CPU: I had a sort of heat-transferring pad between them, that came with the heat sink. So, I go down to Fry's Electronics, and buy some more of that stuff, and also a tube of heat-conducting paste, which one can use instead, in such places. Get back to my machine, take the heat sink off. Try a freshly cut piece of conductive pad. Basically the same results, except a pattern is starting to emerge: I can do the heaviest number-crunching I like within the first 20-30 minutes from power-on, without trouble. Afterwards, anything even mildly strenuous (e.g., kernel compiles) will generate Signal 11 errors. Conclusion: Something is heating up, 1/2 hour from power-on. It's not a strictly operational thing, but rather heat build-up over time.

I shut down, and take the CPU heat sink off, again. Pull the pad out, and clean all surfaces. Put the heat-conducting paste onto the bottom of the heat sink, latch it onto the CPU. Lock everything back in place, re-test. Same results. Hmm, dodgy CPU?

Shortly before I jump to that (wrong) conclusion, I pull the heat sink off, again. I put the heat sink upside-down on my desk and stare at it. Oh, fsck. What a dumbass error.

The outline of the K6 is square, as is its socket. The heat sink is also square. Unlike your Ph.D friend's FPU, the K6's socket is keyed to admit it in only one orientation, so you cannot fry it that way. So, you think, square CPU that can be inserted only the correct way; square heat sink that goes on top; lever that locks the heat sink down. Put heat sink A on chip B in socket C; what can go wrong? There are two orientations the heat sink can mount onto the CPU; 180 degree rotations of one another. Both must be right, right?

Wrong. The contact area on top of the CPU is offset towards one side, and is maybe 65% of the total square. The bottom surface of the heat sink projects downwards only on the lateral side only where the matching 65% of its contact surface can touch the CPU's contact area below it. You can verify that this has happened if/when you yank the heat sink and newly-squeezed thermal paste, because the thermal paste will be mashed from contact with the CPU. Likewise, the CPU's contact area will have some detached thermal paste smeared across it.

Only mine wasn't. On the heat sink bottom, only the paste near the centre of its contact area was mashed, with the remainder dangling into space exactly the way I'd dabbed it on. The CPU's upper surface sported only a little paste, near the middle: Most of the CPU's contact surface had been air-gapped, rather than heat-sinked. Luckily, trapped heat across most of the CPU had not turned it into silicon carne, and everything worked beautifully once I flipped the friggin' heat sink around the correct way.

The silver lining is that that machine emerged from the exercise with awesomely efficient cooling.

Rick Moen
rick@linuxmafia.com


If you lived here, you'd be $HOME already.
Expand Edited by rickmoen Dec. 3, 2002, 04:18:55 PM EST
New Heh. Good story. :-)

"Ah. One of the difficult questions."

     Posting this from... - (admin) - (63)
         Knoppix rawks, and... - (kmself) - (10)
             after seing you guys have orgasms - have grabbed a copy too - (dmarker) - (1)
                 Yeppers... HOWDY DOODY time!!! - (folkert)
             CD now making the rounds of the office - (admin) - (7)
                 Re: Holy Shit! - this post from Knoppix 2 mins after burn !! - (dmarker) - (6)
                     Or as Pete Whysall would say.... - (folkert) - (1)
                         Not here, too! - (Yendor)
                     Re: Holy Shit! - this post from Knoppix 2 mins after burn !! - (deSitter) - (3)
                         Re: Supa Shit! KNOPPIX on my corporate DELL notebook 5mins - - (dmarker) - (2)
                             Re: When does it end - am now listening to mp3 off memstick - (dmarker) - (1)
                                 I make about 20 CDs - (folkert)
         :-( - (imric) - (50)
             Put a bullet in it and get it over with. ;-) - (n3jja) - (49)
                 Nah. - (imric) - (48)
                     For what it's worth, it has a Knoppix-averse "soul-mate". - (a6l6e6x) - (47)
                         Got everything but X on Debian now... - (imric) - (46)
                             Re: Got everything but X on Debian now... - (deSitter) - (45)
                                 Howzabout... - (pwhysall) - (44)
                                     Been there, done that... - (imric) - (2)
                                         Well - (deSitter) - (1)
                                             *chuckle* - (imric)
                                     No Stones - (deSitter) - (40)
                                         Re: No Stones - (pwhysall) - (4)
                                             Re: No Stones - (deSitter) - (3)
                                                 Been there. Done that. - (admin)
                                                 Re: No Stones - (pwhysall)
                                                 Re: No Stones - (pwhysall)
                                         In that case then the perfect distro for you is... - (ben_tilly) - (34)
                                             Oh, yeah. I forgot... I tried that one, too! -NT - (imric) - (32)
                                                 "you" is Ross :-) -NT - (ben_tilly) - (31)
                                                     Oh, I know... - (imric) - (30)
                                                         We-ell... - (ben_tilly)
                                                         Confused - (deSitter) - (28)
                                                             I wish I knew. - (imric) - (27)
                                                                 AARG - (deSitter) - (26)
                                                                     Not home @ the moment, but - (imric) - (25)
                                                                         ROFLMAO - Now Knoppix won't boot! This thing IS haunted! -NT - (imric) - (7)
                                                                             Exorcism in progress - (imric) - (6)
                                                                                 Pita??? Thought it was pea soup ;-) -NT - (bepatient) - (5)
                                                                                     Pea soup for dipping PITAs... - (imric) - (4)
                                                                                         Grrrr. available hosed again. -NT - (imric) - (3)
                                                                                             It's looking like a hardware problem, Imric. -NT - (static) - (2)
                                                                                                 Yup. Memory. Posting from Debian now. - (imric) - (1)
                                                                                                     Skip... Drop me a message to my e-mail... - (folkert)
                                                                         Re: Not home @ the moment, but - (deSitter) - (1)
                                                                             What? - (imric)
                                                                         Looks to be bad hardware - (rickmoen) - (14)
                                                                             On the face of it, disagree because this is a laptop - (deSitter) - (3)
                                                                                 Yeah, my 390x was a pain at first, too! - (imric)
                                                                                 Be very careful about bad RAM, please - (rickmoen)
                                                                                 Laptop? - (rickmoen)
                                                                             I concur. - (imric) - (5)
                                                                                 Hardware diagnosis - (rickmoen) - (4)
                                                                                     Yup! Fun... - (imric)
                                                                                     Smoke Emitting FPUs - (deSitter) - (2)
                                                                                         Re: Smoke Emitting FPUs - (rickmoen) - (1)
                                                                                             Heh. Good story. :-) -NT - (static)
                                                                             +5 Informative. - (static) - (3)
                                                                                 OS/2's install was always killer on bad memory. - (admin) - (2)
                                                                                     Re: OS/2's install was always killer on bad memory. - (rickmoen) - (1)
                                                                                         This was ca. 1992 - (admin)
                                             Interesting -NT - (deSitter)
         Knoppix ++ - (tseliot)

Only you would go for the plague.
244 ms