Re: Adventures in Server restore



Ben,

Just for extra information, I dismounted the store and then ran Prime95 on
the machine to stress test the CPU, whilst monitoring the temperature. Once
it hit about 64C Prime95 generated an internal error:

"Fatal ERROR: Rounding was 0.4990234375, expected less than 0.4
Hardware failure detected"


So it looks to me like the processor is going strange at high temperatures.
I suspect that an app like exchange which relies on parity calculations in
the database stucture may start writing errors into it's database when
overheated. This is what causes the crashes.

The machine is back in the hardware maintainence area being looked at now.

The Bottom Line is, keep an eye on your CPU temperatures if you start
geting database errors.

Andy

"Ben Winzenz [Exchange MVP]" <ben_winzenz@NOSPAMdotmessageonedotcom> wrote
in message news:uGj3g8mdFHA.1600@xxxxxxxxxxxxxxxxxxxxxxx
> Intel's specs say that 70C is the absolute max that the proc should be
> http://support.intel.com/support/processors/pentium4/sb/CS-007999.htm#Specifications
> but I'd suspect variances in processors could induce instability prior to
> that limit. 64C is hot no matter what, though. The P4's should have
> thermal throttling though, so the processor shouldn't overheat - it is
> supposed to throttle down the speed as a measure to prevent thermal
> damage. I'd check a few different temp readings. If the case temp is
> normal, but the CPU temp is high, you may want to invest in a new
> heatsink/thermal compound.
>
> --
> Ben Winzenz
> Exchange MVP
> MessageOne
>
>
> "Andy Cobley" <acobley@xxxxxxxxxxxxxxxxxxxxxx> wrote in message
> news:eEorgVkdFHA.1448@xxxxxxxxxxxxxxxxxxxxxxx
>> Again, many thanks for the reply.
>>
>> This is in fact an in house built server, which are usually very
>> reliable. It's a 3 GHz Pentium 4, not overclocked etc. I ran a copy of
>> Aida32 on it, tem in the processor last night was 60 C, our other servers
>> (in different cases but same processor) are running at 40C, quite a
>> difference.
>>
>> I've been running Exmerge today moving users off this server to an older
>> 5.5 server so we can take this one down and rebuild it. I did notice
>> that at one point the CPU temperature hit 64 C and promptly blue
>> screened. Am I right in saying P4 3Ghz start overheating at 68C or there
>> abouts ?
>>
>> Andy
>>
>>
>>
>> "Ben Winzenz [Exchange MVP]" <ben_winzenz@NOSPAMdotmessageonedotcom>
>> wrote in message news:%23lebqJddFHA.2984@xxxxxxxxxxxxxxxxxxxxxxx
>>> If it is a brand-name box (HP/Compaq, Dell, IBM, etc.) there should be
>>> some software (Dell OpenManager, Compaq Insight Manager) that can give
>>> you that information. Regardless, the chipsets on the motherboard (at
>>> least most that are a few years old or newer) should have integrated
>>> temperature sensors, so it's just a matter of getting an app to read
>>> those.
>>>
>>> Usually, -1018's are indicative of disk subsystem problems, but it could
>>> also be bad memory. In addition to checking the temperature, I'd
>>> suggest running a comprehensive system diagnostics on it as well as
>>> making sure all firmware and drivers are up to date, especially on the
>>> RAID controller and backplane, etc.
>>>
>>> --
>>> Ben Winzenz
>>> Exchange MVP
>>> MessageOne
>>>
>>>
>>> "Andy Cobley" <acobley@xxxxxxxxxxxxxxxxxxxxxx> wrote in message
>>> news:eTgn3qcdFHA.2556@xxxxxxxxxxxxxxxxxxxxxxx
>>>> Thanks for that Ben.
>>>>
>>>> We are in fact running SP1. These errors seem to be at random at the
>>>> moment, I'm strongly suspecting a hardware error due to overheating.
>>>> Time to get some sort of temp monitor in the box I think.
>>>>
>>>> Andy
>>>>
>>>> "Ben Winzenz [Exchange MVP]" <ben_winzenz@NOSPAMdotmessageonedotcom>
>>>> wrote in message news:uvT0WVcdFHA.3052@xxxxxxxxxxxxxxxxxxxxxxx
>>>>> To be clear, SP1 for Exchange includes some additional error
>>>>> correcting that prevents -1018 errors due to flipped bits. If it is a
>>>>> physical RAID controller or disk subsystem problem, it will likely not
>>>>> solve the problem. Also, there is no need to run eseutil /d after
>>>>> installing SP1 for Exchange. The article references the need to run
>>>>> eseutil /d IF you choose to use eseutil to repair the old damaged
>>>>> database. Otherwise, no need.
>>>>>
>>>>> --
>>>>> Ben Winzenz
>>>>> Exchange MVP
>>>>> MessageOne
>>>>>
>>>>>
>>>>> "Andy Cobley" <acobley@xxxxxxxxxxxxxxxxxxxxxx> wrote in message
>>>>> news:eZPTkmadFHA.1292@xxxxxxxxxxxxxxxxxxxxxxx
>>>>>> Sjai,
>>>>>>
>>>>>> Thanks for that information. Very useful.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> "Shai Netanel" <ShaiNetanel@xxxxxxxxx> wrote in message
>>>>>> news:34579114-676B-443D-8B74-F1384478E00F@xxxxxxxxxxxxxxxx
>>>>>>> Hello
>>>>>>> I see this problem before..
>>>>>>> It is hardware problem; it can be disk problem or Raid / SCSI
>>>>>>> controller
>>>>>>> problem
>>>>>>> SP1 for exchange change the db jet and should fix this problem
>>>>>>> (After installing SP1 you need to run eseutil /d)
>>>>>>>
>>>>>>> http://www.kbalertz.com/kb_Q314917.aspx
>>>>>>>
>>>>>>> Regards,
>>>>>>> Shai Netanel
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> "Andy Cobley" wrote:
>>>>>>>
>>>>>>>> I thought I'd pass on my experience in bringing back an exchange
>>>>>>>> server
>>>>>>>> that has gone bad. Basically, although I had done some test backup
>>>>>>>> and
>>>>>>>> restore exercises, a real restore turned out to be far from easy.
>>>>>>>>
>>>>>>>> Basically On Friday night (around midnight) something happened to
>>>>>>>> corrupt
>>>>>>>> the priv.edb. I'm not sure what happened (but a online backup was
>>>>>>>> going on)
>>>>>>>> I started to get the following messages in the event viewer.
>>>>>>>>
>>>>>>>> "Information Store (1932) First Storage Group: The database page
>>>>>>>> read from
>>>>>>>> the file "F:\mdbdata\priv1.edb" at offset 6660096
>>>>>>>> (0x000000000065a000) for
>>>>>>>> 4096 (0x00001000) bytes failed verification due to a page checksum
>>>>>>>> mismatch.
>>>>>>>> The expected checksum was 3713556033693036075 (0x3389338991c82e2b)
>>>>>>>> and the
>>>>>>>> actual checksum was 7734955069946726443 (0x6b5814a791c8082b). The
>>>>>>>> read
>>>>>>>> operation will fail with error -1018 (0xfffffc06). If this
>>>>>>>> condition
>>>>>>>> persists then please restore the database from a previous backup.
>>>>>>>> This
>>>>>>>> problem is likely due to faulty hardware. Please contact your
>>>>>>>> hardware
>>>>>>>> vendor for further assistance diagnosing the problem. "
>>>>>>>>
>>>>>>>> The mail service did continue to run though.
>>>>>>>>
>>>>>>>> On Saturday whilst trying to do a RSG restore from backups things
>>>>>>>> went from
>>>>>>>> bad to worse and eventually the server crashed and continued to
>>>>>>>> crash with
>>>>>>>> random errors in the event log. I thought it may be a virus, but
>>>>>>>> a check
>>>>>>>> revealed not. I did manage to use Exmerge to extract to PST the
>>>>>>>> users
>>>>>>>> mailboxes.
>>>>>>>>
>>>>>>>> The server continues to crash, sometimes not letting me log in. I
>>>>>>>> decided a
>>>>>>>> Dialtone recovery would be best. However I couldn't create a new
>>>>>>>> blank
>>>>>>>> database. Thanks to Rich Matheisen for pointing out that to create
>>>>>>>> the new
>>>>>>>> dialtone database I needed to move the log files to a new location,
>>>>>>>> essentially deleting them form the server. So now I've got a
>>>>>>>> diatone
>>>>>>>> database and people can send and receive mail.
>>>>>>>>
>>>>>>>> Next up, use Exmerge to bring back the pst files from Saturday.
>>>>>>>> Bad move,
>>>>>>>> one of them must have contained the data that was corrupting the
>>>>>>>> database
>>>>>>>> and the server started crashing again with blue screen dumps. I
>>>>>>>> have to
>>>>>>>> admit I was losing what little hair I had left. For good measure
>>>>>>>> the IIS
>>>>>>>> file MetaBase.xml had become corrupt (presumably because of the
>>>>>>>> crashes)
>>>>>>>> causing Exchange services not to start correctly. This was restored
>>>>>>>> from a
>>>>>>>> system backup.
>>>>>>>>
>>>>>>>> The only solution was to disable the exchange services to get the
>>>>>>>> server
>>>>>>>> stable. Then create a new dialtone database (thus losing any mail
>>>>>>>> received
>>>>>>>> this morning, a backup couldn't be created because the server was
>>>>>>>> only up
>>>>>>>> for a couple of minutes at a time.) This has now allowed me to get
>>>>>>>> the
>>>>>>>> server running again. I'm going to run it for a day like this just
>>>>>>>> to
>>>>>>>> confirm it's stable.
>>>>>>>>
>>>>>>>> However I can't restore from the pst files on Saturday. So I'm
>>>>>>>> going to
>>>>>>>> have to go back to older backup files to get the users mail back.
>>>>>>>>
>>>>>>>> Andy C
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>


.



Relevant Pages

  • Re: Have you changed a a PSU fan?
    ... temperature, and comments on this would be appreciated. ... then your CPU is at perfect temperatures. ... As has already been stated by Dorothy Bradbury, the type of fan ... The real upper temperature environmental limitation is set by the hard ...
    (uk.comp.homebuilt)
  • Re: computer keeps crashing
    ... That is the temperature it can be at when the room is at ... the temperature the CPU must be at in a 70 degree room. ... Selecting heatsink assemblies starts by doing calculations. ... I do not see where you performed comprehensive hardware diagnostics ...
    (alt.comp.hardware.pc-homebuilt)
  • Re: Intel Mini Temperatures
    ... Temperatures for the CPU ... And you're using the same program - Temperature Monitor? ... similar with different Mini models I wonder? ... CPU A Heatsink 68 F ...
    (comp.sys.mac.system)
  • Re: Broken Athlon XP CPU diode?
    ... I appreciate your answer but aren't you talking about the CPU ... CPU diode which is supposed to be built-in inside the processor. ... is the same as the CPU Temperature in the ... The monitor in a Super I/O chip can typically ...
    (alt.comp.periphs.mainboard.asus)
  • Re: P4V8x-MX odd CPU BIOS temperature?
    ... I'm starting to suspect it might be correctly reporting temperatures ... Maybe the motherboard or a CPU defect put the CPU in some ... I felt the temperature of the headsink fins after the CPU had ... So I then ran tests with the CPU's fan unplugged from the motherboard, ...
    (alt.comp.periphs.mainboard.asus)