So, now that we got SPU UNIX up and running on the service processor unit (see Convex C1 XP Service Processor power-on), it's time to do something with it.
The purpose of SPU UNIX is to serve as an environment to initialize the system (load microcode onto the job processor, or main CPU; initialize memory), as well as diagnose the hardware. Wil a lot of help from ex-Convex employees (a special thanks goes to Doug Hosking), I learned a lot about what can be done in this environment.
The two main utilities used in diagnosing a system are margin and dshell. The margin command is used to check or alter the clock speed and power supply voltages. Used by itself, it checks and prints out the current status:
(spu)> margin clk: n (10.0Mhz) ps1: n (+5.00) ps2: n (+4.95) ps3: n (+5.00)
But you can also use it to raise or lower power supply voltages and clock frequency slightly:
(spu)> margin -l clk clk: l (9.0Mhz) ps1: n (+5.00) ps2: n (+4.95) ps3: n (+5.00)
The dshell command is used to enter the diagnostic shell, where you can execute diagnostic tests. The following is a list of all tests (excluding those not applicable to my system):
CONVEX DIAGNOSTIC TEST MENU Dir. Test Rev. Description ____ ____________ ____ ___________ /mnt/test cpu4000.t 1.30 CPU INSTRUCTION SET TESTS cpu4010.t 1.20 CPU INSTRUCTION SET TESTS cpu4030.t 1.6 CPU4030 Functional Test Program cpu4040.t 1.4 VECTOR CONCURRENCY TESTS dev4100.t 1.33 Xylogics 450/451 Controller Test Program dev4110.t 1.33 Xylogics 450/451 Controller Test Program dev4200.t 1.17 STC Tape Drive Test Program dev4300.t 1.22 Systech Comm Test Program dev4500.t 1.18 EXOS/201 Ethernet Controller Test io4000.t 1.21 IOP Functional Test Program mem4000.t 1.16 Main Memory Diagnostic Test spu4000.t 1.10 FITS Kernel Tests spu4100.t 1.7 File I/O and Seek Test
A test can be started by typing test . A list of subtests can be obtained with test -s. A selection of subtests to run can then be made. Since I only have the SPU card installed in the main card cage, running the spu4000 tests seems to be in order. However, this falls over pretty quickly:
: test spu4000 Test 'spu4000.t' Sun Aug 30 20:12:35 1970 Subtest 100 0:00:00 0:00:00 0:00:00 passed Subtest 105 0:00:00 0:00:00 0:00:00 failed ***** Sun Aug 30 20:12:54 1970 ***** Test: spu4000.t 1.10 Class: 1 Subtest: 105 1.4 Count: 1 Error: 0 Failed: SPU RUN_ARM Circuitry Expected RUN_ARM and RUN bits set Address: 00ffb840 Exp: 00000009 Act: 00000008 scn_rp: Warning -- No board present in slot mcu Test 'spu4000.t' failed Elapsed time: 0:00:02
Well, Doug warned me that a lot of tests may require the MCU (memory control unit) and at least one MAU (memory array unit) to be installed. So, let's do a pwrdwn and install those cards. After this change, most of the SPU tests passed, except those that talk to the scan interfaces on the other (presently not installed) cards. Some of these tests also margin the power supplies low or high.
Now that we have the MCU and MAUs installed, we should also be able to run the mem4000 tests. Most of these pass (except the ones that require the job processor (main cpu) to be present), but the sizing test reports an error:
Subtest 340 0:00:00 0:00:00 0:00:00 failed ***** Fri Jan 2 16:33:08 1970 ***** Test: mem4000.t 1.16 Class: 3 Subtest: 340 1.2 Count: 1 Error: 1 Failed: Memory sizing test Memory-Cop inconsistency on MAU 2 MAU size Expected 16 Mbytes, Actual 0 Mbytes Expected: MAU Allocated Blocks 2 00 01 02 03 04 05 06 07 Actual: MAU Allocated Blocks 2 Test 'mem4000.t' failed Elapsed time: 0:03:12
The COP is a small serial EEPROM on each board that stores configuration data. Apparently, the COP on this MAU tells the system there is no memory on the board, which is clearly wrong. Perhaps that EEPROM is broken?
Anyway, I took that MAU out, and re-ran the tests. All is fine now. Time to move on, let's add the IOP (I/O Processor) card, and plug the Multibus cards into the Multibus card cage. Now, we can run the io4000 tests.
Subtest 202 0:00:00 SPU_UNIX(109):Trap: Soft error 0:00:05 0:00:10 0:00:10 0:00:10 failed ***** Sun Jan 17 16:31:32 1971 ***** Test: io4000.t 1.21 Class: 2 Subtest: 202 1.1 Count: 1 Error: 0 Failed: Main memory acess test Ccu 7: Memory error @ 200000 Expected: 1 Actual: 0 *************** Test started Sun Jan 17 16:30:09 1971 *************** Test ended Sun Jan 17 16:31:32 1971
Another memory related error. This one turned out to be relatively easy to fix... Memory needs to be initialized before running the io tests. So, we do an mminit -s (the -s tells mminit not to use the job processor), and rerun the io4000 tests. Now they all pass. It should now be possible to let the IO Processor (again based around a 68000) boot its own OS.
iosysload -livh /ioconfig iop
should do the trick. /ioconfig is the name of a text file that contains a definition of the installed I/O cards and their addresses. As a result, the heartbeat LED on the IOP starts blinking at a steady 1Hz, that's good. Doug told me that the row of 8 LEDs on the Systech MTI-1650 16-line serial port card should do a little knight-rider scrolling. They don't. Spent some time trying to figure out what was wrong, and finally found out that the cables to the external box with the actual RS232 ports was not plugged in. Plugged that in, and the lights started scrolling as they should. I also ran the tests for the tape controller (dev4200), disk controller (dev4100), serial card (dev4300) and LAN interface (dev4500). The LAN interface tests passed, the others failed on tests that actually needed the devices to be hooked up. So far, so good.
So far, progress has been really very encouraging, but we're running out of things we can try without the main CPU (job processor), so, let's add the main CPU cards (a few at a time, running the appropriate SPU tests after installing them). With all cards installed, we should be able to load the microcode onto the CPU:
(spu)> initall -c System Initialization clk: n (10.0Mhz) ps1: n (+5.00) ps2: n (+4.85) ps3: n (+5.10) Loading ASU control stores Reading file /mnt/usr/ucode/base.fasu Rev 1.7 cs: Checksum error in /mnt/usr/ucode/base.fasu Initialization Aborted
Drats, we have a corrupt file on the hard disk. So, to make a long story short, I opened the disk image in a hex editor, figured out where the directory entries and inodes were, figured out which blocks on the disk were part of the base.fasu file, re-read those blocks from the original SPU disk (using a slightly slower process, reading them one block at a time), and found a difference. I patched the new SPU disk with the corrected data, reran the initall, and:
(spu)> initall System Initialization clk: n (10.0Mhz) ps1: n (+5.00) ps2: n (+4.90) ps3: n (+5.00) Loading ASU control stores Reading file /mnt/usr/ucode/base.fasu Rev 1.7 0:01 Loading wcs and epcs 0:10 Verifying wcs and epcs 0:19 Loading VCU control stores Reading file /mnt/usr/ucode/lsctl.lcs Rev 2.12 0:01 Loading lcs 0:08 Verifying lcs 0:07 Reading file /mnt/usr/ucode/actl.acs Rev 1.22 0:01 Loading acs 0:08 Verifying acs 0:07 Reading file /mnt/usr/ucode/mctl.mcs Rev 1.17 0:01 Loading mcs 0:07 Verifying mcs 0:08 Memory interleave set to 4 Initializing memory 0:00:00 Hard error reported by MCU Multiple bit error Address: 04003000 Syndrome: 77 MAU: MAU4 Memory master: PCU mminit: Unable to initialize main memory Initialization Aborted
Ok, the microcode loads, but the main CPU does not like something about the memory on board MAU4. Let's take that one out.
Initializing memory 0:00:00 Hard error reported by MCU Nonexistent memory error Address: 0400b010 mminit: Unable to initialize main memory Initialization Aborted
Drats, more memory problems, but no indication which MAU(s) are involved. I won't bother you with everything I tried (no, the mem4000 tests did not help), I stumbled upon a combination of four MAUs that did work:
(spu)> initall System Initialization clk: n (10.0Mhz) ps1: n (+5.05) ps2: n (+4.95) ps3: n (+5.00) Loading ASU control stores Reading file /mnt/usr/ucode/base.fasu Rev 1.7 0:01 Loading wcs and epcs 0:10 Verifying wcs and epcs 0:19 Loading VCU control stores Reading file /mnt/usr/ucode/lsctl.lcs Rev 2.12 0:00 Loading lcs 0:08 Verifying lcs 0:07 Reading file /mnt/usr/ucode/actl.acs Rev 1.22 0:00 Loading acs 0:08 Verifying acs 0:07 Reading file /mnt/usr/ucode/mctl.mcs Rev 1.17 0:01 Loading mcs 0:07 Verifying mcs 0:08 Memory interleave set to 4 Initializing memory 0:00:00 0:00:03 0:00:03 Main memory size: 67108864 MAU Allocated Blocks 0 00 01 02 03 04 05 06 07 1 00 01 02 03 04 05 06 07 2 00 01 02 03 04 05 06 07 3 00 01 02 03 04 05 06 07 4 5 6 7 Initialization Complete
Ok, we have 64MB of memory to play with... Let's try to run all of mem4000 (including the tests that use the job processor):
Subtest 601 0:00:00 0:00:05 0:00:08 0:00:08 failed ***** Thu Jan 1 05:46:53 1970 ***** Test: mem4000.t 1.16 Class: 6 Subtest: 601 1.4 Count: 1 Error: 1 Failed: Vectorized address uniqueness test Mode: EDC disabled Address: 03265998 Exp: 00000000 03265998 Act: 00000000 83265998 MAU: MAU 3 RAMS: E0-017 Test 'mem4000.t' failed Elapsed time: 0:03:00
That's a very useful error message. It indicates a single RAM chip (location E0-017 on MAU3) that's bad. I replaced that chip, and reran the tests. It pointed out another bad RAM chip. Replaced that too, and a third. After that, all mem4000 tests passed.
Let's try to test the CPU:
: test cpu4000 Test 'cpu4000.t' Thu Jan 1 05:37:48 1970 Run default options : y libtest: SIGSEGV received, using SIGIOT to dump core
This one turned out to be another corrupt file on the disk. It crashes before it even starts the subtests. cpu4010 and cpu4030 did pass. I fixed the disk corruption using the find-the-blocks-in-these-files-and-reread-them routine, and got cpu4000 to get as far as the first vector test:
Subtest 925 0:00:00 0:00:00 0:00:00 failed ***** Thu Jan 1 05:38:55 1970 ***** Test: cpu4000.t 1.30 Class: 20 Subtest: 925 1.30 Count: 1 Error: 0 Failed: vl = 0/vec. inst. test, source: vlzero.s Subtest Failure, JP halted a0: 0007d0f0 s0: 00000000 00000010 t0: 00000000 pc: 0002ae08 a1: 00000100 s1: 00000000 00000000 t1: 00000000 ca: 0002ae08 a2: 0002acf8 s2: aaaaaaaa aaaaaaaa t2: 00000000 psw: 02000000 a3: 00000000 s3: 00000000 00000000 t3: 00000000 jpcr: 01d77f08 a4: 00000000 s4: 00000000 00000000 t4: 02000002 upc: bfd a5: 0002adea s5: 00000000 00000000 t5: 00000000 a6: 00000000 s6: 00000000 00000080 t6: 00000014 a7: 00000000 s7: 00000000 00000000 t7: 01d77f08 LCTL: 4c4c0012000a96 MCTL: 300014000a90 ACTL: 280020002696 STLCTL,VL: 17fff207,7f STMCTL,VL: 20011207,7f STACTL,VL: 3bc11207,7f DATA BUS: 0000000000000000 MISC: 383d1 VM REG STATE: 40000 VM: 0000000000000000 0000000000000000 C0,ST0: 0101,10101 C1,ST1: 0101,10101 C2,ST2: 0101,10101 C3,ST3: 0101,10101 Test 'cpu4000.t' failed Elapsed time: 0:02:32
The cpu4040 tests also failed, so there appears to be something wrong with the vector units. At Doug's suggestion, I reran these tests with the clock margined down to 9 MHz, but that did not help. I then checked the board revisions between the C1 XP and the C1 XL, and found that the vector cards (2 VPUs - vector processor unit and 1 VCU - vector control unit) had the same revision level. I put the cards out of the C1 Xl into the C1 XP, and...
cpu 4000 passes. cpu4040 is a very intensive, long-running vector test, but after 5 hours and 26 minutes, that passed as well. One thing left to do. On the SPU disk is a copy of the main Convex UNIX kernel (/mnt/os/vmunix). There is no root disk, so it will fall over, but let's see if we can boot that kernel:
(spu)> /mnt/os/boot Booting Convex UNIX. Type ^C to abort System Initialization clk: n (10.0Mhz) ps1: n (+5.00) ps2: n (+4.95) ps3: n (+5.05) Memory interleave set to 4 Initializing memory 0:00:00 0:00:04 0:00:04 Main memory size: 67108864 MAU Allocated Blocks 0 00 01 02 03 04 05 06 07 1 00 01 02 03 04 05 06 07 2 00 01 02 03 04 05 06 07 3 00 01 02 03 04 05 06 07 4 5 6 7 Initialization Complete [SPU_@05:42:18] Errlog started: -l /mnt/errlog [SPU_@05:42:19] [SPU_@05:42:22] errintd(1.13) started, options: Loading vmunix [SPU_@05:42:27] mm_sniff: sniff rate: 32.14 MB/day (1.99 days/pass) vmunix: text: 749568 data: 77824 bss: 1409024 [CPU_@05:42:47] Convex UNIX Version 7.1.0.10 Mon Mar 27 15:29:00 CST 1989 [CPU_@05:42:50] Device "DKC-001" (7/0/0x3f0 int 2) [CPU_@05:42:50] Unit "DKD-005" unit number 0 [CPU_@05:42:50] Unit "DKD-005" unit number 1 [CPU_@05:42:50] Device "MTC-001" (7/0/0xc0 int 4) [CPU_@05:42:50] Unit "MTD-001" unit number 0 [CPU_@05:42:53] Device "ACM-001" (7/0/0x3c0 int 7) [CPU_@05:42:56] Device "LAN-001" (7/1/0x4c0 int 1) [CPU_@05:42:56] Unit "ex" unit number 0 [CPU_@05:42:56] NX Firmware Version 5.5, Hardware Rev. 0.0 [CPU_@05:42:56] ex0: address = 08:00:14:12:92:32 [CCU7@05:42:56] da0: DrvR:Drive Not Ready [cyl=759 hd=18 sect=58 cnt=0] [CPU_@05:42:57] WARNING: no swap space [CCU7@05:42:57] da0: DrvR:Drive Not Ready [cyl=759 hd=18 sect=58 cnt=0] [CPU_@05:42:58] FATAL CONVEX UNIX ERROR: (rootmount cannot mount root) [CPU_@05:42:58] sp: 0b172f44 a1: 00000000 [CPU_@05:42:58] a2: 00000000 a3: 0038bb70 [CPU_@05:42:58] a4: 0038ab70 a5: 00000400 [CPU_@05:42:58] ap: 0b172f58 fp: 0b172f44 [CPU_@05:42:58] s0: ffffffff00000005 s1: 0000000040000001 [CPU_@05:42:59] s2: 400000ff00000000 s3: 8000000000000010 [CPU_@05:42:59] s4: 0000000000000000 s5: 0000000000000000 [CPU_@05:42:59] s6: ffffffff00000005 s7: ffffffff00000005 [CPU_@05:42:59] int. mask: 40000001 [CPU_@05:42:59] syncing disks... [CPU_@05:42:59] done [CPU_@05:42:59] halting processor /mnt/os/boot: 104 Quit - core dumped
There you go. This might well be the only working Convex C1 CPU on the planet...