Write a comment

So, now that we got SPU UNIX up and running on the service processor unit (see Convex C1 XP Service Processor power-on), it's time to do something with it.

The purpose of SPU UNIX is to serve as an environment to initialize the system (load microcode onto the job processor, or main CPU; initialize memory), as well as diagnose the hardware. Wil a lot of help from ex-Convex employees (a special thanks goes to Doug Hosking), I learned a lot about what can be done in this environment.

The two main utilities used in diagnosing a system are margin and dshell. The margin command is used to check or alter the clock speed and power supply voltages. Used by itself, it checks and prints out the current status:

(spu)> margin
clk: n (10.0Mhz)  ps1: n (+5.00)  ps2: n (+4.95)  ps3: n (+5.00)

But you can also use it to raise or lower power supply voltages and clock frequency slightly:

(spu)> margin -l clk
clk: l (9.0Mhz)  ps1: n (+5.00)  ps2: n (+4.95)  ps3: n (+5.00)

The dshell command is used to enter the diagnostic shell, where you can execute diagnostic tests. The following is a list of all tests (excluding those not applicable to my system):

        CONVEX DIAGNOSTIC TEST MENU
Dir.    Test            Rev.    Description
____    ____________    ____    ___________

/mnt/test
        cpu4000.t       1.30    CPU INSTRUCTION SET TESTS 
        cpu4010.t       1.20    CPU INSTRUCTION SET TESTS 
        cpu4030.t       1.6     CPU4030 Functional Test Program 
        cpu4040.t       1.4     VECTOR CONCURRENCY TESTS 
        dev4100.t       1.33    Xylogics 450/451 Controller Test Program 
        dev4110.t       1.33    Xylogics 450/451 Controller Test Program 
        dev4200.t       1.17    STC Tape Drive Test Program 
        dev4300.t       1.22    Systech Comm Test Program 
        dev4500.t       1.18    EXOS/201 Ethernet Controller Test
        io4000.t        1.21    IOP Functional Test Program
        mem4000.t       1.16    Main Memory Diagnostic Test 
        spu4000.t       1.10    FITS Kernel Tests 
        spu4100.t       1.7     File I/O and Seek Test 

A test can be started by typing test . A list of subtests can be obtained with test -s. A selection of subtests to run can then be made. Since I only have the SPU card installed in the main card cage, running the spu4000 tests seems to be in order. However, this falls over pretty quickly:

: test spu4000
Test 'spu4000.t'                        Sun Aug 30 20:12:35 1970

        Subtest 100    0:00:00   0:00:00   0:00:00   passed
        Subtest 105    0:00:00   0:00:00   0:00:00   failed

*****  Sun Aug 30 20:12:54 1970  *****
Test:    spu4000.t  1.10   Class: 1    Subtest: 105 1.4   Count: 1    Error: 0
Failed:  SPU RUN_ARM Circuitry

Expected RUN_ARM and RUN bits set   
Address: 00ffb840 Exp: 00000009 Act: 00000008 
scn_rp: Warning -- No board present in slot mcu

Test 'spu4000.t' failed
Elapsed time:   0:00:02

Well, Doug warned me that a lot of tests may require the MCU (memory control unit) and at least one MAU (memory array unit) to be installed. So, let's do a pwrdwn and install those cards. After this change, most of the SPU tests passed, except those that talk to the scan interfaces on the other (presently not installed) cards. Some of these tests also margin the power supplies low or high.

Now that we have the MCU and MAUs installed, we should also be able to run the mem4000 tests. Most of these pass (except the ones that require the job processor (main cpu) to be present), but the sizing test reports an error:

        Subtest 340    0:00:00   0:00:00   0:00:00   failed

*****  Fri Jan  2 16:33:08 1970  *****
Test:    mem4000.t  1.16   Class: 3    Subtest: 340 1.2   Count: 1    Error: 1
Failed:  Memory sizing test

Memory-Cop inconsistency on MAU 2
MAU size Expected 16 Mbytes, Actual 0 Mbytes

Expected:
MAU     Allocated Blocks

 2      00 01 02 03 04 05 06 07 

Actual:
MAU     Allocated Blocks

 2      

Test 'mem4000.t' failed
Elapsed time:   0:03:12

The COP is a small serial EEPROM on each board that stores configuration data. Apparently, the COP on this MAU tells the system there is no memory on the board, which is clearly wrong. Perhaps that EEPROM is broken?

Anyway, I took that MAU out, and re-ran the tests. All is fine now. Time to move on, let's add the IOP (I/O Processor) card, and plug the Multibus cards into the Multibus card cage. Now, we can run the io4000 tests.

        Subtest 202    0:00:00 

SPU_UNIX(109):Trap: Soft error 

  0:00:05   0:00:10   0:00:10   0:00:10   failed

*****  Sun Jan 17 16:31:32 1971  *****
Test:    io4000.t  1.21    Class: 2    Subtest: 202 1.1   Count: 1    Error: 0
Failed:  Main memory acess test

Ccu 7: 
Memory error @ 200000
Expected: 1    Actual: 0

*************** Test started Sun Jan 17 16:30:09 1971
*************** Test ended   Sun Jan 17 16:31:32 1971

Another memory related error. This one turned out to be relatively easy to fix... Memory needs to be initialized before running the io tests. So, we do an mminit -s (the -s tells mminit not to use the job processor), and rerun the io4000 tests. Now they all pass. It should now be possible to let the IO Processor (again based around a 68000) boot its own OS.

 

iosysload -livh /ioconfig iop

should do the trick. /ioconfig is the name of a text file that contains a definition of the installed I/O cards and their addresses. As a result, the heartbeat LED on the IOP starts blinking at a steady 1Hz, that's good. Doug told me that the row of 8 LEDs on the Systech MTI-1650 16-line serial port card should do a little knight-rider scrolling. They don't. Spent some time trying to figure out what was wrong, and finally found out that the cables to the external box with the actual RS232 ports was not plugged in. Plugged that in, and the lights started scrolling as they should. I also ran the tests for the tape controller (dev4200), disk controller (dev4100), serial card (dev4300) and LAN interface (dev4500). The LAN interface tests passed, the others failed on tests that actually needed the devices to be hooked up. So far, so good.

So far, progress has been really very encouraging, but we're running out of things we can try without the main CPU (job processor), so, let's add the main CPU cards (a few at a time, running the appropriate SPU tests after installing them). With all cards installed, we should be able to load the microcode onto the CPU:

(spu)> initall -c
System Initialization
clk: n (10.0Mhz)  ps1: n (+5.00)  ps2: n (+4.85)  ps3: n (+5.10)

Loading ASU control stores
Reading file /mnt/usr/ucode/base.fasu  Rev 1.7  
cs: Checksum error in /mnt/usr/ucode/base.fasu
Initialization Aborted

Drats, we have a corrupt file on the hard disk. So, to make a long story short, I opened the disk image in a hex editor, figured out where the directory entries and inodes were, figured out which blocks on the disk were part of the base.fasu file, re-read those blocks from the original SPU disk (using a slightly slower process, reading them one block at a time), and found a difference. I patched the new SPU disk with the corrected data, reran the initall, and:

(spu)> initall
System Initialization
clk: n (10.0Mhz)  ps1: n (+5.00)  ps2: n (+4.90)  ps3: n (+5.00)

Loading ASU control stores
Reading file /mnt/usr/ucode/base.fasu  Rev 1.7  0:01
Loading wcs and epcs    0:10
Verifying wcs and epcs  0:19
Loading VCU control stores
Reading file /mnt/usr/ucode/lsctl.lcs  Rev 2.12  0:01
Loading lcs             0:08
Verifying lcs           0:07
Reading file /mnt/usr/ucode/actl.acs  Rev 1.22  0:01
Loading acs             0:08
Verifying acs           0:07
Reading file /mnt/usr/ucode/mctl.mcs  Rev 1.17  0:01
Loading mcs             0:07
Verifying mcs           0:08


Memory interleave set to 4
Initializing memory   0:00:00 
Hard error reported by MCU
Multiple bit error
Address:  04003000   Syndrome: 77    MAU:  MAU4
Memory master: PCU


mminit: Unable to initialize main memory
Initialization Aborted

Ok, the microcode loads, but the main CPU does not like something about the memory on board MAU4. Let's take that one out.

Initializing memory   0:00:00 
Hard error reported by MCU
Nonexistent memory error
Address: 0400b010


mminit: Unable to initialize main memory
Initialization Aborted

Drats, more memory problems, but no indication which MAU(s) are involved. I won't bother you with everything I tried (no, the mem4000 tests did not help), I stumbled upon a combination of four MAUs that did work:

(spu)> initall
System Initialization
clk: n (10.0Mhz)  ps1: n (+5.05)  ps2: n (+4.95)  ps3: n (+5.00)

Loading ASU control stores
Reading file /mnt/usr/ucode/base.fasu  Rev 1.7  0:01
Loading wcs and epcs    0:10
Verifying wcs and epcs  0:19
Loading VCU control stores
Reading file /mnt/usr/ucode/lsctl.lcs  Rev 2.12  0:00
Loading lcs             0:08
Verifying lcs           0:07
Reading file /mnt/usr/ucode/actl.acs  Rev 1.22  0:00
Loading acs             0:08
Verifying acs           0:07
Reading file /mnt/usr/ucode/mctl.mcs  Rev 1.17  0:01
Loading mcs             0:07
Verifying mcs           0:08


Memory interleave set to 4
Initializing memory   0:00:00   0:00:03   0:00:03 

Main memory size:  67108864

MAU     Allocated Blocks

 0      00 01 02 03 04 05 06 07 
 1      00 01 02 03 04 05 06 07 
 2      00 01 02 03 04 05 06 07 
 3      00 01 02 03 04 05 06 07 
 4      
 5      
 6      
 7      

Initialization Complete

Ok, we have 64MB of memory to play with... Let's try to run all of mem4000 (including the tests that use the job processor):

        Subtest 601    0:00:00   0:00:05   0:00:08   0:00:08   failed

*****  Thu Jan  1 05:46:53 1970  *****
Test:    mem4000.t  1.16   Class: 6    Subtest: 601 1.4   Count: 1    Error: 1
Failed:  Vectorized address uniqueness test

Mode:    EDC disabled   
Address: 03265998   Exp: 00000000 03265998   Act: 00000000 83265998

MAU:     MAU 3
RAMS:    E0-017

Test 'mem4000.t' failed
Elapsed time:   0:03:00

That's a very useful error message. It indicates a single RAM chip (location E0-017 on MAU3) that's bad. I replaced that chip, and reran the tests. It pointed out another bad RAM chip. Replaced that too, and a third. After that, all mem4000 tests passed.

Let's try to test the CPU:

: test cpu4000
Test 'cpu4000.t'                        Thu Jan  1 05:37:48 1970

Run default options                                  : y

libtest:  SIGSEGV received, using SIGIOT to dump core

This one turned out to be another corrupt file on the disk. It crashes before it even starts the subtests. cpu4010 and cpu4030 did pass. I fixed the disk corruption using the find-the-blocks-in-these-files-and-reread-them routine, and got cpu4000 to get as far as the first vector test:

        Subtest 925    0:00:00   0:00:00   0:00:00   failed

*****  Thu Jan  1 05:38:55 1970  *****
Test:    cpu4000.t  1.30   Class: 20   Subtest: 925 1.30  Count: 1    Error: 0
Failed:  vl = 0/vec. inst. test, source: vlzero.s

Subtest Failure, JP halted

a0: 0007d0f0    s0: 00000000 00000010    t0: 00000000    pc:   0002ae08
a1: 00000100    s1: 00000000 00000000    t1: 00000000    ca:   0002ae08
a2: 0002acf8    s2: aaaaaaaa aaaaaaaa    t2: 00000000    psw:  02000000
a3: 00000000    s3: 00000000 00000000    t3: 00000000    jpcr: 01d77f08
a4: 00000000    s4: 00000000 00000000    t4: 02000002    upc:       bfd
a5: 0002adea    s5: 00000000 00000000    t5: 00000000
a6: 00000000    s6: 00000000 00000080    t6: 00000014
a7: 00000000    s7: 00000000 00000000    t7: 01d77f08
LCTL: 4c4c0012000a96      MCTL: 300014000a90          ACTL: 280020002696
STLCTL,VL: 17fff207,7f    STMCTL,VL: 20011207,7f      STACTL,VL: 3bc11207,7f
DATA BUS: 0000000000000000  MISC: 383d1
VM REG STATE: 40000       VM: 0000000000000000 0000000000000000
C0,ST0: 0101,10101  C1,ST1: 0101,10101  C2,ST2: 0101,10101  C3,ST3: 0101,10101


Test 'cpu4000.t' failed
Elapsed time:   0:02:32

The cpu4040 tests also failed, so there appears to be something wrong with the vector units. At Doug's suggestion, I reran these tests with the clock margined down to 9 MHz, but that did not help. I then checked the board revisions between the C1 XP and the C1 XL, and found that the vector cards (2 VPUs - vector processor unit and 1 VCU - vector control unit) had the same revision level. I put the cards out of the C1 Xl into the C1 XP, and...

cpu 4000 passes. cpu4040 is a very intensive, long-running vector test, but after 5 hours and 26 minutes, that passed as well. One thing left to do. On the SPU disk is a copy of the main Convex UNIX kernel (/mnt/os/vmunix). There is no root disk, so it will fall over, but let's see if we can boot that kernel:

(spu)> /mnt/os/boot
Booting Convex UNIX.  Type ^C to abort
System Initialization
clk: n (10.0Mhz)  ps1: n (+5.00)  ps2: n (+4.95)  ps3: n (+5.05)

Memory interleave set to 4
Initializing memory   0:00:00   0:00:04   0:00:04 

Main memory size:  67108864

MAU     Allocated Blocks

 0      00 01 02 03 04 05 06 07 
 1      00 01 02 03 04 05 06 07 
 2      00 01 02 03 04 05 06 07 
 3      00 01 02 03 04 05 06 07 
 4      
 5      
 6      
 7      

Initialization Complete
[SPU_@05:42:18] Errlog started: -l /mnt/errlog
[SPU_@05:42:19] 
[SPU_@05:42:22] errintd(1.13) started, options:
Loading vmunix
[SPU_@05:42:27] mm_sniff: sniff rate: 32.14 MB/day (1.99 days/pass)
vmunix: text: 749568  data: 77824  bss: 1409024  
[CPU_@05:42:47] Convex UNIX Version 7.1.0.10 Mon Mar 27 15:29:00 CST 1989
[CPU_@05:42:50] Device "DKC-001" (7/0/0x3f0 int 2)
[CPU_@05:42:50] Unit   "DKD-005" unit number 0
[CPU_@05:42:50] Unit   "DKD-005" unit number 1
[CPU_@05:42:50] Device "MTC-001" (7/0/0xc0 int 4)
[CPU_@05:42:50] Unit   "MTD-001" unit number 0
[CPU_@05:42:53] Device "ACM-001" (7/0/0x3c0 int 7)
[CPU_@05:42:56] Device "LAN-001" (7/1/0x4c0 int 1)
[CPU_@05:42:56] Unit   "ex" unit number 0
[CPU_@05:42:56] NX Firmware Version 5.5, Hardware Rev. 0.0
[CPU_@05:42:56] ex0: address = 08:00:14:12:92:32
[CCU7@05:42:56] da0: DrvR:Drive Not Ready [cyl=759 hd=18 sect=58 cnt=0]
[CPU_@05:42:57] WARNING: no swap space
[CCU7@05:42:57] da0: DrvR:Drive Not Ready [cyl=759 hd=18 sect=58 cnt=0]
[CPU_@05:42:58]         FATAL CONVEX UNIX ERROR: (rootmount cannot mount root)
[CPU_@05:42:58]         sp:     0b172f44        a1:     00000000
[CPU_@05:42:58]         a2:     00000000        a3:     0038bb70
[CPU_@05:42:58]         a4:     0038ab70        a5:     00000400
[CPU_@05:42:58]         ap:     0b172f58        fp:     0b172f44
[CPU_@05:42:58]         s0: ffffffff00000005    s1: 0000000040000001
[CPU_@05:42:59]         s2: 400000ff00000000    s3: 8000000000000010
[CPU_@05:42:59]         s4: 0000000000000000    s5: 0000000000000000
[CPU_@05:42:59]         s6: ffffffff00000005    s7: ffffffff00000005
[CPU_@05:42:59]         int. mask: 40000001
[CPU_@05:42:59] syncing disks... 
[CPU_@05:42:59] done
[CPU_@05:42:59] halting processor

/mnt/os/boot: 104 Quit - core dumped

There you go. This might well be the only working Convex C1 CPU on the planet...

Write comments...
Log in with ( Sign Up ? )
or post as a guest
Loading comment... The comment will be refreshed after 00:00.

Be the first to comment.