Member Login | Register
Navigation
  • Home
  • Blog
  • Forums
  • FAQ
Home : All Forums : Hardware : APU
Recent Posts My Posts My Subscriptions

APU2: Enable Hardware Watchdog? (Linux)

Subscribe New Topic Reply
    PrevNext
    1 2 3 4 5
  • QuoteRate
    My Posts

    Posts: 20
    b1k2 - Posted Dec 9th 2017

    Hi all,

    I've done some digging. According to the documentation, http://pcengines.ch/file/NCT5104D_Datasheet_V1_9.pdf, there are two revisions (or chip ID's) of the NCT5104D, a B (0xc452) and C (0x453) version. My APU2 seems to be the C revision chip.

    The SP5100_TCO module in Debian 9 (kernel 4.9), does not have support for the above chip (rev. B and C).

    The driver on PCEngine's github page does recognise the chip, but it is not a watchdog driver. https://github.com/pcengines/linux-gpio-nct5104d

    insmod gpio-nct5104d.ko
    ...
    [ 1825.718466] gpio-nct5104d: Found nct5104d at 0x2e chip id 0xc453
    [ 1825.718582] gpio-nct5104d: platform_driver_register
    [ 1825.719506] gpio-nct5104d: Device added

    It seems that some are able to use /dev/watchdog if i2c_piix4 is unloaded, but on my board that won't help.
    [   87.275613] sp5100_tco: failed to find MMIO address, giving up.

    In the mean time we could use the softdog workaround (software watchdog module) for detecting problems other than hardware / system lockups. Like network, filesystem, cpu load... Sysctl kernel.panic=<seconds> might come in handy also.

    Thats all for now, have a nice weekend.

     

     

  • QuoteRate
    My Posts

    Posts: 20
    b1k2 - Posted Dec 10th 2017

    For reference the kernel log on the LEDE platform:

    [    2.799236] gpio-nct5104d: Unsupported device 0xc453
    [    2.804379] gpio-nct5104d: Unsupported device 0xffff

     

  • QuoteRate
    My Posts

    Posts: 16
    ingo2 - Posted Dec 10th 2017

    Bei mir unter Debian-Stretch (nach heutigem Upgrade auf 9.3):

    cat /var/log/messages | grep 51

    Dec 10 12:44:58 apu kernel: [    4.416913] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver v0.05
    Dec 10 12:44:58 apu kernel: [    4.417150] sp5100_tco: PCI Vendor ID: 0x1022, Device ID: 0x780b, Revision ID: 0x42
    Dec 10 12:44:58 apu kernel: [    4.420768] sp5100_tco: Using 0xfeb00000 for watchdog MMIO address
    Dec 10 12:44:58 apu kernel: [    4.420783] sp5100_tco: Last reboot was not triggered by watchdog.
    Dec 10 12:44:58 apu kernel: [    4.422675] sp5100_tco: initialized (0xffffae2a006dd000). heartbeat=60 sec (nowayout=0)

    Nicht vergessen: das Modul i2c_piix4 blacklisten und auch aus der initrd verbannen (update-initramfs -u)

     

  • QuoteRate
    My Posts

    Posts: 20
    b1k2 - Posted Dec 10th 2017

    Hi Ingo2,

    Thanks, I did a clean install of Debian 9.3 and got the following results.

    After blacklisting the i2c module and a reboot, the driver did load correctly. Trying power off/on and reboots successively, it seems the driver never loads correctly after power off/on and sometimes it'll start working after one or more reboots.


    After clean install
    # lsmod |grep -E "i2c|sp5100"
    sp5100_tco             16384  0
    i2c_piix4              24576  0
    i2c_algo_bit           16384  1 igb

    # dmesg |grep -E "i2c|sp5100"
    [    4.890127] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver v0.05
    [    4.890407] sp5100_tco: PCI Vendor ID: 0x1022, Device ID: 0x780b, Revision ID: 0x42
    [    4.890418] sp5100_tco: I/O address 0x0cd6 already in use

    Modification
    cat > /etc/modprobe.d/blacklist.conf <<EOF
    blacklist i2c_piix4
    EOF

    update-initramfs -u

    After shutdown, power off, power on
    # lsmod |grep -E "i2c|sp5100"
    sp5100_tco             16384  0
    i2c_algo_bit           16384  1 igb

    # dmesg |grep -E "i2c|sp5100"
    [    4.821515] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver v0.05
    [    4.821746] sp5100_tco: PCI Vendor ID: 0x1022, Device ID: 0x780b, Revision ID: 0x42
    [    4.821802] sp5100_tco: failed to find MMIO address, giving up.

    After reboot
    # dmesg |grep sp5100
    [    4.826192] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver v0.05
    [    4.826374] sp5100_tco: PCI Vendor ID: 0x1022, Device ID: 0x780b, Revision ID: 0x42
    [    4.826470] sp5100_tco: Using 0xfeb00000 for watchdog MMIO address
    [    4.826481] sp5100_tco: Last reboot was not triggered by watchdog.
    [    4.826912] sp5100_tco: initialized (0xffffb140c071d000). heartbeat=60 sec (nowayout=0)

     

  • QuoteRate
    My Posts

    Posts: 16
    ingo2 - Posted Dec 10th 2017

    b1k2 - Posted 80 Minutes Ago

    ......

    After reboot
    # dmesg |grep sp5100
    [    4.826192] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver v0.05
    [    4.826374] sp5100_tco: PCI Vendor ID: 0x1022, Device ID: 0x780b, Revision ID: 0x42
    [    4.826470] sp5100_tco: Using 0xfeb00000 for watchdog MMIO address
    [    4.826481] sp5100_tco: Last reboot was not triggered by watchdog.
    [    4.826912] sp5100_tco: initialized (0xffffb140c071d000). heartbeat=60 sec (nowayout=0)

    That's great, now all is fine.

    The root cause I found out by checking the I/O-port:

    cat /proc/ioports
                0cd6-0cd7 : smba_idx  -> and smba_idx is part of  i2c_piix4

    (It's a bug already known, see https://bugzilla.redhat.com/show_bug.cgi?id=1406844)

     

    Now you may install the package "watchdog" and configure it to your needs. Documentation how to configure it see:

    http://manpages.ubuntu.com/manpages/zesty/man8/watchdog.8.html
    http://manpages.ubuntu.com/manpages/zesty/man5/watchdog.conf.5.html

    I for instance just configured CPU temperature:

    uncomment and/or edit

    watchdog-device    = /dev/watchdog
    temperature-sensor = /sys/devices/pci0000:00/0000:00:18.3/hwmon/hwmon0/temp1_input
    max-temperature       = 80

     

    The temperature sensdor you can locate using

    find /sys -name 'temp*input' -print

     

    Then check if all is fine:

    service watchdog start

    service watchdog status

     

    I had some problems that every now and then the temperature sensor got a different device name in the sys filesystem. That appears to depend on the sequence the modules are loaded at boot. To finally solve the issue I did put module sp5100_tco in /etc/modules to force it to load and also blacklisted fam15h_power, which tells extremely fluctuating values from milliwatts to 100 watt:

    /etc/modprobe.d/blacklist.conf

        blacklist fam15h_power

     

    This way all is stable, checked with ten#s reboots.

     

    Ingo

     

  • QuoteRate
    My Posts

    Posts: 20
    b1k2 - Posted Dec 12th 2017

    Hi Ingo,

    It was not at all stable on my APU2, sometimes it worked, another time not. But a noticed something. Sometimes the ccp module was initialized before sp5100, another time is was initialized after sp5100. When the sp5100 module was initialized first, the watchdog works.

    reboot:
    [    4.803941] sp5100_tco: initialized (0xffffaa6c80725000). heartbeat=60 sec (nowayout=0)
    [    4.822040] ccp 0000:00:08.0: initialization failed

    reboot:
    [    4.778918] ccp 0000:00:08.0: enabled
    [    4.779444] sp5100_tco: failed to find MMIO address, giving up.

     

    I have now blacklisted i2c_piix4 and ccp. Now I have the /dev/watchdog device at every reboot and power cycle. CCP is the Cryptographic Coprocessor device driver. It provides the interface to use the AMD Cryptographic Coprocessor which can be used to accelerate or offload encryption operations such as SHA, AES and more (according to the help text).

     

     

  • QuoteRate
    My Posts

    Posts: 16
    ingo2 - Posted Dec 12th 2017

    Thanks for this information, I didn't notice the failufre of ccp.

    Got the same here, however here sp5100_tco always loads first:

    dmesg | egrep "ccp|5100"

    [    4.416913] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver v0.05
    [    4.417150] sp5100_tco: PCI Vendor ID: 0x1022, Device ID: 0x780b, Revision ID: 0x42
    [    4.420768] sp5100_tco: Using 0xfeb00000 for watchdog MMIO address
    [    4.420783] sp5100_tco: Last reboot was not triggered by watchdog.
    [    4.422675] sp5100_tco: initialized (0xffffae2a006dd000). heartbeat=60 sec (nowayout=0)
    [    4.714288] ccp 0000:00:08.0: BAR 0: can't reserve [mem 0xfeb00000-0xfeb1ffff 64bit pref]
    [    4.714296] ccp 0000:00:08.0: pci_request_regions failed (-16)
    [    4.720459] ccp 0000:00:08.0: initialization failed
    [    4.731126] ccp: probe of 0000:00:08.0 failed with error -16

    Maybe this is due to loading sp5100_tco directly in /etc/modules. You also see the timestaps,  when modules are loaded are significantly different (0.4 s) compared to less then 0.02 in your setup. As all this stuff is also part of AMD Ryzen platform, we can expect to get it fixed with later kernels. Searching for "linux ccp ryzen" finds a lot of complaints about this incompatibility.

    I'll keep an eye on that. IRC there is a possibilty to load modules in a certain sequence too, but I didn't try so far.

    Just to be on the safe side I now also blacklisted "ccp".

    Amaizingly the most important crypto functions are still available as lsmod tells:

    crc32c_intel           24576  6
    aesni_intel           167936  1
    aes_x86_64             20480  1 aesni_intel

    glue_helper            16384  1 aesni_intel
    lrw                    16384  1 aesni_intel
    gf128mul               16384  1 lrw
    ablk_helper            16384  1 aesni_intel
    cryptd                 24576  3 ablk_helper,ghash_clmulni_intel,aesni_intel

    I also checked encryption speed, which does not show any differences to earlier test:

    # openssl speed -evp aes-256-gcm

    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
    aes-256-gcm      38941.90k   103331.86k   238510.51k   315883.52k   376974.47k   379966.81k

     

    # openssl speed -evp aes-256-cbc

    type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
    aes-256-cbc      69903.30k   119458.94k   151773.15k   161950.72k   165711.28k   164478.98k

     

    My personal guess is that cpp ist used to manage the crypro functions in a kind of TPM - just the AMD couterpart to Intel's TPM?

    Regards, Ingo

     


    Post last edited Dec 12th 2017
  • QuoteRate
    My Posts

    Posts: 16
    ingo2 - Posted Dec 13th 2017

    Have we done something wrong?

     

    I just found this Bug report in Debian 4.9 kernel:

    https://lists.debian.org/debian-kernel/2017/06/msg00128.html

    In the paragraph of loaded modules I see all our modules (of which I have 3 blacklisted now) loaded:

    fam15h_power
    k10temp
    sp5100_tco
    sg
    ccp

    and appearently co-existing peacefully?

  • QuoteRate
    My Posts

    Posts: 20
    b1k2 - Posted Dec 13th 2017

    ingo2 - Posted 7 Hours Ago

    Have we done something wrong?

    No, we're just some nerds playing with our electronics ^_^

    I just found this Bug report in Debian 4.9 kernel:

    https://lists.debian.org/debian-kernel/2017/06/msg00128.html

    In the paragraph of loaded modules I see all our modules (of which I have 3 blacklisted now) loaded:

    fam15h_power
    k10temp
    sp5100_tco
    sg
    ccp

    and appearently co-existing peacefully?

     


    Not quite, don't think the watchdog device is working for Rouven (guy from bug report):

    [    5.428571] sp5100_tco: I/O address 0x0cd6 already in use

     

    My experience now is:

    sp5100_tco: I/O address 0x0cd6 already in use = unload i2c_piix4.
    sp5100_tco: failed to find MMIO address, giving up. = unload ccp.

  • QuoteRate
    My Posts

    Posts: 13
    M G-L - Posted Jan 9th 2018
    My experience now is:

    sp5100_tco: I/O address 0x0cd6 already in use = unload i2c_piix4.
    sp5100_tco: failed to find MMIO address, giving up. = unload ccp.


    I've started seeing the need for this as well.  But for reasons I don't understand, it seems to be hardware specific?  On some units I can have sp5100_tco and ccp co-existing, and on others I can't.  Trying to Google for this problem ... basically just finds this thread.

    One theory: A unit that is working fine has a WiFi card in one of the slots.  This seems to have shifted around the iomem reservations for things, including for the ccp device.  On one of the units that has the problem, trying to load ccp after sp5100_tco causes the ccp driver to complain it can't reserve some iomem that the sp5100_tco driver has reserved.

    Oh I did miss the good old ISA days and "what device is in what slot makes for resource conflicts" ;) /s

    From looking up stuff, it seems that the ccp driver is probably not needed for the vast majority of use cases.  It doesn't seem to be related to aes-ni, but to other forms of crypto acceleration.


    Post last edited Jan 9th 2018
PrevNext
1 2 3 4 5
Subscribe

Rules: