Camino de yuwen-c

A Small Connection Headache with an Intranet Host

#macos #ssh #network #troubleshooting notes

A while ago at work, I ran into a slightly annoying issue. From my Mac laptop, which was on the company intranet, I used SSH to connect to another machine on the same intranet, a DGX Spark. After the connection had been working normally for a while, say one or two hours, it would suddenly drop. At first I wondered if the host had gone to sleep, so I operated the host locally and tested the connection again. Once I did that, the connection between my laptop and the host would recover.

After trying this a few times, I came up with a temporary workaround: whenever my laptop started getting Request timeout from ping host IP, I would go operate the host directly and ping my laptop from there. As soon as I did that, my laptop’s connection to the host would immediately go back to normal. It worked every single time, but only for about one or two hours before I had to do it again.

sequenceDiagram
    participant Mac as Mac laptop
    participant Host as DGX Spark host

    Mac--xHost: SSH drops / ping times out
    Note over Host: Operate the host locally
    Host->>Mac: Ping my Mac from the host
    Mac->>Host: Connection recovers

My teammates were using Windows, and I was the only one on a Mac. They said they had not run into this problem. On top of that, I did not have access to the company’s router configuration, so I could not see the lower-level routing details of the intranet. After going back and forth with GPT many times, the best guess was that Windows and macOS handle the ARP cache differently, which led to different connection behavior.

So I spent a while stuck in this endless loop: “Ah, it is disconnected again ⭢ plug a keyboard, mouse, and monitor into the host ⭢ ping my laptop ⭢ connection works.” But this approach was way too troublesome. After discussing it with GPT, it suggested another method: write the host’s IP / MAC address mapping directly into my laptop, so macOS would no longer need to rely on dynamic ARP lookup. One thing to watch out for, though: at one point, I switched the host to wired networking to make downloading LLM models faster. Then I realized that when the host used Wi-Fi versus wired networking, it could receive a different IP, and the MAC address would also be different. That mapping also had to be added before I could connect.

flowchart LR
    host["<div class='dgx-gold-metallic'>DGX Spark host</div>"]

    subgraph laptop["Mac laptop"]
        arp["Add the host mapping to the laptop<br/>Host IP + MAC address"]
    end

    laptop -->|Find the host directly and connect| host

    style laptop fill:#e8f2ff,stroke:#3b82f6,color:#1e3a8a
    classDef dgx fill:transparent,stroke:transparent,color:#3f2a00
    class host dgx

Not long after that, I found an even more convenient solution: connecting through the host’s mDNS name. In other words, the host broadcasts its .local hostname through mDNS on the local network. As long as the device is on the same local network, it can connect to the host directly with that name. This completely solved the problem for me. Later, I realized that the .local name of the DGX Spark was printed right on the cover of its manual. The answer had been right in front of me the whole time, haha.

flowchart LR
    subgraph lan["Same local network"]
        host["<div class='dgx-gold-metallic'>DGX Spark host</div>"]
        mdns(("mDNS multicast<br/>Broadcasts hostname and IP"))
        laptop["Mac laptop"]
        other1["Other device"]
        other2["Other device"]
    end

    host -.->|Broadcasts| mdns
    mdns -.->|Devices on the same network can discover it| laptop
    mdns -.-> other1
    mdns -.-> other2
    laptop -->|Connect with the .local hostname| host

    classDef mac fill:#e8f2ff,stroke:#3b82f6,color:#1e3a8a
    classDef dgx fill:transparent,stroke:transparent,color:#3f2a00
    class laptop mac
    class host dgx

Postscript: I originally thought setting up mDNS would be the end of this issue. But later, after the host’s driver version was updated, the disconnect problem disappeared too. All the workarounds above could have been skipped. It was a completely unexpected, brute-force fix 🙃.