Four hour home-manager adventure

I spent roughly 4 hours to resolve an misconfiguration issue, and ended up rebuild the nix environment.

It started from the debugging a lambda locally with sam, it read the secrets from vault in the runtime for authentication. The production uses Vault Lambda extension to expose the vault instance to http://127.0.0.1:8200. In my Windows workstation, I deployed a vault proxy to bind to localhost:8200 in the WSL, and then sam local invoke to trigger the lambda. It should work, but the lambda complained the connection rejected. I verified that I could access the endpoint in the browser, weird.

Info

Spoiler alert: sam would pack the lambda function into a docker image with specified runtime, so the function was invoked inside a docker container. Therefore, the host’s localhost should be reference as host.docker.internal.

I considered this was yet another WSL network routing issue, so I decided to give a try in macOS. By the way, I use nix, and home-manager to manage packages and dotfiles across WSL and macOS. The installation for vault failed due to the broken package, valgrind:

**error:** Package ‘valgrind-3.24.0’ in /nix/store/w25dkvl0vhdhngipf8pk8lqwy60cqbwc-nixpkgs/nixpkgs/pkgs/top-level/all-packages.nix:7751 is marked as broken, refusing to evaluate.

But the package was already removed from this change, how could I remove a uninstalled package?

Info

In the retrospect, I considered this might result from the nix-shell? The valgrind was referenced in the llvm shell, and the recent update broke it? It was still a myth to me.

I recalled nixOS supported rollback, though home-manager does not official support this feature, home-manager generation will list the recent generations; and I could run the activate method to rollback. I tried couple recent generations, but still encountered the same error, so I decided to go deep, rollbacked to a generation built two years ago, maybe we could use the similar strategy as git bisect to identify the issue?

No, the home-manager complained missing dependencies, I considered maybe the packages referenced by the old generation was garbage collected? Could I just reinstall the home-manager to address this issue?

Warning

Until this moment, I still had a viable command line, the next operation was a one-way door that I must proceed to fix the issue!

I tried removing the home-manager

nix-env -e home-manager-path

The command line nuked packages installed by home-manager, em, including the zsh shell I was using. Any new shell session would immediately crash due to the missing zsh component, such as starship. It was already 11pm, but I had to fix this issue to have a workable environment for tomorrow’s development!

Restore the nix store

First, I launched the Terminal, and run New Command , /bin/bash then switch the shell: chsh -s /bin/bash, finally I had a working shell to start with.

I needed a nix environment, so I ran the official install script several times thanks to the NixOS/nix#10892 issue in macOS 15 Sequoia. Since I already ran the migration script discussed in the thread, I needed to delete those users to make the script happy:

dscl . list /Users UniqueID | grep -E '\b_nixbld' | cut -d' ' -f1 | xargs -L 1 -I{} sudo dscl . delete  "/Users/{}"

After nix installed, I tried to install the home-manager:

nix-shell '<home-manager>' -A install

**error:** opening file '**/nix/store/pspbxs6hsaaqdvhgwci3730fdl9wfadw-gnu-config-2024-01-01.drv**': **No such file or directory**

It looked the nix-store was corrupted, let me clean up:

nix-collect-garbage -d

nix-store --verify --check-contents --repair
warning: $HOME ('/Users/kun.xi') is not owned by you, falling back to the one defined in the 'passwd' file ('/var/root')
reading the Nix store...
checking path existence...
path '/nix/store/00mgj7rvgyl1qpq10lx26rnv0gmzp91y-python3.12-pytest-httpbin-2.0.0.drv' disappeared, but it still has valid referrers!
warning: cannot repair path '/nix/store/00mgj7rvgyl1qpq10lx26rnv0gmzp91y-python3.12-pytest-httpbin-2.0.0.drv'
path '/nix/store/00rcwjk7ykxyyycb4y0za688070nvpg7-lutok-0.4.drv' disappeared, but it still has valid referrers!
... ...

Retried to install the home-manager, and got the following errors:

nix-shell '<home-manager>' -A install

this derivation will be built:
  /nix/store/5fjaqcd5q7lskiq1a7ssr4wa3rip981k-home-manager.drv
building '/nix/store/5fjaqcd5q7lskiq1a7ssr4wa3rip981k-home-manager.drv'...
/nix/store/vj1c3wf9c11a0qs6p3ymfvrnsdgsdcbq-source-stdenv.sh: line 3: /nix/store/shkw4qm9qcw5sc5n1k5jznc83ny02r39-default-builder.sh: No such file or directory
error: Cannot build '/nix/store/5fjaqcd5q7lskiq1a7ssr4wa3rip981k-home-manager.drv'.
       Reason: builder failed with exit code 1.
       Output paths:
         /nix/store/5a4vls80b9jd4z6bfkdim8ww90ljwxqs-home-manager
       Last 1 log lines:
       > /nix/store/vj1c3wf9c11a0qs6p3ymfvrnsdgsdcbq-source-stdenv.sh: line 3: /nix/store/shkw4qm9qcw5sc5n1k5jznc83ny02r39-default-builder.sh: No such file or directory
       For full logs, run:
         nix-store -l /nix/store/5fjaqcd5q7lskiq1a7ssr4wa3rip981k-home-manager.drv

This uncovered some fundamental build block was missing. The nix installation would NOT bring the system to a stable state.

Info

In the retrospect, I think I might still reference the old profile, which was incompatible to the latest nix-2.29.0:

ls -al ~/.nix-profile/etc/profile.d
lr-xr-xr-x  1 root  nixbld  68 Dec 31  1969 /Users/kun.xi/.nix-profile/etc/profile.d -> /nix/store/1dc38w5wn3z19yjy3jal6s4grgv7rzba-nix-2.13.3/etc/profile.d

Rebuild everything

The attempt to savage the working environment surgically failed, the only option was to nuke, and rebuild everything.

I followed this to uninstall manual to fully remove the nix, then reinstall the nix the nth time, and reinstall the home-manager, and got exactly the same error I encountered about 3 hour ago. #wtf ?

I realized that the root cause was still my home-manager config because the home-manager can build successfully with default home-manager/home.nix. I removed unused packages, and shells vigorously as this PR, and it finally worked.

Then I encounter the exactly same connection rejected issue in macOS, I then realized the localhost bind worked across the WSL and host, but not in the docker contained started by sam! With updated VAULT_ADDR=http://host.docker.internal, it worked in both Windows and macOS.


It was a long, and painful four hours, especially I was in the middle of intensive development. I considered I had made several mistakes in the pivoting moment:

  1. Confirmation bias. I used to spend hours to trace down why VPN did not work correctly in WSL, — the GlobalProtect supports a feature called “Split Channel” in Windows host to automatically route internal traffic to the VPN network, — I was more inclined to attribute the issue to WSL.

  2. Rush, rush. There were several due diligence tasks I skipped, such as I could cut the home-manager/home.nix to baremetal for validation; I could try to undo the rollback for a more stable start point; I could inspect the .zshrc to identify the incorrect profile souced; … I did not do any of those, but recklessly jumped to an action plan, and see whether it stuck.

  3. Fixated on the sunk cost. As I had two workstations, I had the luxury not to fix the development environment in the MacBook Pro, and focused my development. It was not a problem I must fix right now.