Four hour home-manager adventure
I spent roughly 4 hours to resolve an misconfiguration issue, and ended up rebuild the nix environment.
It started from the debugging a lambda locally with sam,
it read the secrets from vault in the runtime for authentication. The production uses
Vault Lambda extension
to expose the vault instance to http://127.0.0.1:8200
. In my Windows workstation, I deployed a
vault proxy
to bind to localhost:8200
in the WSL, and then sam local invoke
to trigger the lambda.
It should work, but the lambda complained the connection rejected. I verified that I could access
the endpoint in the browser, weird.
Spoiler alert: sam would pack the lambda function into a docker image with specified runtime, so the function was invoked inside a docker container. Therefore, the host’s localhost should be reference as host.docker.internal
.
I considered this was yet another WSL network routing issue, so I decided to give a try in macOS.
By the way, I use nix, and home-manager
to manage packages and dotfiles across WSL and macOS. The installation for vault
failed due to the
broken package, valgrind:
**error:** Package ‘valgrind-3.24.0’ in /nix/store/w25dkvl0vhdhngipf8pk8lqwy60cqbwc-nixpkgs/nixpkgs/pkgs/top-level/all-packages.nix:7751 is marked as broken, refusing to evaluate.
But the package was already removed from this change, how could I remove a uninstalled package?
In the retrospect, I considered this might result from the nix-shell? The valgrind was referenced in the llvm shell, and the recent update broke it? It was still a myth to me.
I recalled nixOS supported rollback, though home-manager does not official support this feature,
home-manager generation
will list the recent generations; and I could run the activate
method to rollback.
I tried couple recent generations, but still encountered the same error, so I decided to go deep,
rollbacked to a generation built two years ago, maybe we could use the similar strategy as git bisect
to identify the issue?
No, the home-manager complained missing dependencies, I considered maybe the packages referenced by the old generation was garbage collected? Could I just reinstall the home-manager to address this issue?
Until this moment, I still had a viable command line, the next operation was a one-way door that I must proceed to fix the issue!
I tried removing the home-manager
nix-env -e home-manager-path
The command line nuked packages installed by home-manager, em, including the zsh
shell I was using.
Any new shell session would immediately crash due to the missing zsh component, such as starship
.
It was already 11pm, but I had to fix this issue to have a workable environment for tomorrow’s development!
Restore the nix store
First, I launched the Terminal
, and run New Command , /bin/bash
then switch the shell: chsh -s /bin/bash
,
finally I had a working shell to start with.
I needed a nix environment, so I ran the official install script several times thanks to the NixOS/nix#10892 issue in macOS 15 Sequoia. Since I already ran the migration script discussed in the thread, I needed to delete those users to make the script happy:
dscl . list /Users UniqueID | grep -E '\b_nixbld' | cut -d' ' -f1 | xargs -L 1 -I{} sudo dscl . delete "/Users/{}"
After nix installed, I tried to install the home-manager:
nix-shell '<home-manager>' -A install
**error:** opening file '**/nix/store/pspbxs6hsaaqdvhgwci3730fdl9wfadw-gnu-config-2024-01-01.drv**': **No such file or directory**
It looked the nix-store was corrupted, let me clean up:
nix-collect-garbage -d
nix-store --verify --check-contents --repair
warning: $HOME ('/Users/kun.xi') is not owned by you, falling back to the one defined in the 'passwd' file ('/var/root')
reading the Nix store...
checking path existence...
path '/nix/store/00mgj7rvgyl1qpq10lx26rnv0gmzp91y-python3.12-pytest-httpbin-2.0.0.drv' disappeared, but it still has valid referrers!
warning: cannot repair path '/nix/store/00mgj7rvgyl1qpq10lx26rnv0gmzp91y-python3.12-pytest-httpbin-2.0.0.drv'
path '/nix/store/00rcwjk7ykxyyycb4y0za688070nvpg7-lutok-0.4.drv' disappeared, but it still has valid referrers!
... ...
Retried to install the home-manager, and got the following errors:
nix-shell '<home-manager>' -A install
this derivation will be built:
/nix/store/5fjaqcd5q7lskiq1a7ssr4wa3rip981k-home-manager.drv
building '/nix/store/5fjaqcd5q7lskiq1a7ssr4wa3rip981k-home-manager.drv'...
/nix/store/vj1c3wf9c11a0qs6p3ymfvrnsdgsdcbq-source-stdenv.sh: line 3: /nix/store/shkw4qm9qcw5sc5n1k5jznc83ny02r39-default-builder.sh: No such file or directory
error: Cannot build '/nix/store/5fjaqcd5q7lskiq1a7ssr4wa3rip981k-home-manager.drv'.
Reason: builder failed with exit code 1.
Output paths:
/nix/store/5a4vls80b9jd4z6bfkdim8ww90ljwxqs-home-manager
Last 1 log lines:
> /nix/store/vj1c3wf9c11a0qs6p3ymfvrnsdgsdcbq-source-stdenv.sh: line 3: /nix/store/shkw4qm9qcw5sc5n1k5jznc83ny02r39-default-builder.sh: No such file or directory
For full logs, run:
nix-store -l /nix/store/5fjaqcd5q7lskiq1a7ssr4wa3rip981k-home-manager.drv
This uncovered some fundamental build block was missing. The nix installation would NOT bring the system to a stable state.
In the retrospect, I think I might still reference the old profile, which was incompatible to the latest nix-2.29.0:
ls -al ~/.nix-profile/etc/profile.d
lr-xr-xr-x 1 root nixbld 68 Dec 31 1969 /Users/kun.xi/.nix-profile/etc/profile.d -> /nix/store/1dc38w5wn3z19yjy3jal6s4grgv7rzba-nix-2.13.3/etc/profile.d
Rebuild everything
The attempt to savage the working environment surgically failed, the only option was to nuke, and rebuild everything.
I followed this to uninstall manual to fully remove the nix, then reinstall the nix the nth time, and reinstall the home-manager, and got exactly the same error I encountered about 3 hour ago. #wtf ?
I realized that the root cause was still my home-manager config because the home-manager can build successfully
with default home-manager/home.nix
. I removed unused packages, and shells vigorously as this PR, and it finally worked.
Then I encounter the exactly same connection rejected issue in macOS, I then realized the localhost bind worked
across the WSL and host, but not in the docker contained started by sam! With updated
VAULT_ADDR=http://host.docker.internal
, it worked in both Windows and macOS.
It was a long, and painful four hours, especially I was in the middle of intensive development. I considered I had made several mistakes in the pivoting moment:
-
Confirmation bias. I used to spend hours to trace down why VPN did not work correctly in WSL, — the GlobalProtect supports a feature called “Split Channel” in Windows host to automatically route internal traffic to the VPN network, — I was more inclined to attribute the issue to WSL.
-
Rush, rush. There were several due diligence tasks I skipped, such as I could cut the
home-manager/home.nix
to baremetal for validation; I could try to undo the rollback for a more stable start point; I could inspect the.zshrc
to identify the incorrect profile souced; … I did not do any of those, but recklessly jumped to an action plan, and see whether it stuck. -
Fixated on the sunk cost. As I had two workstations, I had the luxury not to fix the development environment in the MacBook Pro, and focused my development. It was not a problem I must fix right now.