Serve Troubleshooting¶
This document focuses on long-running issues around hermit serve --adapter feishu, especially when:
- the service appears to have “died unexpectedly”
- Feishu still shows the last reply, but later messages get no response
- the menubar is still running while
serveitself is gone - the logs have no traceback and only a stale PID remains
Start with Three Perspectives¶
1. Process Perspective: Is the service still alive?¶
Use the environment control script first:
make env-status ENV=dev
Focus on:
- whether
[service]is empty - whether
PID_FILE=still contains an old PID - whether
[menubar]is still running
Common patterns:
serviceis empty butmenubaris still running: the control layer is still alive, but theserveprocess has already exitedPID_FILEhas a value but the process does not exist: that is a stale PID, not a live process
2. Lifecycle Perspective: What caused the last exit?¶
From this version onward, serve writes its current state and the last exit reason to:
~/.hermit/logs/serve-feishu-status.json~/.hermit/logs/serve-feishu-exit-history.jsonl
For dev / test, the corresponding files are:
~/.hermit-dev/logs/serve-feishu-status.json~/.hermit-dev/logs/serve-feishu-exit-history.jsonl
Start with:
cat ~/.hermit-dev/logs/serve-feishu-status.json
Common fields:
phase:starting/running/reloading/stopped/crashedreason:startup/signal/adapter_stopped/exceptionsignal: for exampleSIGTERM/SIGHUP/SIGINTdetail: human-readable exit explanationexception_type/exception_message/traceback: written for unhandled exceptions
Notes:
SIGTERM,SIGHUP, andSIGINTare now recordedSIGKILLcannot be caught by the process, so aSIGKILLcase usually leaves only a stale PID, not a graceful exit record
3. Business Perspective: Did a task fail, or did the service die?¶
If Feishu shows that “the last message was sent successfully,” do not immediately assume the business logic failed. Check the kernel ledger first:
sqlite3 -header -column ~/.hermit-dev/kernel/state.db \
"select event_type, actor, datetime(occurred_at,'unixepoch','localtime') as occurred_local, substr(payload_json,1,220) as payload
from events
order by occurred_at desc
limit 40;"
If you see:
approval.grantedreceipt.issuedtask.completed
then the task path itself succeeded.
If there are no new task.created / step.started events after that point, and later Feishu messages never enter the kernel, the usual explanation is that the serve host process died, not that a single task failed.
Work Backward from “the Last Reply”¶
This is the most useful path for issues like “it replied at 14:03 and then died.”
1. Find the last conversation¶
sqlite3 -header -column ~/.hermit-dev/kernel/state.db \
"select conversation_id, status, datetime(updated_at,'unixepoch','localtime') as updated_local,
total_input_tokens, total_output_tokens
from conversations
order by updated_at desc
limit 10;"
2. Inspect that conversation’s messages and tasks¶
sqlite3 -line ~/.hermit-dev/kernel/state.db \
"select id, role, created_at, content_json
from conversation_messages
where conversation_id='你的 conversation_id'
order by id;"
sqlite3 -header -column ~/.hermit-dev/kernel/state.db \
"select task_id, status, datetime(created_at,'unixepoch','localtime') as created_local,
datetime(updated_at,'unixepoch','localtime') as updated_local, title
from tasks
where conversation_id='你的 conversation_id'
order by updated_at desc;"
3. Check whether the task actually completed¶
sqlite3 -header -column ~/.hermit-dev/kernel/state.db \
"select task_id, event_type, actor,
datetime(occurred_at,'unixepoch','localtime') as occurred_local,
substr(payload_json,1,220) as payload
from events
where task_id='你的 task_id'
order by occurred_at;"
If the end of the event stream includes task.completed, that means:
- the task itself finished
- “the service disappeared after the last reply” is more likely a process lifecycle problem than a task logic failure
Check What the Tools Actually Did¶
If the task involved approvals, shell commands, or file writes, do not guess. Check the receipts and artifacts directly.
1. Check receipts¶
sqlite3 -header -column ~/.hermit-dev/kernel/state.db \
"select receipt_id, action_type, result_summary,
datetime(created_at,'unixepoch','localtime') as created_local
from receipts
where task_id='你的 task_id'
order by created_at;"
2. Check the actual command input / output¶
sqlite3 -header -column ~/.hermit-dev/kernel/state.db \
"select artifact_id, kind, uri,
datetime(created_at,'unixepoch','localtime') as created_local
from artifacts
where task_id='你的 task_id'
order by created_at;"
Then open the corresponding files directly:
tool_input: the actual tool inputtool_output: the actual stdout / stderr / returncodeapproval_packet: the command preview shown to the user during approval
This lets you clearly distinguish:
- whether only read-only commands were executed
- whether files were actually deleted
- whether approval mismatch triggered a new approval request
How to Read the Logs¶
serve-related logs are usually in:
~/.hermit-dev/logs/dev-restart-service.out~/.hermit-dev/logs/serve-feishu-status.json~/.hermit-dev/logs/serve-feishu-exit-history.jsonl
menubar-related logs are usually in:
~/.hermit-dev/logs/companion.log~/.hermit-dev/logs/feishu-menubar-stdout.log~/.hermit-dev/logs/feishu-menubar-stderr.log
Notes:
- historically,
dev-restart-service.outcould be affected by block buffering under redirected output, so “nothing appeared in the last few minutes” was not always trustworthy servenow forces unbuffered stdout/stderr at startup, so the logs are more reliable for diagnosis
The Most Useful Decision Path for This Class of Issue¶
If you hit a problem similar to this one, use this order:
make env-status ENV=dev- inspect
serve-feishu-status.json - inspect the latest
eventsinkernel/state.db - inspect the matching task’s
receiptsandartifacts - only then go back to
dev-restart-service.out
The reason is simple:
status.jsontells you how the process diedkerneltells you whether the last task actually completedartifactstell you what the tools really did- plain stdout logs are only supporting evidence, not the single source of truth
Known Boundaries¶
SIGTERM/SIGHUP/SIGINT: now recorded- unhandled Python exceptions: now record traceback
SIGKILL: cannot be caught gracefully; you can only infer it from stale PIDs, missing kernel events, and system-level traces- if the service was started temporarily by an external host and later reclaimed by that host, Hermit can only record the signal it received itself; it cannot always know who sent it