- Create models package with organised submodules for messages, chat, audio, tools
- Replace parallel API/handler model hierarchies with single source of truth
- Fix missing role field in message serialisation for llama.cpp

Accommodate refactor

Docker Build and Publish / Lint & Test (push) Failing after 15s

Details

Docker Build and Publish / Build and Push (push) Failing after 0s

Details

46373aa711

More pytest/linting fixes

Docker Build and Publish / Lint & Test (push) Failing after 14s

Details

Docker Build and Publish / Build and Push (push) Failing after 0s

Details

b3014d709f

fix: ensure role field is always included in message dictionaries for llama.cpp

Docker Build and Publish / Lint & Test (push) Failing after 19s

Details

Docker Build and Publish / Build and Push (push) Failing after 0s

Details

74de57323a

The BaseMessage.to_dict() method was not consistently including the role field
when converting messages to dictionaries, causing llama.cpp to reject messages
with 'Missing role in message' errors. This fix ensures the role field is
always present in the output dictionary.

fix: resolve CI linting errors and maintain runtime imports for Pydantic

Docker Build and Publish / Lint & Test (push) Successful in 29s

Details

Docker Build and Publish / Build and Push (push) Failing after 0s

Details

a9a67467d6

- Add noqa directives for legitimate complexity and circular import cases
- Fix TRY300 by using proper try/except/else structure in discovery.py
- Remove TODO with fake GitHub issue link in chat_integration.py
- Add required runtime imports for Pydantic model_rebuild() with TC001/F401 suppressions
- Add PLC0415 suppressions for imports inside functions to avoid circular dependencies

fix: replace gitea context with github context for Forgejo Actions compatibility

Docker Build and Publish / Lint & Test (push) Successful in 27s

Details

Docker Build and Publish / Build and Push (push) Successful in 4m10s

Details

408224661e

feat: add logging for complete streaming responses

Docker Build and Publish / Lint & Test (push) Successful in 29s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m35s

Details

cd11698c46

- Log accumulated response content after streaming completes
- Truncate long responses to 500 chars for readability
- Include tool call summary if present
- Remove redundant import after ruff auto-fix

fix: preserve anyOf nullable patterns in tool schemas

Docker Build and Publish / Lint & Test (push) Failing after 15s

Details

Docker Build and Publish / Build and Push (push) Failing after 0s

Details

aa2e452837

Stop recursively processing items inside anyOf/oneOf/allOf composition patterns
to prevent corrupting valid OpenAPI schemas. The previous logic was incorrectly
transforming {type: "null"} to {type: "string"} inside anyOf patterns, breaking
nullable field handling.

Changes:
- Skip recursive processing of composition pattern items
- Don't add spurious type fields to schemas using composition
- Preserve original OpenAPI schema structure for llama.cpp

style: fix line length for ruff linting

Docker Build and Publish / Lint & Test (push) Successful in 25s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m38s

Details

038546e9f0

debug: add detailed logging for response accumulation

Docker Build and Publish / Lint & Test (push) Successful in 29s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m38s

Details

f9b06b3c82

Add debug and warning logs to understand why response content isn't being
logged at stream completion. This will help diagnose whether messages are
being accumulated properly.

chore: add pre-commit hooks for quality checks

Docker Build and Publish / Lint & Test (push) Successful in 26s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m39s

Details

4d2b0b45c5

Configure prek to run pytest, mypy, ruff check, and ruff format
before each commit to ensure code quality standards are met.

fix: properly collapse anyOf nullable patterns for OpenAI format

Docker Build and Publish / Lint & Test (push) Successful in 27s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m34s

Details

4d9736377e

Convert OpenAPI-style anyOf patterns with null types into simple types
for OpenAI tool format compatibility. OpenAI uses the 'required' array
to indicate optionality, not anyOf with null. This prevents llama.cpp
crashes from invalid schema patterns.

fix: add debug logging to track assistant response accumulation

Docker Build and Publish / Lint & Test (push) Successful in 27s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m36s

Details

6f5c17c32c

- Add logging in ChatChunk.from_json_data to track parsed content
- Add logging in StreamingState.accumulate_chunk to track merging
- Add raw delta logging in parse_chunk to see what llama.cpp sends
- Add detailed debugging in stream completion handler to diagnose why responses aren't logged
- Add noqa comments for unavoidable linting issues

fix: reduce debug logging verbosity for assistant responses

Docker Build and Publish / Lint & Test (push) Successful in 26s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m27s

Details

dbbb353d5d

- Remove per-delta logging (too spammy for short chunks)
- Remove per-chunk parsing logs
- Remove per-merge logging
- Keep only the final accumulated response logging
- Add better debugging for cases where no content is accumulated

fix: implement strict OpenAI tool schema validation

Docker Build and Publish / Lint & Test (push) Successful in 34s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m38s

Details

dc744c55ad

- Create Pydantic models based on OpenAI SDK type definitions
- Separate validation from transformation with helper functions
- Add collapse_anyof_nullable to handle OpenAPI/Pydantic patterns
- Add filter_to_basic_json_schema for llama.cpp compatibility
- Ensure clean, valid tool schemas for llama.cpp backend

fix: include reasoning_content in ChatChunk content accumulation

Docker Build and Publish / Lint & Test (push) Successful in 39s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m41s

Details

266dbaaa10

The critical issue was that reasoning_content (thinking) from llama.cpp
wasn't being accumulated into messages, causing:
- 'No content accumulated' warnings
- Tool calls embedded in thinking never being extracted
- Tool execution failing completely

Now both content and reasoning_content are combined and accumulated,
allowing proper tool call detection and execution.

fix: add missing Any import and fix line length

Docker Build and Publish / Lint & Test (push) Successful in 32s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m42s

Details

a956c5527c

feat: dual-parser architecture for tool call handling

Docker Build and Publish / Lint & Test (push) Successful in 28s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m33s

Details

d97f4c9024

Implements clean separation of concerns using two StreamParsers:
- Container parser tracks tool_calls_begin/end boundaries
- Tool call parser extracts individual calls
- Immediate execution for standalone calls (low latency)
- Batch execution for calls inside containers

This provides optimal performance using StreamParser while maintaining
clean, modular code that handles both single and multiple tool calls.

fix: handle edge cases and add error handling

Docker Build and Publish / Lint & Test (push) Successful in 27s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m24s

Details

4e50deffd0

- Execute tool calls from unclosed containers when stream ends
- Validate JSON syntax before accepting tool calls
- Better error logging for malformed tool calls
- Include content preview in debug logs for failed extractions
- Refactor to reduce nesting depth in stream completion handler

debug: add strategic logging to trace content accumulation

Docker Build and Publish / Lint & Test (push) Successful in 26s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m37s

Details

f2bea55299

Add debug logging to understand why content isn't being accumulated:
- Log when chunks are skipped or merged
- Show what content is being merged
- Track reasoning content extraction
- Log final accumulated state

This will help identify where the content flow is breaking.

fix: improve content accumulation in streaming messages

Docker Build and Publish / Lint & Test (push) Successful in 27s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m24s

Details

40501186e5

- Fix merge logic to handle empty string initialization
- Add detailed logging to trace accumulation issues
- Handle both None and empty string cases in content merging

feat: implement stream halting for immediate tool execution

Docker Build and Publish / Lint & Test (push) Successful in 26s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m39s

Details

d05cb96999

- Add detection for complete tool calls (single and batch)
- Halt stream immediately when tool calls are ready
- Execute tools without waiting for model to finish rambling
- Prevents model from closing thinking tags prematurely

refactor: simplify ToolCall model with response field

Docker Build and Publish / Lint & Test (push) Successful in 27s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m34s

Details

25550674d4

- Add public response field to ToolCall (None = unanswered)
- Simplify is_answered property to check response field
- Prepare for cleaner tool response handling in streaming

refactor: centralize tool processing in AssistantMessage a7b8720bc5

- Add process_tool_calls method to AssistantMessage
- Support both structured and extracted tool call modes
- Tool responses stored in ToolCall.response field
- AssistantMessage handles formatting and appending outputs
- Extracted tools append formatted outputs to content
- Structured tools return ToolMessage objects

style: fix line length issues and formatting c01b7b3f95

- Break long lines for mode assignment and logging statements
- Extract tool count calculation to separate variable
- Auto-format with ruff for consistency

feat: integrate ConversationHandler architecture into model_handler

Docker Build and Publish / Lint & Test (push) Successful in 31s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m33s

Details

a4dbebc71a

Replace LlamaCppStreamingProcessor with ConversationHandler to enable
transparent multi-cycle completion handling for tool execution. The new
architecture orchestrates client streams across multiple backend requests,
executing tools seamlessly without exposing implementation details to clients.

Key changes:
- Replace processor with ConversationHandler in model_handler.py
- Implement proper tool execution using ToolExecutor in conversation.py
- Fix type compatibility with ToolRegistryAdapter pattern
- Make tool marker detection static in stream_turn.py
- Remove obsolete streaming methods from model_handler.py

fix: remove duplicate /v1 from backend URL path

Docker Build and Publish / Lint & Test (push) Successful in 29s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m27s

Details

3bf398112f

fix: add tool injection and correct add_generation_prompt in new streaming flow

Docker Build and Publish / Lint & Test (push) Successful in 32s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m35s

Details

a3d294b7e2

- Pass tools from model_handler through ConversationHandler to CompletionHandler
- Set add_generation_prompt correctly: true for initial requests, false for continuations
- Tools are now properly included in backend requests
- Fixes issue where models couldn't see available tools

fix: handle reasoning_content in ChatChunk for DeepSeek R1 thinking mode

Docker Build and Publish / Lint & Test (push) Successful in 27s

Details

Docker Build and Publish / Build and Push (push) Successful in 1m26s

Details

90c5d873d7

DeepSeek R1 sends reasoning_content instead of content in deltas when
in thinking mode. Update ChatChunk.from_json_data to check for both
fields and use whichever is present.

ci: optimise Docker workflow and build caching

Docker Build and Publish / Build and Push (push) Successful in 58s

Details

0672fe245d

Consolidate CI pipeline to reduce runtime and improve build cache efficiency.

Switch to persistent BuildKit container at tcp://buildkit:8125 for shared cache
across builds, eliminating slow registry cache transfers. Fix Docker layer
invalidation by copying only source code instead of entire repository.

Add Python toolchain caching and uv-lock pre-commit hook to ensure consistent
dependencies. Configure prek for comprehensive linting and testing in single job.

fix: set correct ownership for virtual environment in Docker container

Docker Build and Publish / Build and Push (push) Failing after 7s

Details

3bd09ffc9f

Add --chown=appuser:appuser to ensure runtime user can execute uvicorn

fix: downgrade setup-uv action to v5 for Node 18 compatibility

Docker Build and Publish / Build and Push (push) Successful in 1m41s

Details

385afa5c5f

The v6 action requires Node 20+ due to File API usage

fix: resolve container startup permission errors for virtual environment

Docker Build and Publish / Build and Push (push) Successful in 5m1s

Details

a24fab3da5

Container was failing with 'Permission denied' when trying to execute uvicorn
from the virtual environment. Changed to keep root ownership of .venv for
security while ensuring all binaries are executable by non-root users.

The uv.lock update includes a minor coverage package version bump.

ci: add markdown linting and automatic hook updates to pre-commit

Docker Build and Publish / Build and Push (push) Failing after 34s

Details

8594f63f61

Integrate markdownlint-cli2 with comprehensive style rules to enforce
consistent markdown formatting across documentation. Pin specific hook
versions and add automatic update check to detect outdated dependencies.

The autoupdate hook runs first to ensure all tools are current before
executing quality checks, preventing version drift in CI pipeline.

docs: fix markdown formatting to comply with linting standards

Docker Build and Publish / Build and Push (push) Successful in 2m7s

Details

fb2d6b4303

Resolve all markdownlint violations across documentation files following the
newly added linting configuration. Changes include wrapping long lines at 100
characters, converting asterisk lists to dashes, and adding language
specifiers to code blocks.

refactor: use specific message types instead of generic BaseMessage

Docker Build and Publish / Build and Push (push) Successful in 3m52s

Details

993c7be03b

Replace direct BaseMessage instantiation with appropriate concrete message
types (UserMessage, AssistantMessage, SystemMessage, ToolMessage) throughout
the codebase. This follows the architectural principle of preferring explicit
types over generic base classes for better type safety and code clarity.

Extract message building logic in realtime session to reduce method complexity,
addressing ruff C901 violation. Enhance httpx mock fixture to support GET
requests alongside existing streaming functionality for comprehensive testing.

ci: optimise pre-commit hook execution in CI pipeline

Docker Build and Publish / Build and Push (push) Failing after 12s

Details

6d36fa764a

Configure prek to run with CI-specific settings using --hook-stage flag,
allowing selective hook execution. Skip autoupdate check in CI environments
to reduce noise and unnecessary network calls during automated builds.

Add --no-progress flag to autoupdate hook for cleaner output when checking
repository updates during local commits.

fix: correct pre-commit stage configuration for CI compatibility

Docker Build and Publish / Build and Push (push) Successful in 4m32s

Details

d5aedd7a61

Fix invalid 'ci' stage value that caused workflow failures. Use 'manual' stage
for CI runs and 'pre-commit' for local commits. Configure autoupdate hook to
only run during local commits, preventing unnecessary network calls in CI.

All quality checks (pytest, mypy, ruff) now run in both local and CI contexts,
while autoupdate remains local-only for efficiency.

fix: improve Docker build caching and optimise container build time

Docker Build and Publish / Build and Push (push) Successful in 1m10s

Details

b957c3b130

Enable BuildKit registry caching to persist uv package downloads between CI runs,
preventing redundant downloads. Add sharing=locked to cache mounts to prevent
concurrent build corruption.

Optimise chmod operation to only make binaries executable rather than recursively
processing the entire virtual environment, significantly reducing build time.

ci: reduce pytest verbosity in pre-commit hooks

Docker Build and Publish / Build and Push (push) Successful in 53s

Details

3a961f5fab

Replace verbose pytest output (-xvs) with quiet mode (-x -q) to show only
failures and summary. Reduces CI log noise while maintaining visibility of
actual test failures.

ci: suppress ffmpeg warning during pytest runs

Docker Build and Publish / Build and Push (push) Successful in 1m0s

Details

14cd8346a1

Add filter for the RuntimeWarning about missing ffmpeg/avconv from pydub.
This is expected behaviour as ffmpeg is only installed in the Docker
container, not in the CI runner where tests execute.

ci: use uv run instead of uvx for ruff pre-commit hooks

Docker Build and Publish / Build and Push (push) Successful in 47s

Details

379fbd73bb

Switch from uvx to uv run for ruff commands to leverage existing dependency
caching. Since ruff is installed via uv sync, using uvx causes redundant
downloads as a separate tool, defeating our caching strategy.

fix: add cache-bust ARGs to force rebuild with corrected BuildKit umask

Docker Build and Publish / Build and Push (push) Successful in 1m46s

Details

2df273f161

BuildKit container now has umask 022 configured, but cached layers still have
the old 770 permissions. Adding ARG CACHE_BUST forces fresh layer builds.

fix: explicitly create /app with correct permissions

Docker Build and Publish / Build and Push (push) Successful in 2m10s

Details

9415f5f14e

Even with BuildKit umask fixed, WORKDIR creates directories with wrong
permissions. Explicitly creating /app with chmod 755 before WORKDIR.

feat(pre-commit): enhance hooks with comprehensive code quality checks

Docker Build and Publish / Build and Push (push) Successful in 2m2s

Details

0988cd3143

Add conventional commit message validation, typo checking, and markdown
linting to enforce consistent code and documentation standards. Remove
default_stages restriction that was preventing commit-msg hooks from
running. Fix duplicate reasoning content accumulation in streaming.

Also remove appuser from Dockerfile as container now runs as root,
eliminating permission issues with BuildKit.

fix(streaming): allow reasoning content with tool markers to stream to client

Docker Build and Publish / Build and Push (push) Successful in 1m10s

Details

f48498bee8

DeepSeek R1 models generate reasoning content that should be visible to
clients wrapped in <think> tags. Previously, any content containing tool
markers was suppressed entirely, preventing reasoning from being displayed.

Now, when in thinking mode, content is always streamed to the client even
if it contains tool markers, ensuring reasoning visibility while still
suppressing actual tool call invocations.

fix(streaming): remove duplicate reasoning content accumulation

Docker Build and Publish / Build and Push (push) Successful in 1m5s

Details

ed527095db

ChatChunk.from_json_data already converts reasoning_content to content when
there's no regular content. The additional accumulation logic was causing
content to be doubled internally (OkayOkay, the the), which was interfering
with tool call extraction.

Simplified by removing the redundant second accumulation stage entirely.

feat(streaming): add comprehensive logging for tool call detection

Docker Build and Publish / Build and Push (push) Successful in 1m17s

Details

47187bfeb0

- Add WARNING level logs when tool calls are detected in various forms
- Add DEBUG_SHOW_ALL_TOKENS flag to disable suppression for debugging
- Log final accumulated content with length
- Log when tool_calls appear in delta
- Log when content is yielded despite is_tool_calling being true

This will help identify why the stream appears to pause when processing
DeepSeek R1 reasoning that mentions tools.

fix(streaming): properly attach tool calls from delta to ChatChunk message

Docker Build and Publish / Build and Push (push) Successful in 56s

Details

ecd51b3788

Tool calls were being detected and logged but not actually attached to
the message object, preventing accumulation and execution. Now converts
delta tool_calls to ToolCall objects and attaches them to the message.

feat(streaming): add comprehensive logging for tool call detection

Docker Build and Publish / Build and Push (push) Successful in 59s

Details

7ff30aa0fe

Added detailed logging to track tool call creation and attachment to messages.
Also added fallback detection for tool_calls on message when has_tool_calls
flag is false, which shouldn't happen but helps with debugging.

fix(streaming): properly handle partial tool call deltas from streaming API

Docker Build and Publish / Build and Push (push) Successful in 58s

Details

9ba005bbc7

Tool calls arrive as fragments across multiple deltas (first with id/name,
then subsequent chunks with argument pieces). Now correctly creates partial
ToolCall objects that get merged by the accumulation logic.

refactor(chat): unify streaming backend for all chat completions

Docker Build and Publish / Build and Push (push) Successful in 59s

Details

8934a9cac6

Refactor non-streaming chat handler to use the streaming backend internally
and collect responses. This ensures consistent tool handling and model
configuration loading whether clients request streaming or not.

Extract response accumulation to separate helper function to reduce complexity.
Fix test mocks to properly handle async iteration of streaming chunks.

fix(streaming): add logging to debug tool call accumulation in non-streaming mode

Docker Build and Publish / Build and Push (push) Successful in 54s

Details

de1e874b75

fix(streaming): transfer tool calls from ChatChunk to streaming delta

Docker Build and Publish / Build and Push (push) Successful in 56s

Details

eacae9b99f

Tool calls detected in ChatChunk messages weren't being included in the
streaming delta when converting to ChatStreamingChunk format. This caused
tool calls to be lost when using non-streaming mode, despite being properly
detected and logged during chunk processing.

fix(streaming): add diagnostics to trace tool call handling from llama.cpp

Docker Build and Publish / Build and Push (push) Successful in 57s

Details

911c036551

Add targeted logging to understand exact format of tool call deltas from llama.cpp.
Logs full JSON when tool_calls field detected in delta, and tracks extraction
results to identify why DeepSeek R1 tool calls within think tags aren't being
properly parsed and forwarded to clients.

fix(streaming): yield accumulated tool calls for server-side execution

Docker Build and Publish / Build and Push (push) Successful in 54s

Details

06dc977c5a

When llama.cpp sends tool_calls as streaming deltas, they are accumulated
into current_message.tool_calls but never yielded back through the streaming
pipeline. This causes _collect_streaming_response to receive no tool_calls
and create an AssistantMessage with tool_calls=None.

Add logic to yield a final streaming chunk containing the accumulated tool
calls when the stream completes, ensuring they can be executed server-side.

fix(streaming): ensure message is created when tool_calls are present

Docker Build and Publish / Build and Push (push) Successful in 55s

Details

0a5227d856

ChatChunk.from_json_data was only creating a message when content was present
in the delta. When llama.cpp sends tool_calls without content, no message was
created, preventing tool_calls from being attached. This caused accumulated
tool_calls to be lost.

Now create an AssistantMessage whenever tool_calls are present in the delta,
ensuring they can be properly accumulated via the merge method.

fix(streaming): merge tool call deltas by index to properly accumulate arguments

Docker Build and Publish / Build and Push (push) Successful in 58s

Details

ed563240d4

refactor(streaming): eliminate inefficient message merging pattern

Docker Build and Publish / Build and Push (push) Successful in 57s

Details

48280c1459

Replace the creation and merging of hundreds of AssistantMessage objects
per streaming response with a single message that accumulates deltas directly.
Tool calls now properly accumulate by index without creating intermediate objects.

Breaking change: ChatChunk transformed from message factory to lightweight delta
carrier. ChatStreamingDelta.tool_calls changed from ToolCall objects to raw dicts
for efficiency.

Removes all merge() methods following YAGNI principle - direct accumulation is
simpler and more performant.

fix(streaming): enable server-side tool execution in conversation flow

Docker Build and Publish / Build and Push (push) Successful in 58s

Details

53f4786bc1

Tool calls were being detected but not properly triggering server-side execution.
The CompletionHandler now checks for pending tool calls after stream completion
and signals the ConversationHandler to handle them before continuing.

Added source tracking to ToolCall model to distinguish between structured
(API-provided) and extracted (content-parsed) tool calls, ensuring correct
processing mode for each type.

fix(streaming): add assistant message before tool messages in structured mode

Docker Build and Publish / Build and Push (push) Successful in 53s

Details

3168b8b2fb

The conversation handler was not adding the assistant message containing
tool calls before adding the tool response messages. This caused llama.cpp
to reject subsequent requests with 'Cannot have 2 or more assistant
messages at the end of the list' error.

Now properly adds both the assistant message with tool calls and the
subsequent tool messages to maintain correct conversation flow.

fix(streaming): remove duplicate assistant message addition in completion handler

Docker Build and Publish / Build and Push (push) Successful in 56s

Details

95c83a69ff

The completion handler was adding the assistant message to the conversation
before signaling tool_execution_needed, and then conversation handler was
also adding it when processing structured tool calls. This caused llama.cpp
to reject the request with 'Cannot have 2 or more assistant messages at
the end of the list' error.

Now only conversation.py handles adding messages to maintain single
responsibility and avoid duplication.

fix(streaming): enable server-side tool execution in conversation flow

Docker Build and Publish / Build and Push (push) Successful in 1m3s

Details

4debd032f6

Refactored tool message handling to properly maintain conversation structure:
- AssistantMessage.process_tool_calls now returns complete message sequence
- For structured mode: returns [AssistantMessage, ToolMessage1, ToolMessage2, ...]
- Preserves tool_calls on AssistantMessage (no longer cleared)
- Conversation handler now extends messages with the complete sequence

Also updated ruff config to globally ignore C901 and PLR1702 rules that
were being consistently overridden with noqa comments.

This ensures the correct message order for llama.cpp compatibility while
maintaining tool call information on assistant messages.

fix(streaming): halt immediately on tool detection and filter malformed markers

Docker Build and Publish / Build and Push (push) Successful in 57s

Details

0fce773e27

- Stop stream processing immediately when tool calls are detected
- Don't close thinking tags when halting for tools (continue after execution)
- Filter out malformed tool call markers with Unicode characters
- Add detection for additional malformed tool call variants
- Prevent model-generated malformed tool syntax from leaking to users

fix(llamacpp): disable native tool parsing to prevent DeepSeek R1 crashes

Docker Build and Publish / Build and Push (push) Successful in 57s

Details

4807f53f5c

Adds parse_tool_calls: false to all llama.cpp requests to prevent the server from
attempting to parse tool calls with its grammar system. DeepSeek R1 and other
models use custom tool formats that conflict with llama.cpp's grammar parser,
causing crashes when the server tries to constrain output.

Our extraction method handles all tool formats correctly, so we bypass llama.cpp's
native parsing entirely. Also documents other available llama.cpp parameters for
future use (reasoning_format, thinking_forced_open, chat_template_kwargs).

fix(streaming): improve tool extraction to handle structured separators

Docker Build and Publish / Build and Push (push) Successful in 54s

Details

71e9468232

Enhanced tool call extraction to recognise explicit separator tokens like
<｜tool▁sep｜> used by models to delineate function names. Now extracts all
tool calls including placeholder names (e.g. FUNCTION_NAME) so the execution
layer can return appropriate errors, guiding models to correct behaviour.

Also defaults to empty JSON object when no parameters are present, as many
tools have optional parameters. Added logging to track separator detection
for debugging model-specific formats.

refactor(messages): remove embed_tool_responses config and fix continuation handling

Docker Build and Publish / Build and Push (push) Successful in 55s

Details

f4bf2cdedd

Since we now handle all tool parsing ourselves with parse_tool_calls: false,
the embed_tool_responses config is unnecessary. Tool handling is determined
purely by whether calls are extracted (embedded in assistant) or structured
(separate messages).

Fixed duplicate assistant message issue by preventing _add_assistant_starter
from adding new messages during continuations. The assistant message already
exists with embedded tool responses for extracted mode, so we just continue
it rather than creating a new one.

refactor(logging): improve tool call and conversation logging clarity

Docker Build and Publish / Build and Push (push) Successful in 55s

Details

6581665b82

Replace misleading warning-level logs with appropriate debug/info levels throughout
the streaming pipeline. Focus logging on actionable issues and state transitions
rather than implementation details.

Key improvements:
- Add state machine tracking for tool extraction process
- Remove noisy warnings for expected conditions (e.g. word 'tool' in content)
- Log tool batches with names for better execution visibility
- Track conversation cycles and continuation states clearly
- Reserve warnings for genuine issues (unknown tools, malformed JSON)
- Simplify accumulated content logging to reduce debug spam

refactor(llamacpp): unify tool handling via inject_tools_in_prompt config

Docker Build and Publish / Build and Push (push) Successful in 56s

Details

403296efda

Simplify tool extraction control by using inject_tools_in_prompt as the
single configuration point. When enabled, tools are injected into the prompt
using DeepSeek R1's exact format and parse_tool_calls is disabled to prevent
llama.cpp from parsing them natively.

This eliminates the need for special-case logic and provides a consistent
approach for models requiring custom tool formats like DeepSeek R1 and Gemma.

fix(streaming): ensure extracted tool calls are processed through unified pipeline

Docker Build and Publish / Build and Push (push) Failing after 25s

Details

a22768e25f

Extracted tool calls were being generated but never added to the streaming
state's pending_tool_calls, causing them to be skipped during execution.
Now properly stores extracted calls for processing alongside structured ones.

Includes duplicate detection to handle malformed model outputs where the same
tool call appears multiple times with incorrect separators.

fix(llamacpp): inject tools into system message instead of user message

Docker Build and Publish / Build and Push (push) Failing after 16s

Details

0a82d4afde

Tools should always be injected into the system message to provide context
without affecting the user's actual query. Creates a system message if the
first message isn't already a system message, ensuring proper separation
of tool instructions from user content.

ci(hooks): remove problematic typos hook from pre-commit configuration

Docker Build and Publish / Build and Push (push) Successful in 1m43s

Details

d2d078b039

The typos hook was causing CI authentication and Python version conflicts
during the build process. Removing this hook resolves runner issues and
allows proper execution of other quality checks.

fix(thinking): synchronise think tag injection between model and user

Docker Build and Publish / Build and Push (push) Successful in 1m15s

Details

58e79d56b0

Previously, the model received <think> tags via assistant starter messages but
users only saw the closing </think> tag. This caused confusion as the opening
tag was missing from the user's perspective.

Now both model and user receive identical <think> tags when enforce_thinking
is enabled. The assistant starter conditionally includes <think> based on the
backend configuration, and the streaming handler manually injects the same tag
to the user's stream.

refactor(streaming): remove deprecated LlamaCppStreamingProcessor

Docker Build and Publish / Build and Push (push) Successful in 1m18s

Details

e53a873822

The old streaming processor was replaced by ConversationHandler in commit
a4dbebc7 but never removed. This processor duplicated thinking tag injection
logic and was not used in production.

ConversationHandler now handles all streaming with proper tool execution
and thinking tag synchronisation between model and user.

fix(streaming): preserve conversation state across tool execution cycles

Docker Build and Publish / Build and Push (push) Successful in 1m37s

Details

6d7d8044bc

Shared ConversationContext between completion cycles prevents creating fresh
AssistantMessage on continuation. This allows proper accumulation of new
content after tool execution, fixing infinite loop where tool_calls_end
markers were incorrectly re-detected in subsequent streaming passes.

Also restructures pre-commit hooks for better performance and reliability
by running format checks before expensive test suites.

fix(streaming): clear tool state after execution to prevent re-detection

Docker Build and Publish / Build and Push (push) Successful in 1m6s

Details

49b050294d

Tool calls were persisting in ConversationState.current_message after
execution, causing continuations to immediately re-detect and re-execute
the same tools. Now properly clears tool_calls, pending states, and flags
after processing to allow clean continuation.

fix(streaming): append assistant message before tool execution in extracted mode

Docker Build and Publish / Build and Push (push) Successful in 1m8s

Details

1bd7f02d67

Messages were being lost during tool execution because the assistant message
wasn't added to the conversation before being modified with tool results.
This caused the message count to reset on continuation, breaking the
conversation flow after tool calls.

fix(streaming): process only new chunks through StreamParser, not accumulated content

Docker Build and Publish / Build and Push (push) Successful in 1m17s

Details

fc4d6b8c52

StreamParser was being misused - instead of feeding it new chunks incrementally
as designed, the entire accumulated content was being re-processed on every
chunk. This caused old tool call markers to be re-detected on continuation
cycles, resulting in infinite loops with repeated tool execution.

Modified _handle_content_chunk to feed only new content chunks to the parsers,
allowing them to maintain state properly. Removed _extract_complete_tool_calls
method entirely as it was the source of the re-scanning behaviour.

fix(streaming): clean up output and improve thinking block tool handling

Docker Build and Publish / Build and Push (push) Successful in 1m17s

Details

4e097cd839

Removed DEBUG_SHOW_ALL_TOKENS flag that was leaking all content to users,
causing duplicated thinking content and malformed tool markers in output.

Enhanced think tag buffering to keep tool calls inside thinking blocks when
they immediately follow </think> tags, allowing models to continue reasoning
after tool execution before presenting final output to users.

test(streaming): add comprehensive raw content logging for tool call diagnosis

Docker Build and Publish / Build and Push (push) Successful in 1m14s

Details

a1a61c5ade

Log the raw content from llama.cpp before any processing to identify why
tool calls aren't being generated or extracted. Tracks content through
filtering decisions to reveal whether the model produces tool syntax that
gets incorrectly suppressed.

feat(vllm): add vLLM backend support as alternative to llama.cpp

Docker Build and Publish / Build and Push (push) Failing after 49s

Details

dd83621e36

Implement vLLM as an additional backend provider option alongside llama.cpp.
vLLM provides an OpenAI-compatible API with native GGUF support, offering
an alternative approach to handling tool calls and model inference.

- Create vLLM service lifecycle handler for container management
- Add unified LLM handler factory to route between providers
- Update configuration to use vLLM for tools and agent models
- Maintain backwards compatibility with existing llama.cpp backends

fix(types): resolve mypy errors in vLLM integration

Docker Build and Publish / Build and Push (push) Successful in 1m44s

Details

9f31cd88a7

Fixed type checking issues that bypassed pre-commit hooks in the vLLM feature
commit. Corrected ServiceType enum usage, added type ignore for legitimate
inheritance override, and made UnifiedLLMHandler properly extend base class
to satisfy ModelManager's type requirements.

fix(vllm): handle missing backend config attributes safely

Docker Build and Publish / Build and Push (push) Successful in 1m17s

Details

4c04a925bb

Prevent AttributeError when optional vLLM backend configuration fields
are not specified. Uses getattr with sensible defaults for quantisation
and tensor_parallel_size to allow minimal configurations to work.

fix(vllm): prevent NoneType comparison for tensor parallelism config

Docker Build and Publish / Build and Push (push) Successful in 1m12s

Details

6a0e66c646

The tensor_parallel_size attribute may not exist on BackendConfig objects.
When missing, getattr returns None which cannot be compared with integers.
Added explicit None check before numerical comparison to prevent TypeError.

fix(vllm): ensure GPU count defaults to 1 when tensor_parallel_size is None

Docker Build and Publish / Build and Push (push) Successful in 1m10s

Details

6f88406d69

The tensor_parallel_size attribute may be None after getattr with None default.
Using 'or 1' ensures we always pass a valid GPU count to Docker device requests,
preventing vLLM container startup failures when GPU detection is required.

fix(vllm): use v0.8.5 image for CUDA 12.4 compatibility

Docker Build and Publish / Build and Push (push) Successful in 1m19s

Details

836eed42d2

The latest vLLM images require CUDA 12.8 but TrueNAS driver 550.142 only
supports CUDA 12.4. Pin to v0.8.5 which maintains CUDA 12.4 compatibility
whilst providing stable tool calling functionality.

fix(vllm): add .gguf extension to GGUF model paths for proper file resolution

Docker Build and Publish / Build and Push (push) Successful in 1m18s

Details

a108f74b31

GGUF models require explicit file paths with .gguf extension rather than
directory paths. vLLM was failing to load models because the constructed
path pointed to a directory instead of the actual .gguf file.

ci(vllm): add Docker image build pipeline with CUDA 12.4 support

Docker Build and Publish / Build and Push (push) Failing after 1m41s

Details

Build vLLM Images / build-vllm (devel, 12.4.1, ubuntu22.04) (push) Has been cancelled

Details

545115a147

Create dedicated CI workflow to build vLLM images compatible with TrueNAS
CUDA 12.4 driver constraints. Avoids upstream version incompatibilities by
compiling vLLM against specific CUDA toolkit versions.

Multi-stage Dockerfile minimises runtime image size whilst including all
necessary CUDA libraries. Weekly rebuilds ensure latest vLLM improvements
whilst maintaining driver compatibility.

refactor(ci): consolidate Dockerfiles and optimise CI triggers

Docker Build and Publish / Build and Push (push) Failing after 1m24s

Details

Build vLLM Images / build-vllm (devel, 12.4.1, ubuntu22.04) (push) Failing after 9m47s

Details

2fc66b5b11

Move all Dockerfiles to dedicated docker/ directory for better organisation.
Configure workflows to use shared buildkit container for faster builds.

Add path-specific triggers to main Docker workflow to prevent unnecessary
rebuilds when only documentation or non-code files change. Both workflows
now use the same shared buildkit infrastructure for consistency.

fix(docker): limit vLLM build parallelization to prevent OOM

Build vLLM Images / build-vllm (devel, 12.4.1, ubuntu22.04) (push) Failing after 2m17s

Details

11b0b4bc47

- Set MAX_JOBS=8 to limit cmake parallel compilation
- Add --no-build-isolation flag for better dependency control
- Add cmake to build dependencies
- Prevents build failures on high-core count systems

fix(docker): add setuptools-scm build dependency for vLLM

Build vLLM Images / build-vllm (devel, 12.4.1, ubuntu22.04) (push) Failing after 3m27s

Details

046f67c6b2

vLLM requires setuptools-scm for version management during the build process.
Without it, the build fails with ModuleNotFoundError when compiling from source.

perf(docker): add BuildKit cache mounts for pip wheels

Build vLLM Images / build-vllm (devel, 12.4.1, ubuntu22.04) (push) Failing after 3m16s

Details

7dac61e934

Enable Docker BuildKit cache for all pip install operations to significantly
speed up rebuilds. PyTorch alone is ~2GB of downloads that will now be cached
between builds, reducing both build time and bandwidth usage.

fix(docker): install CMake 4.0.3 to meet vLLM requirements

Build vLLM Images / build-vllm (4.0.3, devel, 12.4.1, ubuntu22.04) (push) Failing after 17m55s

Details

ff42234e04

vLLM requires CMake 3.26+ but Ubuntu 22.04 ships with 3.22.1. Install latest
CMake 4.0.3 from official binary distribution. Made version configurable via
build argument for easier future updates.

fix(docker): pin PyTorch 2.4.0 and optimise GPU architecture targets

Build vLLM Images / build-vllm (4.0.3, devel, 12.4.1, ubuntu22.04) (push) Failing after 15m34s

Details

115547b8dd

vLLM requires PyTorch 2.4.0 but was pulling 2.6.0, causing version mismatch
errors. Also targets only modern GPU architectures (Ampere and newer) to
reduce build time and binary size by ~60%. Supports RTX 3090 and all newer
NVIDIA GPUs including Ada Lovelace and Hopper architectures.

feat(docker): build PyTorch from source for CUDA 12.4 compatibility

Build vLLM Images / build-vllm (4.0.3, devel, 12.4.1, ubuntu22.04, v2.8.0) (push) Failing after 3m51s

Details

3b53b6c1ab

Build PyTorch v2.8.0 from source to get Float8_e8m0fnu support required by
latest vLLM. Uses ccache with BuildKit cache mounts to speed up rebuilds.
Limited to 8 parallel jobs to avoid memory exhaustion on the build server.

fix(docker): add python-is-python3 package for PyTorch build

Build vLLM Images / build-vllm (4.0.3, devel, 12.4.1, ubuntu22.04, v2.8.0) (push) Failing after 5m27s

Details

f705b94b83

PyTorch build scripts expect 'python' command to be available. Ubuntu 22.04
provides python-is-python3 package which creates the necessary symlinks.

fix(docker): use ccache PATH method to avoid NCCL build failures

Build vLLM Images / build-vllm (4.0.3, devel, 12.4.1, ubuntu22.04, v2.8.0) (push) Failing after 29s

Details

23aa887d06

- Replace CC/CXX environment variables with PATH manipulation
- Create symlinks in /usr/lib/ccache for all compilers
- This avoids NCCL's Makefile mishandling ccache wrapper syntax

fix(docker): remove redundant ccache symlinks

Build vLLM Images / build-vllm (4.0.3, devel, 12.4.1, ubuntu22.04, v2.8.0) (push) Has been cancelled

Details

09976ff1ee

Ubuntu's ccache package already creates symlinks in /usr/lib/ccache

feat(docker): increase PyTorch build parallelization to 16 jobs

Build vLLM Images / build-vllm (4.0.3, devel, 12.4.1, ubuntu22.04, v2.8.0) (push) Has been cancelled

Details

ac7c46f847

PyTorch compilation is CPU-bound and uses under 5GB RAM with 8 jobs,
so we can safely increase to 16 jobs while keeping vLLM builds at 8

fix(docker): reduce PyTorch build parallelization to 12 jobs

Build vLLM Images / build-vllm (4.0.3, devel, 12.4.1, ubuntu22.04, v2.8.0) (push) Has been cancelled

Details

09ef8be941

Build with 16 jobs peaked at over 37GB RAM, causing OOM failures on the
64GB build server. Reducing to 12 jobs should provide better balance
between build speed and memory usage.

feat(docker): implement PyTorch wheel build and package registry workflow

Build vLLM Images / build-vllm (12.4.1, ubuntu22.04, 2.8.0) (push) Failing after 57s

Details

Docker Build and Publish / Build and Push (pull_request) Has been skipped

Details

afe2d2eb4c

Create separate PyTorch wheel builder that compiles once and publishes to
Forgejo's generic package registry. This avoids recompiling PyTorch for
every vLLM build and handles both dev/runtime image requirements.

The wheel is built via workflow dispatch and stored in the package registry,
then referenced by URL in downstream Dockerfiles. This provides versioned,
cached PyTorch builds that persist beyond build cache expiry.

tom merged commit 3eab905674 into main

2025-09-05 10:18:39 +01:00

tom deleted branch dev

2025-09-05 10:18:39 +01:00

No reviewers

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tom/neuromancer#1

No description provided.

Rows
Columns