LLMs are stateless APIs — each call is completely independent. The model has no memory of previous exchanges. To create a conversational experience, you must explicitly pass the entire conversation history in every call.
Pattern: Maintain a chat_history list. After each turn, append the user's HumanMessage and the AI's AIMessage. On the next call, prepend history to the messages list.
chat_history = []\n\ndef chat(user_input):\n messages = [SystemMessage('You are a helpful assistant.')] + chat_history + [HumanMessage(user_input)]\n response = model.invoke(messages)\n chat_history.append(HumanMessage(user_input))\n chat_history.append(AIMessage(response.content))\n return response.content
This is why LLM conversation UIs like ChatGPT send the full history on every request — they're building this list and passing it each time. The context window limit is the practical ceiling: long enough conversations hit the token limit and older messages must be truncated or summarised.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- How LLMs simulate memory — passing the full conversation list each call.
- LLMs are stateless APIs — each call is completely independent.
- On the next call, prepend history to the messages list.
- The context window limit is the practical ceiling: long enough conversations hit the token limit and older messages must be truncated or summarised.
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- The model has no memory of previous exchanges.
- After each turn, append the user's HumanMessage and the AI's AIMessage .
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Build deterministic baseline chains first (prompt -> model -> parser), then add retrieval, memory, or tools only when the baseline is stable.
Production note: Keep contracts explicit at each boundary: input variables, output schema, retries, and logs. This is what keeps orchestration reliable at scale.
What actually creates conversational memory: the model does not keep a hidden chat session between API calls. The application reconstructs continuity by sending a structured list of SystemMessage, HumanMessage, and AIMessage objects every time. The system message sets the durable behavioral frame, the human message carries the new request, and earlier AI and human turns are replayed to preserve context. That is why role labels matter. If a prior assistant answer is accidentally replayed as a user message, the model is being given the wrong conversation history.
Architecture reading: system instruction -> prior turns -> new user turn -> model invocation -> assistant reply -> append new turn pair -> next request. This is a state loop implemented by the application layer, not the model provider. Once you understand that loop, products like terminal chats, customer-support bots, and multi-step copilots become easier to reason about because they all differ mainly in how they store, trim, and validate the message list.
Failure modes and production design: chat history grows linearly with conversation length, so every serious system eventually needs truncation, summarization, or retrieval-backed memory. Duplicate appends create repeated context; missing appends make the assistant appear forgetful; missing system prompts make behavior drift. A practical design rule is to treat message construction as a first-class piece of backend logic with tests around role ordering, session isolation, and context-window budgets.