Stay informed and never miss an Core update!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Because of the non-deterministic nature of generative AI models, intelligent agents must be guided and controlled so that their result is as expected.
In the Part I In this series of articles, we describe a first approach to the use of AI multiagents using the AutoGen framework to automate the management of a “domain report” procedure and improve the service capacity of a public sector body for citizens.
Because, based on the tests carried out, certain results were not satisfactory in terms of the accuracy of the information extracted from low-resolution documents, we undertook a series of improvements and strategies to guide and control the outcome of AI multiagents.
In this part II, we describe these strategies.
Strategy #1 - Use of different coordination patterns between agents
A multi-agent coordination pattern is a specific model that describes how agents interact with each other to solve problems and how the scope and flow of context between agents is defined. Some interesting articles describe coordination patterns in detail, specifically in AutoGen 1, and in general, with a certain analogy with the concurrency models according to Akka 2. In addition, there is an online course3 which explains how to use each AutoGen coordination pattern.
As we saw in part I, in our first example we used the pattern Sequential where control is transferred between agents, for example, from the user_proxy to the OCR_agent and the latter returns to the user_proxy. Only one agent is active at a time. The context accumulates or transforms as it moves down the chain, until the final result is reached.
Because the AutoGen framework allows us different coordination patterns between intelligent agents, we perform different tests using each of them.
The second pattern we use is that of Delegation, where a coordinator assigns sub-tasks to other working agents. Workers operate in isolated contexts. The results are processed for synthesis.
Within this pattern, we can mention, the use of GroupChat or group chat, which is a design pattern where a group of agents share a common message thread: everyone subscribes and posts on the same topic. Each participating agent specializes in a specific task. Participants take turns posting messages, and the process is sequential: only one agent works at a time. Internally, the order of the shifts is maintained by an agent coordinating the group chat, who selects the next agent to speak to when receiving a message. The exact algorithm for selecting the next agent may vary depending on application requirements. In our case, we used the rotary-shift algorithm (round robin).
Next, the piece of code where we define the coordinating agent of the chat group:
self.manager = GroupChatManager(
groupchat=GroupChat(
agents=[self.user_proxy, self.ocr_agent, self.formatter_agent],
messages=[],
max_round=5,
speaker_selection_method="round_robin",
),
llm_config=self.llm_config,
system_message=(
"""You are the group manager responsible for coordinating a two-step process to convert pdf or images files to structured JSON responses.
Step 1: Ask the OCR_Agent to extract the text from the image or PDF file.
Step 2: Once the OCR_Agent provides the extracted text, if confidence > 0.8 send that content to the Formatter_Agent to generate response in JSON format.
Make sure the process follows this order strictly:
1. OCR_Agent → extract raw text.
2. Formatter_Agent → create JSON output from the extracted text.
"""
)
)
And the definition of the function that initiates the call to it:
def process_file(self, file_path: str):
message = f"extracts text from file '{file_path}', and provides it in raw text format."
response = self.user_proxy.initiate_chat(self.manager, message=message)
return response.chat_history[-1]['content']
We also tested other Delegation coordination patterns such as:
We implemented these patterns using the AutoGen library called Agentchat4, a high-level API for building multi-agent applications. It is built on the autogen-core package. At the end of the series of articles, we shared the repository with the examples.
With these changes, we were able to verify how messages are exchanged between agents, explicit conversations between them, with their back and forth. We observed that, depending on the pattern used, the conversation may be longer than in another, and in some cases, even, the conversation did not end and we had to cut it from the outside, or with the intervention of a person (human-in-the-loop).
In short, with this strategy of change, we were unable to improve or solve the problem of precision in the results, that is, because the problem of precision was found in the recognition of texts, not in the interaction between agents.
Strategy #2 - Use of Augmented Recovery Generation (RAG)
Because one of the most important data to extract is the owner's identification number (CUIL/CUIT), because this was a problem with low-resolution images, and since we could previously have a list of CUIL/CUIT and DNI (personal identification data), it occurred to us to try to improve data retrieval from models with augmented data using the RAG (Retrieval Augmented Generation) technique5.
To implement this strategy, we use an open source vector database that is standard in the python world: ChromaDB6, is the default vector database used for Augmented Recovery Generation (RAG) applications within the AutoGen framework, allowing AI agents to have long-term memory and access personalized knowledge.
Previously, we populated the new vector database with the CUIL/CUIT values, through an ingestion process from a .csv file, to make it simple, we used only 3 data or columns: CUIL/CUIT, DNI, First Name and Last Name. The ingestion process converts these values into vectors through an embedding technique using a specific “ALL-miniLM-L6-v2” model, whose details are beyond the scope of this article, but at the end of the series, the source code can be consulted in the shared repository.
Once the vector base is populated, with just three lines we instantiate it for use:
from autogen.agentchat.contrib.vectordb.chromadb import ChromaVectorDB
chroma = ChromaVectorDB(path="./chroma_db")
collection = chroma.get_collection("cuits_collection")
Then, we replaced the user_proxy agent definition with rag_proxy_agent as follows:
rag_proxy_agent = RetrieveUserProxyAgent(
name="Rag_Proxy_Agent",
is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
human_input_mode="NEVER",
code_execution_config=False,
system_message = "Context retrieval assistant.",
retrieve_config={
"task": "qa",
"docs_path": None,
"vector_db": chroma,
"model": "qwen2.5vl:7b",
"embedding_model": "all-MiniLM-L6-v2",
"collection_name": "cuits_collection",
"update_context": True,
"context_max_tokens": 10000,
"get_or_create": False
}
)
As a result of the application of this strategy, we can highlight the following:
This led us to continue exploring other alternatives and strategies.
Strategy #3 - Use of validation rules + extraction tools
Continuing our quest to improve accuracy in extracting key information from image files, we found some paths that bring us closer to this objective.
First, it is possible to validate a CUIL/CUIT using an algorithm10, this allows us to verify, quickly and easily, although not perfectly, whether the extraction of such information is valid.
Below is the piece of code to implement the algorithm, courtesy of the Python community in Argentina11:
# CUIL/CUIT Validation function (without dashes)
# Python Community from Argentina - https://wiki.python.org.ar/recetario/validarcuit/
def validar_cuit(cuil: str) -> bool:
# minimum validations
if len(cuil) != 11:
return False
base = [5, 4, 3, 2, 7, 6, 5, 4, 3, 2]
# calculation of the check digit:
aux = 0
for i in range(10):
aux += int(cuil[i]) * base[i]
aux = 11 - (aux - (int(aux / 11) * 11))
if aux == 11:
aux = 0
if aux == 10:
aux = 9
return aux == int(cuil[10])
Within the run_ocr function that performs the data extraction, we implement the logic of parsing the CUIL/CUIT value and validate it with the algorithm using a function.
Having this identification value validated, allowed us to infer, if the extraction was correct, at least for the most critical values, and with that, to proceed with the delivery of results. If the validation was not successful, we had the option of aborting the workflow by sending an error for subsequent manual processing, or using another, more precise extraction tool for those cases.
We follow this last approach, through the use of a powerful document extraction tool, based on intelligent agents, called ADE - Agentic Document Extraction12 From the firm Landing.ai.
The tests carried out with the ADE solution showed very satisfactory results, both in terms of accuracy in the extraction of different document formats and in response times.
However, ADE, being a cloud service consumed through an API, does not meet one of the non-functional requirements:”not to disclose the Citizen's personal information (PII)”, and in addition, it has a cost for document processing, the use of it was conditioned exclusively to cases of invalid extraction with VLMs, that is, for exceptions where the validation of the CUIL/CUIT is not passed. Additionally, and as a mitigating factor, the firm ensures that it does not keep personal information private, through the ZDR - Zero Data Retention option13.
As an additional control element, prior to verifying the CUIL and calling the ADE function, the verification was added to the run_ocr () function prompt that there is a paragraph in the document that begins with “The National Registry of Automotive Property”, that is, to avoid calling the ADE service, which has a cost, if the document does not match the one indicated.
As for the changes in our code, it's very simple to use and call the service.
Next, a fragment of the modified code of the run_ocr function that includes the addition of the parsing logic of the CUIL/CUIT data to be validated and, if it is not valid, the call to the ADE API.
def run_ocr(file_path: str) -> str:
ocr = OCRProcessor(model_name='qwen2.5vl:7b')
result = ocr.process_image(
image_path=file_path,
format_type="markdown",
custom_prompt= """ Extract all visible text from this image in Spanish **without any changes**.
- **Do not summarize, paraphrase, or infer missing text.**
- Retain all spacing, punctuation, and formatting exactly as in the image.
- If the text is unclear or partially visible, extract as much as possible without guessing.
- **Include all text, even if it seems irrelevant or repeated.**
- Check that there exists a paragraph that starts with 'El Registro Nacional de la Propiedad del Automotor'. If it is not present or readable, reply "TERMINATE".
- The value for the key 'Dominio' is located between 'número de dominio' and 'el Automotor'.
- The value for the key 'Nombre' is located after 'Nombre:' in the section "TITULAR" only.
- The value for the key 'Cuil' is located after 'Cuil:'.
- The value for the key 'Nro. Doc.:' is located after 'Nro. Doc.:" in the section "TITULAR" only.
- Extract the rest of the keys-values you can find in the full text.""", # Optional custom prompt
language="Spanish"
)
data = {}
for line in result.strip().split('\n'):
match = re.match(r'^(.*?):\s*(.*)$', line)
if match:
key = match.group(1).strip()
value = match.group(2).strip()
data[key] = value
print(data['Cuil'])
is_cuil = validar_cuit(data['Cuil'])
if not is_cuil:
result = agentic_doc_extraction(file_path, ExtractedFields) # print ("Using ADE services for better extraction")
return result
# Agentic Document Extraction
# Landing.ai - https://landing.ai/
def agentic_doc_extraction(file_path: str, extraction_model: BaseModel) -> str:
result = parse(file_path)
print(result[0].markdown)
return result[0].markdown
Results
Tests with ADE - Agentic Document Extraction of Landing.ai managed to achieve greater accuracy ~ 0.9999%, and this allows us to considerably reduce the possibility of error in the processing, and the consequent return of the procedure for manual processing.
On the other hand, considering a cost of $0.03 per page processed, the cost of using this tool is reasonable for an estimated error of 10% of the total volume of procedures of around 1,500 per day, that is, only for those cases where Ollama's local VLMs do not correctly recognize the information in the documents.
Then, we continued to explore other libraries and alternative document extraction solutions, including the open source solution developed by IBM for document processing called Doping14 15, however, the results obtained with ADE from Landing.ai continued to be superior.
Next, the complete flow implemented during this proof of concept, which includes the interaction of different intelligent agents, calls to tools and services, and the application of different controls to guide and limit the non-deterministic nature of the models:
%207.38.31%E2%80%AFp.%C2%A0m..png)
Summary and lessons learned
We can conclude that this proof of concept achieved the stated objective of the use case of automating the processing of “Vehicle Transfer Report” procedures to avoid human intervention in the reading and analysis of the vehicle's title. For this purpose, intelligent agents from the AutoGen framework were used, and different strategies were implemented with several iterations, each with its pros and cons, as seen in the following table:
%207.33.04%E2%80%AFp.%C2%A0m..png)
As lessons learned from this proof of concept, we can indicate:
In the next article, Part III, we describe some important considerations and frameworks available for productive environments that provide guarantees of resilience, availability, security, and so on.
Credits: Cover photo Shubham Dhage within Unsplash
References:
1-AutoGen Conversation Patterns - AutoGen Conversation Patterns, https://www.gettingstarted.ai/autogen-conversation-patterns-workflows/
2-Multiagent Coordination Patterns - Akka Multi-agent Patterns,
https://www.linkedin.com/posts/tylerjewell_agenticai-ai-akka-activity-7434604332383272960-Z7Gk/
3-AI agent design patterns with AutoGen: https://learn.deeplearning.ai/courses/ai-agentic-design-patterns-with-autogen/lesson/pcet5/introduction
4-API Agenchat: https://microsoft.github.io/autogen/stable//user-guide/agentchat-user-guide/index.html
5-DAY: https://cloud.google.com/use-cases/retrieval-augmented-generation
6-ChromaDB: https://www.trychroma.com/
7- Vision RAG sucks - Text is Still King: https://chunkr.ai/blog/vision-rag-sucks-text-is-still-king
8- OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models: https://arxiv.org/abs/2305.07895
9- Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts: https://arxiv.org/abs/2406.16851
10-Unique Tax Identification Key: https://es.wikipedia.org/wiki/Clave_%C3%9Anica_de_Identificaci%C3%B3n_Tributaria
11-Python algorithm for validating CUIL/CUIT: https://wiki.python.org.ar/recetario/validarcuit/
12- Agentic Document Extraction: https://landing.ai/ade
13- ZDR - Zero Data Retention: https://docs.landing.ai/ade/zdr
14- Docling documentation: https://www.docling.ai/
15- Docking Repo: https://github.com/docling-project/docling
16- Microsoft Semantic Kernel: https://learn.microsoft.com/en-us/semantic-kernel/overview/