How to Build a Data Ingestion Pipeline for Your RAG AI Agent

Marek Hozak

In our previous tutorial, we showed you how to build a RAG-based AI agent in Appmixer, designed to help you easily navigate internal company policies.

As with any RAG-based AI agent, a critical part is providing private data as context. So how do you do that?

In this tutorial, we'll show you how to build a data ingestion pipeline that feeds documents from a Google Drive folder into a vector database—enabling Appmixer’s AI agents to efficiently access large volumes of private data.

Before we dive into the configuration details, watch our RAG-based AI agent in action.

‍

What is a Data Ingestion Pipeline?

Data ingestion pipeline is an automated process of collecting, importing, and processing data from various sources into a destination system (like a data warehouse, data lake, or database), often in preparation for further analysis or use. In our case, it refers to feeding internal policies stored in Google Drive to a vector database.

What might a data ingestion pipeline look like when built in Appmixer?

‍

Workflow Explanation

To build our pipeline, we need to start by identifying where our data resides and where it needs to be stored for further use. In the example above, internal policy documents are stored in a Google Drive folder and transferred to a Pinecone vector database.

A key part of an efficient pipeline is ensuring that the data always stays up to date. To achieve this, we built four "sub-workflows" in Appmixer's no-code builder, each triggered by a different event:

Saving all internal policies to the vector database when the flow starts (triggered by the On Start event). This ensures that all existing documents are ingested at once.

‍

Updating the vector database whenever a new file or folder is created in the Google Drive folder. This ensures that new documents are automatically added to the vector database.

‍

Updating the vector database when a document is modified in Google Drive. This ensures that any changes to policy documents are reflected in the vector database.

‍

Removing documents from the vector database when they are deleted from the Google Drive folder. This prevents deleted documents from remaining in the database.

‍

To demonstrate how we built the pipeline, we'll do a deep dive into the first "subflow", which saves all existing documents to the vector database when the flow starts. This will help you understand the basics of building such workflows so you can replicate them for other events and adapt them to your own use case.

Step 1: Add an On Start Trigger

The On Start component triggers the workflow as soon as it's published. You can select it from the trigger selector or drag it directly onto the canvas.

‍

Step 2: Find All Files in a Google Drive Folder

Next, add the Google Drive connector and select the Find Files or Folders action. Authenticate your account, then choose the folder that contains the documents you want to store in the vector database.

You can enable recursive search if you want to include files in subfolders, and you can also filter by file type—for example, limiting the search to only Google Sheets.

‍

Step 3: Download All Files

From the out port of the previous component, add another Google Drive connector. This time, select the Download File action. Map the Google Drive File ID using the dynamic value from the previous component. Make sure to set Output File Data to true (to return the actual file content), and enable Convert Google Workspace Document to convert the file to plain text.

‍

Step 4: Generate Embeddings via OpenAI Component

The next step is to generate embeddings, which is essential for storing and querying data in a vector database. To do this, we use the OpenAI connector and the "Generate Embeddings" action. In the Text field, we input the File Data variable from the previous component and select the embedding model—in this case, text-embedding-ada-002.

‍

"Generating embeddings" refers to the process of converting textual, visual, or other types of data into dense vector representations (numerical arrays) that capture semantic meaning and relationships.

Step 5: Insert Vectors into Pinecone

Last step in our first "subflow" is to insert the vectors into Pinecone. For this, we choose the Pinecone connector and "Insert Vectors" action. After authenticating our account, we choose the Index, Namespace and again map the Embeddings from the previous component in the Vectors field.

‍

Next steps: Keep Your Database Up to Date

If you've been following the steps above, you've just created a pipeline that ingests all your documents into a vector database. To keep your database up to date, follow similar steps for other scenarios, such as updating your vector database with new files or deleting removed files.

Ready to Build Your RAG AI Agent?

The data ingestion pipeline serves an important purpose when building a RAG-based AI agent. If your goal is to build an agent that lets you or your users "talk to" private data, follow our tutorial on how to build it in Appmixer.