In our previous tutorial, we showed you how to build a RAG-based AI agent in Appmixer, designed to help you easily navigate internal company policies.
As with any RAG-based AI agent, a critical part is providing private data as context. So how do you do that?
In this tutorial, we'll show you how to build a data ingestion pipeline that feeds documents from a Google Drive folder into a vector database—enabling Appmixer’s AI agents to efficiently access large volumes of private data.
Before we dive into the configuration details, watch our RAG-based AI agent in action.
‍
Data ingestion pipeline is an automated process of collecting, importing, and processing data from various sources into a destination system (like a data warehouse, data lake, or database), often in preparation for further analysis or use. In our case, it refers to feeding internal policies stored in Google Drive to a vector database.
What might a data ingestion pipeline look like when built in Appmixer?
‍
To build our pipeline, we need to start by identifying where our data resides and where it needs to be stored for further use. In the example above, internal policy documents are stored in a Google Drive folder and transferred to a Pinecone vector database.
A key part of an efficient pipeline is ensuring that the data always stays up to date. To achieve this, we built four  "sub-workflows" in Appmixer's no-code builder, each triggered by a different event:
‍
‍
‍
‍
To demonstrate how we built the pipeline, we'll do a deep dive into the first "subflow", which saves all existing documents to the vector database when the flow starts. This will help you understand the basics of building such workflows so you can replicate them for other events and adapt them to your own use case.
The On Start component triggers the workflow as soon as it's published. You can select it from the trigger selector or drag it directly onto the canvas.
‍
Next, add the Google Drive connector and select the Find Files or Folders action. Authenticate your account, then choose the folder that contains the documents you want to store in the vector database.
You can enable recursive search if you want to include files in subfolders, and you can also filter by file type—for example, limiting the search to only Google Sheets.
‍
From the out port of the previous component, add another Google Drive connector. This time, select the Download File action. Map the Google Drive File ID using the dynamic value from the previous component. Make sure to set Output File Data to true (to return the actual file content), and enable Convert Google Workspace Document to convert the file to plain text.
‍
The next step is to generate embeddings, which is essential for storing and querying data in a vector database. To do this, we use the OpenAI connector and the "Generate Embeddings" action. In the Text field, we input the File Data variable from the previous component and select the embedding model—in this case, text-embedding-ada-002.
‍
"Generating embeddings" refers to the process of converting textual, visual, or other types of data into dense vector representations (numerical arrays) that capture semantic meaning and relationships.
Last step in our first "subflow" is to insert the vectors into Pinecone. For this, we choose the Pinecone connector and "Insert Vectors" action. After authenticating our account, we choose the Index, Namespace and again map the Embeddings from the previous component in the Vectors field.
‍
If you've been following the steps above, you've just created a pipeline that ingests all your documents into a vector database. To keep your database up to date, follow similar steps for other scenarios, such as updating your vector database with new files or deleting removed files.
The data ingestion pipeline serves an important purpose when building a RAG-based AI agent. If your goal is to build an agent that lets you or your users "talk to" private data, follow our tutorial on how to build it in Appmixer.