
Datasets&Index
Most language models use outdated training data and have length limitations for the context of each request. For example, GPT-3.5 is trained on corpora from 2021 and has a limit of approximately 4k tokens per request. This means that developers who want their AI applications to be based on the latest and private context conversations must use techniques like embedding.
talentbot-enterprise-llmops-alops' dataset feature allows developers (and even non-technical users) to easily manage datasets and automatically integrate them into AI applications. All you need to do is prepare text content, such as:
-
Long text content (TXT, Markdown, JSONL, or even PDF files)
-
Structured data (CSV, Excel, etc.)
Additionally, we are gradually supporting syncing data from various data sources to datasets, including:

Notion

GitHub

Databases

Webpages
i
Practice: If your company wants to build an AI customer service assistant based on existing knowledge bases and product documentation, you can upload the documents to a dataset in Dify and create a conversational application. This might have taken you several weeks in the past and been difficult to maintain continuously.
Datasets and Documents
In talentbot-enterprise-llmops-alops, datasets (Datasets) are collections of documents (Documents). A dataset can be integrated as a whole into an application to be used as context. Documents can be uploaded by developers or operations staff, or synced from other data sources (typically corresponding to a file unit in the data source).
Steps to upload a document:
Step 1
Upload your file, usually a long text file or a spreadsheet
Step 2
Segment, clean, and preview
Step 3
Dify submits it to the LLM provider for embedding as vector data and storage
Step 4
Set metadata for the document
Step 5
Ready to use in the application!
How to write a good dataset description
When multiple datasets are referenced in an application, AI uses the description of the datasets and the user's question to determine which dataset to use to answer the user's question. Therefore, a well-written dataset description can improve the accuracy of AI in selecting datasets.
The key to writing a good dataset description is to clearly describe the content and characteristics of the dataset. It is recommended that the dataset description begin with this: Useful only when the question you want to answer is about the following: specific description. Here is an example of a real estate dataset description:
Useful only when the question you want to answer is about the following: global real estate market data from 2010 to 2020. This data includes information such as the average housing price, property sales volume, and housing types for each city. In addition, this dataset also includes some economic indicators such as GDP and unemployment rate, as well as some social indicators such as population and education level. These indicators can help analyze the trends and influencing factors of the real estate market. With this data, we can understand the development trends of the global real estate market, analyze the changes in housing prices in various cities, and understand the impact of economic and social factors on the real estate market.
Create a dataset
Click on datasets in the main navigation bar of talentbot-enterprise-llmops-alops. On this page, you can see the existing datasets. Click on "Create Dataset" to enter the creation wizard.
If you have already prepared your files, you can start by uploading the files.
If you haven't prepared your documents yet, you can create an empty dataset first.
Uploading Documents By upload file
-
Select the file you want to upload.We support batch uploads.
-
Preview the full text
-
Perform segmentation and cleaning
-
Wait for talentbot-enterprise-llmops-alops to process the data for you; this step usually consumes tokens in the LLM provider

Modify Documents
Modify Documents For technical reasons, if developers make the following changes to documents, Dify will create a new document for you, and the old document will be archived and deactivated:
-
Adjust segmentation and cleaning settings
-
Re-upload the file

Maintain Datasets via API
TODO
Dataset Settings
Click Settings in the left navigation of the dataset. You can change the following settings for the dataset:
-
Dataset name for identifying a dataset
-
Dataset description to allow AI to better use the dataset appropriately. If the description is empty, Dify's automatic indexing strategy will be used.
-
Permissions can be set to Only Me or All Team Members. Those without permissions cannot view and edit the dataset.
-
Indexing mode: In High Quality mode, OpenAI's embedding interface will be called to process and provide higher accuracy when users query. In Economic mode, offline vector engines, keyword indexing, etc. will be used to reduce accuracy without consuming tokens.
Note: Upgrading the indexing mode from Economic to High Quality will incur additional token consumption. Downgrading from High Quality to Economic will not consume tokens.
Integrate into Applications
Once the dataset is ready, it needs to be integrated into the application. When the AI application processes will automatically use the associated dataset content as a reference context.
-
1.Go to the application - Prompt Arrangement page
-
2.In the context options, select the dataset you want to integrate
-
3.Save the settings to complete the integration
Q&A
Q: What should I do if the PDF upload is garbled?
A: If your PDF parsing appears garbled under certain formatted contents, you could consider converting the PDF to Markdown format, which currently offers higher accuracy, or you could reduce the use of images, tables, and other formatted content in the PDF. We are researching ways to optimize the experience of using PDFs.
Q: Why can I hit in test but not in application?
A: You can troubleshoot issues by following these steps:
-
1.Make sure you have added text on the prompt page and clicked on the save button in the top right corner.
-
2.Test whether it responds normally in the prompt debugging interface.
-
3.Try again in a new WebApp session window.
-
4.Optimize your data format and quality. For practice reference, visit https://github.com/langgenius/dify/issues/90 If none of these steps solve your problem, please join our community for help.
Q: How does the consumption mechanism of context work?
A: With a dataset added, each query will consume segmented content (currently embedding two segments) + question + prompt + chat history combined. However, it will not exceed model limitations, such as 4096.
Q: Where does the embedded dataset appear when asking questions?
A: It will be embedded as context before the question.
Q: How do I add multiple datasets?
A: Due to short-term performance considerations, we currently only support one dataset. If you have multiple sets of data, you can upload them within the same dataset for use.
Q: Is there any priority between the added dataset and OpenAI's answers?
A: The dataset serves as context and is used together with questions for LLM to understand and answer; there is no priority relationship.
Q: Will APIs related to hit testing be opened up so that dify can access knowledge bases and implement dialogue generation using custom models?
A: We plan to open up Webhooks later on; however, there are no current plans for this feature. You can achieve your requirements by connecting to any vector database.
Sync from Notion
talentbot-enterprise-llmops-alops dataset supports importing from Notion and setting up Sync so that data is automatically synced to talentbot-enterprise-llmops-alops after updates in Notion.
Authorization verification
1. When creating a dataset, select the data source, click Sync from Notion--Go to connect, and complete the authorization verification according to the prompt.
2. You can also: click Settings--Data Sources--Add a Data Source, click Notion Source Connect to complete authorization verification.
Import Notion data
After completing authorization verification, go to the dataset creation page, click Sync from Notion, and select the required authorization page to import.
Segmentation and cleaning
Next, select your segmentation settings and indexing method, save and process. Wait for Dify to process this data, usually this step requires token consumption in LLM providers. Dify not only supports importing ordinary page types but also summarizes and saves the page attributes under the database type.
Note: Images and files are not currently supported for import. Table data will be converted to text.
Sync Notion data
If your Notion content has been modified, you can click Sync directly on the Dify dataset document list page to sync the data with one click(Please note that each time you click, the current content will be synchronized). This step requires token consumption.
