知识库api批量上传 #30395

shuming-dev · 2025-12-30T15:00:42Z

shuming-dev
Dec 30, 2025

Self Checks

I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:)
Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

The pipeline knowledge base I built wants to upload documents in batches through the API, but now I encounter a 401 error. Dify is a locally deployed version 1.11.2, and the api_key is a string starting with "data" that was generated. The request body should also be correct. I want to know where the problem is. Theoretically, the key exists. I get a response when I retrieve the list of knowledge bases, but uploading doesn't work.

import requests

url = "http://xxx.46.xx.xxx:xxxx/v1/files/upload"

files = { "file": ("example-file", open("data/test.pdf", "rb")) }
payload = { "user": "admin@xxxx.cn" }
headers = {"Authorization": "Bearer dataset-VW0at8WN5kUMjfdR7Yl1GqYm"}

response = requests.post(url, data=payload, files=files, headers=headers)

print(response.text)

{"code":"unauthorized","message":"Access token is invalid","status":401}

2. Additional context or comments

No response

@shuming-dev · 2025-12-30T15:08:00Z

dosubot[bot]
bot Dec 30, 2025

Hi @shuming-dev! I'm Dosu and I’m helping the dify team.

The 401 error happens because the API key you’re using (starting with dataset-) is dataset-scoped and only supports read operations like listing knowledge bases—not write operations like file uploads. For uploading documents in batch via API, you need a user-scoped API token with editor or operator privileges, which you can generate in the Dify web UI if you’re self-hosting [details].

The correct endpoints for uploading files to the knowledge base are:

/v1/datasets/{dataset_id}/document/create-by-file
/v1/datasets/pipeline/file-upload

Your request should be a multipart/form-data POST with:

Authorization: Bearer {user_api_token} (not the dataset key)
file: the actual file to upload
data: a JSON string for extra parameters (like doc_form, indexing_technique, etc.), if needed

Example using Python requests:

import requests

url = "http://your-dify-server/v1/datasets/{dataset_id}/document/create-by-file"
headers = {"Authorization": "Bearer user-api-token"}
files = {"file": open("data/test.pdf", "rb")}
data = {
    "data": '{"doc_form": "text_model", "indexing_technique": "high_quality"}'
}
response = requests.post(url, headers=headers, files=files, data=data)
print(response.text)

Common pitfalls: expired tokens, incorrect endpoint, wrong token type, or malformed multipart requests. The API automatically sets user context from the token, so don’t include user ID/email in the payload [usage tips].

If you regenerate your token and use the correct endpoint and format, batch uploads should work [troubleshooting].

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

4 replies

shuming-dev Dec 31, 2025
Author

I have readjusted it, and now I can upload, but I found that the document parsing did not follow the set pipeline. I used MinerU to parse the document. Here is my code. Please help me check where the problem is:

file_name = os.path.basename(file_path)
data = {
"indexing_technique": "high_quality",
"doc_form": "hierarchical_model",
"process_rule": {
"mode": "hierarchical",
"rules": {
"pre_processing_rules": [
{"id": "remove_extra_spaces", "enabled": True},
{"id": "remove_urls_emails", "enabled": True}
],
"segmentation": {"separator": "\n\n", "max_tokens": 1024},
"subchunk_segmentation": {"separator": "\n", "max_tokens": 512},
"parent_mode": "paragraph"
}
}
}

with open(file_path, "rb") as f:
    resp = requests.post(
        url=f"http://172.16.xxx.xxx:xx/v1/datasets/{DATASET_ID}/document/create-by-file",
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "X-Tenant-ID": DATASET_ID
        },
        files={
            "file": (file_name, f, "application/pdf"),
            "data": (None, json.dumps(data, separators=(",", ":")), "application/json")
        },
        verify=False
    )

dosubot[bot] bot Dec 31, 2025

Your payload structure looks correct, but there are a few key reasons why the custom pipeline (including MinerU/hierarchical parsing) might not be applied as expected:

The dataset must have been created with doc_form: "hierarchical_model" and indexing_technique: "high_quality". All uploads must match these values—if the dataset was created with a different doc_form, hierarchical parsing will be ignored or rejected. You can’t change the doc_form of an existing dataset after creation; if it doesn’t match, you’ll need to create a new dataset for hierarchical mode.
See discussion
The process_rule structure must be exactly as expected: parent_mode and subchunk_segmentation must be at the same level as segmentation under rules, not nested inside segmentation or elsewhere. Your example matches this, but double-check for typos or extra nesting.
See example
The API endpoint /v1/datasets/{dataset_id}/document/create-by-file does not always reliably trigger the full Knowledge Pipeline (including MinerU) as the Web UI does. Sometimes it falls back to traditional parsing and chunking, even if the API returns 200 OK.
See known limitation
Make sure you are using a user-scoped API token (with editor/operator privileges), not a dataset-scoped token, for uploads.
Token requirements
If you need guaranteed pipeline execution (especially for custom extractors like MinerU), uploading via the Web UI is currently the only reliable method. There is a feature request for a dedicated pipeline API endpoint, but it’s not available yet.
Pipeline API request
For debugging, check backend logs for any errors or warnings about process_rule validation or pipeline invocation. Also, ensure your Celery worker is running with the pipeline queue included, or jobs may get stuck.

If you want to experiment with direct pipeline invocation, you can try the /datasets/{dataset_id}/pipeline/run API for more control, but this is more advanced and may require additional setup.
Direct pipeline API

If you confirm the dataset config and payload structure are correct and still see issues, it’s likely due to the current API limitation with pipeline triggering.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

shuming-dev Dec 31, 2025
Author

Hello, I would like to ask what settings are needed for using the /datasets/{dataset_id}/pipeline/run API. Are there any templates for reference? I don't see any relevant introductions in the documentation.

dosubot[bot] bot Dec 31, 2025

The /datasets/{dataset_id}/pipeline/run API lets you trigger a knowledge pipeline run on a dataset for advanced workflows. There’s no official documentation, but the code shows the required settings and payload structure.

Prerequisites:

The dataset must exist and have API access enabled.
You need a user-scoped API token (Bearer token) with editor or dataset operator privileges.
The pipeline must be configured for your dataset.
Pipeline worker must be running.

Request:
Send a POST request to /v1/datasets/{dataset_id}/pipeline/run with the following headers:

Authorization: Bearer {your_user_api_token}
Content-Type: application/json

Payload template:

{
  "inputs": { "your_variable": "value" },             // mapping of input variables for the pipeline
  "datasource_type": "online_document",               // type of datasource (example: "online_document")
  "datasource_info_list": [                           // list of datasource info objects
    { "url": "https://example.com/doc.pdf" }
  ],
  "start_node_id": "node-uuid-or-name",               // ID of the pipeline node to start from
  "is_published": true,                               // true for published pipeline, false for draft
  "response_mode": "streaming"                        // "streaming" or "blocking"
}

All fields are required. Adjust values to match your pipeline configuration.

Response:

On success: 200 OK with either a streaming or blocking response containing pipeline output.
On error: 401 Unauthorized (invalid token), 403 Forbidden (insufficient privileges), or a pipeline error message.

For more details, see the code for PipelineRunApi and the payload model.

Let me know if you need help with specific field values or pipeline setup!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

知识库api批量上传 #30395

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

知识库api批量上传 #30395

Uh oh!

shuming-dev Dec 30, 2025

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

2. Additional context or comments

Replies: 1 comment · 4 replies

Uh oh!

dosubot[bot] bot Dec 30, 2025

Uh oh!

shuming-dev Dec 31, 2025 Author

Uh oh!

dosubot[bot] bot Dec 31, 2025

Uh oh!

shuming-dev Dec 31, 2025 Author

Uh oh!

dosubot[bot] bot Dec 31, 2025

shuming-dev
Dec 30, 2025

Replies: 1 comment 4 replies

dosubot[bot]
bot Dec 30, 2025

shuming-dev Dec 31, 2025
Author

shuming-dev Dec 31, 2025
Author