URL Parser

The URL parser can convert various types of URL content into Markdown format, which is readable by large language models (LLM). This API supports processing URLs for text, PDFs, audio, video, and images, making content extraction and formatting from different media types simple and efficient. The API supports both standard output and stream output (SSE) modes.

Supported Content Types

The API is optimized for specific websites to enhance overall conversion effectiveness.

Web Page Text Parsing

For text content from web pages, the API extracts information such as the title, author, and publication date. It also extracts the web page content and converts it into Markdown format.

Streaming Media Parsing

For video content from streaming sites, the API extracts information such as the title, uploader, and publication date. It also extracts subtitles from the site and converts them into Markdown format. If the web page provides subtitles, it will be billed per extraction. If subtitles are not provided, the API will download the video and fully transcribe the audio and video, billing based on the duration.

Note: If the site does not provide subtitles, the platform needs to download the entire video during the request, which may result in longer processing times.

Supported sites include, but are not limited to:

Bilibili
YouTube
Xiaoyuzhou FM
Ximalaya FM
Apple Podcasts
Spotify
...

If you have specific site requirements, please contact us.

Content Site Parsing

For content sites, the API extracts information such as the title, author, and publication date. It also extracts the content and converts it into Markdown format.

Supported sites include, but are not limited to:

WeChat Official Accounts
Zhihu Columns
Jianshu
CSDN
Juejin
36Kr
Sspai
PMCAFF
Jike
Weibo
Douban
Xiaohongshu
Toutiao
Feishu
...

File Parsing

For URLs that provide binary data files, the API extracts information such as the title, author, and publication date. It also extracts the content and converts it into Markdown format. If the file provides text content, it will be billed based on file size. If the PDF file does not provide text content, the API will download the PDF file and fully transcribe it, billing based on file size.

API Protocol

End Point

GEThttps://huntwiz/v1/scraper

params

parameter	type	description	default
url*	string	URL to be parsed
stream	boolean	Use stream output, default is false	false

Code Example

curl -X GET 'https://huntwiz/v1/scraper'
--header 'Authorization: Bearer {api_key}'
--header 'Content-Type: application/json'
--data-urlencode 'url=https://www.bilibili.com/video/BV1PK41127WS/?vd_source=8f86260f8e7527910b7d27006e3a9f5d'
--data-urlencode 'stream='

Response Format

Standard Response

For non-streaming requests, the API returns a JSON object:

{
  "status": "success",
  "markdown": "# Parsed Content\n\nThis is the content converted into Markdown format...",
  "metadata": {
    "title": "Original Content Title",
    "author": "Author Name",
    "date": "2024-08-03",
    "type": "text"
  }
}

Streaming Response (SSE)

For streaming requests, the API returns data using the Server-Sent Events (SSE) format:

event: markdown
data: # Parsed Content

event: markdown
data:
data: This is the content converted into Markdown format...

event: metadata
data: {"title":"Original Content Title","author":"Author Name","date":"2024-08-03","type":"text"}

event: done
data:

Billing

Text Parsing

Billed per extraction
Each extraction deducts 3 Credits
Each URL is considered one extraction regardless of text length

Audio and Video Parsing

Billed per minute of media duration
Each minute deducts 3 Credits
Less than one minute is rounded up to one minute
For example: A 2-minute 30-second video will deduct 9 Credits (3 minutes * 3 Credits/minute)

Image, PDF, and Other File Parsing

Billed per file size
Each 1MB deducts 3 Credits
Less than 1MB is rounded up to 1MB
For example: A 2.7MB PDF file will deduct 9 Credits (3MB * 3 Credits/MB)

Notes:

Credits are deducted based on the actual content processed, not based on the request parameters.
Before processing begins, the system checks if the account has sufficient Credits. If not, the API returns an error and does not start processing.
For stream output, Credits are deducted once the entire content processing is complete.
You can view the number of Credits consumed for each request in the API response metadata.

Error Handling

In case of errors, the API returns a JSON object containing error details:

{
  "status": "error",
  "error": {
    "code": "invalid_url",
    "message": "The provided URL is invalid or inaccessible"
  }
}

Common error codes include:

invalid_url: The provided URL is invalid or inaccessible
unsupported_content_type: Unsupported content type
insufficient_credits: Insufficient Credits to process the request
file_too_large: File size exceeds the supported limit

Limitations

The API limits to 60 requests per minute
Maximum URL length is 2048 characters
Supported file size limits: Text (10MB), Audio (50MB), Video (100MB), Image (20MB)
Each account may have daily API call limits, check your account settings for details