URL Parser

The URL parser can convert various types of URL content into Markdown format, which is readable by large language models (LLM). This API supports processing URLs for text, PDFs, audio, video, and images, making content extraction and formatting from different media types simple and efficient. The API supports both standard output and stream output (SSE) modes.


Supported Content Types

The API is optimized for specific websites to enhance overall conversion effectiveness.

Web Page Text Parsing

For text content from web pages, the API extracts information such as the title, author, and publication date. It also extracts the web page content and converts it into Markdown format.

Streaming Media Parsing

For video content from streaming sites, the API extracts information such as the title, uploader, and publication date. It also extracts subtitles from the site and converts them into Markdown format. If the web page provides subtitles, it will be billed per extraction. If subtitles are not provided, the API will download the video and fully transcribe the audio and video, billing based on the duration.

Note: If the site does not provide subtitles, the platform needs to download the entire video during the request, which may result in longer processing times.

Supported sites include, but are not limited to:

  • Bilibili
  • YouTube
  • Xiaoyuzhou FM
  • Ximalaya FM
  • Apple Podcasts
  • Spotify
  • ...

If you have specific site requirements, please contact us.

Content Site Parsing

For content sites, the API extracts information such as the title, author, and publication date. It also extracts the content and converts it into Markdown format.

Supported sites include, but are not limited to:

  • WeChat Official Accounts
  • Zhihu Columns
  • Jianshu
  • CSDN
  • Juejin
  • 36Kr
  • Sspai
  • PMCAFF
  • Jike
  • Weibo
  • Douban
  • Xiaohongshu
  • Toutiao
  • Feishu
  • ...

File Parsing

For URLs that provide binary data files, the API extracts information such as the title, author, and publication date. It also extracts the content and converts it into Markdown format. If the file provides text content, it will be billed based on file size. If the PDF file does not provide text content, the API will download the PDF file and fully transcribe it, billing based on file size.

API Protocol

End Point

GET
https://huntwiz/v1/scraper

params

parametertypedescriptiondefault
url*
string
URL to be parsed
stream
boolean
Use stream output, default is false
false

Code Example

curl -X GET 'https://huntwiz/v1/scraper'
--header 'Authorization: Bearer {api_key}'
--header 'Content-Type: application/json'
--data-urlencode 'url=https://www.bilibili.com/video/BV1PK41127WS/?vd_source=8f86260f8e7527910b7d27006e3a9f5d'
--data-urlencode 'stream='

Response Format

Standard Response

For non-streaming requests, the API returns a JSON object:

{
"status": "success",
"markdown": "# Parsed Content\n\nThis is the content converted into Markdown format...",
"metadata": {
"title": "Original Content Title",
"author": "Author Name",
"date": "2024-08-03",
"type": "text"
}
}

Streaming Response (SSE)

For streaming requests, the API returns data using the Server-Sent Events (SSE) format:

event: markdown
data: # Parsed Content
event: markdown
data:
data: This is the content converted into Markdown format...
event: metadata
data: {"title":"Original Content Title","author":"Author Name","date":"2024-08-03","type":"text"}
event: done
data:

Billing

  1. Text Parsing
  • Billed per extraction
  • Each extraction deducts 3 Credits
  • Each URL is considered one extraction regardless of text length
  1. Audio and Video Parsing
  • Billed per minute of media duration
  • Each minute deducts 3 Credits
  • Less than one minute is rounded up to one minute
  • For example: A 2-minute 30-second video will deduct 9 Credits (3 minutes * 3 Credits/minute)
  1. Image, PDF, and Other File Parsing
  • Billed per file size
  • Each 1MB deducts 3 Credits
  • Less than 1MB is rounded up to 1MB
  • For example: A 2.7MB PDF file will deduct 9 Credits (3MB * 3 Credits/MB)

Notes:

  • Credits are deducted based on the actual content processed, not based on the request parameters.
  • Before processing begins, the system checks if the account has sufficient Credits. If not, the API returns an error and does not start processing.
  • For stream output, Credits are deducted once the entire content processing is complete.
  • You can view the number of Credits consumed for each request in the API response metadata.

Error Handling

In case of errors, the API returns a JSON object containing error details:

{
"status": "error",
"error": {
"code": "invalid_url",
"message": "The provided URL is invalid or inaccessible"
}
}

Common error codes include:

  • invalid_url: The provided URL is invalid or inaccessible
  • unsupported_content_type: Unsupported content type
  • insufficient_credits: Insufficient Credits to process the request
  • file_too_large: File size exceeds the supported limit

Limitations

  • The API limits to 60 requests per minute
  • Maximum URL length is 2048 characters
  • Supported file size limits: Text (10MB), Audio (50MB), Video (100MB), Image (20MB)
  • Each account may have daily API call limits, check your account settings for details