URL Parser
The URL parser can convert various types of URL content into Markdown format, which is readable by large language models (LLM). This API supports processing URLs for text, PDFs, audio, video, and images, making content extraction and formatting from different media types simple and efficient. The API supports both standard output and stream output (SSE) modes.
Supported Content Types
The API is optimized for specific websites to enhance overall conversion effectiveness.
Web Page Text Parsing
For text content from web pages, the API extracts information such as the title, author, and publication date. It also extracts the web page content and converts it into Markdown format.
Streaming Media Parsing
For video content from streaming sites, the API extracts information such as the title, uploader, and publication date. It also extracts subtitles from the site and converts them into Markdown format. If the web page provides subtitles, it will be billed per extraction. If subtitles are not provided, the API will download the video and fully transcribe the audio and video, billing based on the duration.
Note: If the site does not provide subtitles, the platform needs to download the entire video during the request, which may result in longer processing times.
Supported sites include, but are not limited to:
- Bilibili
- YouTube
- Xiaoyuzhou FM
- Ximalaya FM
- Apple Podcasts
- Spotify
- ...
If you have specific site requirements, please contact us.
Content Site Parsing
For content sites, the API extracts information such as the title, author, and publication date. It also extracts the content and converts it into Markdown format.
Supported sites include, but are not limited to:
- WeChat Official Accounts
- Zhihu Columns
- Jianshu
- CSDN
- Juejin
- 36Kr
- Sspai
- PMCAFF
- Jike
- Douban
- Xiaohongshu
- Toutiao
- Feishu
- ...
File Parsing
For URLs that provide binary data files, the API extracts information such as the title, author, and publication date. It also extracts the content and converts it into Markdown format. If the file provides text content, it will be billed based on file size. If the PDF file does not provide text content, the API will download the PDF file and fully transcribe it, billing based on file size.
API Protocol
End Point
GEThttps://huntwiz/v1/scraper
params
parameter | type | description | default |
---|---|---|---|
url* | string | URL to be parsed | |
stream | boolean | Use stream output, default is false | false |
Code Example
curl -X GET 'https://huntwiz/v1/scraper'--header 'Authorization: Bearer {api_key}'--header 'Content-Type: application/json'--data-urlencode 'url=https://www.bilibili.com/video/BV1PK41127WS/?vd_source=8f86260f8e7527910b7d27006e3a9f5d'--data-urlencode 'stream='
Response Format
Standard Response
For non-streaming requests, the API returns a JSON object:
{"status": "success","markdown": "# Parsed Content\n\nThis is the content converted into Markdown format...","metadata": {"title": "Original Content Title","author": "Author Name","date": "2024-08-03","type": "text"}}
Streaming Response (SSE)
For streaming requests, the API returns data using the Server-Sent Events (SSE) format:
event: markdowndata: # Parsed Contentevent: markdowndata:data: This is the content converted into Markdown format...event: metadatadata: {"title":"Original Content Title","author":"Author Name","date":"2024-08-03","type":"text"}event: donedata:
Billing
- Text Parsing
- Billed per extraction
- Each extraction deducts 3 Credits
- Each URL is considered one extraction regardless of text length
- Audio and Video Parsing
- Billed per minute of media duration
- Each minute deducts 3 Credits
- Less than one minute is rounded up to one minute
- For example: A 2-minute 30-second video will deduct 9 Credits (3 minutes * 3 Credits/minute)
- Image, PDF, and Other File Parsing
- Billed per file size
- Each 1MB deducts 3 Credits
- Less than 1MB is rounded up to 1MB
- For example: A 2.7MB PDF file will deduct 9 Credits (3MB * 3 Credits/MB)
Notes:
- Credits are deducted based on the actual content processed, not based on the request parameters.
- Before processing begins, the system checks if the account has sufficient Credits. If not, the API returns an error and does not start processing.
- For stream output, Credits are deducted once the entire content processing is complete.
- You can view the number of Credits consumed for each request in the API response metadata.
Error Handling
In case of errors, the API returns a JSON object containing error details:
{"status": "error","error": {"code": "invalid_url","message": "The provided URL is invalid or inaccessible"}}
Common error codes include:
invalid_url
: The provided URL is invalid or inaccessibleunsupported_content_type
: Unsupported content typeinsufficient_credits
: Insufficient Credits to process the requestfile_too_large
: File size exceeds the supported limit
Limitations
- The API limits to 60 requests per minute
- Maximum URL length is 2048 characters
- Supported file size limits: Text (10MB), Audio (50MB), Video (100MB), Image (20MB)
- Each account may have daily API call limits, check your account settings for details