HuntWiz
Transform and convert knowledge into LLM-friendly markdown or structured data. Developed by HuntWiz Inc, HuntWiz includes advanced scraping, crawling, and data extraction capabilities.
Note: This service is currently in its early development stages. We are still integrating custom modules into the main repository.
How to use it?
We offer an easy-to-use API with our hosted version.
API Key
To use the API, you need to sign up on HuntWiz and obtain an API key.
Crawling
Use this endpoint to crawl a URL and all its accessible subpages. This submits a crawl job and returns a job ID to check the status of the crawl.
curl -X POST https://huntwiz.com/v1/crawl \-H 'Content-Type: application/json' \-H 'Authorization: Bearer YOUR_API_KEY' \-d '{"url": "https://example.com"}'
Returns a jobId:
{ "jobId": "1234-5678-9101" }
Check Crawl Job
Use this endpoint to check the status of a crawl job and get its result.
curl -X GET https://api.huntwiz.ai/v0/crawl/status/1234-5678-9101 \-H 'Content-Type: application/json' \-H 'Authorization: Bearer YOUR_API_KEY'
{"status": "completed","current": 22,"total": 22,"data": [{"content": "Raw Content","markdown": "# Markdown Content","provider": "web-scraper","metadata": {"title": "Example Title","description": "Example Description","language": null,"sourceURL": "https://www.example.com/"}}]}
Scraping
Utilize this endpoint to scrape a URL and obtain its content.
curl -X POST https://api.huntwiz.ai/v0/scrape \-H 'Content-Type: application/json' \-H 'Authorization: Bearer YOUR_API_KEY' \-d '{"url": "https://example.com"}'
Response:
{"success": true,"data": {"content": "Raw Content ","markdown": "# Markdown Content","provider": "web-scraper","metadata": {"title": "Example Title","description": "Example Description","language": null,"sourceURL": "https://www.example.com/"}}}
Search (Beta)
Use this endpoint to search the web, get the most relevant results, scrape each page, and return the markdown.
curl -X POST https://api.huntwiz.ai/v0/search \-H 'Content-Type: application/json' \-H 'Authorization: Bearer YOUR_API_KEY' \-d '{"query": "huntwiz","pageOptions": {"fetchPageContent": true // false for a fast search API}}'
{"success": true,"data": [{"url": "https://example.com","markdown": "# Markdown Content","provider": "web-scraper","metadata": {"title": "Example Title","description": "Example Description","language": null,"sourceURL": "https://www.example.com/"}}]}
Intelligent Extraction (Beta)
Use this endpoint to extract structured data from scraped pages.
curl -X POST https://api.huntwiz.ai/v0/scrape \-H 'Content-Type: application/json' \-H 'Authorization: Bearer YOUR_API_KEY' \-d '{"url": "https://www.example.com/","extractorOptions": {"mode": "llm-extraction","extractionPrompt": "Based on the information on the page, extract the following schema","extractionSchema": {"type": "object","properties": {"company_mission": {"type": "string"},"supports_sso": {"type": "boolean"},"is_open_source": {"type": "boolean"},"is_in_yc": {"type": "boolean"}},"required": ["company_mission","supports_sso","is_open_source","is_in_yc"]}}}'
{"success": true,"data": {"content": "Raw Content","metadata": {"title": "Example Title","description": "Example Description","robots": "follow, index","ogTitle": "Example Title","ogDescription": "Example Description","ogURL": "https://example.com/","ogImage": "https://example.com/image.png","ogLocaleAlternate": [],"ogSiteName": "Example Site","sourceURL": "https://example.com/"},"llm_extraction": {"company_mission": "Example mission statement","supports_sso": true,"is_open_source": false,"is_in_yc": true}}}
Using Python SDK
Installing Python SDK
pip install huntwiz-py
Crawl a Website
from huntwiz import HuntWizAppapp = HuntWizApp(api_key="YOUR_API_KEY")crawl_result = app.crawl_url('example.com', {'crawlerOptions': {'excludes': ['blog/*']}})# Get the markdownfor result in crawl_result:print(result['markdown'])
Scraping a URL
To scrape a single URL, use the scrape_url
method. It takes the URL as a parameter and returns the scraped data as a dictionary.
url = 'https://example.com'scraped_data = app.scrape_url(url)
Extracting Structured Data from a URL
With LLM extraction, you can easily extract structured data from any URL. We support Pydantic schemas to simplify the process. Here is how to use it:
from pydantic import BaseModel, Fieldfrom typing import Listclass ArticleSchema(BaseModel):title: strpoints: intby: strcommentsURL: strclass TopArticlesSchema(BaseModel):top: List[ArticleSchema] = Field(..., max_items=5, description="Top 5 stories")data = app.scrape_url('https://news.example.com', {'extractorOptions': {'extractionSchema': TopArticlesSchema.model_json_schema(),'mode': 'llm-extraction'},'pageOptions':{'onlyMainContent': True}})print(data["llm_extraction"])
Search for a Query
Perform a web search, retrieve the top results, and extract data from each page, returning their markdown.
query = 'What is HuntWiz?'search_result = app.search(query)
Using the Node SDK
Installation
To install the HuntWiz Node SDK, use npm:
npm install @huntwiz/huntwiz-js
Usage
- Obtain an API key from huntwiz.ai
- Set the API key as an environment variable named
HUNTWIZ_API_KEY
or pass it as a parameter to theHuntWizApp
class.
Scraping a URL
To scrape a single URL with error handling, use the scrapeUrl
method. It takes the URL as a parameter and returns the scraped data as a dictionary.
try {const url = 'https://example.com';const scrapedData = await app.scrapeUrl(url);console.log(scrapedData);} catch (error) {console.error('Error occurred while scraping:', error.message);}
Crawling a Website
To crawl a website with error handling, use the crawlUrl
method. It takes the starting URL and optional parameters as arguments. The params
argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.
const crawlUrl = 'https://example.com';const params = {crawlerOptions: {excludes: ['blog/'],includes: [], // leave empty for all pageslimit: 1000,},pageOptions: {onlyMainContent: true,},};const waitUntilDone = true;const timeout = 5;const crawlResult = await app.crawlUrl(crawlUrl,params,waitUntilDone,timeout);
Checking Crawl Status
To check the status of a crawl job with error handling, use the checkCrawlStatus
method. It takes the job ID as a parameter and returns the current status of the crawl job.
const jobId = '1234-5678-9101';const status = await app.checkCrawlStatus(jobId);console.log(status);
Extracting Structured Data from a URL
With LLM extraction, you can easily extract structured data from any URL. We support Zod schemas to streamline the process. Here is how to use it:
import HuntWizApp from '@huntwiz/huntwiz-js';import { z } from 'zod';const app = new HuntWizApp({apiKey: 'hw-YOUR_API_KEY',});// Define schema to extract contents intoconst schema = z.object({top: z.array(z.object({title: z.string(),points: z.number(),by: z.string(),commentsURL: z.string(),})).length(5).describe('Top 5 stories on Hacker News'),});const scrapeResult = await app.scrapeUrl('https://news.example.com', {extractorOptions: { extractionSchema: schema },});console.log(scrapeResult.data['llm_extraction']);
Search for a Query
With the search
method, you can submit a query to a search engine, fetch the top results, extract data from each page, and return the markdown. The method takes the query as a parameter and returns the search results.
const query = 'What is HuntWiz?';const searchResults = await app.search(query, {pageOptions: {fetchPageContent: true, // Fetch the page content for each search result},});
Contributing
We welcome contributions! Please read our contributing guide before submitting a pull request.
It is the sole responsibility of the end users to respect websites' policies when scraping, searching, and crawling with HuntWiz. Users are advised to adhere to the applicable privacy policies and terms of use of the websites prior to initiating any scraping activities. By default, HuntWiz respects the directives specified in the websites' robots.txt files when crawling. By utilizing HuntWiz, you expressly agree to comply with these conditions.