HuntWiz

Transform and convert knowledge into LLM-friendly markdown or structured data. Developed by HuntWiz Inc, HuntWiz includes advanced scraping, crawling, and data extraction capabilities.

Note: This service is currently in its early development stages. We are still integrating custom modules into the main repository.

How to use it?

We offer an easy-to-use API with our hosted version.

API Key

To use the API, you need to sign up on HuntWiz and obtain an API key.

Crawling

Use this endpoint to crawl a URL and all its accessible subpages. This submits a crawl job and returns a job ID to check the status of the crawl.

curl -X POST https://huntwiz.com/v1/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://example.com"
    }'

Returns a jobId:

{ "jobId": "1234-5678-9101" }

Check Crawl Job

Use this endpoint to check the status of a crawl job and get its result.

curl -X GET https://api.huntwiz.ai/v0/crawl/status/1234-5678-9101 \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY'

{
  "status": "completed",
  "current": 22,
  "total": 22,
  "data": [
    {
      "content": "Raw Content",
      "markdown": "# Markdown Content",
      "provider": "web-scraper",
      "metadata": {
        "title": "Example Title",
        "description": "Example Description",
        "language": null,
        "sourceURL": "https://www.example.com/"
      }
    }
  ]
}

Scraping

Utilize this endpoint to scrape a URL and obtain its content.

curl -X POST https://api.huntwiz.ai/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://example.com"
    }'

Response:

{
  "success": true,
  "data": {
    "content": "Raw Content ",
    "markdown": "# Markdown Content",
    "provider": "web-scraper",
    "metadata": {
      "title": "Example Title",
      "description": "Example Description",
      "language": null,
      "sourceURL": "https://www.example.com/"
    }
  }
}

Search (Beta)

Use this endpoint to search the web, get the most relevant results, scrape each page, and return the markdown.

curl -X POST https://api.huntwiz.ai/v0/search \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "query": "huntwiz",
      "pageOptions": {
        "fetchPageContent": true // false for a fast search API
      }
    }'

{
  "success": true,
  "data": [
    {
      "url": "https://example.com",
      "markdown": "# Markdown Content",
      "provider": "web-scraper",
      "metadata": {
        "title": "Example Title",
        "description": "Example Description",
        "language": null,
        "sourceURL": "https://www.example.com/"
      }
    }
  ]
}

Intelligent Extraction (Beta)

Use this endpoint to extract structured data from scraped pages.

curl -X POST https://api.huntwiz.ai/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://www.example.com/",
      "extractorOptions": {
        "mode": "llm-extraction",
        "extractionPrompt": "Based on the information on the page, extract the following schema",
        "extractionSchema": {
          "type": "object",
          "properties": {
            "company_mission": {
                      "type": "string"
            },
            "supports_sso": {
                      "type": "boolean"
            },
            "is_open_source": {
                      "type": "boolean"
            },
            "is_in_yc": {
                      "type": "boolean"
            }
          },
          "required": [
            "company_mission",
            "supports_sso",
            "is_open_source",
            "is_in_yc"
          ]
        }
      }
    }'

{
    "success": true,
    "data": {
      "content": "Raw Content",
      "metadata": {
        "title": "Example Title",
        "description": "Example Description",
        "robots": "follow, index",
        "ogTitle": "Example Title",
        "ogDescription": "Example Description",
        "ogURL": "https://example.com/",
        "ogImage": "https://example.com/image.png",
        "ogLocaleAlternate": [],
        "ogSiteName": "Example Site",
        "sourceURL": "https://example.com/"
      },
      "llm_extraction": {
        "company_mission": "Example mission statement",
        "supports_sso": true,
        "is_open_source": false,
        "is_in_yc": true
      }
    }
}

Using Python SDK

Installing Python SDK

pip install huntwiz-py

Crawl a Website

from huntwiz import HuntWizApp

app = HuntWizApp(api_key="YOUR_API_KEY")

crawl_result = app.crawl_url('example.com', {'crawlerOptions': {'excludes': ['blog/*']}})

# Get the markdown
for result in crawl_result:
    print(result['markdown'])

Scraping a URL

To scrape a single URL, use the scrape_url method. It takes the URL as a parameter and returns the scraped data as a dictionary.

url = 'https://example.com'
scraped_data = app.scrape_url(url)

Extracting Structured Data from a URL

With LLM extraction, you can easily extract structured data from any URL. We support Pydantic schemas to simplify the process. Here is how to use it:

from pydantic import BaseModel, Field
from typing import List

class ArticleSchema(BaseModel):
    title: str
    points: int
    by: str
    commentsURL: str

class TopArticlesSchema(BaseModel):
    top: List[ArticleSchema] = Field(..., max_items=5, description="Top 5 stories")

data = app.scrape_url('https://news.example.com', {
    'extractorOptions': {
        'extractionSchema': TopArticlesSchema.model_json_schema(),
        'mode': 'llm-extraction'
    },
    'pageOptions':{
        'onlyMainContent': True
    }
})
print(data["llm_extraction"])

Search for a Query

Perform a web search, retrieve the top results, and extract data from each page, returning their markdown.

query = 'What is HuntWiz?'
search_result = app.search(query)

Using the Node SDK

Installation

To install the HuntWiz Node SDK, use npm:

npm install @huntwiz/huntwiz-js

Usage

Obtain an API key from huntwiz.ai
Set the API key as an environment variable named HUNTWIZ_API_KEY or pass it as a parameter to the HuntWizApp class.

Scraping a URL

To scrape a single URL with error handling, use the scrapeUrl method. It takes the URL as a parameter and returns the scraped data as a dictionary.

try {
  const url = 'https://example.com';
  const scrapedData = await app.scrapeUrl(url);
  console.log(scrapedData);
} catch (error) {
  console.error('Error occurred while scraping:', error.message);
}

To crawl a website with error handling, use the crawlUrl method. It takes the starting URL and optional parameters as arguments. The params argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.

const crawlUrl = 'https://example.com';
const params = {
  crawlerOptions: {
    excludes: ['blog/'],
    includes: [], // leave empty for all pages
    limit: 1000,
  },
  pageOptions: {
    onlyMainContent: true,
  },
};
const waitUntilDone = true;
const timeout = 5;
const crawlResult = await app.crawlUrl(
  crawlUrl,
  params,
  waitUntilDone,
  timeout
);

Checking Crawl Status

To check the status of a crawl job with error handling, use the checkCrawlStatus method. It takes the job ID as a parameter and returns the current status of the crawl job.

const jobId = '1234-5678-9101';
const status = await app.checkCrawlStatus(jobId);
console.log(status);

Extracting Structured Data from a URL

With LLM extraction, you can easily extract structured data from any URL. We support Zod schemas to streamline the process. Here is how to use it:

import HuntWizApp from '@huntwiz/huntwiz-js';
import { z } from 'zod';

const app = new HuntWizApp({
  apiKey: 'hw-YOUR_API_KEY',
});

// Define schema to extract contents into
const schema = z.object({
  top: z
    .array(
      z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })
    )
    .length(5)
    .describe('Top 5 stories on Hacker News'),
});

const scrapeResult = await app.scrapeUrl('https://news.example.com', {
  extractorOptions: { extractionSchema: schema },
});

console.log(scrapeResult.data['llm_extraction']);

Search for a Query

With the search method, you can submit a query to a search engine, fetch the top results, extract data from each page, and return the markdown. The method takes the query as a parameter and returns the search results.

const query = 'What is HuntWiz?';
const searchResults = await app.search(query, {
  pageOptions: {
    fetchPageContent: true, // Fetch the page content for each search result
  },
});

Contributing

We welcome contributions! Please read our contributing guide before submitting a pull request.

It is the sole responsibility of the end users to respect websites' policies when scraping, searching, and crawling with HuntWiz. Users are advised to adhere to the applicable privacy policies and terms of use of the websites prior to initiating any scraping activities. By default, HuntWiz respects the directives specified in the websites' robots.txt files when crawling. By utilizing HuntWiz, you expressly agree to comply with these conditions.