Volantis Crawler Console

API Reference

Use the Crawler as a REST service from any language or tool. No authentication required.

Overview

Base URL

http://localhost:8000/api

Content Type

application/json

Authentication

None built-in. Restrict via firewall or reverse proxy for production use. CORS allows all origins when CORS_ORIGINS=* is set.

Async Model

Crawls run in the background (202 Accepted). Poll /jobs/{id} for status.

Expose to network (remote service mode)

CORS_ORIGINS=* crawler admin --host 0.0.0.0 --port 8000

Then call the API from any machine using the server's IP. Secure behind a reverse proxy in production.

Supported Platforms

xinstagramtiktokfacebooklinkedin

Quick Start

The integration flow: add a seed → trigger a crawl → poll until done → fetch data.

# 1. Add a seed
SEED=$(curl -sX POST http://localhost:8000/api/seeds \
  -H "Content-Type: application/json" \
  -d '{"platform":"instagram","profile_id":"nasa","post_limit":20}')
SEED_ID=$(echo $SEED | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")

# 2. Trigger a crawl
JOB=$(curl -sX POST http://localhost:8000/api/seeds/$SEED_ID/crawl)
JOB_ID=$(echo $JOB | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")

# 3. Poll until done
while true; do
  STATUS=$(curl -s http://localhost:8000/api/jobs/$JOB_ID | python3 -c "import sys,json; print(json.load(sys.stdin)['status'])")
  echo "Status: $STATUS"
  [ "$STATUS" = "done" ] || [ "$STATUS" = "failed" ] && break
  sleep 3
done

# 4. Fetch data
curl "http://localhost:8000/api/profiles/instagram/nasa"

Sessions

Several platforms (X, Instagram, Facebook, LinkedIn) require a logged-in browser session to return full data. Sessions are saved once on the server via the crawler save-session CLI command — not through the API. Once saved, every subsequent crawl uses the session automatically.

How to save a session (run on the crawler server)

  1. SSH into the machine running the crawler (or open a terminal if it's local).
  2. Activate the virtual environment: source .venv/bin/activate
  3. Run the save-session command for the platform (see per-platform commands below).
  4. A browser window opens — log in manually inside that window.
  5. Once fully logged in, return to the terminal and press Enter.
  6. The session is saved to .sessions/{platform}/default.json.
# Replace {platform} with: x | instagram | tiktok | facebook | linkedin
crawler save-session --platform {platform}
# Optional: name the session (useful for multiple accounts)
crawler save-session --platform instagram --identity myaccount
x login required

X (Twitter) requires a logged-in session to serve GraphQL responses. Without a session, profile and post pages return null.

crawler save-session --platform x

Login URL (opens automatically)

https://x.com/login

Notes

Use a regular personal account. X business/API accounts are not required.

After saving the session, verify it worked by running a crawl — if profile_name is still null, the session may have expired.
instagram login required

Instagram's GraphQL API only returns full profile data (biography, follower_count, etc.) when the browser is authenticated.

crawler save-session --platform instagram

Login URL (opens automatically)

https://www.instagram.com/accounts/login/

Notes

Use a regular personal account. Business accounts work too. Two-factor authentication is supported — complete the 2FA flow before pressing Enter.

Instagram sessions expire after several weeks. Re-run save-session if crawls start returning empty profiles.
tiktok login optional

TikTok's public API returns profile and post data without authentication for public accounts.

crawler save-session --platform tiktok

Login URL (opens automatically)

https://www.tiktok.com/login

Notes

A session is only needed for private accounts or to avoid aggressive rate limits.

facebook login required

Facebook requires login to access profile data via its internal API.

crawler save-session --platform facebook

Login URL (opens automatically)

https://www.facebook.com/login

Notes

Use a personal account. Make sure the account can view the profiles you want to crawl.

Facebook frequently challenges automated browsers. If the crawl fails, re-save the session.
linkedin login required

LinkedIn requires login to access profile and post data.

crawler save-session --platform linkedin

Login URL (opens automatically)

https://www.linkedin.com/login

Notes

Use a personal LinkedIn account. The account must be connected to or able to view the target profiles.

LinkedIn is aggressive about bot detection. Use a real, active account and avoid crawling at high frequency.

Check session status via API

GET /sessions

Check which platforms have a saved login session. Use this to verify a session was saved correctly before triggering a crawl.

curl http://localhost:8000/api/sessions

Response

[
  { "platform": "x",         "has_session": true,  "identities": ["default"], "save_command": "crawler save-session --platform x" },
  { "platform": "instagram", "has_session": true,  "identities": ["default"], "save_command": "crawler save-session --platform instagram" },
  { "platform": "tiktok",    "has_session": false, "identities": [],          "save_command": "crawler save-session --platform tiktok" },
  { "platform": "facebook",  "has_session": false, "identities": [],          "save_command": "crawler save-session --platform facebook" },
  { "platform": "linkedin",  "has_session": false, "identities": [],          "save_command": "crawler save-session --platform linkedin" }
]

Seeds

Seeds are profiles you want to crawl. Add a seed once, then trigger crawl jobs on demand.

GET /seeds

Return all seed profiles.

curl http://localhost:8000/api/seeds

Response

[
  {
    "id": 1,
    "platform": "instagram",
    "profile_id": "nasa",
    "label": "NASA official",
    "is_active": true,
    "post_limit": 50,
    "comment_limit": 100,
    "crawl_status": "done",
    "created_at": "2025-01-01T00:00:00Z",
    "last_crawled_at": "2025-01-02T10:00:00Z"
  }
]
POST /seeds

Register a new profile to crawl. profile_id is normalized — pass a URL, @handle, or plain username.

Request Body

platformstringrequiredx | instagram | tiktok | facebook | linkedin
profile_idstringrequiredHandle, URL, or @mention e.g. "nasa", "@nasa", "https://instagram.com/nasa"
labelstringoptionalHuman-readable label
post_limitintegeroptionalMax posts per crawl (1–1000, default 50)
comment_limitintegeroptionalMax comments per post (0–5000, default 100)
curl -X POST http://localhost:8000/api/seeds \
  -H "Content-Type: application/json" \
  -d '{"platform":"instagram","profile_id":"nasa","post_limit":20,"comment_limit":50}'

Response

{
  "id": 3,
  "platform": "instagram",
  "profile_id": "nasa",
  "label": null,
  "is_active": true,
  "post_limit": 20,
  "comment_limit": 50,
  "crawl_status": null,
  "created_at": "2025-03-01T12:00:00Z",
  "last_crawled_at": null
}
POST /seeds/{seed_id}/crawl

Enqueue a crawl job for the seed. Returns 202 immediately — the job runs in the background. Poll GET /jobs/{job_id} to track progress.

curl -X POST http://localhost:8000/api/seeds/3/crawl

Response

{
  "id": 7,
  "platform": "instagram",
  "profile_id": "nasa",
  "status": "pending",
  "post_limit": 20,
  "comment_limit": 50,
  "posts_crawled": 0,
  "comments_crawled": 0,
  "errors": [],
  "started_at": null,
  "finished_at": null,
  "created_at": "2025-03-01T12:01:00Z"
}
PATCH /seeds/{seed_id}

Update seed settings. All fields are optional — only provided fields are changed.

Request Body

labelstringoptionalNew label
is_activebooleanoptionalEnable or pause this seed
post_limitintegeroptional1–1000
comment_limitintegeroptional0–5000
curl -X PATCH http://localhost:8000/api/seeds/3 \
  -H "Content-Type: application/json" \
  -d '{"is_active":false,"post_limit":100}'

Response

{ "id": 3, "is_active": false, "post_limit": 100, ... }
DELETE /seeds/{seed_id}

Remove a seed. Does not delete already-crawled profile/post data.

curl -X DELETE http://localhost:8000/api/seeds/3

Response

204 No Content

Jobs

Jobs represent a single crawl run. Created via POST /seeds/{id}/crawl, they run asynchronously.

GET /jobs

List crawl jobs, newest first.

Query Parameters

statusstringoptionalpending | running | done | failed
platformstringoptionalFilter by platform
limitintegeroptionalMax results (1–500, default 50)
curl "http://localhost:8000/api/jobs?status=done&limit=10"

Response

[
  {
    "id": 7,
    "platform": "instagram",
    "profile_id": "nasa",
    "status": "done",
    "post_limit": 20,
    "comment_limit": 50,
    "posts_crawled": 18,
    "comments_crawled": 342,
    "errors": [],
    "started_at": "2025-03-01T12:01:05Z",
    "finished_at": "2025-03-01T12:03:22Z",
    "created_at": "2025-03-01T12:01:00Z"
  }
]
GET /jobs/{job_id}

Get a single job by ID. Use this to poll for completion after triggering a crawl.

curl http://localhost:8000/api/jobs/7

Response

{
  "id": 7,
  "status": "running",
  "posts_crawled": 5,
  "comments_crawled": 80,
  "errors": [],
  "started_at": "2025-03-01T12:01:05Z",
  "finished_at": null,
  ...
}

Profiles

Profiles are populated automatically after a successful crawl. Query them to retrieve normalized data.

GET /profiles

List all crawled profiles. Supports search and platform filter with pagination.

Query Parameters

platformstringoptionalx | instagram | tiktok | facebook | linkedin
searchstringoptionalPartial match on profile_name (case-insensitive)
limitintegeroptional1–1000, default 20
offsetintegeroptionalPagination offset, default 0
curl "http://localhost:8000/api/profiles?platform=instagram&search=nasa&limit=20"

Response

[
  {
    "id": 12,
    "platform": "instagram",
    "platform_user_id": "528817151",
    "profile_name": "nasa",
    "display_name": "NASA",
    "bio": "Explore the universe and discover our home planet.",
    "location": null,
    "website_url": "https://www.nasa.gov",
    "avatar_url": "https://...",
    "banner_url": null,
    "total_posts": 3812,
    "total_followers": 97400000,
    "total_following": 62,
    "is_verified": true,
    "is_private": false,
    "joined_at": null,
    "first_crawled_at": "2025-01-01T00:00:00Z",
    "last_crawled_at": "2025-03-01T12:03:22Z"
  }
]
GET /profiles/{platform}/{profile_name}

Full profile detail: normalized profile + up to 200 posts + interaction graph edges (received and made).

curl http://localhost:8000/api/profiles/instagram/nasa

Response

{
  "profile": {
    "id": 12,
    "platform": "instagram",
    "profile_name": "nasa",
    "display_name": "NASA",
    "total_followers": 97400000,
    ...
  },
  "posts": [
    {
      "id": 55,
      "post_id": "CXabcdef",
      "content": "Hubble captures a new nebula...",
      "media_type": "image",
      "post_url": "https://www.instagram.com/p/CXabcdef/",
      "like_count": 94200,
      "comment_count": 812,
      "share_count": null,
      "view_count": null,
      "hashtags": ["#space","#hubble"],
      "mentions": [],
      "is_reply": false,
      "posted_at": "2025-02-14T18:30:00Z",
      "last_crawled_at": "2025-03-01T12:03:00Z"
    }
  ],
  "interactions_received": [
    {
      "id": 88,
      "source_profile_name": "spacex",
      "target_profile_name": "nasa",
      "interaction_type": "commented_on",
      "weight": 3,
      "post_id": "CXabcdef",
      "occurred_at": "2025-02-15T09:12:00Z",
      "last_seen_at": "2025-03-01T12:03:00Z"
    }
  ],
  "interactions_made": [ ... ]
}

Stats

Aggregated counts across the entire database.

GET /stats

Aggregated counts for the whole database — useful for monitoring.

curl http://localhost:8000/api/stats

Response

{
  "total_profiles": 24,
  "total_posts": 1840,
  "total_comments": 12305,
  "total_interactions": 4821,
  "total_seeds": 6,
  "active_seeds": 4,
  "jobs_by_status": {
    "done": 18,
    "failed": 2,
    "running": 0,
    "pending": 0
  }
}

Settings

Read and write runtime configuration — database backend, proxy list, browser options, and crawl defaults. Settings are stored in the database and override environment defaults at runtime. DATABASE_URL is additionally written to .env so it survives restarts.

DATABASE_URL formats

BackendURL formatUse case
SQLitesqlite+aiosqlite:///./crawler.dbLocal / single-node, zero config
PostgreSQLpostgresql+asyncpg://user:pass@host:5432/dbProduction / multi-node / remote
PostgreSQL + SSLpostgresql+asyncpg://user:pass@host:5432/db?ssl=requireManaged cloud DB (RDS, Supabase, etc.)

PROXY_LIST — supported URL formats

Proxies are rotated round-robin across crawls. A proxy is automatically evicted after 3 consecutive failures.

# HTTP proxy (no auth)
http://proxy.example.com:8080
# HTTP proxy with credentials
http://user:password@proxy.example.com:8080
# HTTPS proxy
https://user:password@proxy.example.com:8080
# SOCKS5 proxy
socks5://user:password@proxy.example.com:1080

Pass an empty array [] to clear all proxies and use a direct connection.

GET /settings

Return all current runtime settings — merged from environment defaults and any DB overrides saved via the admin UI.

curl http://localhost:8000/api/settings

Response

{
  "DATABASE_URL": "sqlite+aiosqlite:///./crawler.db",
  "BROWSER_MAX_CONTEXTS": 8,
  "BROWSER_HEADLESS": false,
  "PROXY_LIST": [],
  "CRAWL_DEFAULT_POST_LIMIT": 50,
  "CRAWL_DEFAULT_COMMENT_LIMIT": 100,
  "CRAWL_COMMENT_CONCURRENCY": 5,
  "LOG_LEVEL": "INFO",
  "REMOTE_WORKERS": false
}
PUT /settings

Persist one or more settings. All fields are optional — only provided fields are updated. Changes take effect immediately except DATABASE_URL, which also requires a restart. DATABASE_URL is automatically written to the .env file for restart persistence.

Request Body

DATABASE_URLstringoptionalSQLite: sqlite+aiosqlite:///./file.db | PostgreSQL: postgresql+asyncpg://user:pass@host:5432/db?ssl=require
BROWSER_MAX_CONTEXTSintegeroptionalParallel browser contexts (1–32)
BROWSER_HEADLESSbooleanoptionalRun browser without visible UI
PROXY_LISTstring[]optionalHTTP/HTTPS/SOCKS5 proxy URLs, rotated round-robin. Pass [] to clear.
CRAWL_DEFAULT_POST_LIMITintegeroptionalDefault post limit for new seeds (1–1000)
CRAWL_DEFAULT_COMMENT_LIMITintegeroptionalDefault comment limit for new seeds (0–5000)
CRAWL_COMMENT_CONCURRENCYintegeroptionalConcurrent comment pages per crawl (1–20)
LOG_LEVELstringoptionalDEBUG | INFO | WARNING | ERROR
REMOTE_WORKERSbooleanoptionalfalse (default) = admin runs crawls locally. true = admin only queues jobs; worker processes execute them.
# Switch to PostgreSQL and add proxies
curl -X PUT http://localhost:8000/api/settings \
  -H "Content-Type: application/json" \
  -d '{
    "DATABASE_URL": "postgresql+asyncpg://crawler:secret@db.host:5432/crawler",
    "PROXY_LIST": [
      "http://user:pass@proxy1.example.com:8080",
      "socks5://proxy2.example.com:1080"
    ],
    "BROWSER_HEADLESS": true
  }'

Response

{
  "DATABASE_URL": "postgresql+asyncpg://crawler:secret@db.example.com:5432/crawler?ssl=require",
  "BROWSER_MAX_CONTEXTS": 8,
  "BROWSER_HEADLESS": true,
  "PROXY_LIST": [
    "http://user:pass@proxy1.example.com:8080",
    "socks5://proxy2.example.com:1080"
  ],
  "CRAWL_DEFAULT_POST_LIMIT": 50,
  "CRAWL_DEFAULT_COMMENT_LIMIT": 100,
  "CRAWL_COMMENT_CONCURRENCY": 5,
  "LOG_LEVEL": "INFO",
  "REMOTE_WORKERS": true
}
POST /settings/test-db

Test a database URL before saving it. Opens a temporary connection, runs SELECT 1, and returns whether the connection succeeded. Supports both SQLite and PostgreSQL URLs.

Request Body

urlstringrequiredFull SQLAlchemy async URL to test, e.g. postgresql+asyncpg://user:pass@host:5432/db
curl -X POST http://localhost:8000/api/settings/test-db \
  -H "Content-Type: application/json" \
  -d '{"url":"postgresql+asyncpg://crawler:secret@db.host:5432/crawler"}'

Response

// Success
{ "ok": true, "message": "Connection successful" }

// Failure
{ "ok": false, "message": "could not connect to server: Connection refused\n\tIs the server running on host \"db.host\" (10.0.0.5) and accepting TCP/IP connections on port 5432?" }

Workers

Optional distributed crawl processes. When REMOTE_WORKERS=true, the admin only creates job records and workers pick them up via the shared PostgreSQL database. Workers communicate through the database only — no direct HTTP between admin and workers.

Communication model

Worker → PostgreSQLRequiredWorkers read pending jobs, write crawl results, send heartbeats (port 5432)
Admin → PostgreSQLRequiredAdmin reads job status and worker heartbeats (port 5432)
Worker → AdminNot neededWorkers never call the admin API — no inbound port required on admin
Admin → WorkerNot neededAdmin never pushes to workers — no inbound port required on workers

Worker object fields

FieldTypeDescription
idstringUnique worker ID — auto-generated as hostname-<hex> on first start, or set via --worker-id
hostnamestringOS hostname of the VM running the worker
regionstring?Optional label set via --region flag, e.g. "us-east"
statusstring"online" if last_heartbeat < 2 min ago, "offline" otherwise
current_job_idinteger?ID of the job currently being executed, or null if idle
last_heartbeatdatetimeUTC timestamp of the most recent heartbeat (updated every ~20 s during a crawl, every ~5 s when idle)
registered_atdatetimeUTC timestamp when the worker first connected to this database

Starting a worker (CLI)

Run on each worker VM after installing the base package (pip install crawler) and Playwright browsers (playwright install chromium).

# Minimal — connects to shared PostgreSQL, auto-generates worker ID
DATABASE_URL="postgresql+asyncpg://user:pass@db-host:5432/crawler" \
crawler worker
# With region label and stable ID
DATABASE_URL="postgresql+asyncpg://user:pass@db-host:5432/crawler" \
crawler worker --region us-east --worker-id my-worker-1
# Flags
crawler worker --help
--region TEXT Label shown in admin UI (e.g. us-east, eu-west)
--worker-id TEXT Stable ID; auto: hostname-<hex> if omitted
--poll-interval INT Seconds between DB polls when idle (default 5)
Workers only need the base crawler package — FastAPI and Uvicorn are not installed on workers. The admin server requires pip install "crawler[admin]".
GET /workers

List all registered worker processes. Workers are considered online if their last heartbeat is less than 2 minutes ago. Workers register themselves automatically on first start — no manual registration needed.

curl http://localhost:8000/api/workers

Response

[
  {
    "id": "worker-vm-us-east-a1b2c3d4",
    "hostname": "worker-vm-us-east",
    "region": "us-east",
    "status": "online",
    "current_job_id": 42,
    "last_heartbeat": "2025-03-01T12:05:28Z",
    "registered_at": "2025-03-01T09:00:00Z"
  },
  {
    "id": "worker-vm-eu-west-e5f6a7b8",
    "hostname": "worker-vm-eu-west",
    "region": "eu-west",
    "status": "online",
    "current_job_id": null,
    "last_heartbeat": "2025-03-01T12:05:30Z",
    "registered_at": "2025-03-01T09:01:00Z"
  }
]

Errors & Status

HTTP Status Codes

200OKRequest succeeded
201CreatedSeed created successfully
202AcceptedCrawl job enqueued — runs in background
204No ContentSeed deleted
400Bad RequestInvalid platform or missing required field
404Not FoundSeed / job / profile not found
409ConflictSeed already exists for platform + profile_id
422UnprocessableValidation error — check the errors array in response body

Job Status Values

pendingQueued, not yet started
runningBrowser open, currently crawling
doneCompleted successfully
failedFailed — check the errors[] array in the job response
Note: A job with status done may still have entries in errors[] if some individual posts or comment pages failed while the overall crawl completed. Always check posts_crawled and errors together.