API Reference

Use the Crawler as a REST service from any language or tool. No authentication required.

Overview

Base URL

http://localhost:8000/api

Content Type

application/json

Authentication

None built-in. Restrict via firewall or reverse proxy for production use. CORS allows all origins when CORS_ORIGINS=* is set.

Async Model

Crawls run in the background (202 Accepted). Poll /jobs/{id} for status.

Expose to network (remote service mode)

CORS_ORIGINS=* crawler admin --host 0.0.0.0 --port 8000

Then call the API from any machine using the server's IP. Secure behind a reverse proxy in production.

Supported Platforms

xinstagramtiktokfacebooklinkedin

Quick Start

The integration flow: add a seed → trigger a crawl → poll until done → fetch data.

# 1. Add a seed
SEED=$(curl -sX POST http://localhost:8000/api/seeds \
  -H "Content-Type: application/json" \
  -d '{"platform":"instagram","profile_id":"nasa","post_limit":20}')
SEED_ID=$(echo $SEED | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")

# 2. Trigger a crawl
JOB=$(curl -sX POST http://localhost:8000/api/seeds/$SEED_ID/crawl)
JOB_ID=$(echo $JOB | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")

# 3. Poll until done
while true; do
  STATUS=$(curl -s http://localhost:8000/api/jobs/$JOB_ID | python3 -c "import sys,json; print(json.load(sys.stdin)['status'])")
  echo "Status: $STATUS"
  [ "$STATUS" = "done" ] || [ "$STATUS" = "failed" ] && break
  sleep 3
done

# 4. Fetch data
curl "http://localhost:8000/api/profiles/instagram/nasa"

Sessions

Several platforms (X, Instagram, Facebook, LinkedIn) require a logged-in browser session to return full data. Sessions are saved once on the server via the crawler save-session CLI command — not through the API. Once saved, every subsequent crawl uses the session automatically.

How to save a session (run on the crawler server)

SSH into the machine running the crawler (or open a terminal if it's local).
Activate the virtual environment: source .venv/bin/activate
Run the save-session command for the platform (see per-platform commands below).
A browser window opens — log in manually inside that window.
Once fully logged in, return to the terminal and press Enter.
The session is saved to .sessions/{platform}/default.json.

# Replace {platform} with: x | instagram | tiktok | facebook | linkedin

crawler save-session --platform {platform}

# Optional: name the session (useful for multiple accounts)

crawler save-session --platform instagram --identity myaccount

x login required

X (Twitter) requires a logged-in session to serve GraphQL responses. Without a session, profile and post pages return null.

crawler save-session --platform x

https://x.com/login

Notes

Use a regular personal account. X business/API accounts are not required.

After saving the session, verify it worked by running a crawl — if profile_name is still null, the session may have expired.

instagram login required

Instagram's GraphQL API only returns full profile data (biography, follower_count, etc.) when the browser is authenticated.

crawler save-session --platform instagram

https://www.instagram.com/accounts/login/

Notes

Use a regular personal account. Business accounts work too. Two-factor authentication is supported — complete the 2FA flow before pressing Enter.

Instagram sessions expire after several weeks. Re-run save-session if crawls start returning empty profiles.

tiktok login optional

TikTok's public API returns profile and post data without authentication for public accounts.

crawler save-session --platform tiktok

https://www.tiktok.com/login

Notes

A session is only needed for private accounts or to avoid aggressive rate limits.

facebook login required

Facebook requires login to access profile data via its internal API.

crawler save-session --platform facebook

https://www.facebook.com/login

Notes

Use a personal account. Make sure the account can view the profiles you want to crawl.

Facebook frequently challenges automated browsers. If the crawl fails, re-save the session.

linkedin login required

LinkedIn requires login to access profile and post data.

crawler save-session --platform linkedin

https://www.linkedin.com/login

Notes

Use a personal LinkedIn account. The account must be connected to or able to view the target profiles.

LinkedIn is aggressive about bot detection. Use a real, active account and avoid crawling at high frequency.

Check session status via API

GET /sessions

Check which platforms have a saved login session. Use this to verify a session was saved correctly before triggering a crawl.

curl http://localhost:8000/api/sessions

Response

[
  { "platform": "x",         "has_session": true,  "identities": ["default"], "save_command": "crawler save-session --platform x" },
  { "platform": "instagram", "has_session": true,  "identities": ["default"], "save_command": "crawler save-session --platform instagram" },
  { "platform": "tiktok",    "has_session": false, "identities": [],          "save_command": "crawler save-session --platform tiktok" },
  { "platform": "facebook",  "has_session": false, "identities": [],          "save_command": "crawler save-session --platform facebook" },
  { "platform": "linkedin",  "has_session": false, "identities": [],          "save_command": "crawler save-session --platform linkedin" }
]

Seeds

Seeds are profiles you want to crawl. Add a seed once, then trigger crawl jobs on demand.

GET /seeds

Return all seed profiles.

curl http://localhost:8000/api/seeds

Response

[
  {
    "id": 1,
    "platform": "instagram",
    "profile_id": "nasa",
    "label": "NASA official",
    "is_active": true,
    "post_limit": 50,
    "comment_limit": 100,
    "crawl_status": "done",
    "created_at": "2025-01-01T00:00:00Z",
    "last_crawled_at": "2025-01-02T10:00:00Z"
  }
]

POST /seeds

Request Body

platform	string	required	x \| instagram \| tiktok \| facebook \| linkedin
profile_id	string	required	Handle, URL, or @mention e.g. "nasa", "@nasa", "https://instagram.com/nasa"
label	string	optional	Human-readable label
post_limit	integer	optional	Max posts per crawl (1–1000, default 50)
comment_limit	integer	optional	Max comments per post (0–5000, default 100)

curl -X POST http://localhost:8000/api/seeds \
  -H "Content-Type: application/json" \
  -d '{"platform":"instagram","profile_id":"nasa","post_limit":20,"comment_limit":50}'

Response

{
  "id": 3,
  "platform": "instagram",
  "profile_id": "nasa",
  "label": null,
  "is_active": true,
  "post_limit": 20,
  "comment_limit": 50,
  "crawl_status": null,
  "created_at": "2025-03-01T12:00:00Z",
  "last_crawled_at": null
}

POST /seeds/{seed_id}/crawl

Enqueue a crawl job for the seed. Returns 202 immediately — the job runs in the background. Poll GET /jobs/{job_id} to track progress.

curl -X POST http://localhost:8000/api/seeds/3/crawl

Response

{
  "id": 7,
  "platform": "instagram",
  "profile_id": "nasa",
  "status": "pending",
  "post_limit": 20,
  "comment_limit": 50,
  "posts_crawled": 0,
  "comments_crawled": 0,
  "errors": [],
  "started_at": null,
  "finished_at": null,
  "created_at": "2025-03-01T12:01:00Z"
}

PATCH /seeds/{seed_id}

Update seed settings. All fields are optional — only provided fields are changed.

Request Body

label	string	optional	New label
is_active	boolean	optional	Enable or pause this seed
post_limit	integer	optional	1–1000
comment_limit	integer	optional	0–5000

curl -X PATCH http://localhost:8000/api/seeds/3 \
  -H "Content-Type: application/json" \
  -d '{"is_active":false,"post_limit":100}'

Response

{ "id": 3, "is_active": false, "post_limit": 100, ... }

DELETE /seeds/{seed_id}

Remove a seed. Does not delete already-crawled profile/post data.

curl -X DELETE http://localhost:8000/api/seeds/3

Response

204 No Content

Jobs

Jobs represent a single crawl run. Created via POST /seeds/{id}/crawl, they run asynchronously.

GET /jobs

List crawl jobs, newest first.

Query Parameters

status	string	optional	pending \| running \| done \| failed
platform	string	optional	Filter by platform
limit	integer	optional	Max results (1–500, default 50)

curl "http://localhost:8000/api/jobs?status=done&limit=10"

Response

[
  {
    "id": 7,
    "platform": "instagram",
    "profile_id": "nasa",
    "status": "done",
    "post_limit": 20,
    "comment_limit": 50,
    "posts_crawled": 18,
    "comments_crawled": 342,
    "errors": [],
    "started_at": "2025-03-01T12:01:05Z",
    "finished_at": "2025-03-01T12:03:22Z",
    "created_at": "2025-03-01T12:01:00Z"
  }
]

GET /jobs/{job_id}

Get a single job by ID. Use this to poll for completion after triggering a crawl.

curl http://localhost:8000/api/jobs/7

Response

{
  "id": 7,
  "status": "running",
  "posts_crawled": 5,
  "comments_crawled": 80,
  "errors": [],
  "started_at": "2025-03-01T12:01:05Z",
  "finished_at": null,
  ...
}

Profiles

Profiles are populated automatically after a successful crawl. Query them to retrieve normalized data.

GET /profiles

List all crawled profiles. Supports search and platform filter with pagination.

Query Parameters

platform	string	optional	x \| instagram \| tiktok \| facebook \| linkedin
search	string	optional	Partial match on profile_name (case-insensitive)
limit	integer	optional	1–1000, default 20
offset	integer	optional	Pagination offset, default 0

curl "http://localhost:8000/api/profiles?platform=instagram&search=nasa&limit=20"

Response

[
  {
    "id": 12,
    "platform": "instagram",
    "platform_user_id": "528817151",
    "profile_name": "nasa",
    "display_name": "NASA",
    "bio": "Explore the universe and discover our home planet.",
    "location": null,
    "website_url": "https://www.nasa.gov",
    "avatar_url": "https://...",
    "banner_url": null,
    "total_posts": 3812,
    "total_followers": 97400000,
    "total_following": 62,
    "is_verified": true,
    "is_private": false,
    "joined_at": null,
    "first_crawled_at": "2025-01-01T00:00:00Z",
    "last_crawled_at": "2025-03-01T12:03:22Z"
  }
]

GET /profiles/{platform}/{profile_name}

Full profile detail: normalized profile + up to 200 posts + interaction graph edges (received and made).

curl http://localhost:8000/api/profiles/instagram/nasa

Response

{
  "profile": {
    "id": 12,
    "platform": "instagram",
    "profile_name": "nasa",
    "display_name": "NASA",
    "total_followers": 97400000,
    ...
  },
  "posts": [
    {
      "id": 55,
      "post_id": "CXabcdef",
      "content": "Hubble captures a new nebula...",
      "media_type": "image",
      "post_url": "https://www.instagram.com/p/CXabcdef/",
      "like_count": 94200,
      "comment_count": 812,
      "share_count": null,
      "view_count": null,
      "hashtags": ["#space","#hubble"],
      "mentions": [],
      "is_reply": false,
      "posted_at": "2025-02-14T18:30:00Z",
      "last_crawled_at": "2025-03-01T12:03:00Z"
    }
  ],
  "interactions_received": [
    {
      "id": 88,
      "source_profile_name": "spacex",
      "target_profile_name": "nasa",
      "interaction_type": "commented_on",
      "weight": 3,
      "post_id": "CXabcdef",
      "occurred_at": "2025-02-15T09:12:00Z",
      "last_seen_at": "2025-03-01T12:03:00Z"
    }
  ],
  "interactions_made": [ ... ]
}

Stats

Aggregated counts across the entire database.

GET /stats

Aggregated counts for the whole database — useful for monitoring.

curl http://localhost:8000/api/stats

Response

{
  "total_profiles": 24,
  "total_posts": 1840,
  "total_comments": 12305,
  "total_interactions": 4821,
  "total_seeds": 6,
  "active_seeds": 4,
  "jobs_by_status": {
    "done": 18,
    "failed": 2,
    "running": 0,
    "pending": 0
  }
}

Settings

Read and write runtime configuration — database backend, proxy list, browser options, and crawl defaults. Settings are stored in the database and override environment defaults at runtime. DATABASE_URL is additionally written to .env so it survives restarts.

DATABASE_URL formats

Backend	URL format	Use case
SQLite	sqlite+aiosqlite:///./crawler.db	Local / single-node, zero config
PostgreSQL	postgresql+asyncpg://user:pass@host:5432/db	Production / multi-node / remote
PostgreSQL + SSL	postgresql+asyncpg://user:pass@host:5432/db?ssl=require	Managed cloud DB (RDS, Supabase, etc.)

PROXY_LIST — supported URL formats

Proxies are rotated round-robin across crawls. A proxy is automatically evicted after 3 consecutive failures.

# HTTP proxy (no auth)

http://proxy.example.com:8080

# HTTP proxy with credentials

http://user:password@proxy.example.com:8080

# HTTPS proxy

https://user:password@proxy.example.com:8080

# SOCKS5 proxy

socks5://user:password@proxy.example.com:1080

Pass an empty array [] to clear all proxies and use a direct connection.

GET /settings

Return all current runtime settings — merged from environment defaults and any DB overrides saved via the admin UI.

curl http://localhost:8000/api/settings

Response

{
  "DATABASE_URL": "sqlite+aiosqlite:///./crawler.db",
  "BROWSER_MAX_CONTEXTS": 8,
  "BROWSER_HEADLESS": false,
  "PROXY_LIST": [],
  "CRAWL_DEFAULT_POST_LIMIT": 50,
  "CRAWL_DEFAULT_COMMENT_LIMIT": 100,
  "CRAWL_COMMENT_CONCURRENCY": 5,
  "LOG_LEVEL": "INFO",
  "REMOTE_WORKERS": false
}

PUT /settings

Persist one or more settings. All fields are optional — only provided fields are updated. Changes take effect immediately except DATABASE_URL, which also requires a restart. DATABASE_URL is automatically written to the .env file for restart persistence.

Request Body

DATABASE_URL	string	optional	SQLite: sqlite+aiosqlite:///./file.db \| PostgreSQL: postgresql+asyncpg://user:pass@host:5432/db?ssl=require
BROWSER_MAX_CONTEXTS	integer	optional	Parallel browser contexts (1–32)
BROWSER_HEADLESS	boolean	optional	Run browser without visible UI
PROXY_LIST	string[]	optional	HTTP/HTTPS/SOCKS5 proxy URLs, rotated round-robin. Pass [] to clear.
CRAWL_DEFAULT_POST_LIMIT	integer	optional	Default post limit for new seeds (1–1000)
CRAWL_DEFAULT_COMMENT_LIMIT	integer	optional	Default comment limit for new seeds (0–5000)
CRAWL_COMMENT_CONCURRENCY	integer	optional	Concurrent comment pages per crawl (1–20)
LOG_LEVEL	string	optional	DEBUG \| INFO \| WARNING \| ERROR
REMOTE_WORKERS	boolean	optional	false (default) = admin runs crawls locally. true = admin only queues jobs; worker processes execute them.

# Switch to PostgreSQL and add proxies
curl -X PUT http://localhost:8000/api/settings \
  -H "Content-Type: application/json" \
  -d '{
    "DATABASE_URL": "postgresql+asyncpg://crawler:secret@db.host:5432/crawler",
    "PROXY_LIST": [
      "http://user:pass@proxy1.example.com:8080",
      "socks5://proxy2.example.com:1080"
    ],
    "BROWSER_HEADLESS": true
  }'

Response

{
  "DATABASE_URL": "postgresql+asyncpg://crawler:secret@db.example.com:5432/crawler?ssl=require",
  "BROWSER_MAX_CONTEXTS": 8,
  "BROWSER_HEADLESS": true,
  "PROXY_LIST": [
    "http://user:pass@proxy1.example.com:8080",
    "socks5://proxy2.example.com:1080"
  ],
  "CRAWL_DEFAULT_POST_LIMIT": 50,
  "CRAWL_DEFAULT_COMMENT_LIMIT": 100,
  "CRAWL_COMMENT_CONCURRENCY": 5,
  "LOG_LEVEL": "INFO",
  "REMOTE_WORKERS": true
}

POST /settings/test-db

Test a database URL before saving it. Opens a temporary connection, runs SELECT 1, and returns whether the connection succeeded. Supports both SQLite and PostgreSQL URLs.

Request Body

url

string

required

Full SQLAlchemy async URL to test, e.g. postgresql+asyncpg://user:pass@host:5432/db

curl -X POST http://localhost:8000/api/settings/test-db \
  -H "Content-Type: application/json" \
  -d '{"url":"postgresql+asyncpg://crawler:secret@db.host:5432/crawler"}'

Response

// Success
{ "ok": true, "message": "Connection successful" }

// Failure
{ "ok": false, "message": "could not connect to server: Connection refused\n\tIs the server running on host \"db.host\" (10.0.0.5) and accepting TCP/IP connections on port 5432?" }

Workers

Optional distributed crawl processes. When REMOTE_WORKERS=true, the admin only creates job records and workers pick them up via the shared PostgreSQL database. Workers communicate through the database only — no direct HTTP between admin and workers.

Communication model

Worker → PostgreSQL	Required	Workers read pending jobs, write crawl results, send heartbeats (port 5432)
Admin → PostgreSQL	Required	Admin reads job status and worker heartbeats (port 5432)
Worker → Admin	Not needed	Workers never call the admin API — no inbound port required on admin
Admin → Worker	Not needed	Admin never pushes to workers — no inbound port required on workers

Worker object fields

Field	Type	Description
id	string	Unique worker ID — auto-generated as hostname-<hex> on first start, or set via --worker-id
hostname	string	OS hostname of the VM running the worker
region	string?	Optional label set via --region flag, e.g. "us-east"
status	string	"online" if last_heartbeat < 2 min ago, "offline" otherwise
current_job_id	integer?	ID of the job currently being executed, or null if idle
last_heartbeat	datetime	UTC timestamp of the most recent heartbeat (updated every ~20 s during a crawl, every ~5 s when idle)
registered_at	datetime	UTC timestamp when the worker first connected to this database

Starting a worker (CLI)

Run on each worker VM after installing the base package (pip install crawler) and Playwright browsers (playwright install chromium).

# Minimal — connects to shared PostgreSQL, auto-generates worker ID

DATABASE_URL="postgresql+asyncpg://user:pass@db-host:5432/crawler" \

crawler worker

# With region label and stable ID

DATABASE_URL="postgresql+asyncpg://user:pass@db-host:5432/crawler" \

crawler worker --region us-east --worker-id my-worker-1

# Flags

crawler worker --help

--region TEXT Label shown in admin UI (e.g. us-east, eu-west)

--worker-id TEXT Stable ID; auto: hostname-<hex> if omitted

--poll-interval INT Seconds between DB polls when idle (default 5)

Workers only need the base crawler package — FastAPI and Uvicorn are not installed on workers. The admin server requires pip install "crawler[admin]".

GET /workers

List all registered worker processes. Workers are considered online if their last heartbeat is less than 2 minutes ago. Workers register themselves automatically on first start — no manual registration needed.

curl http://localhost:8000/api/workers

Response

[
  {
    "id": "worker-vm-us-east-a1b2c3d4",
    "hostname": "worker-vm-us-east",
    "region": "us-east",
    "status": "online",
    "current_job_id": 42,
    "last_heartbeat": "2025-03-01T12:05:28Z",
    "registered_at": "2025-03-01T09:00:00Z"
  },
  {
    "id": "worker-vm-eu-west-e5f6a7b8",
    "hostname": "worker-vm-eu-west",
    "region": "eu-west",
    "status": "online",
    "current_job_id": null,
    "last_heartbeat": "2025-03-01T12:05:30Z",
    "registered_at": "2025-03-01T09:01:00Z"
  }
]

Errors & Status

HTTP Status Codes

200	OK	Request succeeded
201	Created	Seed created successfully
202	Accepted	Crawl job enqueued — runs in background
204	No Content	Seed deleted
400	Bad Request	Invalid platform or missing required field
404	Not Found	Seed / job / profile not found
409	Conflict	Seed already exists for platform + profile_id
422	Unprocessable	Validation error — check the errors array in response body

Job Status Values

pending	Queued, not yet started
running	Browser open, currently crawling
done	Completed successfully
failed	Failed — check the errors[] array in the job response

Note: A job with status done may still have entries in errors[] if some individual posts or comment pages failed while the overall crawl completed. Always check posts_crawled and errors together.