API Reference
Use the Crawler as a REST service from any language or tool. No authentication required.
Overview
Base URL
http://localhost:8000/apiContent Type
application/jsonAuthentication
None built-in. Restrict via firewall or reverse proxy for production use. CORS allows all origins when CORS_ORIGINS=* is set.
Async Model
Crawls run in the background (202 Accepted). Poll /jobs/{id} for status.
Expose to network (remote service mode)
CORS_ORIGINS=* crawler admin --host 0.0.0.0 --port 8000
Then call the API from any machine using the server's IP. Secure behind a reverse proxy in production.
Supported Platforms
Quick Start
The integration flow: add a seed → trigger a crawl → poll until done → fetch data.
# 1. Add a seed
SEED=$(curl -sX POST http://localhost:8000/api/seeds \
-H "Content-Type: application/json" \
-d '{"platform":"instagram","profile_id":"nasa","post_limit":20}')
SEED_ID=$(echo $SEED | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")
# 2. Trigger a crawl
JOB=$(curl -sX POST http://localhost:8000/api/seeds/$SEED_ID/crawl)
JOB_ID=$(echo $JOB | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")
# 3. Poll until done
while true; do
STATUS=$(curl -s http://localhost:8000/api/jobs/$JOB_ID | python3 -c "import sys,json; print(json.load(sys.stdin)['status'])")
echo "Status: $STATUS"
[ "$STATUS" = "done" ] || [ "$STATUS" = "failed" ] && break
sleep 3
done
# 4. Fetch data
curl "http://localhost:8000/api/profiles/instagram/nasa"Sessions
Several platforms (X, Instagram, Facebook, LinkedIn) require a logged-in browser session to return
full data. Sessions are saved once on the server via the crawler save-session CLI
command — not through the API. Once saved, every subsequent crawl uses the session automatically.
How to save a session (run on the crawler server)
- SSH into the machine running the crawler (or open a terminal if it's local).
- Activate the virtual environment:
source .venv/bin/activate - Run the save-session command for the platform (see per-platform commands below).
- A browser window opens — log in manually inside that window.
- Once fully logged in, return to the terminal and press Enter.
- The session is saved to
.sessions/{platform}/default.json.
x login requiredX (Twitter) requires a logged-in session to serve GraphQL responses. Without a session, profile and post pages return null.
Login URL (opens automatically)
https://x.com/loginNotes
Use a regular personal account. X business/API accounts are not required.
instagram login requiredInstagram's GraphQL API only returns full profile data (biography, follower_count, etc.) when the browser is authenticated.
Login URL (opens automatically)
https://www.instagram.com/accounts/login/Notes
Use a regular personal account. Business accounts work too. Two-factor authentication is supported — complete the 2FA flow before pressing Enter.
tiktok login optionalTikTok's public API returns profile and post data without authentication for public accounts.
Login URL (opens automatically)
https://www.tiktok.com/loginNotes
A session is only needed for private accounts or to avoid aggressive rate limits.
facebook login requiredFacebook requires login to access profile data via its internal API.
Login URL (opens automatically)
https://www.facebook.com/loginNotes
Use a personal account. Make sure the account can view the profiles you want to crawl.
linkedin login requiredLinkedIn requires login to access profile and post data.
Login URL (opens automatically)
https://www.linkedin.com/loginNotes
Use a personal LinkedIn account. The account must be connected to or able to view the target profiles.
Check session status via API
/sessionsCheck which platforms have a saved login session. Use this to verify a session was saved correctly before triggering a crawl.
curl http://localhost:8000/api/sessions
Response
[
{ "platform": "x", "has_session": true, "identities": ["default"], "save_command": "crawler save-session --platform x" },
{ "platform": "instagram", "has_session": true, "identities": ["default"], "save_command": "crawler save-session --platform instagram" },
{ "platform": "tiktok", "has_session": false, "identities": [], "save_command": "crawler save-session --platform tiktok" },
{ "platform": "facebook", "has_session": false, "identities": [], "save_command": "crawler save-session --platform facebook" },
{ "platform": "linkedin", "has_session": false, "identities": [], "save_command": "crawler save-session --platform linkedin" }
]Seeds
Seeds are profiles you want to crawl. Add a seed once, then trigger crawl jobs on demand.
/seedsReturn all seed profiles.
curl http://localhost:8000/api/seeds
Response
[
{
"id": 1,
"platform": "instagram",
"profile_id": "nasa",
"label": "NASA official",
"is_active": true,
"post_limit": 50,
"comment_limit": 100,
"crawl_status": "done",
"created_at": "2025-01-01T00:00:00Z",
"last_crawled_at": "2025-01-02T10:00:00Z"
}
]/seedsRegister a new profile to crawl. profile_id is normalized — pass a URL, @handle, or plain username.
Request Body
| platform | string | required | x | instagram | tiktok | facebook | linkedin |
| profile_id | string | required | Handle, URL, or @mention e.g. "nasa", "@nasa", "https://instagram.com/nasa" |
| label | string | optional | Human-readable label |
| post_limit | integer | optional | Max posts per crawl (1–1000, default 50) |
| comment_limit | integer | optional | Max comments per post (0–5000, default 100) |
curl -X POST http://localhost:8000/api/seeds \
-H "Content-Type: application/json" \
-d '{"platform":"instagram","profile_id":"nasa","post_limit":20,"comment_limit":50}'Response
{
"id": 3,
"platform": "instagram",
"profile_id": "nasa",
"label": null,
"is_active": true,
"post_limit": 20,
"comment_limit": 50,
"crawl_status": null,
"created_at": "2025-03-01T12:00:00Z",
"last_crawled_at": null
}/seeds/{seed_id}/crawlEnqueue a crawl job for the seed. Returns 202 immediately — the job runs in the background. Poll GET /jobs/{job_id} to track progress.
curl -X POST http://localhost:8000/api/seeds/3/crawl
Response
{
"id": 7,
"platform": "instagram",
"profile_id": "nasa",
"status": "pending",
"post_limit": 20,
"comment_limit": 50,
"posts_crawled": 0,
"comments_crawled": 0,
"errors": [],
"started_at": null,
"finished_at": null,
"created_at": "2025-03-01T12:01:00Z"
}/seeds/{seed_id}Update seed settings. All fields are optional — only provided fields are changed.
Request Body
| label | string | optional | New label |
| is_active | boolean | optional | Enable or pause this seed |
| post_limit | integer | optional | 1–1000 |
| comment_limit | integer | optional | 0–5000 |
curl -X PATCH http://localhost:8000/api/seeds/3 \
-H "Content-Type: application/json" \
-d '{"is_active":false,"post_limit":100}'Response
{ "id": 3, "is_active": false, "post_limit": 100, ... }/seeds/{seed_id}Remove a seed. Does not delete already-crawled profile/post data.
curl -X DELETE http://localhost:8000/api/seeds/3
Response
204 No Content
Jobs
Jobs represent a single crawl run. Created via POST /seeds/{id}/crawl, they run asynchronously.
/jobsList crawl jobs, newest first.
Query Parameters
| status | string | optional | pending | running | done | failed |
| platform | string | optional | Filter by platform |
| limit | integer | optional | Max results (1–500, default 50) |
curl "http://localhost:8000/api/jobs?status=done&limit=10"
Response
[
{
"id": 7,
"platform": "instagram",
"profile_id": "nasa",
"status": "done",
"post_limit": 20,
"comment_limit": 50,
"posts_crawled": 18,
"comments_crawled": 342,
"errors": [],
"started_at": "2025-03-01T12:01:05Z",
"finished_at": "2025-03-01T12:03:22Z",
"created_at": "2025-03-01T12:01:00Z"
}
]/jobs/{job_id}Get a single job by ID. Use this to poll for completion after triggering a crawl.
curl http://localhost:8000/api/jobs/7
Response
{
"id": 7,
"status": "running",
"posts_crawled": 5,
"comments_crawled": 80,
"errors": [],
"started_at": "2025-03-01T12:01:05Z",
"finished_at": null,
...
}Profiles
Profiles are populated automatically after a successful crawl. Query them to retrieve normalized data.
/profilesList all crawled profiles. Supports search and platform filter with pagination.
Query Parameters
| platform | string | optional | x | instagram | tiktok | facebook | linkedin |
| search | string | optional | Partial match on profile_name (case-insensitive) |
| limit | integer | optional | 1–1000, default 20 |
| offset | integer | optional | Pagination offset, default 0 |
curl "http://localhost:8000/api/profiles?platform=instagram&search=nasa&limit=20"
Response
[
{
"id": 12,
"platform": "instagram",
"platform_user_id": "528817151",
"profile_name": "nasa",
"display_name": "NASA",
"bio": "Explore the universe and discover our home planet.",
"location": null,
"website_url": "https://www.nasa.gov",
"avatar_url": "https://...",
"banner_url": null,
"total_posts": 3812,
"total_followers": 97400000,
"total_following": 62,
"is_verified": true,
"is_private": false,
"joined_at": null,
"first_crawled_at": "2025-01-01T00:00:00Z",
"last_crawled_at": "2025-03-01T12:03:22Z"
}
]/profiles/{platform}/{profile_name}Full profile detail: normalized profile + up to 200 posts + interaction graph edges (received and made).
curl http://localhost:8000/api/profiles/instagram/nasa
Response
{
"profile": {
"id": 12,
"platform": "instagram",
"profile_name": "nasa",
"display_name": "NASA",
"total_followers": 97400000,
...
},
"posts": [
{
"id": 55,
"post_id": "CXabcdef",
"content": "Hubble captures a new nebula...",
"media_type": "image",
"post_url": "https://www.instagram.com/p/CXabcdef/",
"like_count": 94200,
"comment_count": 812,
"share_count": null,
"view_count": null,
"hashtags": ["#space","#hubble"],
"mentions": [],
"is_reply": false,
"posted_at": "2025-02-14T18:30:00Z",
"last_crawled_at": "2025-03-01T12:03:00Z"
}
],
"interactions_received": [
{
"id": 88,
"source_profile_name": "spacex",
"target_profile_name": "nasa",
"interaction_type": "commented_on",
"weight": 3,
"post_id": "CXabcdef",
"occurred_at": "2025-02-15T09:12:00Z",
"last_seen_at": "2025-03-01T12:03:00Z"
}
],
"interactions_made": [ ... ]
}Stats
Aggregated counts across the entire database.
/statsAggregated counts for the whole database — useful for monitoring.
curl http://localhost:8000/api/stats
Response
{
"total_profiles": 24,
"total_posts": 1840,
"total_comments": 12305,
"total_interactions": 4821,
"total_seeds": 6,
"active_seeds": 4,
"jobs_by_status": {
"done": 18,
"failed": 2,
"running": 0,
"pending": 0
}
}Settings
Read and write runtime configuration — database backend, proxy list, browser options, and crawl defaults.
Settings are stored in the database and override environment defaults at runtime. DATABASE_URL is additionally written to .env so it survives restarts.
DATABASE_URL formats
| Backend | URL format | Use case |
|---|---|---|
| SQLite | sqlite+aiosqlite:///./crawler.db | Local / single-node, zero config |
| PostgreSQL | postgresql+asyncpg://user:pass@host:5432/db | Production / multi-node / remote |
| PostgreSQL + SSL | postgresql+asyncpg://user:pass@host:5432/db?ssl=require | Managed cloud DB (RDS, Supabase, etc.) |
PROXY_LIST — supported URL formats
Proxies are rotated round-robin across crawls. A proxy is automatically evicted after 3 consecutive failures.
Pass an empty array [] to clear all proxies and use a direct connection.
/settingsReturn all current runtime settings — merged from environment defaults and any DB overrides saved via the admin UI.
curl http://localhost:8000/api/settings
Response
{
"DATABASE_URL": "sqlite+aiosqlite:///./crawler.db",
"BROWSER_MAX_CONTEXTS": 8,
"BROWSER_HEADLESS": false,
"PROXY_LIST": [],
"CRAWL_DEFAULT_POST_LIMIT": 50,
"CRAWL_DEFAULT_COMMENT_LIMIT": 100,
"CRAWL_COMMENT_CONCURRENCY": 5,
"LOG_LEVEL": "INFO",
"REMOTE_WORKERS": false
}/settingsPersist one or more settings. All fields are optional — only provided fields are updated. Changes take effect immediately except DATABASE_URL, which also requires a restart. DATABASE_URL is automatically written to the .env file for restart persistence.
Request Body
| DATABASE_URL | string | optional | SQLite: sqlite+aiosqlite:///./file.db | PostgreSQL: postgresql+asyncpg://user:pass@host:5432/db?ssl=require |
| BROWSER_MAX_CONTEXTS | integer | optional | Parallel browser contexts (1–32) |
| BROWSER_HEADLESS | boolean | optional | Run browser without visible UI |
| PROXY_LIST | string[] | optional | HTTP/HTTPS/SOCKS5 proxy URLs, rotated round-robin. Pass [] to clear. |
| CRAWL_DEFAULT_POST_LIMIT | integer | optional | Default post limit for new seeds (1–1000) |
| CRAWL_DEFAULT_COMMENT_LIMIT | integer | optional | Default comment limit for new seeds (0–5000) |
| CRAWL_COMMENT_CONCURRENCY | integer | optional | Concurrent comment pages per crawl (1–20) |
| LOG_LEVEL | string | optional | DEBUG | INFO | WARNING | ERROR |
| REMOTE_WORKERS | boolean | optional | false (default) = admin runs crawls locally. true = admin only queues jobs; worker processes execute them. |
# Switch to PostgreSQL and add proxies
curl -X PUT http://localhost:8000/api/settings \
-H "Content-Type: application/json" \
-d '{
"DATABASE_URL": "postgresql+asyncpg://crawler:secret@db.host:5432/crawler",
"PROXY_LIST": [
"http://user:pass@proxy1.example.com:8080",
"socks5://proxy2.example.com:1080"
],
"BROWSER_HEADLESS": true
}'Response
{
"DATABASE_URL": "postgresql+asyncpg://crawler:secret@db.example.com:5432/crawler?ssl=require",
"BROWSER_MAX_CONTEXTS": 8,
"BROWSER_HEADLESS": true,
"PROXY_LIST": [
"http://user:pass@proxy1.example.com:8080",
"socks5://proxy2.example.com:1080"
],
"CRAWL_DEFAULT_POST_LIMIT": 50,
"CRAWL_DEFAULT_COMMENT_LIMIT": 100,
"CRAWL_COMMENT_CONCURRENCY": 5,
"LOG_LEVEL": "INFO",
"REMOTE_WORKERS": true
}/settings/test-dbTest a database URL before saving it. Opens a temporary connection, runs SELECT 1, and returns whether the connection succeeded. Supports both SQLite and PostgreSQL URLs.
Request Body
| url | string | required | Full SQLAlchemy async URL to test, e.g. postgresql+asyncpg://user:pass@host:5432/db |
curl -X POST http://localhost:8000/api/settings/test-db \
-H "Content-Type: application/json" \
-d '{"url":"postgresql+asyncpg://crawler:secret@db.host:5432/crawler"}'Response
// Success
{ "ok": true, "message": "Connection successful" }
// Failure
{ "ok": false, "message": "could not connect to server: Connection refused\n\tIs the server running on host \"db.host\" (10.0.0.5) and accepting TCP/IP connections on port 5432?" }Workers
Optional distributed crawl processes. When REMOTE_WORKERS=true,
the admin only creates job records and workers pick them up via the shared PostgreSQL database.
Workers communicate through the database only — no direct HTTP between admin and workers.
Communication model
| Worker → PostgreSQL | Required | Workers read pending jobs, write crawl results, send heartbeats (port 5432) |
| Admin → PostgreSQL | Required | Admin reads job status and worker heartbeats (port 5432) |
| Worker → Admin | Not needed | Workers never call the admin API — no inbound port required on admin |
| Admin → Worker | Not needed | Admin never pushes to workers — no inbound port required on workers |
Worker object fields
| Field | Type | Description |
|---|---|---|
| id | string | Unique worker ID — auto-generated as hostname-<hex> on first start, or set via --worker-id |
| hostname | string | OS hostname of the VM running the worker |
| region | string? | Optional label set via --region flag, e.g. "us-east" |
| status | string | "online" if last_heartbeat < 2 min ago, "offline" otherwise |
| current_job_id | integer? | ID of the job currently being executed, or null if idle |
| last_heartbeat | datetime | UTC timestamp of the most recent heartbeat (updated every ~20 s during a crawl, every ~5 s when idle) |
| registered_at | datetime | UTC timestamp when the worker first connected to this database |
Starting a worker (CLI)
Run on each worker VM after installing the base package (pip install crawler) and Playwright browsers (playwright install chromium).
crawler package — FastAPI and Uvicorn are not installed on workers. The admin server requires pip install "crawler[admin]"./workersList all registered worker processes. Workers are considered online if their last heartbeat is less than 2 minutes ago. Workers register themselves automatically on first start — no manual registration needed.
curl http://localhost:8000/api/workers
Response
[
{
"id": "worker-vm-us-east-a1b2c3d4",
"hostname": "worker-vm-us-east",
"region": "us-east",
"status": "online",
"current_job_id": 42,
"last_heartbeat": "2025-03-01T12:05:28Z",
"registered_at": "2025-03-01T09:00:00Z"
},
{
"id": "worker-vm-eu-west-e5f6a7b8",
"hostname": "worker-vm-eu-west",
"region": "eu-west",
"status": "online",
"current_job_id": null,
"last_heartbeat": "2025-03-01T12:05:30Z",
"registered_at": "2025-03-01T09:01:00Z"
}
]Errors & Status
HTTP Status Codes
| 200 | OK | Request succeeded |
| 201 | Created | Seed created successfully |
| 202 | Accepted | Crawl job enqueued — runs in background |
| 204 | No Content | Seed deleted |
| 400 | Bad Request | Invalid platform or missing required field |
| 404 | Not Found | Seed / job / profile not found |
| 409 | Conflict | Seed already exists for platform + profile_id |
| 422 | Unprocessable | Validation error — check the errors array in response body |
Job Status Values
| pending | Queued, not yet started |
| running | Browser open, currently crawling |
| done | Completed successfully |
| failed | Failed — check the errors[] array in the job response |
done may still have entries in errors[] if some individual posts or comment pages failed while the overall crawl completed.
Always check posts_crawled and errors together.