US-Artificial-Intelligence / Scraper: An API that takes a URL and provides a file with browser screens.

You run the API on your machine, you send it a URL, and you return the website data as a site screenshot file. Simply as.
This project was made to support AbbeyA AI platform. Its Author Gordon room.
Some time:
- Page scrolls and picked up screenshots in different sections
- Runs in a docker container
- Browser-based (run on JavaScript on website)
- HTTP status code gives you from the first request
- Automatically hold 302 again
- Manage download links correctly
- Tasks are processed by a queue with configurable memory allocation
- Block the API
- Zero state or other complexity
This Web Scraper has a resource of higher quality than many alternatives. Websites are shipped using PlayWright, which launch a brownox browser context for each job.
You must have Docker and docker compose
installed.
- Clone this repo
- run
docker compose up
(adocker-compose.yml
The file is provided for your use)
… and the service applies to http://localhost:5006
. View the use section below for details on how to interact with it.
You can put an API key using a .env
file within /scraper
folder (same level as app.py
).
You can put multiple API keys as you want; Authorizes API keys are those who start SCRAPER_API_KEY
. For example, here is a .env
File with three available keys:
SCRAPER_API_KEY=should-be-secret
SCRAPER_API_KEY_OTHER=can-also-be-used
SCRAPER_API_KEY_3=works-too
API keys are sent to service using Authorization carrier design.
The root path /
Return to Status 200 If there is online, in addition to Gilbert and Sullivan lyrics (you can go there in your browser to see if it’s online).
The only other way /scrape
you’ve sent a Jon Format post requested and (if all things are good) will receive a multipart/mixed
Type answer.
The answer can be:
- Status to 200:
multipart/mixed
answer where the first partapplication/json
with information about the request; The second feature is website data (usuallytext/html
); And the remaining parts up to 5 screenshots. - Not Status 200:
application/json
Answer an error message under the “Error” key.
Here’s a sample curl request:
curl -X POST "http://localhost:5006/scrape"
-H "Content-Type: application/json"
-d '{"url": "https://us.ai"}'
Here is an example of the code using Python and the request_tolbelt library will allow you to associate with the API correctly:
import requests
from requests_toolbelt.multipart.decoder import MultipartDecoder
import sys
import json
data = {
'url': "https://us.ai"
}
# Optional if you're using an API key
headers = {
'Authorization': f'Bearer Your-API-Key'
}
response = requests.post('http://localhost:5006/scrape', json=data, headers=headers, timeout=30)
if response.status_code != 200:
my_json = response.json()
message = my_json('error')
print(f"Error scraping: {message}", file=sys.stderr)
else:
decoder = MultipartDecoder.from_response(response)
resp = None
for i, part in enumerate(decoder.parts):
if i == 0: # First is some JSON
json_part = json.loads(part.content)
req_status = json_part('status') # An integer
req_headers = json_part('headers') # Headers from the request made to your URL
metadata = json_part('metadata') # Information like the number of screenshots and their compressed / uncompressed sizes
# ...
elif i == 1: # Next is the actual content of the page
content = part.content
headers = part.headers # Will contain info about the content (text/html, application/pdf, etc.)
# ...
else: # Other parts are screenshots, if they exist
img = part.content
headers = part.headers # Will tell you the image format
# ...
Navigation of unreliable websites is a serious security issue. Risks are somewhat motivated by the following ways:
- Running as a detached container (container containing content)
- Each website is shaved in a new browser context (processing process)
- Restricted memory limits and times of each task
- The URL is checked to make sure it is not odd (loopback, not http, etc.)
You can make additional precautions depending on your needs, such as:
- Just provide APIs trusted URLs (or otherwise screening URL)
- It runs API in remote VMS (Hardware Ilolation)
- Using a single user per user
- Not make any secret files or keys available within the room (other than the API key for the scraper itself)
If you want to make sure this API is in your security standards, please check the code and open issues! It’s not a big repo.
You can handle memory limits and other variables above scraper/worker.py
. Here are the defaults:
MEM_LIMIT_MB = 4_000 # 4 GB memory threshold for child scraping process
MAX_SCREENSHOTS = 5
SCREENSHOT_JPEG_QUALITY = 85
BROWSER_HEIGHT = 2000
BROWSER_WIDTH = 1280
https://opengraph.githubassets.com/744211d8178c795b8cb108932bf65ac51ed3ef9f4e7786d2b94db871f068fb66/US-Artificial-Intelligence/scraper
2025-02-06 21:48:00