US-Artificial-Intelligence / Scraper: An API that takes a URL and provides a file with browser screens.

AljwadhFebruary 6, 2025

0 0 3 minutes read

You run the API on your machine, you send it a URL, and you return the website data as a site screenshot file. Simply as.

This project was made to support AbbeyA AI platform. Its Author Gordon room.

Some time:

Page scrolls and picked up screenshots in different sections
Runs in a docker container
Browser-based (run on JavaScript on website)
HTTP status code gives you from the first request
Automatically hold 302 again
Manage download links correctly
Tasks are processed by a queue with configurable memory allocation
Block the API
Zero state or other complexity

This Web Scraper has a resource of higher quality than many alternatives. Websites are shipped using PlayWright, which launch a brownox browser context for each job.

You must have Docker and docker compose installed.

Clone this repo
run docker compose up (a docker-compose.yml The file is provided for your use)

… and the service applies to http://localhost:5006. View the use section below for details on how to interact with it.

You can put an API key using a .env file within /scraper folder (same level as app.py).

You can put multiple API keys as you want; Authorizes API keys are those who start SCRAPER_API_KEY. For example, here is a .env File with three available keys:

SCRAPER_API_KEY=should-be-secret
SCRAPER_API_KEY_OTHER=can-also-be-used
SCRAPER_API_KEY_3=works-too

API keys are sent to service using Authorization carrier design.

The root path / Return to Status 200 If there is online, in addition to Gilbert and Sullivan lyrics (you can go there in your browser to see if it’s online).

The only other way /scrapeyou’ve sent a Jon Format post requested and (if all things are good) will receive a multipart/mixed Type answer.

The answer can be:

Status to 200: multipart/mixed answer where the first part application/json with information about the request; The second feature is website data (usually text/html); And the remaining parts up to 5 screenshots.
Not Status 200: application/json Answer an error message under the “Error” key.

Here’s a sample curl request:

curl -X POST "http://localhost:5006/scrape"
    -H "Content-Type: application/json"
    -d '{"url": "https://us.ai"}'

Here is an example of the code using Python and the request_tolbelt library will allow you to associate with the API correctly:

import requests
from requests_toolbelt.multipart.decoder import MultipartDecoder
import sys
import json

data = {
    'url': "https://us.ai"
}
# Optional if you're using an API key
headers = {
    'Authorization': f'Bearer Your-API-Key'
}

response = requests.post('http://localhost:5006/scrape', json=data, headers=headers, timeout=30)
if response.status_code != 200:
    my_json = response.json()
    message = my_json('error')
    print(f"Error scraping: {message}", file=sys.stderr)
else:
    decoder = MultipartDecoder.from_response(response)
    resp = None
    for i, part in enumerate(decoder.parts):
        if i == 0:  # First is some JSON
            json_part = json.loads(part.content)
            req_status = json_part('status')  # An integer
            req_headers = json_part('headers')  # Headers from the request made to your URL
            metadata = json_part('metadata')  # Information like the number of screenshots and their compressed / uncompressed sizes
            # ...
        elif i == 1:  # Next is the actual content of the page
            content = part.content
            headers = part.headers  # Will contain info about the content (text/html, application/pdf, etc.)
            # ...
        else:  # Other parts are screenshots, if they exist
            img = part.content
            headers = part.headers  # Will tell you the image format
            # ...

Navigation of unreliable websites is a serious security issue. Risks are somewhat motivated by the following ways:

Running as a detached container (container containing content)
Each website is shaved in a new browser context (processing process)
Restricted memory limits and times of each task
The URL is checked to make sure it is not odd (loopback, not http, etc.)

You can make additional precautions depending on your needs, such as:

Just provide APIs trusted URLs (or otherwise screening URL)
It runs API in remote VMS (Hardware Ilolation)
Using a single user per user
Not make any secret files or keys available within the room (other than the API key for the scraper itself)

If you want to make sure this API is in your security standards, please check the code and open issues! It’s not a big repo.

You can handle memory limits and other variables above scraper/worker.py. Here are the defaults:

MEM_LIMIT_MB = 4_000  # 4 GB memory threshold for child scraping process
MAX_SCREENSHOTS = 5
SCREENSHOT_JPEG_QUALITY = 85
BROWSER_HEIGHT = 2000
BROWSER_WIDTH = 1280

https://opengraph.githubassets.com/744211d8178c795b8cb108932bf65ac51ed3ef9f4e7786d2b94db871f068fb66/US-Artificial-Intelligence/scraper

2025-02-06 21:48:00

AljwadhFebruary 6, 2025

0 0 3 minutes read

US-Artificial-Intelligence / Scraper: An API that takes a URL and provides a file with browser screens.

Aljwadh

Leave a Reply Cancel reply

Elon Musk agrees with Tweet saying Americans aren’t smart enough for tech jobs

Apple Allows Support for Satellite T-Mobile and Starlink in the iPhone

Lamar Kendrick will appear in Synth Riders experience on Apple Pro vision

The 2024 Movie Monster State of the Union

Thousands of people are evacuating in LA as wildfires and extreme winds hit Southern California

Syrians rights activation activation activation of Europe’s al-Assad to support the new Islamist Regime

Ryan Reynolds and Andrew Garfield Are Game to Return as Deadpool and Spider-Man

Your Dishwasher Is Gross. Here’s How to Clean It

Apple Music expands its live radio offerings with three new stations

Ready Player Me’s Player Zero sees momentum for Web3 collectible avatars

The 33 Best Shows on Apple TV+ Right Now (December 2024)

Aljwadh

Peloton continues to return with strong subscription dimensions

Bunny Shaw: Manchester City striker withdraws from the semifinals of the League Cup after being racist and misogyny | Football news

Related Articles

Anti-Scheling Points and Waiting for my Earned Cafe (Valued)

The Journey to the Rescue of the Last Known 43-inch Sony CRT

Hypocrisy and politics of free and open software projects

Sound of noise