221 lines
8.3 KiB
Markdown
221 lines
8.3 KiB
Markdown
ABOUT
|
|
-------------------------
|
|
This is the base implementation of a full crawler that uses a spacetime
|
|
cache server to receive requests.
|
|
|
|
CONFIGURATION
|
|
-------------------------
|
|
|
|
### Step 1: Install dependencies
|
|
|
|
If you do not have Python 3.6+:
|
|
|
|
Windows: https://www.python.org/downloads/windows/
|
|
|
|
Linux: https://docs.python-guide.org/starting/install3/linux/
|
|
|
|
MAC: https://docs.python-guide.org/starting/install3/osx/
|
|
|
|
Check if pip is installed by opening up a terminal/command prompt and typing
|
|
the commands `python3 -m pip`. This should show the help menu for all the
|
|
commands possible with pip. If it does not, then get pip by following the
|
|
instructions at https://pip.pypa.io/en/stable/installing/
|
|
|
|
To install the dependencies for this project run the following two commands
|
|
after ensuring pip is installed for the version of python you are using.
|
|
Admin privileges might be required to execute the commands. Also make sure
|
|
that the terminal is at the root folder of this project.
|
|
```
|
|
python -m pip install packages/spacetime-2.1.1-py3-none-any.whl
|
|
python -m pip install -r packages/requirements.txt
|
|
```
|
|
|
|
### Step 2: Configuring config.ini
|
|
|
|
Set the options in the config.ini file. The following
|
|
configurations exist.
|
|
|
|
**USERAGENT**: Set the useragent to `IR F19 uci-id1,uci-id2,uci-id3`.
|
|
It is important to set the useragent appropriately to get the credit for
|
|
hitting our cache.
|
|
|
|
**HOST**: This is the host name of our caching server. Please set it as per spec.
|
|
|
|
**PORT**: This is the port number of our caching server. Please set it as per spec.
|
|
|
|
**SEEDURL**: The starting url that a crawler first starts downloading.
|
|
|
|
**POLITENESS**: The time delay each thread has to wait for after each download.
|
|
|
|
**SAVE**: The file that is used to save crawler progress. If you want to restart the
|
|
crawler from the seed url, you can simply delete this file.
|
|
|
|
**THREADCOUNT**: This can be a configuration used to increase the number of concurrent
|
|
threads used. Do not change it if you have not implemented multi threading in
|
|
the crawler. The crawler, as it is, is deliberately not thread safe.
|
|
|
|
|
|
### Step 3: Define your scraper rules.
|
|
|
|
Develop the definition of the function scraper in scraper.py
|
|
|
|
```
|
|
def scraper (url: str, resp: utils.response.Response): -> list
|
|
pass
|
|
```
|
|
|
|
The scraper takes in two parameters:
|
|
|
|
**ARGS**
|
|
|
|
*url*:
|
|
|
|
The URL that was added to the frontier, and downloaded from the cache.
|
|
It is of type str and was an url that was previously added to the
|
|
frontier.
|
|
|
|
*resp*:
|
|
|
|
This is the response given by the caching server for the requested URL.
|
|
The response is an object of type Response (see utils/response.py)
|
|
```
|
|
class Response:
|
|
Attributes:
|
|
url:
|
|
The URL identifying the response.
|
|
status:
|
|
An integer that identifies the status of the response. This
|
|
follows the same status codes of http.
|
|
(REF: https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html)
|
|
In addition there are status codes provided by the caching
|
|
server (600-606) that define caching specific errors.
|
|
error:
|
|
If the status codes are between 600 and 606, the reason for
|
|
the error is provided in this attribute. Note that for status codes
|
|
(400-599), the error message is not put in this error attribute; instead it
|
|
must picked up from the raw_response (if any, and if useful).
|
|
raw_response:
|
|
If the status is between 200-599 (standard http), the raw
|
|
response object is the one defined by the requests library.
|
|
Useful resources in understanding this raw response object:
|
|
https://realpython.com/python-requests/#the-response
|
|
https://requests.kennethreitz.org/en/master/api/#requests.Response
|
|
HINT: raw_response.content gives you the webpage html content.
|
|
```
|
|
**Return Value**
|
|
|
|
This function needs to return a list of urls that are scraped from the
|
|
response. (An empty list for responses that are empty). These urls will be
|
|
added to the Frontier and retrieved from the cache. These urls have to be
|
|
filtered so that urls that do not have to be downloaded are not added to the
|
|
frontier.
|
|
|
|
The first step of filtering the urls can be by using the **is_valid** function
|
|
provided in the same scraper.py file. Additional rules should be added to the is_valid function to filter the urls.
|
|
|
|
EXECUTION
|
|
-------------------------
|
|
|
|
To execute the crawler run the launch.py command.
|
|
```python3 launch.py```
|
|
|
|
You can restart the crawler from the seed url
|
|
(all current progress will be deleted) using the command
|
|
```python3 launch.py --restart```
|
|
|
|
You can specify a different config file to use by using the command with the option
|
|
```python3 launch.py --config_file path/to/config```
|
|
|
|
ARCHITECTURE
|
|
-------------------------
|
|
|
|
### FLOW
|
|
|
|
The crawler receives a cache host and port from the spacetime servers
|
|
and instantiates the config.
|
|
|
|
It launches a crawler (defined in crawler/\_\_init\_\_.py L5) which creates a
|
|
Frontier and Worker(s) using the optional parameters frontier_factory, and
|
|
worker_factory.
|
|
|
|
When the crawler in started, workers are created that pick up an
|
|
undownloaded link from the frontier, download it from our cache server, and
|
|
pass the response to your scraper function. The links that are received by
|
|
the scraper is added to the list of undownloaded links in the frontier and
|
|
the url that was downloaded is marked as complete. The cycle continues until
|
|
there are no more urls to be downloaded in the frontier.
|
|
|
|
### REDEFINING THE FRONTIER:
|
|
|
|
You can make your own frontier to use with the crawler if they meet this
|
|
interface definition:
|
|
```
|
|
class Frontier:
|
|
def __init__(self, config, restart):
|
|
#Initializer.
|
|
# config -> Config object (defined in utils/config.py L1)
|
|
# Note that the cache server is already defined at this
|
|
# point.
|
|
# restart -> A bool that is True if the crawler has to restart
|
|
# from the seed url and delete any current progress.
|
|
|
|
def get_tbd_url(self):
|
|
# Get one url that has to be downloaded.
|
|
# Can return None to signify the end of crawling.
|
|
|
|
def add_url(self, url):
|
|
# Adds one url to the frontier to be downloaded later.
|
|
# Checks can be made to prevent downloading duplicates.
|
|
|
|
def mark_url_complete(self, url):
|
|
# mark a url as completed so that on restart, this url is not
|
|
# downloaded again.
|
|
```
|
|
A sample reference is given in utils/frontier.py L10. Note that this
|
|
reference is not thread safe.
|
|
|
|
### REDEFINING THE WORKER
|
|
|
|
You can make your own worker to use with the crawler if they meet this
|
|
interface definition:
|
|
```
|
|
from scraper import scraper
|
|
from utils.download import download
|
|
class Worker(Thread): # Worker must inherit from Thread or Process.
|
|
def __init__(self, worker_id, config, frontier):
|
|
# worker_id -> a unique id for the worker to self identify.
|
|
# config -> Config object (defined in utils/config.py L1)
|
|
# Note that the cache server is already defined at this
|
|
# point.
|
|
# frontier -> Frontier object created by the Crawler. Base reference
|
|
# is shown in utils/frontier.py L10 but can be overloaded
|
|
# as detailed above.
|
|
self.config = config
|
|
super().__init__(daemon=True)
|
|
|
|
def run(self):
|
|
In loop:
|
|
> url = get one undownloaded link from frontier.
|
|
> resp = download(url, self.config)
|
|
> next_links = scraper(url, resp)
|
|
> add next_links to frontier
|
|
> sleep for self.config.time_delay
|
|
```
|
|
A sample reference is given in utils/worker.py L9.
|
|
|
|
THINGS TO KEEP IN MIND
|
|
-------------------------
|
|
|
|
1. It is important to filter out urls that do not point to a webpage. For
|
|
example, PDFs, PPTs, css, js, etc. The is_valid filters a large number of
|
|
such extensions, but there may be more.
|
|
2. It is important to filter out urls that are not with ics.uci.edu domain.
|
|
3. It is important to maintain the politeness to the cache server (on a per
|
|
domain basis).
|
|
4. It is important to set the user agent in the config.ini correctly to get
|
|
credit for hitting the cache servers.
|
|
5. Launching multiple instances of the crawler will download the same urls in
|
|
both. Mechanisms can be used to avoid that, however the politeness limits
|
|
still apply and will be checked.
|
|
6. Do not attempt to download the links directly from ics servers.
|