iNocturnis
|
e27b40f153
|
Merge branch 'main' of https://github.com/iNocturnis/webcrawler
|
2022-04-20 17:50:44 -07:00 |
|
iNocturnis
|
b495292b87
|
Merge remote-tracking branch 'origin/traps'
|
2022-04-20 17:49:34 -07:00 |
|
Lacerum
|
809b3dc820
|
moved robots ok to other file like datacollect
|
2022-04-20 13:29:18 -07:00 |
|
Lacerum
|
ab39c4b8c6
|
changed elif to if to speed up regex in is_vaild
|
2022-04-20 12:18:25 -07:00 |
|
Lacerum
|
af26611ef4
|
hopeful fixes for issue #2,#3
|
2022-04-20 11:11:43 -07:00 |
|
Hieuhuy Pham
|
58d15918d5
|
Change more syntax to get data collection working, check extracturl and sorted links into sets instead of lists to signifcantly reduce url extractions
|
2022-04-20 04:03:58 -07:00 |
|
Hieuhuy Pham
|
d0dde4a4db
|
Fixes error in syntax for new merged code from data collection branch, fixed 'infinite loop', added timers to measure performance of functions.
|
2022-04-20 03:52:14 -07:00 |
|
iNocturnis
|
367a324ead
|
Merge remote-tracking branch 'origin/traps'
|
2022-04-20 02:17:27 -07:00 |
|
unknown
|
e31ad13d40
|
Moved stuff out of scraper.py into frontier.py
|
2022-04-20 01:00:04 -07:00 |
|
unknown
|
bdd61a373b
|
Moved stuff out of scraper
|
2022-04-20 00:49:49 -07:00 |
|
unknown
|
44c86eb51a
|
finished datacollection
|
2022-04-19 22:59:14 -07:00 |
|
Lacerum
|
0377265180
|
urls when opened download a file, keep or no, idk
|
2022-04-19 13:18:15 -07:00 |
|
Lacerum
|
56e74c6b4b
|
url len chg and added catch for repeating filter
|
2022-04-19 12:52:23 -07:00 |
|
Lacerum
|
8f260cb110
|
trap fixes based on internet and what I found
|
2022-04-19 03:02:14 -07:00 |
|
unknown
|
f2cdf66de1
|
added functionality for unique links
|
2022-04-18 19:01:07 -07:00 |
|
Lacerum
|
4ace2164f2
|
more todos
|
2022-04-18 18:38:16 -07:00 |
|
Lacerum
|
4080d46541
|
added my todo for traps so far
|
2022-04-18 18:04:11 -07:00 |
|
Lacerum
|
0e5af0a4c7
|
added commented out robot check in next link
|
2022-04-18 11:59:56 -07:00 |
|
Lacerum
|
1fbcb81fae
|
forgot to add robot check in is_valid
|
2022-04-18 11:54:47 -07:00 |
|
Lacerum
|
577fdb5a80
|
added robot.txt check
|
2022-04-18 11:29:43 -07:00 |
|
Lacerum
|
0e4187a5fa
|
added a looping and repeating trap fix
|
2022-04-18 02:25:03 -07:00 |
|
Lacerum
|
2efcb22c58
|
test create branch, place holder for trap fix
|
2022-04-17 13:00:07 -07:00 |
|
iNocturnis
|
e19f68a6a6
|
Add files via upload
First Upload
|
2022-04-15 17:55:11 -07:00 |
|