Commit Graph

24 Commits

Author SHA1 Message Date
iNocturnis
e27b40f153 Merge branch 'main' of https://github.com/iNocturnis/webcrawler 2022-04-20 17:50:44 -07:00
iNocturnis
b495292b87 Merge remote-tracking branch 'origin/traps' 2022-04-20 17:49:34 -07:00
Lacerum
809b3dc820 moved robots ok to other file like datacollect 2022-04-20 13:29:18 -07:00
Lacerum
ab39c4b8c6 changed elif to if to speed up regex in is_vaild 2022-04-20 12:18:25 -07:00
Lacerum
af26611ef4 hopeful fixes for issue #2,#3 2022-04-20 11:11:43 -07:00
Hieuhuy Pham
58d15918d5 Change more syntax to get data collection working, check extracturl and sorted links into sets instead of lists to signifcantly reduce url extractions 2022-04-20 04:03:58 -07:00
Hieuhuy Pham
d0dde4a4db Fixes error in syntax for new merged code from data collection branch, fixed 'infinite loop', added timers to measure performance of functions. 2022-04-20 03:52:14 -07:00
iNocturnis
367a324ead Merge remote-tracking branch 'origin/traps' 2022-04-20 02:17:27 -07:00
unknown
e31ad13d40 Moved stuff out of scraper.py into frontier.py 2022-04-20 01:00:04 -07:00
unknown
bdd61a373b Moved stuff out of scraper 2022-04-20 00:49:49 -07:00
unknown
44c86eb51a finished datacollection 2022-04-19 22:59:14 -07:00
Lacerum
0377265180 urls when opened download a file, keep or no, idk 2022-04-19 13:18:15 -07:00
Lacerum
56e74c6b4b url len chg and added catch for repeating filter 2022-04-19 12:52:23 -07:00
Lacerum
8f260cb110 trap fixes based on internet and what I found 2022-04-19 03:02:14 -07:00
unknown
f2cdf66de1 added functionality for unique links 2022-04-18 19:01:07 -07:00
Lacerum
4ace2164f2 more todos 2022-04-18 18:38:16 -07:00
Lacerum
4080d46541 added my todo for traps so far 2022-04-18 18:04:11 -07:00
Lacerum
0e5af0a4c7 added commented out robot check in next link 2022-04-18 11:59:56 -07:00
Lacerum
1fbcb81fae forgot to add robot check in is_valid 2022-04-18 11:54:47 -07:00
Lacerum
577fdb5a80 added robot.txt check 2022-04-18 11:29:43 -07:00
Lacerum
0e4187a5fa added a looping and repeating trap fix 2022-04-18 02:25:03 -07:00
Lacerum
2efcb22c58 test create branch, place holder for trap fix 2022-04-17 13:00:07 -07:00
Lacerum
3e8f57bd34 added my uci id to useragent 2022-04-16 02:50:28 -07:00
iNocturnis
e19f68a6a6
Add files via upload
First Upload
2022-04-15 17:55:11 -07:00