Commit Graph

33 Commits

Author SHA1 Message Date
Hieuhuy Pham
c1b7a50460 Locks are not racing anymore, locks work multi-thread works, change some storing information stuff so its more readble, add some new regex but it will need to be trim later because it does not do its job 2022-04-23 18:49:24 -07:00
Hieuhuy Pham
74063e5d00 Fixed a lot of racing issues, there potentially could be a writer reader confusion type of thing, but it should not matter that much, as long as server is healthy we can let this bad boi lose 2022-04-23 02:13:12 -07:00
Hieuhuy Pham
90a5d16456 Load balancer installed, havent not been able to test yet 2022-04-22 16:51:32 -07:00
Hieuhuy Pham
8b96a7c9f7 More refinement of frontier and worker for delicious multi-threading 2022-04-21 21:08:23 -07:00
iNocturnis
58c923f075 Merge branch 'data_collection' 2022-04-21 20:44:18 -07:00
Hieuhuy Pham
9301bd5ebe More locks and sempahore refinement 2022-04-21 20:41:25 -07:00
unknown
754d3b4af6 (andy) first move recent discussed issue 2022-04-21 20:31:38 -07:00
Hieuhuy Pham
320fe26c23 Added basic multi-threading, reader-first implementation 2022-04-21 19:44:30 -07:00
iNocturnis
9fcd9cfd99 Merged conflicts resolved 2022-04-20 17:54:37 -07:00
iNocturnis
e27b40f153 Merge branch 'main' of https://github.com/iNocturnis/webcrawler 2022-04-20 17:50:44 -07:00
iNocturnis
b495292b87 Merge remote-tracking branch 'origin/traps' 2022-04-20 17:49:34 -07:00
Lacerum
809b3dc820 moved robots ok to other file like datacollect 2022-04-20 13:29:18 -07:00
Lacerum
ab39c4b8c6 changed elif to if to speed up regex in is_vaild 2022-04-20 12:18:25 -07:00
Lacerum
af26611ef4 hopeful fixes for issue #2,#3 2022-04-20 11:11:43 -07:00
Hieuhuy Pham
58d15918d5 Change more syntax to get data collection working, check extracturl and sorted links into sets instead of lists to signifcantly reduce url extractions 2022-04-20 04:03:58 -07:00
Hieuhuy Pham
d0dde4a4db Fixes error in syntax for new merged code from data collection branch, fixed 'infinite loop', added timers to measure performance of functions. 2022-04-20 03:52:14 -07:00
iNocturnis
367a324ead Merge remote-tracking branch 'origin/traps' 2022-04-20 02:17:27 -07:00
unknown
e31ad13d40 Moved stuff out of scraper.py into frontier.py 2022-04-20 01:00:04 -07:00
unknown
bdd61a373b Moved stuff out of scraper 2022-04-20 00:49:49 -07:00
unknown
44c86eb51a finished datacollection 2022-04-19 22:59:14 -07:00
Lacerum
0377265180 urls when opened download a file, keep or no, idk 2022-04-19 13:18:15 -07:00
Lacerum
56e74c6b4b url len chg and added catch for repeating filter 2022-04-19 12:52:23 -07:00
Lacerum
8f260cb110 trap fixes based on internet and what I found 2022-04-19 03:02:14 -07:00
unknown
f2cdf66de1 added functionality for unique links 2022-04-18 19:01:07 -07:00
Lacerum
4ace2164f2 more todos 2022-04-18 18:38:16 -07:00
Lacerum
4080d46541 added my todo for traps so far 2022-04-18 18:04:11 -07:00
Lacerum
0e5af0a4c7 added commented out robot check in next link 2022-04-18 11:59:56 -07:00
Lacerum
1fbcb81fae forgot to add robot check in is_valid 2022-04-18 11:54:47 -07:00
Lacerum
577fdb5a80 added robot.txt check 2022-04-18 11:29:43 -07:00
Lacerum
0e4187a5fa added a looping and repeating trap fix 2022-04-18 02:25:03 -07:00
Lacerum
2efcb22c58 test create branch, place holder for trap fix 2022-04-17 13:00:07 -07:00
Lacerum
3e8f57bd34 added my uci id to useragent 2022-04-16 02:50:28 -07:00
iNocturnis
e19f68a6a6
Add files via upload
First Upload
2022-04-15 17:55:11 -07:00