Hieuhuy Pham
|
c1b7a50460
|
Locks are not racing anymore, locks work multi-thread works, change some storing information stuff so its more readble, add some new regex but it will need to be trim later because it does not do its job
|
2022-04-23 18:49:24 -07:00 |
|
Hieuhuy Pham
|
74063e5d00
|
Fixed a lot of racing issues, there potentially could be a writer reader confusion type of thing, but it should not matter that much, as long as server is healthy we can let this bad boi lose
|
2022-04-23 02:13:12 -07:00 |
|
Hieuhuy Pham
|
90a5d16456
|
Load balancer installed, havent not been able to test yet
|
2022-04-22 16:51:32 -07:00 |
|
Hieuhuy Pham
|
8b96a7c9f7
|
More refinement of frontier and worker for delicious multi-threading
|
2022-04-21 21:08:23 -07:00 |
|
iNocturnis
|
58c923f075
|
Merge branch 'data_collection'
|
2022-04-21 20:44:18 -07:00 |
|
Hieuhuy Pham
|
9301bd5ebe
|
More locks and sempahore refinement
|
2022-04-21 20:41:25 -07:00 |
|
unknown
|
754d3b4af6
|
(andy) first move recent discussed issue
|
2022-04-21 20:31:38 -07:00 |
|
Hieuhuy Pham
|
320fe26c23
|
Added basic multi-threading, reader-first implementation
|
2022-04-21 19:44:30 -07:00 |
|
iNocturnis
|
9fcd9cfd99
|
Merged conflicts resolved
|
2022-04-20 17:54:37 -07:00 |
|
iNocturnis
|
e27b40f153
|
Merge branch 'main' of https://github.com/iNocturnis/webcrawler
|
2022-04-20 17:50:44 -07:00 |
|
iNocturnis
|
b495292b87
|
Merge remote-tracking branch 'origin/traps'
|
2022-04-20 17:49:34 -07:00 |
|
Lacerum
|
809b3dc820
|
moved robots ok to other file like datacollect
|
2022-04-20 13:29:18 -07:00 |
|
Lacerum
|
ab39c4b8c6
|
changed elif to if to speed up regex in is_vaild
|
2022-04-20 12:18:25 -07:00 |
|
Lacerum
|
af26611ef4
|
hopeful fixes for issue #2,#3
|
2022-04-20 11:11:43 -07:00 |
|
Hieuhuy Pham
|
58d15918d5
|
Change more syntax to get data collection working, check extracturl and sorted links into sets instead of lists to signifcantly reduce url extractions
|
2022-04-20 04:03:58 -07:00 |
|
Hieuhuy Pham
|
d0dde4a4db
|
Fixes error in syntax for new merged code from data collection branch, fixed 'infinite loop', added timers to measure performance of functions.
|
2022-04-20 03:52:14 -07:00 |
|
iNocturnis
|
367a324ead
|
Merge remote-tracking branch 'origin/traps'
|
2022-04-20 02:17:27 -07:00 |
|
unknown
|
e31ad13d40
|
Moved stuff out of scraper.py into frontier.py
|
2022-04-20 01:00:04 -07:00 |
|
unknown
|
bdd61a373b
|
Moved stuff out of scraper
|
2022-04-20 00:49:49 -07:00 |
|
unknown
|
44c86eb51a
|
finished datacollection
|
2022-04-19 22:59:14 -07:00 |
|
Lacerum
|
0377265180
|
urls when opened download a file, keep or no, idk
|
2022-04-19 13:18:15 -07:00 |
|
Lacerum
|
56e74c6b4b
|
url len chg and added catch for repeating filter
|
2022-04-19 12:52:23 -07:00 |
|
Lacerum
|
8f260cb110
|
trap fixes based on internet and what I found
|
2022-04-19 03:02:14 -07:00 |
|
unknown
|
f2cdf66de1
|
added functionality for unique links
|
2022-04-18 19:01:07 -07:00 |
|
Lacerum
|
4ace2164f2
|
more todos
|
2022-04-18 18:38:16 -07:00 |
|
Lacerum
|
4080d46541
|
added my todo for traps so far
|
2022-04-18 18:04:11 -07:00 |
|
Lacerum
|
0e5af0a4c7
|
added commented out robot check in next link
|
2022-04-18 11:59:56 -07:00 |
|
Lacerum
|
1fbcb81fae
|
forgot to add robot check in is_valid
|
2022-04-18 11:54:47 -07:00 |
|
Lacerum
|
577fdb5a80
|
added robot.txt check
|
2022-04-18 11:29:43 -07:00 |
|
Lacerum
|
0e4187a5fa
|
added a looping and repeating trap fix
|
2022-04-18 02:25:03 -07:00 |
|
Lacerum
|
2efcb22c58
|
test create branch, place holder for trap fix
|
2022-04-17 13:00:07 -07:00 |
|
Lacerum
|
3e8f57bd34
|
added my uci id to useragent
|
2022-04-16 02:50:28 -07:00 |
|
iNocturnis
|
e19f68a6a6
|
Add files via upload
First Upload
|
2022-04-15 17:55:11 -07:00 |
|