Commit Graph

  • 5b0a9bfbe2 Git pushed after crawling #1 main Hieuhuy Pham 2022-04-25 20:19:40 -0700
  • 8d5a669d9e Added some trap detection for really bad links Hieuhuy Pham 2022-04-25 15:54:57 -0700
  • c1b7a50460 Locks are not racing anymore, locks work multi-thread works, change some storing information stuff so its more readble, add some new regex but it will need to be trim later because it does not do its job Hieuhuy Pham 2022-04-23 18:49:24 -0700
  • 9c31a901b7 another attempt at robots, merged regex as well traps Lacerum 2022-04-23 14:44:47 -0700
  • 74063e5d00 Fixed a lot of racing issues, there potentially could be a writer reader confusion type of thing, but it should not matter that much, as long as server is healthy we can let this bad boi lose Hieuhuy Pham 2022-04-23 02:13:12 -0700
  • 90a5d16456 Load balancer installed, havent not been able to test yet Hieuhuy Pham 2022-04-22 16:51:32 -0700
  • 8b96a7c9f7 More refinement of frontier and worker for delicious multi-threading Hieuhuy Pham 2022-04-21 21:08:23 -0700
  • 58c923f075 Merge branch 'data_collection' iNocturnis 2022-04-21 20:44:18 -0700
  • 9301bd5ebe More locks and sempahore refinement Hieuhuy Pham 2022-04-21 20:41:25 -0700
  • 754d3b4af6 (andy) first move recent discussed issue data_collection unknown 2022-04-21 20:31:38 -0700
  • 320fe26c23 Added basic multi-threading, reader-first implementation Hieuhuy Pham 2022-04-21 19:44:30 -0700
  • 9fcd9cfd99 Merged conflicts resolved iNocturnis 2022-04-20 17:54:37 -0700
  • e27b40f153 Merge branch 'main' of https://github.com/iNocturnis/webcrawler iNocturnis 2022-04-20 17:50:44 -0700
  • b495292b87 Merge remote-tracking branch 'origin/traps' iNocturnis 2022-04-20 17:49:34 -0700
  • 809b3dc820 moved robots ok to other file like datacollect Lacerum 2022-04-20 13:29:18 -0700
  • ab39c4b8c6 changed elif to if to speed up regex in is_vaild Lacerum 2022-04-20 12:18:25 -0700
  • af26611ef4 hopeful fixes for issue #2,#3 Lacerum 2022-04-20 11:11:43 -0700
  • 58d15918d5 Change more syntax to get data collection working, check extracturl and sorted links into sets instead of lists to signifcantly reduce url extractions Hieuhuy Pham 2022-04-20 04:03:58 -0700
  • d0dde4a4db Fixes error in syntax for new merged code from data collection branch, fixed 'infinite loop', added timers to measure performance of functions. Hieuhuy Pham 2022-04-20 03:52:14 -0700
  • 367a324ead Merge remote-tracking branch 'origin/traps' iNocturnis 2022-04-20 02:17:27 -0700
  • e31ad13d40 Moved stuff out of scraper.py into frontier.py unknown 2022-04-20 01:00:04 -0700
  • bdd61a373b Moved stuff out of scraper unknown 2022-04-20 00:49:49 -0700
  • 44c86eb51a finished datacollection unknown 2022-04-19 22:59:14 -0700
  • 0377265180 urls when opened download a file, keep or no, idk Lacerum 2022-04-19 13:18:15 -0700
  • 56e74c6b4b url len chg and added catch for repeating filter Lacerum 2022-04-19 12:52:23 -0700
  • 8f260cb110 trap fixes based on internet and what I found Lacerum 2022-04-19 03:02:14 -0700
  • f2cdf66de1 added functionality for unique links unknown 2022-04-18 19:01:07 -0700
  • 4ace2164f2 more todos Lacerum 2022-04-18 18:38:16 -0700
  • 4080d46541 added my todo for traps so far Lacerum 2022-04-18 18:04:11 -0700
  • 0e5af0a4c7 added commented out robot check in next link Lacerum 2022-04-18 11:59:56 -0700
  • 1fbcb81fae forgot to add robot check in is_valid Lacerum 2022-04-18 11:54:47 -0700
  • 577fdb5a80 added robot.txt check Lacerum 2022-04-18 11:29:43 -0700
  • 0e4187a5fa added a looping and repeating trap fix Lacerum 2022-04-18 02:25:03 -0700
  • 2efcb22c58 test create branch, place holder for trap fix Lacerum 2022-04-17 13:00:07 -0700
  • 3e8f57bd34 added my uci id to useragent Lacerum 2022-04-16 02:50:28 -0700
  • e19f68a6a6
    Add files via upload iNocturnis 2022-04-15 17:55:11 -0700