test and readme txt

2022-05-27 21:37:38 -07:00
parent e325b9d810
commit 3e047aec45
2 changed files with 60 additions and 0 deletions
--- a/README.txt
+++ b/README.txt
@@ -0,0 +1,8 @@
+### To create index:
+ 1. Make sure that all requirements are installed, check `requirements.txt` and install using `pip install reqirements.txt`. 
+ 2. Run `python indexer.py` to build index, this may take some time to run.
+ 3. Index is now created.
+### Start search interface:
+Run `python launcher.py` to start the search interface.
+### Perform query:
+To perfrom a search simply enter a query in the textbox and click search. The top results will be displayed.  
--- a/TEST.txt
+++ b/TEST.txt
@@ -0,0 +1,52 @@
+
+### Bad:
+- computer science - common
+- university of california irvine -common
+- donald bren - common
+- uci - common
+- informatics - common
+- The Donald Bren School of Information and Computer Sciences - long and common
+- toilet - not likely to be found easily 
+- perfume - not likely to be found
+- SPY×FAMILY - should not exist in data
+- undergraduate - likely to be on tons of pages
+### Good to Meh:
+- liquids in labs - uncommon word with common
+- Alberto Krone-Martins - should have a good amount of results but not absurd 
+- Advising & Planning - should be specific but not too common
+- Honors Program - ^
+- Papaefthymiou - similar to the martins query 
+- General information - there should be quite a few pages with this but not tons
+- Prerequisite Clearing System - has some common and uncommon terms
+- Recruiting - not stupid common
+- counseling - ^ and should only be on a subset of pages
+- social justice - specific terms that should appear without being costly
+### Others tested:
+- masters of computer science - not super common but will have a good amount of pages
+- thornton ics46 notes - name + class + common
+- Theory of Computation - two terms which have high count in papers
+- facility distribution  - two terms which don't really make sense together
+- artificial intelligence history - two common terms with semi-common
+- prospective alumni - should have very few instances of both terms but should be found together
+- enrollment window - should be on only a couple of pages
+- available capstone sponsorship - ^
+- spring seminars - common with term that may be somewhat restricted
+- hackuci - two terms into one that exists in dataset
+- ucinetid help  - specific term with common 
+- course restrictions - specific pages
+- project management - a course name
+- yelan research - term should not exist + common
+- hybrid-learning - common phrase 
+- genshin is a computer game - contains terms that exist and others that don't 
+- computable AI machine learning big data - sentence of CS buzz words (really really common)
+- Publications & Technical Reports - in json file
+- Tutor coordinators - in many json (bold, title, and body)
+- Death Image Service - in some weird areas
+- send anonymous email - only in some
+### Things done for improvement
+ 1. Create index of index for substantial gain in efficiency and speed.
+ 2.  Split TF-IDF into TF and IDF for more specific calculations when needed without the whole computation. This also removes the relevance on external library for TF-IDF.
+ 3. Switched from using IDF & weight, to TF & weight for helping with the overall weight.
+ 4. Dropped indexing and searching of unigram, bigram, and trigrams.
+ 5. Add length of document during indexing for improved speed via normalization calculation.
+