test and readme txt

This commit is contained in:
Aaron 2022-05-27 21:37:38 -07:00 committed by GitHub
parent e325b9d810
commit 3e047aec45
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 60 additions and 0 deletions

8
README.txt Normal file
View File

@ -0,0 +1,8 @@
### To create index:
1. Make sure that all requirements are installed, check `requirements.txt` and install using `pip install reqirements.txt`.
2. Run `python indexer.py` to build index, this may take some time to run.
3. Index is now created.
### Start search interface:
Run `python launcher.py` to start the search interface.
### Perform query:
To perfrom a search simply enter a query in the textbox and click search. The top results will be displayed.

52
TEST.txt Normal file
View File

@ -0,0 +1,52 @@
### Bad:
- computer science - common
- university of california irvine -common
- donald bren - common
- uci - common
- informatics - common
- The Donald Bren School of Information and Computer Sciences - long and common
- toilet - not likely to be found easily
- perfume - not likely to be found
- SPY×FAMILY - should not exist in data
- undergraduate - likely to be on tons of pages
### Good to Meh:
- liquids in labs - uncommon word with common
- Alberto Krone-Martins - should have a good amount of results but not absurd
- Advising & Planning - should be specific but not too common
- Honors Program - ^
- Papaefthymiou - similar to the martins query
- General information - there should be quite a few pages with this but not tons
- Prerequisite Clearing System - has some common and uncommon terms
- Recruiting - not stupid common
- counseling - ^ and should only be on a subset of pages
- social justice - specific terms that should appear without being costly
### Others tested:
- masters of computer science - not super common but will have a good amount of pages
- thornton ics46 notes - name + class + common
- Theory of Computation - two terms which have high count in papers
- facility distribution - two terms which don't really make sense together
- artificial intelligence history - two common terms with semi-common
- prospective alumni - should have very few instances of both terms but should be found together
- enrollment window - should be on only a couple of pages
- available capstone sponsorship - ^
- spring seminars - common with term that may be somewhat restricted
- hackuci - two terms into one that exists in dataset
- ucinetid help - specific term with common
- course restrictions - specific pages
- project management - a course name
- yelan research - term should not exist + common
- hybrid-learning - common phrase
- genshin is a computer game - contains terms that exist and others that don't
- computable AI machine learning big data - sentence of CS buzz words (really really common)
- Publications & Technical Reports - in json file
- Tutor coordinators - in many json (bold, title, and body)
- Death Image Service - in some weird areas
- send anonymous email - only in some
### Things done for improvement
1. Create index of index for substantial gain in efficiency and speed.
2. Split TF-IDF into TF and IDF for more specific calculations when needed without the whole computation. This also removes the relevance on external library for TF-IDF.
3. Switched from using IDF & weight, to TF & weight for helping with the overall weight.
4. Dropped indexing and searching of unigram, bigram, and trigrams.
5. Add length of document during indexing for improved speed via normalization calculation.