test and readme txt
This commit is contained in:
parent
e325b9d810
commit
3e047aec45
8
README.txt
Normal file
8
README.txt
Normal file
@ -0,0 +1,8 @@
|
||||
### To create index:
|
||||
1. Make sure that all requirements are installed, check `requirements.txt` and install using `pip install reqirements.txt`.
|
||||
2. Run `python indexer.py` to build index, this may take some time to run.
|
||||
3. Index is now created.
|
||||
### Start search interface:
|
||||
Run `python launcher.py` to start the search interface.
|
||||
### Perform query:
|
||||
To perfrom a search simply enter a query in the textbox and click search. The top results will be displayed.
|
52
TEST.txt
Normal file
52
TEST.txt
Normal file
@ -0,0 +1,52 @@
|
||||
|
||||
### Bad:
|
||||
- computer science - common
|
||||
- university of california irvine -common
|
||||
- donald bren - common
|
||||
- uci - common
|
||||
- informatics - common
|
||||
- The Donald Bren School of Information and Computer Sciences - long and common
|
||||
- toilet - not likely to be found easily
|
||||
- perfume - not likely to be found
|
||||
- SPY×FAMILY - should not exist in data
|
||||
- undergraduate - likely to be on tons of pages
|
||||
### Good to Meh:
|
||||
- liquids in labs - uncommon word with common
|
||||
- Alberto Krone-Martins - should have a good amount of results but not absurd
|
||||
- Advising & Planning - should be specific but not too common
|
||||
- Honors Program - ^
|
||||
- Papaefthymiou - similar to the martins query
|
||||
- General information - there should be quite a few pages with this but not tons
|
||||
- Prerequisite Clearing System - has some common and uncommon terms
|
||||
- Recruiting - not stupid common
|
||||
- counseling - ^ and should only be on a subset of pages
|
||||
- social justice - specific terms that should appear without being costly
|
||||
### Others tested:
|
||||
- masters of computer science - not super common but will have a good amount of pages
|
||||
- thornton ics46 notes - name + class + common
|
||||
- Theory of Computation - two terms which have high count in papers
|
||||
- facility distribution - two terms which don't really make sense together
|
||||
- artificial intelligence history - two common terms with semi-common
|
||||
- prospective alumni - should have very few instances of both terms but should be found together
|
||||
- enrollment window - should be on only a couple of pages
|
||||
- available capstone sponsorship - ^
|
||||
- spring seminars - common with term that may be somewhat restricted
|
||||
- hackuci - two terms into one that exists in dataset
|
||||
- ucinetid help - specific term with common
|
||||
- course restrictions - specific pages
|
||||
- project management - a course name
|
||||
- yelan research - term should not exist + common
|
||||
- hybrid-learning - common phrase
|
||||
- genshin is a computer game - contains terms that exist and others that don't
|
||||
- computable AI machine learning big data - sentence of CS buzz words (really really common)
|
||||
- Publications & Technical Reports - in json file
|
||||
- Tutor coordinators - in many json (bold, title, and body)
|
||||
- Death Image Service - in some weird areas
|
||||
- send anonymous email - only in some
|
||||
### Things done for improvement
|
||||
1. Create index of index for substantial gain in efficiency and speed.
|
||||
2. Split TF-IDF into TF and IDF for more specific calculations when needed without the whole computation. This also removes the relevance on external library for TF-IDF.
|
||||
3. Switched from using IDF & weight, to TF & weight for helping with the overall weight.
|
||||
4. Dropped indexing and searching of unigram, bigram, and trigrams.
|
||||
5. Add length of document during indexing for improved speed via normalization calculation.
|
||||
|
Loading…
Reference in New Issue
Block a user