From 3e047aec451fdcf69c6b69d358f4d00c57f86824 Mon Sep 17 00:00:00 2001 From: Aaron Date: Fri, 27 May 2022 21:37:38 -0700 Subject: [PATCH] test and readme txt --- README.txt | 8 ++++++++ TEST.txt | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 60 insertions(+) create mode 100644 README.txt create mode 100644 TEST.txt diff --git a/README.txt b/README.txt new file mode 100644 index 0000000..8842680 --- /dev/null +++ b/README.txt @@ -0,0 +1,8 @@ +### To create index: + 1. Make sure that all requirements are installed, check `requirements.txt` and install using `pip install reqirements.txt`. + 2. Run `python indexer.py` to build index, this may take some time to run. + 3. Index is now created. +### Start search interface: +Run `python launcher.py` to start the search interface. +### Perform query: +To perfrom a search simply enter a query in the textbox and click search. The top results will be displayed. \ No newline at end of file diff --git a/TEST.txt b/TEST.txt new file mode 100644 index 0000000..015cc98 --- /dev/null +++ b/TEST.txt @@ -0,0 +1,52 @@ + +### Bad: +- computer science - common +- university of california irvine -common +- donald bren - common +- uci - common +- informatics - common +- The Donald Bren School of Information and Computer Sciences - long and common +- toilet - not likely to be found easily +- perfume - not likely to be found +- SPYĂ—FAMILY - should not exist in data +- undergraduate - likely to be on tons of pages +### Good to Meh: +- liquids in labs - uncommon word with common +- Alberto Krone-Martins - should have a good amount of results but not absurd +- Advising & Planning - should be specific but not too common +- Honors Program - ^ +- Papaefthymiou - similar to the martins query +- General information - there should be quite a few pages with this but not tons +- Prerequisite Clearing System - has some common and uncommon terms +- Recruiting - not stupid common +- counseling - ^ and should only be on a subset of pages +- social justice - specific terms that should appear without being costly +### Others tested: +- masters of computer science - not super common but will have a good amount of pages +- thornton ics46 notes - name + class + common +- Theory of Computation - two terms which have high count in papers +- facility distribution - two terms which don't really make sense together +- artificial intelligence history - two common terms with semi-common +- prospective alumni - should have very few instances of both terms but should be found together +- enrollment window - should be on only a couple of pages +- available capstone sponsorship - ^ +- spring seminars - common with term that may be somewhat restricted +- hackuci - two terms into one that exists in dataset +- ucinetid help - specific term with common +- course restrictions - specific pages +- project management - a course name +- yelan research - term should not exist + common +- hybrid-learning - common phrase +- genshin is a computer game - contains terms that exist and others that don't +- computable AI machine learning big data - sentence of CS buzz words (really really common) +- Publications & Technical Reports - in json file +- Tutor coordinators - in many json (bold, title, and body) +- Death Image Service - in some weird areas +- send anonymous email - only in some +### Things done for improvement + 1. Create index of index for substantial gain in efficiency and speed. + 2. Split TF-IDF into TF and IDF for more specific calculations when needed without the whole computation. This also removes the relevance on external library for TF-IDF. + 3. Switched from using IDF & weight, to TF & weight for helping with the overall weight. + 4. Dropped indexing and searching of unigram, bigram, and trigrams. + 5. Add length of document during indexing for improved speed via normalization calculation. +