A UCI domain related web search. Migrated from github.
Go to file
2022-05-26 23:34:29 -07:00
__init__.py First pushed, setup all the stuff we need, no launcher yet. So test your code in another place for now, because they are all codepended on each others ... 2022-05-04 12:22:20 -07:00
.gitignore Changed some files and tf_idf, added data storage, and finish the loop for indexing 2022-05-06 14:58:03 -07:00
importanttext.py added important tokens 2022-05-06 17:19:37 -07:00
indexer.py search functionality to obtain set of documents 2022-05-26 23:34:29 -07:00
mytest.py attempted fix for if-idf 2022-05-06 14:03:49 -07:00
posting.py created new tf-idf and changed posting class 2022-05-25 18:41:36 -07:00
README.md Update README.md 2022-05-03 21:32:33 -07:00
requirements.txt Stemmed done 2022-05-04 15:30:01 -07:00
save_1.shelve.bak Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
save_1.shelve.dat Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
save_1.shelve.dir Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
save_2.shelve.bak Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
save_2.shelve.dat Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
save_2.shelve.dir Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
save_3.shelve.bak Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
save_3.shelve.dat Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
save_3.shelve.dir Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
save_4.shelve.bak Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
save_4.shelve.dat Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
save_4.shelve.dir Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
save_5.shelve.bak Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
save_5.shelve.dat Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
save_5.shelve.dir Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
search.py search functionality to obtain set of documents 2022-05-26 23:34:29 -07:00
searchtesting.py search functionality to obtain set of documents 2022-05-26 23:34:29 -07:00
stemmer.py First pushed, setup all the stuff we need, no launcher yet. So test your code in another place for now, because they are all codepended on each others ... 2022-05-04 12:22:20 -07:00
tempCodeRunnerFile.py search functionality to obtain set of documents 2022-05-26 23:34:29 -07:00
test1.py Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
test.py First pushed, setup all the stuff we need, no launcher yet. So test your code in another place for now, because they are all codepended on each others ... 2022-05-04 12:22:20 -07:00
testfile.json added important tokens 2022-05-06 17:18:34 -07:00
urlID.pkl Added way to save ngrams to index 2022-05-13 16:42:33 -07:00
worker.py search functionality to obtain set of documents 2022-05-26 23:34:29 -07:00

Search_Engine

Developing a mini search-engine in python using reverse-indexed stemming and other SEOs implementations

Part 1: The Reversed-Index

Create an inverted index for the corpus with data structures designed by you.

  • Tokens: all alphanumeric sequences in the dataset.

  • Stop words: do not use stopping while indexing, i.e. use all words, even the frequently occurring ones.

  • Stemming: use stemming for better textual matches. Suggestion: Porter stemming, but it is up to you to choose.

  • Important text: text in bold (b, strong), in headings (h1, h2, h3), and in titles should be treated as more important than the in other places.

Verify which are the relevant HTML tags to select the important words.

Building the inverted index

Now that you have been provided the HTML files to index, you may build your inverted index off of them. The inverted index is simply a map with the token as a key and a list of its corresponding postings. A posting is the representation of the tokens occurrence in a document. The posting typically (not limited to) contains the following info (you are encouraged to think of other attributes that you could add to the index):

  • The document name/id the token was found in.
  • Its tf-idf score for that document (for MS1, add only the term frequency).

Some tips:

  • When designing your inverted index, you will think about the structure of your posting first.
  • You would normally begin by implementing the code to calculate/fetch the elements which will constitute your posting.
  • Modularize. Use scripts/classes that will perform a function or a set of closely related functions. This helps in keeping track of your progress, debugging, and also dividing work amongst teammates if youre in a group.
  • We recommend you use GitHub as a mechanism to work with your team members on this project, but you are not required to do so.