A UCI domain related web search. Migrated from github.
Go to file
2022-05-12 17:58:40 -07:00
__init__.py First pushed, setup all the stuff we need, no launcher yet. So test your code in another place for now, because they are all codepended on each others ... 2022-05-04 12:22:20 -07:00
.gitignore Changed some files and tf_idf, added data storage, and finish the loop for indexing 2022-05-06 14:58:03 -07:00
importanttext.py added important tokens 2022-05-06 17:19:37 -07:00
indexer.py Update indexer.py 2022-05-12 17:58:31 -07:00
mytest.py tf-idf ngrams and now returns dict rather than 2022-05-11 14:46:32 -07:00
posting.py Changed some files and tf_idf, added data storage, and finish the loop for indexing 2022-05-06 14:58:03 -07:00
README.md Update README.md 2022-05-03 21:32:33 -07:00
requirements.txt Stemmed done 2022-05-04 15:30:01 -07:00
stemmer.py First pushed, setup all the stuff we need, no launcher yet. So test your code in another place for now, because they are all codepended on each others ... 2022-05-04 12:22:20 -07:00
test.py First pushed, setup all the stuff we need, no launcher yet. So test your code in another place for now, because they are all codepended on each others ... 2022-05-04 12:22:20 -07:00
testfile.json added important tokens 2022-05-06 17:18:34 -07:00
worker.py Changed tf_idf model into the new one, try it on the current dataset 2022-05-12 15:00:09 -07:00

Search_Engine

Developing a mini search-engine in python using reverse-indexed stemming and other SEOs implementations

Part 1: The Reversed-Index

Create an inverted index for the corpus with data structures designed by you.

  • Tokens: all alphanumeric sequences in the dataset.

  • Stop words: do not use stopping while indexing, i.e. use all words, even the frequently occurring ones.

  • Stemming: use stemming for better textual matches. Suggestion: Porter stemming, but it is up to you to choose.

  • Important text: text in bold (b, strong), in headings (h1, h2, h3), and in titles should be treated as more important than the in other places.

Verify which are the relevant HTML tags to select the important words.

Building the inverted index

Now that you have been provided the HTML files to index, you may build your inverted index off of them. The inverted index is simply a map with the token as a key and a list of its corresponding postings. A posting is the representation of the tokens occurrence in a document. The posting typically (not limited to) contains the following info (you are encouraged to think of other attributes that you could add to the index):

  • The document name/id the token was found in.
  • Its tf-idf score for that document (for MS1, add only the term frequency).

Some tips:

  • When designing your inverted index, you will think about the structure of your posting first.
  • You would normally begin by implementing the code to calculate/fetch the elements which will constitute your posting.
  • Modularize. Use scripts/classes that will perform a function or a set of closely related functions. This helps in keeping track of your progress, debugging, and also dividing work amongst teammates if youre in a group.
  • We recommend you use GitHub as a mechanism to work with your team members on this project, but you are not required to do so.