A UCI domain related web search. Migrated from github.

Go to file

unknown 60f6eb0df0 search functionality to obtain set of documents		2022-05-26 23:34:29 -07:00
__init__.py	First pushed, setup all the stuff we need, no launcher yet. So test your code in another place for now, because they are all codepended on each others ...	2022-05-04 12:22:20 -07:00
.gitignore	Changed some files and tf_idf, added data storage, and finish the loop for indexing	2022-05-06 14:58:03 -07:00
importanttext.py	added important tokens	2022-05-06 17:19:37 -07:00
indexer.py	search functionality to obtain set of documents	2022-05-26 23:34:29 -07:00
mytest.py	attempted fix for if-idf	2022-05-06 14:03:49 -07:00
posting.py	created new tf-idf and changed posting class	2022-05-25 18:41:36 -07:00
README.md	Update README.md	2022-05-03 21:32:33 -07:00
requirements.txt	Stemmed done	2022-05-04 15:30:01 -07:00
save_1.shelve.bak	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
save_1.shelve.dat	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
save_1.shelve.dir	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
save_2.shelve.bak	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
save_2.shelve.dat	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
save_2.shelve.dir	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
save_3.shelve.bak	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
save_3.shelve.dat	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
save_3.shelve.dir	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
save_4.shelve.bak	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
save_4.shelve.dat	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
save_4.shelve.dir	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
save_5.shelve.bak	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
save_5.shelve.dat	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
save_5.shelve.dir	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
search.py	search functionality to obtain set of documents	2022-05-26 23:34:29 -07:00
searchtesting.py	search functionality to obtain set of documents	2022-05-26 23:34:29 -07:00
stemmer.py	First pushed, setup all the stuff we need, no launcher yet. So test your code in another place for now, because they are all codepended on each others ...	2022-05-04 12:22:20 -07:00
tempCodeRunnerFile.py	search functionality to obtain set of documents	2022-05-26 23:34:29 -07:00
test1.py	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
test.py	First pushed, setup all the stuff we need, no launcher yet. So test your code in another place for now, because they are all codepended on each others ...	2022-05-04 12:22:20 -07:00
testfile.json	added important tokens	2022-05-06 17:18:34 -07:00
urlID.pkl	Added way to save ngrams to index	2022-05-13 16:42:33 -07:00
worker.py	search functionality to obtain set of documents	2022-05-26 23:34:29 -07:00

README.md

Search_Engine

Developing a mini search-engine in python using reverse-indexed stemming and other SEOs implementations

Part 1: The Reversed-Index

Create an inverted index for the corpus with data structures designed by you.

Tokens: all alphanumeric sequences in the dataset.
Stop words: do not use stopping while indexing, i.e. use all words, even the frequently occurring ones.
Stemming: use stemming for better textual matches. Suggestion: Porter stemming, but it is up to you to choose.
Important text: text in bold (b, strong), in headings (h1, h2, h3), and in titles should be treated as more important than the in other places.

Verify which are the relevant HTML tags to select the important words.

Building the inverted index

Now that you have been provided the HTML files to index, you may build your inverted index off of them. The inverted index is simply a map with the token as a key and a list of its corresponding postings. A posting is the representation of the token’s occurrence in a document. The posting typically (not limited to) contains the following info (you are encouraged to think of other attributes that you could add to the index):

The document name/id the token was found in.
Its tf-idf score for that document (for MS1, add only the term frequency).

Some tips:

When designing your inverted index, you will think about the structure of your posting first.
You would normally begin by implementing the code to calculate/fetch the elements which will constitute your posting.
Modularize. Use scripts/classes that will perform a function or a set of closely related functions. This helps in keeping track of your progress, debugging, and also dividing work amongst teammates if you’re in a group.
We recommend you use GitHub as a mechanism to work with your team members on this project, but you are not required to do so.

README.md Unescape Escape