Update README.md

This commit is contained in:
iNocturnis 2022-05-03 21:00:01 -07:00 committed by GitHub
parent 884ae4a22f
commit 3a4359c53c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -1,3 +1,32 @@
# Search_Engine
Search Engine TIME !
Inversed-Index part 1
Developing a mini search-engine in python using reverse-indexed stemming and other SEOs implementations
## Part 1: The Reversed-Index
Create an inverted index for the corpus with data structures designed by you.
• Tokens: all alphanumeric sequences in the dataset.
• Stop words: do not use stopping while indexing, i.e. use all words, even
the frequently occurring ones.
• Stemming: use stemming for better textual matches. Suggestion: Porter
stemming, but it is up to you to choose.
• Important text: text in bold (b, strong), in headings (h1, h2, h3), and
in titles should be treated as more important than the in other places.
Verify which are the relevant HTML tags to select the important words.
Building the inverted index
Now that you have been provided the HTML files to index, you may build your
inverted index off of them. The inverted index is simply a map with the token
as a key and a list of its corresponding postings. A posting is the representation
of the tokens occurrence in a document. The posting typically (not limited to)
contains the following info (you are encouraged to think of other attributes that
you could add to the index):
• The document name/id the token was found in.
• Its tf-idf score for that document (for MS1, add only the term frequency).
Some tips:
• When designing your inverted index, you will think about the structure
of your posting first.
• You would normally begin by implementing the code to calculate/fetch
the elements which will constitute your posting.
• Modularize. Use scripts/classes that will perform a function or a set of
closely related functions. This helps in keeping track of your progress,
debugging, and also dividing work amongst teammates if youre in a group.
• We recommend you use GitHub as a mechanism to work with your team
members on this project, but you are not required to do so.