Update README.md
This commit is contained in:
parent
3a4359c53c
commit
5875ac0e79
33
README.md
33
README.md
@ -1,32 +1,39 @@
|
|||||||
# Search_Engine
|
# Search_Engine
|
||||||
Developing a mini search-engine in python using reverse-indexed stemming and other SEOs implementations
|
Developing a mini search-engine in python using reverse-indexed stemming and other SEOs implementations
|
||||||
## Part 1: The Reversed-Index
|
## Part 1: The Reversed-Index
|
||||||
Create an inverted index for the corpus with data structures designed by you.
|
|
||||||
• Tokens: all alphanumeric sequences in the dataset.
|
### Create an inverted index for the corpus with data structures designed by you.
|
||||||
• Stop words: do not use stopping while indexing, i.e. use all words, even
|
|
||||||
|
- Tokens: all alphanumeric sequences in the dataset.
|
||||||
|
|
||||||
|
- Stop words: do not use stopping while indexing, i.e. use all words, even
|
||||||
the frequently occurring ones.
|
the frequently occurring ones.
|
||||||
• Stemming: use stemming for better textual matches. Suggestion: Porter
|
|
||||||
|
- Stemming: use stemming for better textual matches. Suggestion: Porter
|
||||||
stemming, but it is up to you to choose.
|
stemming, but it is up to you to choose.
|
||||||
• Important text: text in bold (b, strong), in headings (h1, h2, h3), and
|
|
||||||
|
- Important text: text in bold (b, strong), in headings (h1, h2, h3), and
|
||||||
in titles should be treated as more important than the in other places.
|
in titles should be treated as more important than the in other places.
|
||||||
|
|
||||||
Verify which are the relevant HTML tags to select the important words.
|
Verify which are the relevant HTML tags to select the important words.
|
||||||
|
|
||||||
Building the inverted index
|
### Building the inverted index
|
||||||
Now that you have been provided the HTML files to index, you may build your
|
Now that you have been provided the HTML files to index, you may build your
|
||||||
inverted index off of them. The inverted index is simply a map with the token
|
inverted index off of them. The inverted index is simply a map with the token
|
||||||
as a key and a list of its corresponding postings. A posting is the representation
|
as a key and a list of its corresponding postings. A posting is the representation
|
||||||
of the token’s occurrence in a document. The posting typically (not limited to)
|
of the token’s occurrence in a document. The posting typically (not limited to)
|
||||||
contains the following info (you are encouraged to think of other attributes that
|
contains the following info (you are encouraged to think of other attributes that
|
||||||
you could add to the index):
|
you could add to the index):
|
||||||
• The document name/id the token was found in.
|
- The document name/id the token was found in.
|
||||||
• Its tf-idf score for that document (for MS1, add only the term frequency).
|
- Its tf-idf score for that document (for MS1, add only the term frequency).
|
||||||
Some tips:
|
|
||||||
• When designing your inverted index, you will think about the structure
|
### Some tips:
|
||||||
|
- When designing your inverted index, you will think about the structure
|
||||||
of your posting first.
|
of your posting first.
|
||||||
• You would normally begin by implementing the code to calculate/fetch
|
- You would normally begin by implementing the code to calculate/fetch
|
||||||
the elements which will constitute your posting.
|
the elements which will constitute your posting.
|
||||||
• Modularize. Use scripts/classes that will perform a function or a set of
|
- Modularize. Use scripts/classes that will perform a function or a set of
|
||||||
closely related functions. This helps in keeping track of your progress,
|
closely related functions. This helps in keeping track of your progress,
|
||||||
debugging, and also dividing work amongst teammates if you’re in a group.
|
debugging, and also dividing work amongst teammates if you’re in a group.
|
||||||
• We recommend you use GitHub as a mechanism to work with your team
|
- We recommend you use GitHub as a mechanism to work with your team
|
||||||
members on this project, but you are not required to do so.
|
members on this project, but you are not required to do so.
|
||||||
|
Loading…
Reference in New Issue
Block a user