Write My Paper Button

WhatsApp Widget

ASK A QUESTION

COMP6714 2025T2 Project Specification 1. Project Overview In this project, you will implement (using Python3 in CSE Linux machines) a simple search engine that ranks

COMP6714 2025T2 Project Specification

1. Project Overview

In this project, you will implement (using Python3 in CSE Linux machines) a simple search engine that ranks documents based on:

  • Query term coverage
  • Proximity of matched terms
  • Preservation of query term order

A search query consists of space-separated terms containing only alphanumeric characters (no punctuation).

2. Core Requirements

  • Implement an indexer (index.py) and a search program (search.py).
  • Use an inverted index with positional information (as described in Week 1 lectures).
  • Additional indexes may be implemented if needed.

3. Term Matching Rules

  • Case insensitive (e.g., “Apple” = “apple”).
  • Abbreviations: Ignore full stops (e.g., “U.S.” = “US”).
  • Hyphenated terms:
    • Preserve if the first part has < 3 letters (e.g., “D-Kans”, “co-author”).
    • Split otherwise (e.g., “set-aside” → “set”, “aside”).
  • Singular/Plural/Tense ignored (e.g., “cat” = “cats”; “breach” = “breached”).
  • Sentence endings: Only ., ?, ! mark sentence boundaries.
  • Numbers:
    • Decimal numbers can be ignored (. is invalid in search terms).
    • Years/integers should be indexed (commas ignored, e.g., “1,000,000” = “1000000”).
  • Other punctuation: Treated as token dividers.

4. Ranking Criteria

Documents are ranked by:

  1. Term coverage (proportion of query terms matched).
  2. Proximity (average distance between matched terms).
  3. Order preservation (consecutive query terms appearing in the same left-to-right order).

Scoring Formula:
[
Score(d) = alpha * frac{#matched_terms}{#query_terms} + beta * frac{1}{1 + avg_distance} + gamma * ordered_pairs
]
Where:

  • (alpha = 1.0), (beta = 1.0), (gamma = 0.1) (default values).
  • For single-term queries, proximity and order scores are 0.

5. Indexer (index.py)

Command:

python3 index.py [folder-of-documents] [folder-of-indexes]

Output:

  • Total documents, tokens, and terms indexed.

Example:

$ python3 index.py ~cs6714/Public/data ./MyTestIndex  
Total number of documents: 1000  
Total number of tokens: 268,568  
Total number of terms: 259,182  

6. Search Program (search.py)

Command:

python3 search.py [folder-of-indexes]

Behavior:

  • Accepts queries from stdin until Ctrl-D.
  • Outputs ranked document IDs (one per line).

Example:

$ python3 search.py ~/Proj/MyTestIndex  
Apple  
1361  
Australia Technology  
3454  
10  
18  
...  

7. Displaying Matching Lines (Optional)

For queries starting with > :

  • Displays document IDs prefixed with > followed by lines containing the closest matching terms.
  • Only one line per matched term (prioritizing earliest occurrence).

Example:

$ python3 search.py ~/Proj/MyTestIndex  
> Apples  
> 1361  
The department said stocks of fresh apples in cold storage  

8. Marking (40 Points Total)

  • Correctness: Exact match of document IDs and order required for full marks.
  • Partial Marks: F-measure used for ranking errors (precision/recall).
  • Runtime Limits:
    • Indexer: 1 minute.
    • Search: 10 seconds per query.

9. Submission

  • Deadline: Friday, 1st August 23:59.
  • Format: All .py files in a .zip folder submitted via Moodle.
  • Late Penalty: 5% deduction per day (up to 5 days).

10. Permitted Libraries

  • Python Standard Library only.
  • NLTK allowed (pre-downloaded for marking; remove nltk.download() calls).

11. Plagiarism Policy

  • Individual work only.
  • Penalties apply for copied code or public repositories.
COMP6714 2025T2 Project Specification 1. Project Overview In this project, you will implement (using Python3 in CSE Linux machines) a simple search engine that ranks
Scroll to top