diff_months: 16

Develop a vertical search engine and document clustering system

Download Solution Now
Added on: 2023-01-16 05:05:23
Order Code: CLT295961
Question Task Id: 0
  • Country :

    Australia

Task 1. Search Engine

Develop a vertical search engine similar to Google Scholar, but specialised to retrieve only papers/books published by a member of the School of Economics, Finance and Accounting (SEFA) at Coventry University:

https://pureportal.coventry.ac.uk/en/organisations/school-of-economics-finance-and-accounting

(That is, at least one of the co-authors is a member of SEFA.)

Your system crawls the relevant web pages and retrieves information about all available publications. For each publication, it extracts available data (such as authors, publication year, and title) and the links to both the publication page and the author’s profile (also called “pureportal” profile) page. 

Make sure you that your crawler is polite, i.e. it preserves the robots.txt rules and does not hit the servers unnecessarily or too fast.

Because of low rate of changes to this information, your crawler may be scheduled to look for new information, say, once per week, but it should ideally be able to do so automatically, as a scheduled task. Every time it runs, it should update the index with the new data. 

Make sure you apply the required pre-processing tasks to both the crawled data and the users’ queries.  

From the user’s point of view, your system has an interface that is similar to the Google Scholar main page, where the user can type in their queries/keywords about the resources they want to find. Then, your system will display the results, sorted by relevance, in a similar way Google Scholar does. However, the search results are restricted to the publications by SLC members only.

NOTE: You must show in your report and viva that your system is accurate by trying varies queries. For example, you must use both short and long queries, both with and without stop words, queries with various keywords and more challenging queries to prove the robustness of your system.

Task 2. Document Clustering

Develop a document clustering system. 

First, collect a number of documents that belong to different categories, namely SportHealth and Politics. Each document should be at least one sentence (the longer is usually the better). The total number of documents is up to you but should be at least 100 (the more is usually the better). You may collect these document from publicly available web sites such as BBC news websites, but make sure you preserve their copyrights and terms of use and clearly cite them in your work. You may simply copy-paste such texts manually, and writing an RSS feed reader/crawler to do it automatically is NOT mandatory.

Once you have collected sufficient documents, cluster them using a standard clustering method (e.g. K-means). 

Finally, use the created model to assign a new document to one of the existing clusters. That is, the user enters a document (e.g. a sentence) and your system outputs the right cluster. 

NOTE: You must show in your report and viva that your system suggests the right cluster for variety of inputs, e.g. short and long inputs, those with and without stop worlds, inputs of different topics, as well as more challenging inputs to show the system is robust enough. 

Appendix 1. Items to cover in your report and Video

Part 1 – Search engine

  1. Crawler:

1.1 Number of staff whose publications are crawled (approximately) and the maximum number of publications per staff

1.2. Information collected about each publication (e.g. links, title, year, author or any additional part)

1.3. Which pre-processing tasks are performed before passing data to Indexer/Elastic Search 

1.4. When the crawler operates, e.g. scheduled or run manually

1.5. Brief explanation of how it works

  1. Indexer

2.1. Whether you implemented the index or used Elastic Search (note that if Elastic Search is used you will lose the 15 marks for index construction, but the project becomes easier).

2.2. If you implemented it, which data structure is used (for example, incidence matrix or inverted index)

2.3. If you implemented it, whether it is incremental, i.e. it grows and gets updated over  the time, or it is constructed from scratch every time your crawler is run

2.4. If you implemented it, show some part of its content (e.g. the constructed dictionary).

2.5. Brief explanation of how it works

  1. Query processor

3.1. Which pre-processing tasks are applied to a given query

3.2. Do you only support Boolean queries (using AND, OR, NOT, etc.) or accept keywords like Google does (without any need for AND, OR, NOT etc.)

3.3. If Elastic Search is used, how you convert a user query to an appropriate query for Elastic Search 

3.4. If Elastic Search is NOT used, whether or not you perform ranked retrieval; if yes, specify whether or not you used vector space and the method used to calculate the ranks

3.5. Demonstration of the running system (use screenshots in you report and run your software in your viva). You must run your system on numerous and various input queries to prove the accuracy and robustness of your system. For example, you must use appropriate queries to prove your system performs stop-word removal and stemming and ranked retrieval.

3.6. Brief explanation of how it works

  1. (Optional) 

Any other important point you may want to mention, including any restriction, extras, issues

Part 2 – Document clustering

  1. How and how many input documents are collected
  2. Which document clustering method (e.g. K-means with appropriate K value) has been used and how its performance is measured
  3. Which type of clustering is used (hierarchical/flat and hard/soft)
  4. Screenshot and demonstration of its accuracy and robustness for numerous and various inputs
  5. Brief explanation of how it works
  6. (Optional) any other important point you may want to mention
  • Uploaded By : Katthy Wills
  • Posted on : January 16th, 2023
  • Downloads : 0
  • Views : 214

Download Solution Now

Can't find what you're looking for?

Whatsapp Tap to ChatGet instant assistance

Choose a Plan

Premium

80 USD
  • All in Gold, plus:
  • 30-minute live one-to-one session with an expert
    • Understanding Marking Rubric
    • Understanding task requirements
    • Structuring & Formatting
    • Referencing & Citing
Most
Popular

Gold

30 50 USD
  • Get the Full Used Solution
    (Solution is already submitted and 100% plagiarised.
    Can only be used for reference purposes)
Save 33%

Silver

20 USD
  • Journals
  • Peer-Reviewed Articles
  • Books
  • Various other Data Sources – ProQuest, Informit, Scopus, Academic Search Complete, EBSCO, Exerpta Medica Database, and more