Develop a vertical search engine and document clustering system

Country :
Australia

Task 1. Search Engine

Develop a vertical search engine similar to Google Scholar, but specialised to retrieve only papers/books published by a member of theSchool of Economics, Finance and Accounting (SEFA)at Coventry University:

https://pureportal.coventry.ac.uk/en/organisations/school-of-economics-finance-and-accounting

(That is, at least one of the co-authors is a member of SEFA.)

Your system crawls the relevant web pages and retrieves information about all available publications. For each publication, it extracts available data (such as authors, publication year, and title) and the links to both the publication page and the authors profile (also called pureportal profile) page.

Make sure you that your crawler is polite, i.e. it preserves the robots.txt rules and does not hit the servers unnecessarily or too fast.

Because of low rate of changes to this information, your crawler may be scheduled to look for new information, say, once per week, but it should ideally be able to do so automatically, as a scheduled task. Every time it runs, it should update the index with the new data.

Make sure you apply the required pre-processing tasks to both the crawled data and the users queries.

From the users point of view, your system has an interface that is similar to the Google Scholar main page, where the user can type in their queries/keywords about the resources they want to find. Then, your system will display the results, sorted by relevance, in a similar way Google Scholar does. However, the search results are restricted to the publications by SLC members only.

NOTE: You must show in your report and viva that your system is accurate by trying varies queries. For example, you must use both short and long queries, both with and without stop words, queries with various keywords and more challenging queries to prove the robustness of your system.

Task 2. Document Clustering

Develop a document clustering system.

First, collect a number of documents that belong to different categories, namelySport,HealthandPolitics. Each document should be at least one sentence (the longer is usually the better). The total number of documents is up to you but should be at least 100 (the more is usually the better). You may collect these document from publicly available web sites such as BBC news websites, but make sure you preserve their copyrights and terms of use and clearly cite them in your work. You may simply copy-paste such texts manually, and writing an RSS feed reader/crawler to do it automatically is NOT mandatory.

Once you have collected sufficient documents, cluster them using a standard clustering method (e.g. K-means).

Finally, use the created model to assign a new document to one of the existing clusters. That is, the user enters a document (e.g. a sentence) and your system outputs the right cluster.

NOTE: You must show in your report and viva that your system suggests the right cluster for variety of inputs, e.g. short and long inputs, those with and without stop worlds, inputs of different topics, as well as more challenging inputs to show the system is robust enough.

Appendix 1. Items to cover in your report and Video

Part 1 Search engine

Crawler:

1.1 Number of staff whose publications are crawled (approximately) and the maximumnumber of publications per staff

1.2. Information collected about each publication (e.g. links, title, year, author or any additional part)

1.3. Which pre-processing tasks are performed before passing data to Indexer/Elastic Search

1.4. When the crawler operates, e.g. scheduled or run manually

1.5. Brief explanation of how it works

Indexer

2.1. Whether you implemented the index or used Elastic Search (note that if Elastic Search isused you will lose the 15 marks for index construction, but the project becomes easier).

2.2. If you implemented it, which data structure is used (for example, incidence matrix or inverted index)

2.3. If you implemented it, whether it is incremental, i.e. it grows and gets updated over the time, or it is constructed from scratch every time your crawler is run

2.4. If you implemented it, show some part of its content (e.g. the constructed dictionary).

2.5. Brief explanation of how it works

Query processor

3.1. Which pre-processing tasks are applied to a given query

3.2. Do you only support Boolean queries (using AND, OR, NOT, etc.) or accept keywordslike Google does (without any need for AND, OR, NOT etc.)

3.3. If Elastic Search is used, how you convert a user query to an appropriate query for Elastic Search

3.4. If Elastic Search is NOT used, whether or not you perform rankedretrieval; if yes, specify whether or not you used vector space and the method used to calculate the ranks

3.5. Demonstration of the running system (use screenshots in you report and run your software in your viva). You must run your system on numerous andvariousinput queries to prove the accuracy androbustness of your system. For example, you must use appropriate queries to prove your system performs stop-word removal and stemming and ranked retrieval.

3.6. Brief explanation of how it works

(Optional)

Any other important point you may want to mention, including any restriction, extras, issues

Part 2 Document clustering

How and how many input documents are collected
Which document clustering method (e.g. K-means with appropriate K value) has been usedand how its performance is measured
Which type of clustering is used (hierarchical/flat and hard/soft)
Screenshot and demonstration of its accuracy and robustness for numerous and various inputs
Brief explanation of how it works
(Optional) any other important point you may want to mention

Download Solution Now

Uploaded By : Katthy Wills
Posted on : January 16th, 2023
Downloads : 0
Views : 793

Develop a vertical search engine and document clustering system

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

Develop a vertical search engine and document clustering system

Download Solution Now

Download Solution Now

Choose a Plan

Premium

Gold

Silver

Request a Call Back