This is an individual piece of work that accounts for 70% of your overall module marks. The total achievable points for this assignment is 100, whic
March 2024
This is an individual piece of work that accounts for 70% of your overall module marks. The total achievable points for this assignment is 100, which will be converted to 70%.
To complete your assignment, you should submit the following wrapped in a zipped file and upload it to ELE:
A well-formatted report with the requested explanations and visualizations saved as a PDF document. An interpretation must accompany each visualization.
The Python scripts for all the assignment exercises and the visualizations provided in the PDF report.
These scripts will be run and your points will depend on the completeness and accuracy of the results produced by your code. Hard-coding in any of the exercises will lead to marks not being awarded.
The naming of the variables should be intuitive. Modular structure in the Python scripts with definitions of user-defined functions, as and when required, would be encouraged. Ensure that your code is well-commented and organized for readability and reproducibility.
The minimum word count of your report should be 2000 words. This limit excludes the title page, table of contents, tables, figure captions, codes and bibliography.
1 Data analysis using random data
Generate a synthetic dataset that simulates monthly retail sales data for the period from January 2020 to April 2024. This dataset will be used for analysis and forecasting purposes. Simulate three main components of the data:
A trend component that represents an increase in the sales represented in terms of a polynomial function at2 + bt + c, where t represents the time units and a, b, c are constants. You can choose the constants as you like, just make sure that it is an increasing function.
A seasonal component capturing quarterly recurring patterns in sales over each year.
Random fluctuations representing noise or unexplained variance in the data.
Then, combine these components to create the retail sales data.
(a) Store the generated synthetic dataset in a pandas DataFrame and display the first 10 rows of data. (12 points)
(b) Perform visualizations to display the following: (8 points)
the change of sales over time.
the data points corresponding to the ten lowest values of sales over time
data points corresponding to the ten highest values of sales over time.
comparison of the highest sales in any quarter (Jan-Mar, Apr-Jun, Jul-Sept, Oct-Dec) per year over 2020-2023.
(c) Fit an ETS model from statsmodel and provide a forecast of the next two quarters in 2024. Explain the algorithmic working of the ETS model in producing a forecast. Provide visualizations and interpret the quality of the results in your own words in the report. (10 points)
2 Data analysis using actual stock data
Import the yfinance Python library for this assignment. Make sure to install it before you import this library. This library will provide a convenient way to fetch historical market data, including stock prices, from Yahoo Finance. If you need information regarding this library, please refer to the documentation, https://pypi.org/project/yfinance/. An example to retrieve stock data for the ticker AAPL:
data = yf.download(AAPL, start=2021-01-01, end=2022-12-31)
2.1 Analysis of the pre-COVID era (before March 2020)
For any top three tech and healthcare companies of your choice, do the following:
(a) Write a Python function stock_retrieval() to retrieve the historical stock data for two years before March 2020 for all of your chosen companies from Yahoo Finance. Decide which arguments could be the best fit for your function. Use a dictionary to store the retrieved stock data where the date is the key and a nested dictionary contains the stocks attributes (e.g., Open, High, Low, Close, Volume, Adj Close) as the value. (10 points)
[b] Using the retrieved stock data perform statistical analyses like the aver- age, maximum, and minimum opening and closing prices in that period in another Python function stock_stats(). In this function, compare the opening and closing prices for each trading day and identify any sig- nificant differences. You can iterate over each trading day, calculate the absolute difference between the opening and closing prices, and compare it to a predefined threshold (e.g., 5% of the closing price). If the difference exceeds the threshold, you can add the date and the price difference to a list of dates with significant differences. (10 points)
[c] Provide suitable and intuitive visualizations (e.g., line plots, histograms) to display the change in the statistical parameters over time for all the companies. Compare the trends between the healthcare and tech compa- nies during this period using plots. (15 points)
Analysis of during COVID and post-COVID era (after March 2020)
[a] Perform regression analyses using a rolling window of 4 months to iden- tify any significant trend changes during COVID (the year following March 2020) and post-COVID (the period beyond March 2021 until the present). Please justify with explanations and visualizations the differences and sim- ilarities between tech and healthcare companies. Explain the rationale behind the algorithmic structure of your analysis. The analysis should be based on the top three companies in each category chosen by you for the previous analyses. (15 points)
[b] Consider relevant features that may influence stock prices, such as market indices, trading volume, volatility etc. from the dataset and apply PCA to reduce the dimensionality of the feature space and extract principal com- ponents capturing the variability in the data. Remember to standardize the features to ensure they have comparable scales. Provide explanations and visualizations to explain your findings. (10 points)
Any foresights?
[a] Forecast the stock data trends if COVID did not exist during March 2020 - March 2021. Explain with visualizations the differences between actual data and the forecasted data for each month during that period. (10 points)