Table Of Contents TOC o "1-3" h z u Abstract PAGEREF _Toc132920519 h 1Polynomial Regression PAGEREF _Toc132920520 h 2Input PAGEREF _Toc132920521 h
Table Of Contents TOC o "1-3" h z u Abstract PAGEREF _Toc132920519 h 1Polynomial Regression PAGEREF _Toc132920520 h 2Input PAGEREF _Toc132920521 h 2Output PAGEREF _Toc132920522 h 2Parameters PAGEREF _Toc132920523 h 2Linear Regression PAGEREF _Toc132920524 h 3Data Analysis and Linear Regression Analysis PAGEREF _Toc132920525 h 5Task1: PAGEREF _Toc132920526 h 6Conclusion PAGEREF _Toc132920528 h 8References PAGEREF _Toc132920529 h 8
AbstractThe process of preparing data is precisely the same as it sounds. The act of putting your data into decent form for analysis is referred to as "data wrangling," and it encompasses everything that is involved in the procedure. It's a very important step in the process of machine learning. The process of data preparation has always been rather laborious. The vast majority of data analysts and scientists concur that this particular aspect of the work is not very appealing. According to Forbes, up to 76 percent of data scientists believe that the most challenging aspect of their profession is the preparation of data. They put up with it because if they don't, they'll have spent all the effort it took to develop and collect the data, only to discover that the findings are very deceptive and have big gaps in them. So they just put up with it. There wouldn't be a need for data preparation if it were possible to examine the data as soon as it arrived in precisely the same format in which it was received. On the other hand, this is seldom the case at all. More than eighty percent of the data we have available in the world today is described by Gartner as being unstructured. This indicates that the data have not been cleaned, labeled, or organized in an orderly manner. There is no such thing as a "nice" data set. The vast majority of the time, they won't be adequately completed with proper and pertinent information in all of the appropriate categories. Before you could potentially get significant insights from the data, you would often need to spend a considerable amount of timeweeks or even monthsconverting the data into a format that could be used.
The user should not be burdened with data that contains multiple invalid, out of range, or missing values since this will lead to unsatisfactory outcomes, thus every effort should be done to prevent this from happening. There is no substitute for having high-quality data, since this is the only thing that can ensure having high-quality insights.
Polynomial RegressionPolynomial regression is a form of linear regression in which the relationship between the independent variablexand the dependent variableyis modeled as annthorder polynomial.
Inputtraining set(Data Table)
This input port expects an Example Set. This operator cannot handle nominal attributes; it can be applied on data sets with numeric attributes. Thus often you may have to use the Nominal to Numerical operator before application of this operator.
Outputmodel(Linear Regression Model)
The regression model is delivered from this output port. This model can now be applied on unseen data sets.
example set(Data Table)
The Example Set that was given as input is passed without changing to the output through this port. This is usually used to reuse the same Example Set in further operators or to view the Example Set in the Results Workspace.
weights(Attribute Weights)
The attribute weights are sent by this port.
Parameters
feature_selection is a parameter that only experts need to worry about. It identifies the technique of feature selection that will be used throughout the regression process. The following choices are at your disposal: T-Test, iterative T-Test, none, M5 prime, greedy, and iterative T-TestRange: selection alpha: You will only have access to this parameter if the 'T-Test' setting is in effect for the feature selection parameter. The value of alpha that will be utilized in the T-Test feature selection is specified by this element. The range is actual
max_iterations: You'll only have access to this option if the feature selection parameter is set to 'iterative T-Test'. The maximum number of iterations of the iterative T-Test that may be used for feature selection is specified by this.Range: integer
forward_alpha: You will only have access to this option if the 'iterative T-Test' setting for the feature selection parameter is active. The value of forward alpha that will be utilized in the T-Test feature selection is specified by this element.The range is actual
backward_alpha: You will only have access to this option if the feature selection parameter is set to "iterative T-Test." It lays down the parameters for the T-Test feature selection, including the value of the backward alpha parameter.The range is actual
eliminate_colinear_features: This option tells the algorithm whether or not it should attempt to erase collinear features while it is doing the regression.Boolean is the range.
min_tolerance: You will only have access to this option if the remove colinear features parameter is configured to have a value of "true." It outlines the minimal tolerance that must be met in order to get rid of collinear characteristics. The range is actual
use_bias is the name of the option that specifies whether or not an intercept value should be computed. Boolean is the range.
ridge: The ridge parameter that will be used in ridge regression may be specified by utilizing this parameter. The range is actual
Linear RegressionThe process of numerical prediction may make use of a variety of methodologies, including the statistical approach known as regression, which is one of the ways. One dependent variable, also known as the label attribute, and a series of other changing variables, which are referred to as independent variables (regular attributes), are taken into consideration in the process of regression, which is a statistical measure that aims to establish the strength of the link between these two types of variables. A comparison of the values of the dependent variable with those of the regular characteristics is used to arrive. The task of predicting categorical labels is within the purview of classification, while the prediction of a continuous value falls under the purview of regression. Data may be analyzed using any of these two approaches. For example, we may desire to estimate the number of possible purchasers for a new product based on the price of the product, or we may wish to predict the average income of college graduates with five years of work experience. Both of these estimates are dependent on the price of the product. Alternately, we may try to make a prediction about the typical salary of people having 10 years of work experience and a bachelor's degree. When trying to discover the extent to which certain variables, like as the price of a commodity, interest rates, or certain firms or sectors, have an impact on the movement of the price of an asset, regression analysis is often utilized. This is because regression analysis may help one determine the degree to which certain elements have an influence. It is possible that this undertaking will be difficult. The objective of linear regression is to find a linear equation that gives the best fit to the data that has been observed in order to try to show the relationship that exists between a scalar variable and one or more factors that explain it. This is done by looking for a linear equation that provides the best fit to the data. Finding a linear equation that offers the greatest possible match to the data is the method used to accomplish this goal. To give you an example, you may make the decision to use a linear regression model in order to find out how the weights of people related to their heights. This can be done in a number of different ways. A model of linear regression may be created by using this operator in the data analysis process. The Akaike criteria are used in this situation with the intention of choosing the most suitable model. The Akaike information criterion is a relative measure that evaluates the degree to which a statistical model corresponds to the available data. This criterion is used in the evaluation process in order to determine how well a statistical model fits the data. It is based on the idea of information entropy, and in practice, what it does is offer a relative assessment of the information that is lost when a certain model is used to represent reality. This measurement is derived from the fact that it is based on the idea of information entropy. Claude Shannon was the one who first established this idea. It is possible to say that it illustrates the tradeoff between bias and variance in the process of model development, or, to put it in a larger perspective, between the correctness of the model and the degree of complexity it has. Either way, it is possible to say that it illustrates the tradeoff.
Before the data can be utilized in data mining, machine learning, or any of the other subfields that make up data science, it is vital to convert the data into a format that can be handled in a more expeditious and correct manner. This may be accomplished by converting the data into a format that can be handled in a more accurate way by converting the data into a format that can handle it. Before the data may be used in any form, it is necessary to finish this stage in its entirety first. The preparation of the data is the factor that contributes the most substantially to bringing about the transformation. In the early stages of the development pipeline for machine learning and artificial intelligence, it is a common practice to employ the methodologies in their totality throughout all of the stages. This includes the early stages of the development pipeline. This is done on purpose to ensure that the dependability of the results is not affected in any way, and it is done so in order to guarantee that the trustworthiness of the findings is not compromised in any way.
Data Analysis and Linear Regression AnalysisLinear regression analysis may anticipate how one variable affects another. "Dependent variable" is the variable you wish to estimate. The "independent variable" is the one you wish to predict based on another one. This method uses the linear equation and one or more independent variables to determine which factors best explain dependent variable changes. Linear equation coefficients are estimated. Linear regression may fit a straight line or surface to close the output gap. Simple linear regression calculators employ "least squares" to identify the line that best matches matched data. Given Y's value, we can estimate X's value.
Task1: EDA process
ConclusionA method for making accurate numerical predictions, regression is one such method. The statistical measure of regression is an effort to identify the strength of the link between one dependent variable (i.e. the label attribute) and a sequence of other changing variables known as independent variables (regular attributes). Regression is a statistical measure. Regression, like classification, may be used to predict categorical labels, but unlike classification, it can also predict continuous values. For instance, we could want to estimate the average annual income of people who have a bachelor's degree and five years of work experience, or the number of prospective customers for a new product based on its pricing. It is common practice to make use of regression in order to ascertain the degree to which certain variables, such as the price of a commodity, interest rates, or certain industries or sectors, impact the price movement of an asset.
ReferencesDavid A. Freedman (2009). Statistical Models: Theory and Practice. Cambridge University Press. p. 26. A simple regression equation has on the right hand side an intercept and an explanatory variable with a slope coefficient. A multiple regression e right hand side, each with its own slope coefficient
Rencher, Alvin C.; Christensen, William F. (2012), "Chapter 10, Multivariate regression Section 10.1, Introduction", Methods of Multivariate Analysis, Wiley Series in Probability and Statistics, vol. 709 (3rd ed.), John Wiley & Sons, p. 19, ISBN 9781118391679.
Hilary L. Seal (1967). "The historical development of the Gauss linear model". Biometrika. 54 (1/2): 124. doi:10.1093/biomet/54.1-2.1. JSTOR 2333849.
Yan, Xin (2009), Linear Regression Analysis: Theory and Computing, World Scientific, pp. 12, ISBN 9789812834119, Regression analysis ... is probably one of the oldest topics in mathematical statistics dating back to about two hundred years ago
Jump up to:a b Efron, Bradley; Hastie, Trevor; Johnstone, Iain; Tibshirani, Robert (2004). "Least Angle Regression". The Annals of Statistics. 32 (2): 407451. arXiv:math/0406456. doi:10.1214/009053604000000067. JSTOR 3448465. S2CID 204004121.
Jump up to:a b Hawkins, Douglas M. (1973). "On the Investigation of Alternative Regressions by Principal Component Analysis". Journal of the Royal Statistical Society, Series C. 22 (3): 275286. doi:10.2307/2346776. JSTOR 2346776.
Jump up to:a b Jolliffe, Ian T. (1982). "A Note on the Use of Principal Components in Regression". Journal of the Royal Statistical Society, Series C. 31 (3): 300303. doi:10.2307/2348005. JSTOR 2348005.
Berk, Richard A. (2007). "Regression Analysis: A Constructive Critique". Criminal Justice Review. 32 (3): 301302. doi:10.1177/0734016807304871. S2CID 145389362.
Tsao, Min (2022). "Group least squares regression for linear models with strongly correlated predictor variables". Annals of the Institute of Statistical Mathematics. 75 (2): 233250. arXiv:1804.02499. doi:10.1007/s10463-022-00841-7. S2CID 237396158.
Hidalgo, Bertha; Goodman, Melody (2012-11-15). "Multivariate or Multivariable Regression?". American Journal of Public Health. 103 (1): 3940. doi:10.2105/AJPH.2012.300897. ISSN 0090-0036. PMC 3518362. PMID 23153131.
Brillinger, David R. (1977). "The Identification of a Particular Nonlinear Time Series System". Biometrika. 64 (3): 509515. doi:10.1093/biomet/64.3.509. JSTOR 2345326.
Galton, Francis (1886). "Regression Towards Mediocrity in Hereditary Stature". The Journal of the Anthropological Institute of Great Britain and Ireland. 15: 246263. doi:10.2307/2841583. ISSN 0959-5295. JSTOR 2841583.
Lange, Kenneth L.; Little, Roderick J. A.; Taylor, Jeremy M. G. (1989). "Robust Statistical Modeling Using the t Distribution" (PDF). Journal of the American Statistical Association. 84 (408): 881896. doi:10.2307/2290063. JSTOR 2290063.
Swindel, Benee F. (1981). "Geometry of Ridge Regression Illustrated". The American Statistician. 35 (1): 1215. doi:10.2307/2683577. JSTOR 2683577.