BIG DATA ANALYSIS
BIG DATA ANALYSIS
Table of Contents
TOC o "1-3" h z u Introduction PAGEREF _Toc124331931 h 3Summary of the objectives PAGEREF _Toc124331932 h 3Business case scenario PAGEREF _Toc124331933 h 4Analytical design of a dimension data model PAGEREF _Toc124331934 h 4Implementation and testing of data warehouse PAGEREF _Toc124331935 h 6Implementation and testing of big data storage on HDFS PAGEREF _Toc124331936 h 7Reflective conclusion PAGEREF _Toc124331937 h 8Reference List PAGEREF _Toc124331938 h 9
IntroductionThe term big data analysis is understood for the characteristics that are very well-defined in the technology. More use of big data is looking for the promising factors in the process. This report defines the objectives of the data analysis, the scenario in the business, the analytical design process of the data model, implementation of the data warehouse, testing of the big data storage are presented in the report. A discussion is defined on the state of big data analysis. Generally, big data analysis contains data that is greater in variety that arrives for increasing volume includinghigh velocity in the process. It is basically a big data set as well as complexthat is especially from the new sources of data. The data sets are voluminous because they are data processing which follows traditional software and are not able to manage the data sets. The vast volume of data is used to address business problems. The development in the process that are open source of framework such as the Hadoop that is essential for the big data analysis growth and implementation. The Hadoop software is used in this report because it makes the big data analysis easier to work and also cheaper to store in the process. Big data analysis helps to address the range in the business related activities from the customer experience to the analysis.
Summary of the objectives
In many ways the information technologies of big data analysis brings dramatic change in the cost reduction or in the new products and the service offerings in the context. Same as the traditional analytics, it can support the decisions that are in the business decisions (Ali, et al. 2022). By using big data analysis it is possible to achieve a variety of objectives in the process but it is necessary to focus on the report goal in the first place. In the era the big data analysis created many substantial opportunities that are of customer demands for developing the aligned the products. The firms can develop the products that are directly connected to the consumers, it provides an increase in the consumer value and it decreases the risks that are associated when a new product is launched. The firms can also be identified through the data mining process. The primary seeking of the cost reduction in big data analysis makes it conscious of how the computer system is crunching the data.
Figure 1: Sample database connector
(Source: Self-Created)
The terabyte storage for the structured data are of cheap cost that are mostly delivered through the big data analysis technologies such as the Hadoop framework that are specially built for implementing the large amount of the data in the process. Including the hardware as well as the software and other types of expenses (Azeroual, and Fabre 2021). This process is less than the other types of data management systems. These comparisons are not fair for the process to implement a Hadoop cluster and all the related tools. The big data analysis makes the idea possible to gain the complete answers because the user has more information. With more completed answers there is a chance to have more confidence in the data sets that simplifies into solving the problems and tackling it better. Big data tends to open new opportunities and new business models in the process. There are three key sections in the works of big data analysis that are integrated, managed and analyzed. In the integrate part the big data from different sources and the applications. The traditional integration in the data as well as the mechanisms that extract, load and to transform are not generally up to the tasks. It required new types of strategies and also technologies for analyzing the data sets at the terabyte even in the pet byte scales (Babar, et al. 2019). In the management section the big data analysis requires the storage; the data can be stored in any form and bring the important requirements and the necessary engine process. The analyze part signifies the investment in big data analysis; it pays off when the data analyzes and acts accordingly. It builds the data models that are in machine learning and in artificial intelligence and puts the data sets to work.
Here, the analysis is done as per the idea and understanding accordingly due to the unavailability of proper resources or articles. The implementation of the SQL database connector is done to show the course of objectives that shows the big data analytics as per the proceeded software. There is a simple analytical case that connects the above objective phases that are already implemented and added in the report. The connection to Big data HIVE (Hadoop) with SQL big data parameters are not possible in technical cases due to some errors in the coding parameters and setting up of Sqoop handles.
To easily link the two systems, Hadoop Distributed File System (HDFS) is connected to and migrated to SQL Hadoop through Hive. Hive must be set up to cooperate with HDFS in order to facilitate data integration between the two systems. The objective is to move data efficiently and with data integrity from HDFS to SQL Hadoop. Hive's effective data management and organization techniques allow for the best possible data retrieval and storage. Hive's SQL-like interface may be used to process and analyze data while utilizing sophisticated queries and transformations. Techniques for performance optimization guarantee effective resource use and query execution.
By achieving these goals, organizations can take advantage of HDFS and SQL Hadoop's advantages, facilitating thorough large data processing and informed decision-making. Hive is used to transport records from HDFS to SQL Hadoop during the migration procedure. To do this, the information must be in use out of HDFS and transformed into a format that works with Hive's SQL-based data model. In order to preserve the correctness and reliability of the migrated data, it is important to ensure data truthfulness throughout the adaptation process.
Effective data management and organization become crucial after the data has been moved. Hive offers a scalable and adaptable environment for building tables, partitions, and database structures. Organizations can improve data accessibility and retrieval by optimizing data storage inside Hive, facilitating effective data processing and analysis. Utilizing Hive's SQL-like interface for data processing and analysis is one of the key benefits of linking HDFS to SQL Hadoop through Hive. Users that are comfortable writing SQL queries can easily aggregate, join, and alter the moved data. Using their existing SQL experience, this enables organizations to gain useful insights and make data-driven decisions. Optimization approaches can be used to boost SQL Hadoop through Hive's performance. These consist of parallel processing, query optimization, and data indexing. Organizations can accelerate query execution by adjusting the system.
Business case scenario
Big data analysis is one of the resources that grow without any limits as long as there are enough places to store the data and handle the computing power. The big data is all about the size. There is a stage where the data can take place in many forms that might change in different types of velocities and contain various levels in the data quality among the other attributes (Ketu, et al. 2020). The ability to deal with all facts of data analysis is mandatory. The mastery of data brings many companies into business value significantly, that is for enabling the optimization of cost that improves the power of efficiency to get better insights to the customers, on the other side where the world is changing increasingly at a faster rate. As technology is becoming ubiquitous, footprints are everywhere.
The organizations that mostly use big data analysis effectively are focusing more on the customers, the users, the patients and applying the understanding to fulfill the needs. The advanced analysis soft wares and the dashboards that are connected to the big data deliver a more complete view of interactions with the customers and the behavior of them (Khan and Malviya 2020). Many of the data are combining themselves from the internal variety and the external sources for upgrading the customer services, improving the business in sales, optimizing the marketing business and also enhancing the products as well as the services. In the process including more real intelligence into the operations. Improving customer acquisition and retention. The organization handles in a better way that the customers like and are interested in and how the products are used and the services. In big data analysis applications the companies can identify accurately that the customers are looking for and observing the behavior patterns.
Analytical design of a dimension data model
A dimensional model in the data warehouse is basically designed for reading, to summarize, to analyzethe numeric information like the values, balance, count and the weight in the data warehouse models. Eventually, the relations model has been optimized for adding and updating and alsodeleting the data in the real time systems of transactions (Kumar and Singh 2019). The dimensional and the relational model own their way of the storage of the data that carry someadvantages in the process. In the following model mode, having normalization formand also ER models helps to reduce the redundancy in the big data. In addition the models that are dimensional in the warehouse that arrange the data makes it easy for retrieving the information in the system and generates the report into the system. Hence, the dimensional model is usually used inwarehouses that are in the data models or systems and it is not a good fit for the relational models or the systems.
Figure 2: Creation of the database in SQL management server studio
(Source: Self-Created)
There are some elements in the dimensional data models such as:
Fact
Facts are mainly about the measurements or the metrics in the business process. For the business process the measurement will be the sales number quarterly.
Dimension
The dimension provides context between the surroundings in the business process of an event. It gives that by whom, who and from where of the following fact. Process of sales business for the quarterly sales of numbers the dimensions will be by whom there will be customer names, in where there will be the location of the customer, and lastly what there will be the product name (Kumar et al. 2020). It signifies that the dimension allows viewing the information of the customers related details for fact.
Attributes
The attributes contain various characteristics in the models of dimension data. In the system there is a location dimension where the present attributes will be in a state, name of the country, codes etc details are needed (Zhang and Wang 2021). These attributes are used for searching, filtering and also classifying the facts. The dimension table has attributes in it.
Figure 3: Fact table data displayed
(Source: Self-Created)
Fact table
It is the primary table in the dimension data models. Here the fact table contains measurement as well as facts and the foreign keys to the dimension table.
Dimensional table
It contains the dimension facts; these are also joined to the fact table as foreign keys. There are no limits for the number of the dimensions.
Implementation and testing of data warehouse The implementation part in the data warehouse which follows:
Requirement in the analysis and the planning of capacity. It is the first process of the data warehousing, involves the need of enterprise, defining of the architecture, carrying out the capacity of planning and selecting the tools.
Integration of hardware is the second priority that is done after selecting the hardware and the software and required to integrate the servers by putting it as well as the software tools used by the user.
Figure 4: Dim table data displayed
(Source: Self-Created)
Modeling the third part signifies designing of the warehouse and also the views.
Physical modeling allows the data warehouse to perform effectively and the physical model is necessary.
Sources allow the data warehouse models to identify and connect the sources by using gateways.
One would first need to design and develop the Dim table with the necessary columns before implementing and testing a data warehouse in connection with the Dim table data. The following SQL command would be used to create the table:
CREATE TABLE Dim (Performance VARCHAR(50), CustomerID INT, Salary INT, Date DATE);
Using INSERT statements, the table should be filled with pertinent data after it has been created. To analyze and extract data from the table, data manipulation procedures like SELECT, UPDATE, DELETE, and JOIN can then be used. Different SQL queries are run on the table to ensure correct results and to evaluate the data warehouse architecture. The SQL statements can be changed to suit particular needs and the selected database management system.
Data manipulation techniques can be used to gain important insights after filling the Dim table with data. Specific customer records can be retrieved using SELECT queries with conditions and filters depending on factors like salary ranges, dates, or performance levels. Calculating statistical measurements on the data can be done using aggregate functions like SUM, AVG, or COUNT. UPDATE statements let you change data in records, such performance or salary numbers. Data that is no longer needed or relevant can be deleted using DELETE statements.
The successful design, creation, and populating of the dim table are the main priorities in the implementation and testing of a data warehouse in relation to the dim table data. Customer ID, Salary, Date, and Performance are just a few examples of the desired data columns that should be appropriately reflected in the table structure. The table must be filled with pertinent data after it has been established. This can be done with the help of INSERT statements, where figures is added either one row at a time or in loose, dependent on the size and openness of the dataset.
To preserve data truthfulness, it is essential to make sure the data being injected is correct. Testing systems should be passed out to verify the execution. The retrieval, filtering, and aggregation of data on the dim table may be done by running various SQL queries. In the testing stage, it will be determined whether the data warehouse implementation is reliable and producing accurate results. Potential problems like data inconsistencies, erroneous mappings, or performance bottlenecks can be found and fixed by running various SQL queries. Furthermore, it is vital to make sure that the project's planned goals are aligned with the data warehouse implementation.
Implementation and testing of big data storage on HDFS
It is a testing process of the big data applications for ensuring that all of the functions of big data applications work accordingly in the process (Mahmud et al. 2020). The goal of the big data application or storage is for making sure that the system of big data storage is running smoothly and also no error will be foundwhile the performance and the security are under maintenance. Big data arethe collection of data sets that is not able to process by using various techniques.
Figure 5: Interface of the HADOOP through command line operations
(Source: Self-Created)
The testing of the datasets here involves many tools, techniques and also frameworks for processing the procedure.
Testing process of big data validation is:
Staging data validation
This process involves the validation of the big data in the process. Data that are from different sources will be validated for making sure that the correct data should be included in the system.
Figure 6: SQL code for the DIM table
(Source: Self-Created)
Map reducing validation
In this process the big data will verify the logic of the business validations on each node then validating the data set after running by multiple types of nodes (Mohammadpoor, and Torabi 2020).
Phase of output validation
The third step in this testing process of the Hadoop process is the output of the validation process. The data files that are outputs are being generated and are ready to move into the data warehouse or also can be moved to the other system based on the requirement.
Figure 7: SQL code for the FACT table
(Source: Self-Created)
Architecture testing
The Hadoop process is of very large volumes of big data and also it is highly resource intensive (Omar, and Jumaa, 2019). It includes the completion time of the job, utilization of the memory, throughput the data and metrics system that are similar.
Performance testing
The performance testing includes two actions that are data ingestion and data processing for the process of testing. The data ingestion allows the system to verify how fast it will consume the data from different types of sources (Wei, and Chou, 2020). The data processing verifies the speed by which the map reduces the jobs and is executed in the process of testing.
The steps listed below can be used to construct and test big data storage on Hadoop Distributed File System (HDFS) in conjunction with the dim table SQL:
Apache Pig data processing A high-level scripting language called Apache Pig is used by Hadoop to analyze massive datasets. Examples of Pig commands for processing SQL data from the dim database in HDFS include:
a. Data loading: pig dim_data = LOAD "hdfs://hdfs_path>/dim_data.csv" USING PigStorage(',') AS (CustomerID:int, Salary:int, Date:chararray, Performance:chararray);
b. Data Filtering: FILTER dim_data BY Salary > 5000 for pig;
c. Data Aggregating: pig grouped_data = GROUP dim_data BY Performance; result = FOREACH grouped_data. c. Aggregating Data COUNT(dim_data) AS Count, and GENERALISE group AS Performance;
Reflective conclusion
After doing the implementations and testing of the big data analysis, the availability of the big data with low-cost hardware commodity and also the new information of management and the software portion have produced a unique state in the history of big data analysis. The trends by the convergence prove that we have capabilities that are required to analyze the big datasets in a very short time and it is cost effective. The capabilities in the system are not oretical nor is it trivial. The report represents a leap forward with clear opportunities to get enormous gains or profit for efficiency, profitability and also revenue and productivity.
In conclusion, Hadoop and SQL work well together to provide strong capabilities for big data processing and storage, particularly when HDFS and tools like Apache Pig are used. Large datasets may be stored and retrieved effectively using Hadoop's distributed file system, while SQL offers a familiar and expressive language for data manipulation and analysis. Organizations can take use of Hadoop's scalability and parallel processing capabilities as well as SQL's flexibility and analytical strength by combining the two technologies. This combination facilitates the management of enormous amounts of data and supports data-driven insights, making it an invaluable option for big data analysis and storage across numerous industries.
Reference ListJournals
Ali, S.A.G., Al-Fayyadh, H.R.D., Mohammed, S.H. and Ahmed, S.R., 2022, June. A Descriptive Statistical Analysis of Overweight and Obesity Using Big Data. In2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)(pp. 1-6). IEEE.
Azeroual, O. and Fabre, R., 2021. Processing big data with apache hadoop in the current challenging era of COVID-19.Big Data and Cognitive Computing,5(1), p.12.
Babar, M., Arif, F., Jan, M.A., Tan, Z. and Khan, F., 2019. Urban data management system: Towards Big Data analytics for Internet of Things based smart urban environment using customized Hadoop.Future Generation Computer Systems,96, pp.398-409.
Ketu, S., Mishra, P.K. and Agarwal, S., 2020. Performance analysis of distributed computing frameworks for big data analytics: hadoop vs spark.Computacin y Sistemas,24(2), pp.669-686.
Khan, M. and Malviya, A., 2020, February. Big data approach for sentiment analysis of twitter data using Hadoop framework and deep learning. In2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE)(pp. 1-5). IEEE.
Kumar, S. and Singh, M., 2019. A novel clustering technique for efficient clustering of big data in Hadoop Ecosystem.Big Data Mining and Analytics,2(4), pp.240-247.
Kumar, Y., Sood, K., Kaul, S. and Vasuja, R., 2020. Big data analytics and its benefits in healthcare. InBig data analytics in healthcare(pp. 3-21). Springer, Cham.
Mahmud, M.S., Huang, J.Z., Salloum, S., Emara, T.Z. and Sadatdiynov, K., 2020. A survey of data partitioning and sampling methods to support big data analysis.Big Data Mining and Analytics,3(2), pp.85-101.
Mohammadpoor, M. and Torabi, F., 2020. Big Data analytics in oil and gas industry: An emerging trend.Petroleum,6(4), pp.321-328.
Omar, H.K. and Jumaa, A.K., 2019. Big data analysis using apache spark MLlib and Hadoop HDFS with Scala and Java.Kurdistan Journal of Applied Research,4(1), pp.7-14.
Wei, C.C. and Chou, T.H., 2020. Typhoon quantitative rainfall prediction from big data analytics by using the apache hadoop spark parallel computing framework.Atmosphere,11(8), p.870.
Zhang, X. and Wang, Y., 2021. Research on intelligent medical big data system based on Hadoop and blockchain.EURASIP Journal on Wireless Communications and Networking,2021(1), pp.1-21.
 
								