Data analysis is the process of checking, cleaning, transforming, and modeling data with the aim of finding useful information, informing conclusions, and supporting decision making. Data analysis has multiple sides and approaches, covering a variety of techniques under various names, while used in different business, science, and social sciences domains.
Data mining is a specialized data analysis technique that focuses on modeling and discovery of knowledge for predictive purposes rather than purely descriptive, while business intelligence includes data analysis highly dependent on aggregation, particularly focused on business information. In statistical applications, data analysis can be divided into descriptive statistics, exploration data analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on the discovery of new features in the data while the CDA focuses on confirming or falsifying existing hypotheses. Predictive analysis focuses on the application of statistical models for predictive or predictive classification, while text analysis uses statistical, linguistic, and structural techniques to extract and classify information from textual sources, unstructured data species. All of the above are varieties of data analysis.
Data integration is a precursor to data analysis, and data analysis is closely related to data visualization and data dissemination. The term data analysis â ⬠<â ⬠is sometimes used as a synonym for data modeling.
Video Data analysis
Proses analisis data ââ¬â¹Ã¢â¬â¹span>
Analysis refers to the overall breakdown into separate components for individual examination. Data analysis is the process of obtaining raw data and converting it into useful information for user decision making. Data are collected and analyzed to answer questions, test hypotheses or incorrect theories.
Statistical expert John Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of the procedure, how to plan data collection to make the analysis easier, more precise or more accurate, and all machines and statistical results (mathematical) which applies to analyzing data. "
There are several distinguishable phases, described below. The phase is repeatable, in feedback from the next phase can generate additional work in the previous phase.
Data requirements â ⬠<â â¬
Data is required as input for analysis, which is determined based on the requirements of those who direct the analysis or the customer (who will use the final product of the analysis). Common types of entities in which data will be collected are referred to as experimental units (for example, a person or a people population). Specific variables regarding a population (eg age and income) can be determined and obtained. Data can be numbers or categories (i.e. text labels for numbers).
Data collection â ⬠<â â¬
Data are collected from various sources. Requirements can be communicated by the analyst to the data keeper, such as information technology personnel within an organization. Data can also be collected from sensors in the environment, such as traffic cameras, satellites, tape recorders, etc. This can also be obtained through interviews, downloads from online sources, or reading documentation.
Data processing â ⬠<â â¬
The initially obtained data must be processed or organized for analysis. For example, this may involve placing data into rows and columns in tabular format (ie, structured data) for further analysis, such as in spreadsheets or statistical software.
Data cleaning â ⬠<â â¬
Once processed and organized, the data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way data is entered and stored. Data cleaning is the process of preventing and correcting these errors. Common tasks include matching records, identifying data inaccuracies, overall quality of existing data, deduplication, and column segmentation. The problem of such data can also be identified through various analytic techniques. For example, with financial information, the totals for a given variable can be compared with separate published figures that are believed to be reliable. Unusual amounts above or below the pre-determined threshold can also be reviewed. There are several types of data cleanup that depend on data types such as phone numbers, email addresses, companies, etc. Quantitative data methods for detection of silane can be used to remove data that may be incorrectly entered. A textual data spelling checker can be used to reduce the number of typos, but it is more difficult to say whether the words are true.
Analysis of explorative data ââspan>
After the data is cleared, it can be analyzed. Analysts can apply various techniques called exploratory data analysis to begin to understand the messages contained in the data. The exploration process can result in additional data cleanup or additional requests for data, so this activity may be iterative. Descriptive statistics, such as average or median, can be generated to help understand the data. Data visualization can also be used to check data in graphical formats, to gain additional insight into the messages in the data.
Modeling and algorithms
Formulas or mathematical models called algorithms can be applied to data to identify relationships between variables, such as correlation or cause-and-effect. In general, models can be developed to evaluate certain variables in the data based on other variables in the data, with some residual errors depending on the model's accuracy (ie, Data = Error Model).
Inferential statistics include techniques for measuring the relationship between particular variables. For example, regression analysis can be used to model whether changes in ads (independent variables X) explain variations in sales (dependent variable Y). In mathematical terms, Y (sales) is a function X (advertising). This can be described as Y = aX b error, where the model is designed in such a way that a and b minimize errors when the model predicts Y for the given value range X. Analysts may try to construct a descriptive data model to simplify analysis and communicate results.
Data products â ⬠<â â¬
Data products are computer applications that take input data and generate output, giving them back to the environment. This may be based on a model or algorithm. An example is an app that analyzes data about a customer's purchase history and recommends other purchases customers might enjoy.
Communications
Once the data is analyzed, it can be reported in various formats to the users of the analysis to support their requirements. Users may have feedback, which generates additional analysis. Thus, most analytical cycles are repetitive.
When determining how to communicate results, analysts can consider data visualization techniques to help clearly and efficiently communicate the message to the audience. Data visualization uses information display (such as tables and graphs) to help communicate key messages contained in the data. Tables are useful for users who may be looking for specific numbers, while charts (for example, bar charts or line charts) can help explain the quantitative messages contained in the data.
Maps Data analysis
Quantitative messages
Stephen Few describes eight types of quantitative messages that users can understand or communicate from a set of related data and graphics used to help communicate messages. Customers who set requirements and analysts performing data analysis may consider these messages during the process.
- Time series: One variable is taken over a period of time, such as the unemployment rate over a 10-year period. Line charts can be used to show trends.
- Ranking: Category subdivisions are sorted in ascending or descending order, such as sales performance ratings ( size ) by sales staff (category , with each person's sales category division ) for a period. Bar charts can be used to display comparisons across sales staff.
- Overall: Categorical subdivisions are measured as ratios to the whole (that is, percentages of 100%). Pie charts or bar charts can show comparison ratios, such as market share represented by competitors in the market.
- Deviation: Categorical subdivision compared with reference, such as comparison between actual vs. costs budget for some business departments over a period of time. The bar chart can show the comparison of the actual number versus the reference count.
- Frequency distribution: Shows the number of observations of a particular variable for a given interval, such as the number of years in which stock market returns between intervals such as 0-10%, 11-20%, etc. A histogram, a kind of bar chart, can be used for this analysis.
- Correlations: Comparisons between observations represented by two variables (X, Y) to determine whether they are likely to move in the same or opposite direction. For example, plan for unemployment (X) and inflation (Y) for month samples. A scatter plot is usually used for this message.
- Nominal comparison: Compares subdivisions of categories in no particular order, such as sales volumes by product code. Bar charts can be used for this comparison.
- Geographic or geospatial: Comparison of variables on a map or layout, such as the state's unemployment rate or the number of people on different floors of a building. Cartogram is the usual graph used.
Techniques for analyzing quantitative data ââspan>
Author Jonathan Koomey has recommended a range of best practices for understanding quantitative data. These include:
- Check raw data for anomalies before doing your analysis;
- Reload important calculations, such as verifying data columns driven by formulas;
- The main total confirmation is the number of subtotals;
- Check the relationship between the numbers to be associated in a predictable way, such as the ratio over time;
- Normalize the numbers to facilitate comparison, such as analyzing the amount per person or relative to GDP or as an index value relative to the base year;
- Troubleshoot into component parts by analyzing the results-driven factors, such as DuPont's analysis of return on equity.
For the variables studied, analysts usually obtain descriptive statistics for them, such as average (average), median, and standard deviation. They can also analyze the distribution of key variables to see how individual values ââare clustered around the mean.
Consultants at McKinsey and Company call the technique for solving quantitative problems into their component parts called MECE principles. Each layer can be broken down into its components; each sub-component must be mutually exclusive to each other and collectively add up to the layer above it. This relationship is referred to as "Mutually Exclusive and Collectively Exhaustive" or MECE. For example, earnings by definition can be divided into total revenue and total costs. In turn, total revenue can be analyzed by its components, such as division income A, B, and C (which are mutually exclusive) and must add to total revenue (collectively complete).
Analysts can use powerful statistical measurements to solve specific analytical problems. Hypothesis testing is used when certain hypotheses about the actual state are made by the analyst and data are collected to determine whether the state is true or false. For example, the hypothesis may be that "Unemployment has no effect on inflation", which is related to an economic concept called the Phillips Curve. Hypothesis testing involves considering the possibility of Type I and Type II errors, which relate to whether the data support accept or reject the hypothesis.
Regression analysis can be used when the analyst tries to determine the extent to which the independent variable X affects the dependent variable Y (eg, "To what extent changes in the unemployment rate (X) affect the inflation rate (Y)?"). This is an attempt to model or pair the line or equation curve to the data, such that Y is a function of X.
The necessary condition analysis (NCA) can be used when the analyst tries to determine the extent to which the independent variable X allows the variable Y (eg, "To what extent a certain unemployment rate (X) is required for a given inflation rate (Y)?"). While (multiple) regression analysis uses additive logic where each X-variable can produce results and X can offset each other (they are adequate but not necessary), the required condition analysis (NCA) uses the logic of need, where one or more X-variables allow results exist, but may not produce them (they are necessary but not enough). Every necessary condition must exist and compensation is not possible.
Analytical activity of data users â ⬠<â â¬
Users can have particular data points of interest in the data set, compared to the general message described above. Such low-level user analytics activities are presented in the following table. Taxonomy can also be governed by three activity poles: fetch values, find data points, and set data points.
Barriers to effective analysis
Barriers to effective analysis may exist among analysts who perform data analysis or among audiences. Distinguishing fact from opinion, cognitive, and numeracy are all challenges for sound data analysis.
The confusing facts and opinions
Effective analysis requires relevant facts to answer questions, support conclusions or formal opinions, or test hypotheses. Facts by definition can not be denied, which means that everyone involved in the analysis should be able to approve it. For example, in August 2010, the Congressional Budget Office (CBO) estimated that extending the Bush tax cuts in 2001 and 2003 for the 2011-2020 time period would add about $ 3.3 trillion to the national debt. Everyone should be able to agree that this is exactly what the CBO is reporting; they can all check their reports. This makes it a fact. Whether people agree or disagree with CBO is their own opinion.
As another example, public company auditors must arrive at a formal opinion as to whether the public company's financial statements are "fairly stated, in all material respects." This requires extensive factual analysis and evidence to support their opinions. When making the leap from fact to opinion, there is always the possibility that the opinion is wrong.
Cognitive bias
There are various cognitive biases that may affect the analysis. For example, a confirmation bias is a tendency to seek or interpret information in a way that confirms a person's preconceptions. In addition, individuals can discredit information that does not support their views.
Analysts can be specially trained to realize this bias and how to overcome it. In his book The Psychology of Intelligence Analysis, retired CIA analyst Richards Heuer writes that analysts should clearly illustrate their inference and chain of inference and determine the degree and source of uncertainty involved in conclusions. He emphasized the procedure to help bring up and debate an alternative viewpoint.
Not Counting
Effective analysts are generally proficient with a variety of numerical techniques. However, the audience may not have literacy with numbers or counting; they are said to be counting. People who communicate data may also try to mislead or misinform, deliberately using poor numerical techniques.
For example, whether a number up or down may not be a key factor. More importantly, the number relative to other numbers, such as the size of government revenue or expenditure relative to the size of the economy (GDP) or the amount of costs relative to income in the company's financial statements. This numerical technique is called normalization or common measure. There are many techniques used by analysts, whether adjusting for inflation (ie comparing real vs nominal data) or considering population increases, demographics, etc. Analysts apply various techniques to address the various quantitative messages described in the above section.
Analysts can also analyze data under different assumptions or scenarios. For example, when analysts perform financial statement analysis, they will often rearrange financial statements with different assumptions to help achieve future cash flow forecasts, which they then discount to present value based on some interest rate, to determine the valuation of the firm or its inventory. Similarly, the CBO analyzes the effects of various policy options on government revenues, expenditures and deficits, creating alternate future scenarios for key steps.
More topics
Smart building
An analytic data approach can be used to predict energy consumption in buildings. Different steps of the data analysis process are performed to realize smart buildings, where building management and control operations including heating, ventilation, air conditioning, lighting and security are manifested automatically by imitating the needs of building users and optimizing resources. such as energy and time.
Analytics and business intelligence
Analytics is "an extensive use of data, quantitative statistics and analysis, explanative and predictive models, and fact-based management to drive decisions and actions." This is part of business intelligence, which is a set of technologies and processes that use data to understand and analyze business performance.
Education
In education, most educators have access to data systems for the purpose of analyzing student data. This data system provides data for educators in over-the-counter data formats (embedding labels, supplementary documentation, and help systems and key package/content creation and decisions) to improve the accuracy of educator data analysis.
Practitioner notes
This section contains a rather technical explanation that can help the practitioner but is beyond the scope of a typical Wikipedia article.
Analysis of initial data ââspan>
The most important difference between the initial data analysis phase and the major analysis phase is that during the initial data analysis one refrains from any analysis aimed at answering the original research question. The initial data analysis phase is guided by the following four questions:
Data quality â ⬠<â â¬
Data quality should be checked as early as possible. Data quality can be assessed in several ways, using various types of analysis: number of frequencies, descriptive statistics (mean, standard deviation, median), normality (slope, kurtosis, frequency histogram, n: variables compared to coding schemes of external variables of data sets, and may be corrected if the coding scheme can not be compared.
- Test for common method variance.
The choice of analysis to assess the quality of data during the initial data analysis phase depends on the analysis to be performed in the main analysis phase.
Measurement quality
The quality of measurement instruments should only be checked during the initial data analysis phase when this is not the focus or research research question. One should check whether the structure of the measuring instrument corresponds to the structure reported in the literature.
There are two ways to assess measurements: [NOTE: only one way appears to be listed]
- Homogeneity analysis (internal consistency), which provides an indication of the reliability of the measuring instrument. During this analysis, someone checks the item's variance and scale, Cronbach's? scale, and changes in alpha Cronbach when the item is removed from the scale
Initial transformation
After assessing the quality of data and measurement, one may decide to include lost data, or to perform the initial transformation of one or more variables, although this can also be done during the major analysis phase.
Possible transformations of variables are:
- Quadratic root transformation (if distribution is slightly different from normal)
- Log-transformation (if distribution differs substantially from normal)
- Changes are reversed (if distribution is very different than usual)
- Create category (ordinal/dichotomous) (if distribution is very different than usual, and no transformation helps)
Does the research implementation fulfill the purpose of the research design?
One should check the success of a randomization procedure, for example by checking whether the background and substantive variables are distributed evenly within and across the group.
If the study does not require or use randomization procedures, one should check the success of non-random sampling, for example by checking whether all interested subgroups are represented in the sample.
Other possible data distortions that should be checked are:
- is broken (this should be identified during the initial data analysis phase)
- Nonresponse items (whether these are random or should not be assessed during the initial data analysis phase)
- Quality of care (using manipulation checks).
Characteristics of sample data âââ ⬠<â â¬
In any report or article, the structure of the sample should be accurately described. It is important to specify the sample structure (and especially the subgroup size) when subgroup analysis will be performed during the main analysis phase.
Characteristics of data samples can be assessed by looking at:
- Basic stats of important variables
- Scatter plot
- Correlations and associations
- Cross tab
Final stage of initial data analysis
During the final stages, the findings from the initial data analysis are documented, and need, preferably, and corrective action that may be taken.
Also, the initial plan for the main data analysis can and should be determined in more detail or rewritten To do this, some major data analysis decisions can and should be made:
- In the case of non-normals: someone should change variables; create a category variable (ordinal/dichotomous); customize the analytical method?
- In case of missing data: should someone ignore or blame the lost data; which imputation techniques should be used?
- In the case of outliers: should someone use strong analytical techniques?
- In case the item does not fit the scale: should someone adjust the measurement instrument by ignoring the item, or rather ensuring comparability with other measuring instruments?
- In the case of (too) small subgroups: should one drop hypotheses about intergroup differences, or use a small sample technique, such as a proper test or bootstrapping?
- In case the randomization procedure seems to be broken: can and should someone calculate the trend score and include it as a covariate in the main analysis?
Analysis
Some analysis may be used during the initial data analysis phase:
- Univariate stats (single variable)
- Bivariate association (correlation)
- Graphical technique (spread plot)
It is important to take variable-level measurements into account for analysis, since special statistical techniques are available for each level:
- The nominal and ordinal variables
- Calculation of frequency (number and percentage)
- Association
- circumambulations (crosstabulations)
- hierarchical logical analysis (limited to a maximum of 8 variables)
- loglinear analysis (to identify relevant and important variables and possible confounders)
- The right test or bootstrapping (in the case of a small subgroup)
- Calculation of new variables
- Continuous variables
- Distribution
- Statistics (M, SD, variance, skewness, kurtosis)
- Stem-and-leaf view
- Plot box
- Distribution
Nonlinear analysis
Nonlinear analysis will be required when data is recorded from a nonlinear system. Nonlinear systems can exhibit complex dynamic effects including bifurcation, chaos, harmonics and subharmonics that can not be analyzed using simple linear methods. Analysis of nonlinear data is closely related to nonlinear system identification.
Primary data analysis
In the analysis of the main analysis phase aimed at answering research questions were undertaken as well as other relevant analyzes needed to write the first draft of the research report.
Exploration and confirmation approach
In the main analysis phase, either exploratory or confirmatory approaches can be adopted. Usually the approach is decided before the data is collected. In the exploratory analysis there is no clear hypothesis stated before analyzing the data, and the data searched for models that describe the data well. In a clear confirmation analysis the hypothesis about the data is tested.
Analysis of exploratory data should be interpreted with caution. When testing multiple models at once, there is a high chance of finding at least one of them to be significant, but this can be caused by a type 1 error. It is important to always adjust the level of significance when testing multiple models with, for example, Bonferroni correction. Also, one should not follow-up an exploratory analysis with confirmatory analysis in the same dataset. Exploratory analysis is used to find ideas for theory, but not to test the theory as well. When a model is found exploration in the dataset, then following up the analysis with confirmation analysis in the same dataset can mean that the result of the confirmatory analysis is due to the same type 1 error that produces the exploratory model in the first place. Therefore the confirmatory analysis will not be more informative than the original exploratory analysis.
Results stability
It is important to get some pointers on how results can be generalized. Although this is difficult to examine, one can see the stability of the results. Is the result reliable and reproducible? There are two main ways to do this:
- Cross Validation: By splitting the data in several sections, we can check whether the analysis (such as a fitting model) based on one part of the data also generalizes to other parts of the data.
- Sensitivity analysis: A procedure for studying the behavior of a system or model when global parameters (systematically) vary. One way to do this is with bootstrap.
Statistical methods
Many statistical methods have been used for statistical analysis. A short list of four more popular methods is:
- General linear model: A widely used model in which various methods are based (eg t test, ANOVA, ANCOVA, MANOVA). Can be used to assess the effects of multiple predictors on one or more sustained dependent variables.
- Generalized linear model: Extension of the general linear model for discrete dependent variable.
- Structural equation modeling: Can be used to assess the latent structure of measurable manifest variables.
- The item response theory: The model for (mostly) assesses a latent variable of some binary measurement variable (eg exam).
Free software for data analysis â ⬠<â â¬
- DevInfo - a database system supported by the United Nations Development Group to monitor and analyze human development.
- ELKI - data mining framework in Java with data mining oriented visualization function.
- KNIME - Konstanz Information Miner, a user friendly and comprehensive data analytics framework.
- Orange - A visual programming tool that displays interactive data visualizations and methods for analyzing statistical data, data mining, and machine learning.
- PAST - free software for scientific data analysis
- PAW - FORTRAN/C data analysis framework developed at CERN
- R - programming languages ââand software environments for computing and graphical statistics.
- ROOT - C data analysis framework developed at CERN
- SciPy and Pandas - Python libraries for data analysis â â¬
The international data analysis contest
Source of the article : Wikipedia