Friday, November 29, 2019

AP Chemistry Course and Exam Topics

AP Chemistry Course and Exam Topics This is an outline of the chemistry topics covered by the AP (Advanced Placement) Chemistry course and exam, as described by the College Board. The percentage given after the topic is the approximate percentage of multiple-choice questions on the AP Chemistry Exam about that topic. Structure of Matter (20%)States of Matter (20%)Reactions (35–40%)Descriptive Chemistry (10–15%)Laboratory (5–10%) I. Structure of Matter (20%) Atomic Theory and Atomic Structure Evidence for the atomic theoryAtomic masses; determination by chemical and physical meansAtomic number and mass number; isotopesElectron energy levels: atomic spectra, quantum numbers, atomic orbitalsPeriodic relationships including atomic radii, ionization energies, electron affinities, oxidation states Chemical Bonding Binding forcesa. Types: ionic, covalent, metallic, hydrogen bonding, van der Waals (including London dispersion forces)b. Relationships to states, structure, and properties of matterc. Polarity of bonds, electronegativitiesMolecular modelsa. Lewis structuresb. Valence bond: hybridization of orbitals, resonance, sigma and pi bondsc. VSEPRGeometry of molecules and ions, structural isomerism of simple organic molecules and coordination complexes; dipole moments of molecules; relation of properties to structure Nuclear Chemistry Nuclear equations, half-lives, and radioactivity; chemical applications. II. States of Matter (20%) Gases Laws of ideal gasesa. Equation of state for an ideal gasb. Partial pressuresKinetic-molecular theorya. Interpretation of ideal gas laws on the basis of this theoryb. Avogadros hypothesis and the mole conceptc. Dependence of kinetic energy of molecules on temperatured. Deviations from ideal gas laws Liquids and Solids Liquids and solids from the kinetic-molecular viewpointPhase diagrams of one-component systemsChanges of state, including critical points and triple pointsStructure of solids; lattice energies Solutions Types of solutions and factors affecting solubilityMethods of expressing concentration (The use of normalities is not tested.)Raoults law and colligative properties (nonvolatile solutes); osmosisNon-ideal behavior (qualitative aspects) III. Reactions (35–40%) Reaction Types Acid-base reactions; concepts of Arrhenius, Brà ¶nsted-Lowry, and Lewis; coordination complexes; amphoterismPrecipitation reactionsOxidation-reduction reactionsa. Oxidation numberb. The role of the electron in oxidation-reductionc. Electrochemistry: electrolytic and galvanic cells; Faradays laws; standard half-cell potentials; Nernst equation; prediction of the direction of redox reactions Stoichiometry Ionic and molecular species present in chemical systems: net ionic equationsBalancing of equations including those for redox reactionsMass and volume relations with emphasis on the mole concept, including empirical formulas and limiting reactants Equilibrium Concept of dynamic equilibrium, physical and chemical; Le Chateliers principle; equilibrium constantsQuantitative treatmenta. Equilibrium constants for gaseous reactions: Kp, Kcb. Equilibrium constants for reactions in solution(1) Constants for acids and bases; pK; pH(2) Solubility product constants and their application to precipitation and the dissolution of slightly soluble compounds(3) Common ion effect; buffers; hydrolysis Kinetics Concept of rate of reactionUse of experimental data and graphical analysis to determine reactant order, rate constants, and reaction rate lawsEffect of temperature change on ratesEnergy of activation; the role of catalystsThe relationship between the rate-determining step and a mechanism Thermodynamics State functionsFirst law: change in enthalpy; heat of formation; heat of reaction; Hesss law; heats of vaporization and fusion; calorimetrySecond law: entropy; free energy of formation; free energy of reaction; dependence of change in free energy on enthalpy and entropy changesRelationship of change in free energy to equilibrium constants and electrode potentials IV. Descriptive Chemistry (10–15%) A. Chemical reactivity and products of chemical reactions. B. Relationships in the periodic table: horizontal, vertical, and diagonal with examples from alkali metals, alkaline earth metals, halogens, and the first series of transition elements. C. Introduction to organic chemistry: hydrocarbons and functional groups (structure, nomenclature, chemical properties). Physical and chemical properties of simple organic compounds should also be included as exemplary material for the study of other areas such as bonding, equilibria involving weak acids, kinetics, colligative properties, and stoichiometric determinations of empirical and molecular formulas. V. Laboratory (5–10%) The AP Chemistry Exam includes some questions based on experiences and skills students acquire in the laboratory: making observations of chemical reactions and substances; recording data; calculating and interpreting results based on the quantitative data obtained, and communicating effectively the results of experimental work. AP Chemistry coursework and the AP Chemistry Exam also include working some specific types of chemistry problems. AP Chemistry Calculations When performing chemistry calculations, students will be expected to pay attention to significant figures, precision of measured values, and the use of logarithmic and exponential relationships. Students should be able to determine whether or not a calculation is reasonable. According to the College Board, the following types of chemical calculations may appear on the AP Chemistry Exam: Percentage compositionEmpirical and molecular formulas from experimental dataMolar masses from gas density, freezing-point, and boiling-point measurementsGas laws, including the ideal gas law, Daltons law, and Grahams lawStoichiometric relations using the concept of the mole; titration calculationsMole fractions; molar and molal solutionsFaradays law of electrolysisEquilibrium constants and their applications, including their use for simultaneous equilibriaStandard electrode potentials and their use; Nernst equationThermodynamic and thermochemical calculationsKinetics calculations

Monday, November 25, 2019

Blindness in Oedipus essays

Blindness in Oedipus essays Blindness is defined as the inability to see but blindness can be overlooking details or ignoring the facts. These examples can be seen in the play Oedipus the King by Sophocles. There are many people throughout the play that demonstrate their blindness. One person is the Herdsman. Due to his blindness, the Herdsman is one person to be blamed for the murder of Laius and for the plague that struck Thebes. On page sixty-three, line 1177, the Herdsman tells Oedipus that a man came from another country to adopt him. O master, I pitied it, and thought that I could send it off to another country and this man was from another country. Assuming that the mans story was true, the Herdsman just gave him baby Oedipus. Although the Herdsman knew what Oedipus fate was (to marry his mother after killing his father) he just allowed the man to take the baby. The Herdsman should have asked more questions or even told the man what the childs fate was to make sure that Oedipus would never return to Thebes. I chose this passage because the agony that Thebes went through could have been surpassed if the Herdsman made sure that Oedipus was killed or out of the country for good. The blindness of the Herdsman caused the city to go through this suffering. I found this passage to be important because the irresponsible Herdsman was blind to the fact that he might be the cause of the extinction of a city. He single-handedly could have saved Thebes by making sure that Polybus and Merope were from another country or by allowing Oedipus to die the way Laius and Jocasta wanted him to. ...

Thursday, November 21, 2019

Sales Managment Assignment Essay Example | Topics and Well Written Essays - 2000 words

Sales Managment Assignment - Essay Example This paper approves that keeping the current scenario in mind, Total Gas & Power Ltd depends on its people being able to work together and systematically planned induction training will greatly accelerate this. Hence an induction plan is accordingly set which would cover a long checklist. A formal induction will be carried out. In case of formal orientation the program is very much structured and systematic. Everything in program is layed down previously and the flow is very much according to that. In this, the sales executive is made knowledgeable about four aspects which are- the organization, the job, the employees and other aspects. This essay makes a conclusion the sales personnel should be made more knowledgeable about their product’s features and its functions through seminars etc. Finally good performance should be rewarded with monetary incentives. Next, the performance appraisal scheme that had been introduced should be communicated to the employees properly leaving no room for confusion. Then, a senior sales executive should undertake the responsibility of tackling the complaints received from customers, evaluate its causes and then find out their solutions. Finally, the organization is looking forward to introduce another office in the mainland Europe. For this, they s should be highly motivated and informed about the market trends. The risk bearing capacity should be high in the beginning and they should strive towards building good customer relations.

Wednesday, November 20, 2019

Supply Chain Management Essay Example | Topics and Well Written Essays - 2250 words - 2

Supply Chain Management - Essay Example n that health care units are normally used by upper class people of high income groups (HIG) who are health conscious and wish to remain fit and trim. It could be used by children, adolescents, men and women and also elderly people. In the case of prospective location, Newcastle in UK, it is seen that percentage population above age of 60 years is 18.7%. Again, children within 18 years constitute 18.3% of the population. (Newcastle upon tyne central 2007). It is seen that in the context of current competitors’ charges, in most cases the prices are more or less equal for fitness services rendered, through use of machines or physical training, swimming pools, etc. Coming now to the second part of the question, it could be said that the main duty of a Geographical Information System (GIS) is to store, analyze, manage and present data which is connected with location. â€Å"Geography plays a role in nearly every decision we make. Choosing sites, targeting market segments, planning distribution networks, responding to emergencies, or redrawing country boundaries—all of these problems involve questions of geography.† (What is GIS). It is necessary that information needs to be associated with locations. Thus the latter plays a very determinant part in business strategy for bonding with customers or setting new plants. Again it is seen that Choosing a site, targeting a market segment, planning a distribution network, zoning a neighbourhood, allocating resources, and responding to emergencies—all these problems involve questions of geography. (Geography matters 2008). It is believed that retail business of the kind carried out by Tesco could greatly benefit by GIS. â€Å"Businesses maintain information about sales, customers, inventory, demographic profiles, and mailing lists, all of which have geographic locations. Therefore, business managers, marketing strategists, financial analysts, and professional planners increasingly rely on GIS to organize, analyze,

Monday, November 18, 2019

Catwalk Versus Visual Fashion Shows Dissertation

Catwalk Versus Visual Fashion Shows - Dissertation Example The paper "Catwalk Versus Visual Fashion Shows" discovers Young Designers Preference and fashion shows. The research is done by considering choices of young fashion designers from UK and Korea. In the introductory portion, the content, aims and objectives of this study and limitations faced while conducting the research work are clearly stated. Literature review comes next where all aspects related to visual fashion shows are dealt in details including current trends, influencing factors, recent shows, etc. Next is methodology, where the procedure, which was followed while doing this research work, is outlined. Research approaches, philosophies, data collection process, sampling etc, with every details are depicted in this section. Several people were interviewed. Continue with ‘findings’, summarizes the answers given by the respondents. Carry on with ‘discussion’, the answers of the respondents were interpreted, analyzed and compared. The views that came up through their answers were related with trends, practices and concepts referred to in ‘findings’. In the final section, a brief conclusion is provided to the whole study from point of view of the researcher. Along with a justification is also provided for young designer’s preference for visual fashion shows over live catwalk shows. Visual fashion shows are gradually taking over live fashion shows in recent times. Unlike the traditional fashion shows where the model catwalks down the ramp, visual fashion shows present digitalized images.... Unlike the traditional fashion shows where the model catwalks down the ramp, visual fashion shows present digitalized images of the same but, projected on a screen. Thus, on one hand the people fail to perceive the liveliness, but on the other hand designers could easily showcase their creations. For this reason, one could find an increasing popularity of visual fashion shows among designers, especially among the young the young fashion designers. Content In the present times, the fashion world is experiencing a new type of fashion show- the visual fashion show, which involves digitalized images of models decked up in fashion items. Contrary to conventional catwalk shows, visual fashion shows saves time, energy, organization, and are flexible in nature. At the same time, the visual fashion shows are also cost effective and interactive. This is the reason why many young designers in present are choosing visual fashion shows over the traditional ones (Menkes, 2010). Aims This study aim s at finding out whether the young fashion designers from UK and Korea prefer live catwalks on ramps or visual fashion shows. Objectives 1. To find out the choice and preferences of young designers from UK and Korea. 2. To explore what factors have affected their choice. 3. To examine relevance and effectiveness of visual fashion shows in contemporary fashion industry. 4. To analyze success of visual fashion shows in terms of marketing, promotion and popularity. Methodology For this paper, both primary and secondary sources will be used data collected by interviewing some people as well as those collected from books, journals, and reviews and articles from fashion magazines will be interpreted and analyzed. Findings It was found out that the new concept of

Saturday, November 16, 2019

Data Pre-processing Tool

Data Pre-processing Tool Chapter- 2 Real life data rarely comply with the necessities of various data mining tools. It is usually inconsistent and noisy. It may contain redundant attributes, unsuitable formats etc. Hence data has to be prepared vigilantly before the data mining actually starts. It is well known fact that success of a data mining algorithm is very much dependent on the quality of data processing. Data processing is one of the most important tasks in data mining. In this context it is natural that data pre-processing is a complicated task involving large data sets. Sometimes data pre-processing take more than 50% of the total time spent in solving the data mining problem. It is crucial for data miners to choose efficient data preprocessing technique for specific data set which can not only save processing time but also retain the quality of the data for data mining process. A data pre-processing tool should help miners with many data mining activates. For example, data may be provided in different formats as discussed in previous chapter (flat files, database files etc). Data files may also have different formats of values, calculation of derived attributes, data filters, joined data sets etc. Data mining process generally starts with understanding of data. In this stage pre-processing tools may help with data exploration and data discovery tasks. Data processing includes lots of tedious works, Data pre-processing generally consists of Data Cleaning Data Integration Data Transformation And Data Reduction. In this chapter we will study all these data pre-processing activities. 2.1 Data Understanding In Data understanding phase the first task is to collect initial data and then proceed with activities in order to get well known with data, to discover data quality problems, to discover first insight into the data or to identify interesting subset to form hypothesis for hidden information. The data understanding phase according to CRISP model can be shown in following . 2.1.1 Collect Initial Data The initial collection of data includes loading of data if required for data understanding. For instance, if specific tool is applied for data understanding, it makes great sense to load your data into this tool. This attempt possibly leads to initial data preparation steps. However if data is obtained from multiple data sources then integration is an additional issue. 2.1.2 Describe data Here the gross or surface properties of the gathered data are examined. 2.1.3 Explore data This task is required to handle the data mining questions, which may be addressed using querying, visualization and reporting. These include: Sharing of key attributes, for instance the goal attribute of a prediction task Relations between pairs or small numbers of attributes Results of simple aggregations Properties of important sub-populations Simple statistical analyses. 2.1.4 Verify data quality In this step quality of data is examined. It answers questions such as: Is the data complete (does it cover all the cases required)? Is it accurate or does it contains errors and if there are errors how common are they? Are there missing values in the data? If so how are they represented, where do they occur and how common are they? 2.2 Data Preprocessing Data preprocessing phase focus on the pre-processing steps that produce the data to be mined. Data preparation or preprocessing is one most important step in data mining. Industrial practice indicates that one data is well prepared; the mined results are much more accurate. This means this step is also a very critical fro success of data mining method. Among others, data preparation mainly involves data cleaning, data integration, data transformation, and reduction. 2.2.1 Data Cleaning Data cleaning is also known as data cleansing or scrubbing. It deals with detecting and removing inconsistencies and errors from data in order to get better quality data. While using a single data source such as flat files or databases data quality problems arises due to misspellings while data entry, missing information or other invalid data. While the data is taken from the integration of multiple data sources such as data warehouses, federated database systems or global web-based information systems, the requirement for data cleaning increases significantly. This is because the multiple sources may contain redundant data in different formats. Consolidation of different data formats abs elimination of redundant information becomes necessary in order to provide access to accurate and consistent data. Good quality data requires passing a set of quality criteria. Those criteria include: Accuracy: Accuracy is an aggregated value over the criteria of integrity, consistency and density. Integrity: Integrity is an aggregated value over the criteria of completeness and validity. Completeness: completeness is achieved by correcting data containing anomalies. Validity: Validity is approximated by the amount of data satisfying integrity constraints. Consistency: consistency concerns contradictions and syntactical anomalies in data. Uniformity: it is directly related to irregularities in data. Density: The density is the quotient of missing values in the data and the number of total values ought to be known. Uniqueness: uniqueness is related to the number of duplicates present in the data. 2.2.1.1 Terms Related to Data Cleaning Data cleaning: data cleaning is the process of detecting, diagnosing, and editing damaged data. Data editing: data editing means changing the value of data which are incorrect. Data flow: data flow is defined as passing of recorded information through succeeding information carriers. Inliers: Inliers are data values falling inside the projected range. Outlier: outliers are data value falling outside the projected range. Robust estimation: evaluation of statistical parameters, using methods that are less responsive to the effect of outliers than more conventional methods are called robust method. 2.2.1.2 Definition: Data Cleaning Data cleaning is a process used to identify imprecise, incomplete, or irrational data and then improving the quality through correction of detected errors and omissions. This process may include format checks Completeness checks Reasonableness checks Limit checks Review of the data to identify outliers or other errors Assessment of data by subject area experts (e.g. taxonomic specialists). By this process suspected records are flagged, documented and checked subsequently. And finally these suspected records can be corrected. Sometimes validation checks also involve checking for compliance against applicable standards, rules, and conventions. The general framework for data cleaning given as: Define and determine error types; Search and identify error instances; Correct the errors; Document error instances and error types; and Modify data entry procedures to reduce future errors. Data cleaning process is referred by different people by a number of terms. It is a matter of preference what one uses. These terms include: Error Checking, Error Detection, Data Validation, Data Cleaning, Data Cleansing, Data Scrubbing and Error Correction. We use Data Cleaning to encompass three sub-processes, viz. Data checking and error detection; Data validation; and Error correction. A fourth improvement of the error prevention processes could perhaps be added. 2.2.1.3 Problems with Data Here we just note some key problems with data Missing data : This problem occur because of two main reasons Data are absent in source where it is expected to be present. Some times data is present are not available in appropriately form Detecting missing data is usually straightforward and simpler. Erroneous data: This problem occurs when a wrong value is recorded for a real world value. Detection of erroneous data can be quite difficult. (For instance the incorrect spelling of a name) Duplicated data : This problem occur because of two reasons Repeated entry of same real world entity with some different values Some times a real world entity may have different identifications. Repeat records are regular and frequently easy to detect. The different identification of the same real world entities can be a very hard problem to identify and solve. Heterogeneities: When data from different sources are brought together in one analysis problem heterogeneity may occur. Heterogeneity could be Structural heterogeneity arises when the data structures reflect different business usage Semantic heterogeneity arises when the meaning of data is different n each system that is being combined Heterogeneities are usually very difficult to resolve since because they usually involve a lot of contextual data that is not well defined as metadata. Information dependencies in the relationship between the different sets of attribute are commonly present. Wrong cleaning mechanisms can further damage the information in the data. Various analysis tools handle these problems in different ways. Commercial offerings are available that assist the cleaning process, but these are often problem specific. Uncertainty in information systems is a well-recognized hard problem. In following a very simple examples of missing and erroneous data is shown Extensive support for data cleaning must be provided by data warehouses. Data warehouses have high probability of â€Å"dirty data† since they load and continuously refresh huge amounts of data from a variety of sources. Since these data warehouses are used for strategic decision making therefore the correctness of their data is important to avoid wrong decisions. The ETL (Extraction, Transformation, and Loading) process for building a data warehouse is illustrated in following . Data transformations are related with schema or data translation and integration, and with filtering and aggregating data to be stored in the data warehouse. All data cleaning is classically performed in a separate data performance area prior to loading the transformed data into the warehouse. A large number of tools of varying functionality are available to support these tasks, but often a significant portion of the cleaning and transformation work has to be done manually or by low-level programs that are difficult to write and maintain. A data cleaning method should assure following: It should identify and eliminate all major errors and inconsistencies in an individual data sources and also when integrating multiple sources. Data cleaning should be supported by tools to bound manual examination and programming effort and it should be extensible so that can cover additional sources. It should be performed in association with schema related data transformations based on metadata. Data cleaning mapping functions should be specified in a declarative way and be reusable for other data sources. 2.2.1.4 Data Cleaning: Phases 1. Analysis: To identify errors and inconsistencies in the database there is a need of detailed analysis, which involves both manual inspection and automated analysis programs. This reveals where (most of) the problems are present. 2. Defining Transformation and Mapping Rules: After discovering the problems, this phase are related with defining the manner by which we are going to automate the solutions to clean the data. We will find various problems that translate to a list of activities as a result of analysis phase. Example: Remove all entries for J. Smith because they are duplicates of John Smith Find entries with `bule in colour field and change these to `blue. Find all records where the Phone number field does not match the pattern (NNNNN NNNNNN). Further steps for cleaning this data are then applied. Etc †¦ 3. Verification: In this phase we check and assess the transformation plans made in phase- 2. Without this step, we may end up making the data dirtier rather than cleaner. Since data transformation is the main step that actually changes the data itself so there is a need to be sure that the applied transformations will do it correctly. Therefore test and examine the transformation plans very carefully. Example: Let we have a very thick C++ book where it says strict in all the places where it should say struct 4. Transformation: Now if it is sure that cleaning will be done correctly, then apply the transformation verified in last step. For large database, this task is supported by a variety of tools Backflow of Cleaned Data: In a data mining the main objective is to convert and move clean data into target system. This asks for a requirement to purify legacy data. Cleansing can be a complicated process depending on the technique chosen and has to be designed carefully to achieve the objective of removal of dirty data. Some methods to accomplish the task of data cleansing of legacy system include: n Automated data cleansing n Manual data cleansing n The combined cleansing process 2.2.1.5 Missing Values Data cleaning addresses a variety of data quality problems, including noise and outliers, inconsistent data, duplicate data, and missing values. Missing values is one important problem to be addressed. Missing value problem occurs because many tuples may have no record for several attributes. For Example there is a customer sales database consisting of a whole bunch of records (lets say around 100,000) where some of the records have certain fields missing. Lets say customer income in sales data may be missing. Goal here is to find a way to predict what the missing data values should be (so that these can be filled) based on the existing data. Missing data may be due to following reasons Equipment malfunction Inconsistent with other recorded data and thus deleted Data not entered due to misunderstanding Certain data may not be considered important at the time of entry Not register history or changes of the data How to Handle Missing Values? Dealing with missing values is a regular question that has to do with the actual meaning of the data. There are various methods for handling missing entries 1. Ignore the data row. One solution of missing values is to just ignore the entire data row. This is generally done when the class label is not there (here we are assuming that the data mining goal is classification), or many attributes are missing from the row (not just one). But if the percentage of such rows is high we will definitely get a poor performance. 2. Use a global constant to fill in for missing values. We can fill in a global constant for missing values such as unknown, N/A or minus infinity. This is done because at times is just doesnt make sense to try and predict the missing value. For example if in customer sales database if, say, office address is missing for some, filling it in doesnt make much sense. This method is simple but is not full proof. 3. Use attribute mean. Let say if the average income of a a family is X you can use that value to replace missing income values in the customer sales database. 4. Use attribute mean for all samples belonging to the same class. Lets say you have a cars pricing DB that, among other things, classifies cars to Luxury and Low budget and youre dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value youd get if you factor in the low budget 5. Use data mining algorithm to predict the value. The value can be determined using regression, inference based tools using Bayesian formalism, decision trees, clustering algorithms etc. 2.2.1.6 Noisy Data Noise can be defined as a random error or variance in a measured variable. Due to randomness it is very difficult to follow a strategy for noise removal from the data. Real world data is not always faultless. It can suffer from corruption which may impact the interpretations of the data, models created from the data, and decisions made based on the data. Incorrect attribute values could be present because of following reasons Faulty data collection instruments Data entry problems Duplicate records Incomplete data: Inconsistent data Incorrect processing Data transmission problems Technology limitation. Inconsistency in naming convention Outliers How to handle Noisy Data? The methods for removing noise from data are as follows. 1. Binning: this approach first sort data and partition it into (equal-frequency) bins then one can smooth it using- Bin means, smooth using bin median, smooth using bin boundaries, etc. 2. Regression: in this method smoothing is done by fitting the data into regression functions. 3. Clustering: clustering detect and remove outliers from the data. 4. Combined computer and human inspection: in this approach computer detects suspicious values which are then checked by human experts (e.g., this approach deal with possible outliers).. Following methods are explained in detail as follows: Binning: Data preparation activity that converts continuous data to discrete data by replacing a value from a continuous range with a bin identifier, where each bin represents a range of values. For instance, age can be changed to bins such as 20 or under, 21-40, 41-65 and over 65. Binning methods smooth a sorted data set by consulting values around it. This is therefore called local smoothing. Let consider a binning example Binning Methods n Equal-width (distance) partitioning Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward, but outliers may dominate presentation Skewed data is not handled well n Equal-depth (frequency) partitioning 1. It divides the range (values of a given attribute) into N intervals, each containing approximately same number of samples (elements) 2. Good data scaling 3. Managing categorical attributes can be tricky. n Smooth by bin means- Each bin value is replaced by the mean of values n Smooth by bin medians- Each bin value is replaced by the median of values n Smooth by bin boundaries Each bin value is replaced by the closest boundary value Example Let Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 n Partition into equal-frequency (equi-depth) bins: o Bin 1: 4, 8, 9, 15 o Bin 2: 21, 21, 24, 25 o Bin 3: 26, 28, 29, 34 n Smoothing by bin means: o Bin 1: 9, 9, 9, 9 ( for example mean of 4, 8, 9, 15 is 9) o Bin 2: 23, 23, 23, 23 o Bin 3: 29, 29, 29, 29 n Smoothing by bin boundaries: o Bin 1: 4, 4, 4, 15 o Bin 2: 21, 21, 25, 25 o Bin 3: 26, 26, 26, 34 Regression: Regression is a DM technique used to fit an equation to a dataset. The simplest form of regression is linear regression which uses the formula of a straight line (y = b+ wx) and determines the suitable values for b and w to predict the value of y based upon a given value of x. Sophisticated techniques, such as multiple regression, permit the use of more than one input variable and allow for the fitting of more complex models, such as a quadratic equation. Regression is further described in subsequent chapter while discussing predictions. Clustering: clustering is a method of grouping data into different groups , so that data in each group share similar trends and patterns. Clustering constitute a major class of data mining algorithms. These algorithms automatically partitions the data space into set of regions or cluster. The goal of the process is to find all set of similar examples in data, in some optimal fashion. Following shows three clusters. Values that fall outsid e the cluster are outliers. 4. Combined computer and human inspection: These methods find the suspicious values using the computer programs and then they are verified by human experts. By this process all outliers are checked. 2.2.1.7 Data cleaning as a process Data cleaning is the process of Detecting, Diagnosing, and Editing Data. Data cleaning is a three stage method involving repeated cycle of screening, diagnosing, and editing of suspected data abnormalities. Many data errors are detected by the way during study activities. However, it is more efficient to discover inconsistencies by actively searching for them in a planned manner. It is not always right away clear whether a data point is erroneous. Many times it requires careful examination. Likewise, missing values require additional check. Therefore, predefined rules for dealing with errors and true missing and extreme values are part of good practice. One can monitor for suspect features in survey questionnaires, databases, or analysis data. In small studies, with the examiner intimately involved at all stages, there may be small or no difference between a database and an analysis dataset. During as well as after treatment, the diagnostic and treatment phases of cleaning need insight into the sources and types of errors at all stages of the study. Data flow concept is therefore crucial in this respect. After measurement the research data go through repeated steps of- entering into information carriers, extracted, and transferred to other carriers, edited, selected, transformed, summarized, and presented. It is essential to understand that errors can occur at any stage of the data flow, including during data cleaning itself. Most of these problems are due to human error. Inaccuracy of a single data point and measurement may be tolerable, and associated to the inherent technological error of the measurement device. Therefore the process of data clenaning mus focus on those errors that are beyond small technical variations and that form a major shift within or beyond the population distribution. In turn, it must be based on understanding of technical errors and expected ranges of normal values. Some errors are worthy of higher priority, but which ones are most significant is highly study-specific. For instance in most medical epidemiological studies, errors that need to be cleaned, at all costs, include missing gender, gender misspecification, birth date or examination date errors, duplications or merging of records, and biologically impossible results. Another example is in nutrition studies, date errors lead to age errors, which in turn lead to errors in weight-for-age scoring and, further, to misclassification of subjects as under- or overweight. Errors of sex and date are particularly important because they contaminate derived variables. Prioritization is essential if the study is under time pressures or if resources for data cleaning are limited. 2.2.2 Data Integration This is a process of taking data from one or more sources and mapping it, field by field, onto a new data structure. Idea is to combine data from multiple sources into a coherent form. Various data mining projects requires data from multiple sources because n Data may be distributed over different databases or data warehouses. (for example an epidemiological study that needs information about hospital admissions and car accidents) n Sometimes data may be required from different geographic distributions, or there may be need for historical data. (e.g. integrate historical data into a new data warehouse) n There may be a necessity of enhancement of data with additional (external) data. (for improving data mining precision) 2.2.2.1 Data Integration Issues There are number of issues in data integrations. Consider two database tables. Imagine two database tables Database Table-1 Database Table-2 In integration of there two tables there are variety of issues involved such as 1. The same attribute may have different names (for example in above tables Name and Given Name are same attributes with different names) 2. An attribute may be derived from another (for example attribute Age is derived from attribute DOB) 3. Attributes might be redundant( For example attribute PID is redundant) 4. Values in attributes might be different (for example for PID 4791 values in second and third field are different in both the tables) 5. Duplicate records under different keys( there is a possibility of replication of same record with different key values) Therefore schema integration and object matching can be trickier. Question here is how equivalent entities from different sources are matched? This problem is known as entity identification problem. Conflicts have to be detected and resolved. Integration becomes easier if unique entity keys are available in all the data sets (or tables) to be linked. Metadata can help in schema integration (example of metadata for each attribute includes the name, meaning, data type and range of values permitted for the attribute) 2.2.2.1 Redundancy Redundancy is another important issue in data integration. Two given attribute (such as DOB and age for instance in give table) may be redundant if one is derived form the other attribute or set of attributes. Inconsistencies in attribute or dimension naming can lead to redundancies in the given data sets. Handling Redundant Data We can handle data redundancy problems by following ways n Use correlation analysis n Different coding / representation has to be considered (e.g. metric / imperial measures) n Careful (manual) integration of the data can reduce or prevent redundancies (and inconsistencies) n De-duplication (also called internal data linkage) o If no unique entity keys are available o Analysis of values in attributes to find duplicates n Process redundant and inconsistent data (easy if values are the same) o Delete one of the values o Average values (only for numerical attributes) o Take majority values (if more than 2 duplicates and some values are the same) Correlation analysis is explained in detail here. Correlation analysis (also called Pearsons product moment coefficient): some redundancies can be detected by using correlation analysis. Given two attributes, such analysis can measure how strong one attribute implies another. For numerical attribute we can compute correlation coefficient of two attributes A and B to evaluate the correlation between them. This is given by Where n n is the number of tuples, n and are the respective means of A and B n ÏÆ'A and ÏÆ'B are the respective standard deviation of A and B n ÃŽ £(AB) is the sum of the AB cross-product. a. If -1 b. If rA, B is equal to zero it indicates A and B are independent of each other and there is no correlation between them. c. If rA, B is less than zero then A and B are negatively correlated. , where if value of one attribute increases value of another attribute decreases. This means that one attribute discourages another attribute. It is important to note that correlation does not imply causality. That is, if A and B are correlated, this does not essentially mean that A causes B or that B causes A. for example in analyzing a demographic database, we may find that attribute representing number of accidents and the number of car theft in a region are correlated. This does not mean that one is related to another. Both may be related to third attribute, namely population. For discrete data, a correlation relation between two attributes, can be discovered by a χ ²(chi-square) test. Let A has c distinct values a1,a2,†¦Ã¢â‚¬ ¦ac and B has r different values namely b1,b2,†¦Ã¢â‚¬ ¦br The data tuple described by A and B are shown as contingency table, with c values of A (making up columns) and r values of B( making up rows). Each and every (Ai, Bj) cell in table has. X^2 = sum_{i=1}^{r} sum_{j=1}^{c} {(O_{i,j} E_{i,j})^2 over E_{i,j}} . Where n Oi, j is the observed frequency (i.e. actual count) of joint event (Ai, Bj) and n Ei, j is the expected frequency which can be computed as E_{i,j}=frac{sum_{k=1}^{c} O_{i,k} sum_{k=1}^{r} O_{k,j}}{N} , , Where n N is number of data tuple n Oi,k is number of tuples having value ai for A n Ok,j is number of tuples having value bj for B The larger the χ ² value, the more likely the variables are related. The cells that contribute the most to the χ ² value are those whose actual count is very different from the expected count Chi-Square Calculation: An Example Suppose a group of 1,500 people were surveyed. The gender of each person was noted. Each person has polled their preferred type of reading material as fiction or non-fiction. The observed frequency of each possible joint event is summarized in following table.( number in parenthesis are expected frequencies) . Calculate chi square. Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 E11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90 and so on For this table the degree of freedom are (2-1)(2-1) =1 as table is 2X2. for 1 degree of freedom , the χ ² value needed to reject the hypothesis at the 0.001 significance level is 10.828 (taken from the table of upper percentage point of the χ ² distribution typically available in any statistic text book). Since the computed value is above this, we can reject the hypothesis that gender and preferred reading are independent and conclude that two attributes are strongly correlated for given group. Duplication must also be detected at the tuple level. The use of renormalized tables is also a source of redundancies. Redundancies may further lead to data inconsistencies (due to updating some but not others). 2.2.2.2 Detection and resolution of data value conflicts Another significant issue in data integration is the discovery and resolution of data value conflicts. For example, for the same entity, attribute values from different sources may differ. For example weight can be stored in metric unit in one source and British imperial unit in another source. For instance, for a hotel cha Data Pre-processing Tool Data Pre-processing Tool Chapter- 2 Real life data rarely comply with the necessities of various data mining tools. It is usually inconsistent and noisy. It may contain redundant attributes, unsuitable formats etc. Hence data has to be prepared vigilantly before the data mining actually starts. It is well known fact that success of a data mining algorithm is very much dependent on the quality of data processing. Data processing is one of the most important tasks in data mining. In this context it is natural that data pre-processing is a complicated task involving large data sets. Sometimes data pre-processing take more than 50% of the total time spent in solving the data mining problem. It is crucial for data miners to choose efficient data preprocessing technique for specific data set which can not only save processing time but also retain the quality of the data for data mining process. A data pre-processing tool should help miners with many data mining activates. For example, data may be provided in different formats as discussed in previous chapter (flat files, database files etc). Data files may also have different formats of values, calculation of derived attributes, data filters, joined data sets etc. Data mining process generally starts with understanding of data. In this stage pre-processing tools may help with data exploration and data discovery tasks. Data processing includes lots of tedious works, Data pre-processing generally consists of Data Cleaning Data Integration Data Transformation And Data Reduction. In this chapter we will study all these data pre-processing activities. 2.1 Data Understanding In Data understanding phase the first task is to collect initial data and then proceed with activities in order to get well known with data, to discover data quality problems, to discover first insight into the data or to identify interesting subset to form hypothesis for hidden information. The data understanding phase according to CRISP model can be shown in following . 2.1.1 Collect Initial Data The initial collection of data includes loading of data if required for data understanding. For instance, if specific tool is applied for data understanding, it makes great sense to load your data into this tool. This attempt possibly leads to initial data preparation steps. However if data is obtained from multiple data sources then integration is an additional issue. 2.1.2 Describe data Here the gross or surface properties of the gathered data are examined. 2.1.3 Explore data This task is required to handle the data mining questions, which may be addressed using querying, visualization and reporting. These include: Sharing of key attributes, for instance the goal attribute of a prediction task Relations between pairs or small numbers of attributes Results of simple aggregations Properties of important sub-populations Simple statistical analyses. 2.1.4 Verify data quality In this step quality of data is examined. It answers questions such as: Is the data complete (does it cover all the cases required)? Is it accurate or does it contains errors and if there are errors how common are they? Are there missing values in the data? If so how are they represented, where do they occur and how common are they? 2.2 Data Preprocessing Data preprocessing phase focus on the pre-processing steps that produce the data to be mined. Data preparation or preprocessing is one most important step in data mining. Industrial practice indicates that one data is well prepared; the mined results are much more accurate. This means this step is also a very critical fro success of data mining method. Among others, data preparation mainly involves data cleaning, data integration, data transformation, and reduction. 2.2.1 Data Cleaning Data cleaning is also known as data cleansing or scrubbing. It deals with detecting and removing inconsistencies and errors from data in order to get better quality data. While using a single data source such as flat files or databases data quality problems arises due to misspellings while data entry, missing information or other invalid data. While the data is taken from the integration of multiple data sources such as data warehouses, federated database systems or global web-based information systems, the requirement for data cleaning increases significantly. This is because the multiple sources may contain redundant data in different formats. Consolidation of different data formats abs elimination of redundant information becomes necessary in order to provide access to accurate and consistent data. Good quality data requires passing a set of quality criteria. Those criteria include: Accuracy: Accuracy is an aggregated value over the criteria of integrity, consistency and density. Integrity: Integrity is an aggregated value over the criteria of completeness and validity. Completeness: completeness is achieved by correcting data containing anomalies. Validity: Validity is approximated by the amount of data satisfying integrity constraints. Consistency: consistency concerns contradictions and syntactical anomalies in data. Uniformity: it is directly related to irregularities in data. Density: The density is the quotient of missing values in the data and the number of total values ought to be known. Uniqueness: uniqueness is related to the number of duplicates present in the data. 2.2.1.1 Terms Related to Data Cleaning Data cleaning: data cleaning is the process of detecting, diagnosing, and editing damaged data. Data editing: data editing means changing the value of data which are incorrect. Data flow: data flow is defined as passing of recorded information through succeeding information carriers. Inliers: Inliers are data values falling inside the projected range. Outlier: outliers are data value falling outside the projected range. Robust estimation: evaluation of statistical parameters, using methods that are less responsive to the effect of outliers than more conventional methods are called robust method. 2.2.1.2 Definition: Data Cleaning Data cleaning is a process used to identify imprecise, incomplete, or irrational data and then improving the quality through correction of detected errors and omissions. This process may include format checks Completeness checks Reasonableness checks Limit checks Review of the data to identify outliers or other errors Assessment of data by subject area experts (e.g. taxonomic specialists). By this process suspected records are flagged, documented and checked subsequently. And finally these suspected records can be corrected. Sometimes validation checks also involve checking for compliance against applicable standards, rules, and conventions. The general framework for data cleaning given as: Define and determine error types; Search and identify error instances; Correct the errors; Document error instances and error types; and Modify data entry procedures to reduce future errors. Data cleaning process is referred by different people by a number of terms. It is a matter of preference what one uses. These terms include: Error Checking, Error Detection, Data Validation, Data Cleaning, Data Cleansing, Data Scrubbing and Error Correction. We use Data Cleaning to encompass three sub-processes, viz. Data checking and error detection; Data validation; and Error correction. A fourth improvement of the error prevention processes could perhaps be added. 2.2.1.3 Problems with Data Here we just note some key problems with data Missing data : This problem occur because of two main reasons Data are absent in source where it is expected to be present. Some times data is present are not available in appropriately form Detecting missing data is usually straightforward and simpler. Erroneous data: This problem occurs when a wrong value is recorded for a real world value. Detection of erroneous data can be quite difficult. (For instance the incorrect spelling of a name) Duplicated data : This problem occur because of two reasons Repeated entry of same real world entity with some different values Some times a real world entity may have different identifications. Repeat records are regular and frequently easy to detect. The different identification of the same real world entities can be a very hard problem to identify and solve. Heterogeneities: When data from different sources are brought together in one analysis problem heterogeneity may occur. Heterogeneity could be Structural heterogeneity arises when the data structures reflect different business usage Semantic heterogeneity arises when the meaning of data is different n each system that is being combined Heterogeneities are usually very difficult to resolve since because they usually involve a lot of contextual data that is not well defined as metadata. Information dependencies in the relationship between the different sets of attribute are commonly present. Wrong cleaning mechanisms can further damage the information in the data. Various analysis tools handle these problems in different ways. Commercial offerings are available that assist the cleaning process, but these are often problem specific. Uncertainty in information systems is a well-recognized hard problem. In following a very simple examples of missing and erroneous data is shown Extensive support for data cleaning must be provided by data warehouses. Data warehouses have high probability of â€Å"dirty data† since they load and continuously refresh huge amounts of data from a variety of sources. Since these data warehouses are used for strategic decision making therefore the correctness of their data is important to avoid wrong decisions. The ETL (Extraction, Transformation, and Loading) process for building a data warehouse is illustrated in following . Data transformations are related with schema or data translation and integration, and with filtering and aggregating data to be stored in the data warehouse. All data cleaning is classically performed in a separate data performance area prior to loading the transformed data into the warehouse. A large number of tools of varying functionality are available to support these tasks, but often a significant portion of the cleaning and transformation work has to be done manually or by low-level programs that are difficult to write and maintain. A data cleaning method should assure following: It should identify and eliminate all major errors and inconsistencies in an individual data sources and also when integrating multiple sources. Data cleaning should be supported by tools to bound manual examination and programming effort and it should be extensible so that can cover additional sources. It should be performed in association with schema related data transformations based on metadata. Data cleaning mapping functions should be specified in a declarative way and be reusable for other data sources. 2.2.1.4 Data Cleaning: Phases 1. Analysis: To identify errors and inconsistencies in the database there is a need of detailed analysis, which involves both manual inspection and automated analysis programs. This reveals where (most of) the problems are present. 2. Defining Transformation and Mapping Rules: After discovering the problems, this phase are related with defining the manner by which we are going to automate the solutions to clean the data. We will find various problems that translate to a list of activities as a result of analysis phase. Example: Remove all entries for J. Smith because they are duplicates of John Smith Find entries with `bule in colour field and change these to `blue. Find all records where the Phone number field does not match the pattern (NNNNN NNNNNN). Further steps for cleaning this data are then applied. Etc †¦ 3. Verification: In this phase we check and assess the transformation plans made in phase- 2. Without this step, we may end up making the data dirtier rather than cleaner. Since data transformation is the main step that actually changes the data itself so there is a need to be sure that the applied transformations will do it correctly. Therefore test and examine the transformation plans very carefully. Example: Let we have a very thick C++ book where it says strict in all the places where it should say struct 4. Transformation: Now if it is sure that cleaning will be done correctly, then apply the transformation verified in last step. For large database, this task is supported by a variety of tools Backflow of Cleaned Data: In a data mining the main objective is to convert and move clean data into target system. This asks for a requirement to purify legacy data. Cleansing can be a complicated process depending on the technique chosen and has to be designed carefully to achieve the objective of removal of dirty data. Some methods to accomplish the task of data cleansing of legacy system include: n Automated data cleansing n Manual data cleansing n The combined cleansing process 2.2.1.5 Missing Values Data cleaning addresses a variety of data quality problems, including noise and outliers, inconsistent data, duplicate data, and missing values. Missing values is one important problem to be addressed. Missing value problem occurs because many tuples may have no record for several attributes. For Example there is a customer sales database consisting of a whole bunch of records (lets say around 100,000) where some of the records have certain fields missing. Lets say customer income in sales data may be missing. Goal here is to find a way to predict what the missing data values should be (so that these can be filled) based on the existing data. Missing data may be due to following reasons Equipment malfunction Inconsistent with other recorded data and thus deleted Data not entered due to misunderstanding Certain data may not be considered important at the time of entry Not register history or changes of the data How to Handle Missing Values? Dealing with missing values is a regular question that has to do with the actual meaning of the data. There are various methods for handling missing entries 1. Ignore the data row. One solution of missing values is to just ignore the entire data row. This is generally done when the class label is not there (here we are assuming that the data mining goal is classification), or many attributes are missing from the row (not just one). But if the percentage of such rows is high we will definitely get a poor performance. 2. Use a global constant to fill in for missing values. We can fill in a global constant for missing values such as unknown, N/A or minus infinity. This is done because at times is just doesnt make sense to try and predict the missing value. For example if in customer sales database if, say, office address is missing for some, filling it in doesnt make much sense. This method is simple but is not full proof. 3. Use attribute mean. Let say if the average income of a a family is X you can use that value to replace missing income values in the customer sales database. 4. Use attribute mean for all samples belonging to the same class. Lets say you have a cars pricing DB that, among other things, classifies cars to Luxury and Low budget and youre dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value youd get if you factor in the low budget 5. Use data mining algorithm to predict the value. The value can be determined using regression, inference based tools using Bayesian formalism, decision trees, clustering algorithms etc. 2.2.1.6 Noisy Data Noise can be defined as a random error or variance in a measured variable. Due to randomness it is very difficult to follow a strategy for noise removal from the data. Real world data is not always faultless. It can suffer from corruption which may impact the interpretations of the data, models created from the data, and decisions made based on the data. Incorrect attribute values could be present because of following reasons Faulty data collection instruments Data entry problems Duplicate records Incomplete data: Inconsistent data Incorrect processing Data transmission problems Technology limitation. Inconsistency in naming convention Outliers How to handle Noisy Data? The methods for removing noise from data are as follows. 1. Binning: this approach first sort data and partition it into (equal-frequency) bins then one can smooth it using- Bin means, smooth using bin median, smooth using bin boundaries, etc. 2. Regression: in this method smoothing is done by fitting the data into regression functions. 3. Clustering: clustering detect and remove outliers from the data. 4. Combined computer and human inspection: in this approach computer detects suspicious values which are then checked by human experts (e.g., this approach deal with possible outliers).. Following methods are explained in detail as follows: Binning: Data preparation activity that converts continuous data to discrete data by replacing a value from a continuous range with a bin identifier, where each bin represents a range of values. For instance, age can be changed to bins such as 20 or under, 21-40, 41-65 and over 65. Binning methods smooth a sorted data set by consulting values around it. This is therefore called local smoothing. Let consider a binning example Binning Methods n Equal-width (distance) partitioning Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward, but outliers may dominate presentation Skewed data is not handled well n Equal-depth (frequency) partitioning 1. It divides the range (values of a given attribute) into N intervals, each containing approximately same number of samples (elements) 2. Good data scaling 3. Managing categorical attributes can be tricky. n Smooth by bin means- Each bin value is replaced by the mean of values n Smooth by bin medians- Each bin value is replaced by the median of values n Smooth by bin boundaries Each bin value is replaced by the closest boundary value Example Let Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 n Partition into equal-frequency (equi-depth) bins: o Bin 1: 4, 8, 9, 15 o Bin 2: 21, 21, 24, 25 o Bin 3: 26, 28, 29, 34 n Smoothing by bin means: o Bin 1: 9, 9, 9, 9 ( for example mean of 4, 8, 9, 15 is 9) o Bin 2: 23, 23, 23, 23 o Bin 3: 29, 29, 29, 29 n Smoothing by bin boundaries: o Bin 1: 4, 4, 4, 15 o Bin 2: 21, 21, 25, 25 o Bin 3: 26, 26, 26, 34 Regression: Regression is a DM technique used to fit an equation to a dataset. The simplest form of regression is linear regression which uses the formula of a straight line (y = b+ wx) and determines the suitable values for b and w to predict the value of y based upon a given value of x. Sophisticated techniques, such as multiple regression, permit the use of more than one input variable and allow for the fitting of more complex models, such as a quadratic equation. Regression is further described in subsequent chapter while discussing predictions. Clustering: clustering is a method of grouping data into different groups , so that data in each group share similar trends and patterns. Clustering constitute a major class of data mining algorithms. These algorithms automatically partitions the data space into set of regions or cluster. The goal of the process is to find all set of similar examples in data, in some optimal fashion. Following shows three clusters. Values that fall outsid e the cluster are outliers. 4. Combined computer and human inspection: These methods find the suspicious values using the computer programs and then they are verified by human experts. By this process all outliers are checked. 2.2.1.7 Data cleaning as a process Data cleaning is the process of Detecting, Diagnosing, and Editing Data. Data cleaning is a three stage method involving repeated cycle of screening, diagnosing, and editing of suspected data abnormalities. Many data errors are detected by the way during study activities. However, it is more efficient to discover inconsistencies by actively searching for them in a planned manner. It is not always right away clear whether a data point is erroneous. Many times it requires careful examination. Likewise, missing values require additional check. Therefore, predefined rules for dealing with errors and true missing and extreme values are part of good practice. One can monitor for suspect features in survey questionnaires, databases, or analysis data. In small studies, with the examiner intimately involved at all stages, there may be small or no difference between a database and an analysis dataset. During as well as after treatment, the diagnostic and treatment phases of cleaning need insight into the sources and types of errors at all stages of the study. Data flow concept is therefore crucial in this respect. After measurement the research data go through repeated steps of- entering into information carriers, extracted, and transferred to other carriers, edited, selected, transformed, summarized, and presented. It is essential to understand that errors can occur at any stage of the data flow, including during data cleaning itself. Most of these problems are due to human error. Inaccuracy of a single data point and measurement may be tolerable, and associated to the inherent technological error of the measurement device. Therefore the process of data clenaning mus focus on those errors that are beyond small technical variations and that form a major shift within or beyond the population distribution. In turn, it must be based on understanding of technical errors and expected ranges of normal values. Some errors are worthy of higher priority, but which ones are most significant is highly study-specific. For instance in most medical epidemiological studies, errors that need to be cleaned, at all costs, include missing gender, gender misspecification, birth date or examination date errors, duplications or merging of records, and biologically impossible results. Another example is in nutrition studies, date errors lead to age errors, which in turn lead to errors in weight-for-age scoring and, further, to misclassification of subjects as under- or overweight. Errors of sex and date are particularly important because they contaminate derived variables. Prioritization is essential if the study is under time pressures or if resources for data cleaning are limited. 2.2.2 Data Integration This is a process of taking data from one or more sources and mapping it, field by field, onto a new data structure. Idea is to combine data from multiple sources into a coherent form. Various data mining projects requires data from multiple sources because n Data may be distributed over different databases or data warehouses. (for example an epidemiological study that needs information about hospital admissions and car accidents) n Sometimes data may be required from different geographic distributions, or there may be need for historical data. (e.g. integrate historical data into a new data warehouse) n There may be a necessity of enhancement of data with additional (external) data. (for improving data mining precision) 2.2.2.1 Data Integration Issues There are number of issues in data integrations. Consider two database tables. Imagine two database tables Database Table-1 Database Table-2 In integration of there two tables there are variety of issues involved such as 1. The same attribute may have different names (for example in above tables Name and Given Name are same attributes with different names) 2. An attribute may be derived from another (for example attribute Age is derived from attribute DOB) 3. Attributes might be redundant( For example attribute PID is redundant) 4. Values in attributes might be different (for example for PID 4791 values in second and third field are different in both the tables) 5. Duplicate records under different keys( there is a possibility of replication of same record with different key values) Therefore schema integration and object matching can be trickier. Question here is how equivalent entities from different sources are matched? This problem is known as entity identification problem. Conflicts have to be detected and resolved. Integration becomes easier if unique entity keys are available in all the data sets (or tables) to be linked. Metadata can help in schema integration (example of metadata for each attribute includes the name, meaning, data type and range of values permitted for the attribute) 2.2.2.1 Redundancy Redundancy is another important issue in data integration. Two given attribute (such as DOB and age for instance in give table) may be redundant if one is derived form the other attribute or set of attributes. Inconsistencies in attribute or dimension naming can lead to redundancies in the given data sets. Handling Redundant Data We can handle data redundancy problems by following ways n Use correlation analysis n Different coding / representation has to be considered (e.g. metric / imperial measures) n Careful (manual) integration of the data can reduce or prevent redundancies (and inconsistencies) n De-duplication (also called internal data linkage) o If no unique entity keys are available o Analysis of values in attributes to find duplicates n Process redundant and inconsistent data (easy if values are the same) o Delete one of the values o Average values (only for numerical attributes) o Take majority values (if more than 2 duplicates and some values are the same) Correlation analysis is explained in detail here. Correlation analysis (also called Pearsons product moment coefficient): some redundancies can be detected by using correlation analysis. Given two attributes, such analysis can measure how strong one attribute implies another. For numerical attribute we can compute correlation coefficient of two attributes A and B to evaluate the correlation between them. This is given by Where n n is the number of tuples, n and are the respective means of A and B n ÏÆ'A and ÏÆ'B are the respective standard deviation of A and B n ÃŽ £(AB) is the sum of the AB cross-product. a. If -1 b. If rA, B is equal to zero it indicates A and B are independent of each other and there is no correlation between them. c. If rA, B is less than zero then A and B are negatively correlated. , where if value of one attribute increases value of another attribute decreases. This means that one attribute discourages another attribute. It is important to note that correlation does not imply causality. That is, if A and B are correlated, this does not essentially mean that A causes B or that B causes A. for example in analyzing a demographic database, we may find that attribute representing number of accidents and the number of car theft in a region are correlated. This does not mean that one is related to another. Both may be related to third attribute, namely population. For discrete data, a correlation relation between two attributes, can be discovered by a χ ²(chi-square) test. Let A has c distinct values a1,a2,†¦Ã¢â‚¬ ¦ac and B has r different values namely b1,b2,†¦Ã¢â‚¬ ¦br The data tuple described by A and B are shown as contingency table, with c values of A (making up columns) and r values of B( making up rows). Each and every (Ai, Bj) cell in table has. X^2 = sum_{i=1}^{r} sum_{j=1}^{c} {(O_{i,j} E_{i,j})^2 over E_{i,j}} . Where n Oi, j is the observed frequency (i.e. actual count) of joint event (Ai, Bj) and n Ei, j is the expected frequency which can be computed as E_{i,j}=frac{sum_{k=1}^{c} O_{i,k} sum_{k=1}^{r} O_{k,j}}{N} , , Where n N is number of data tuple n Oi,k is number of tuples having value ai for A n Ok,j is number of tuples having value bj for B The larger the χ ² value, the more likely the variables are related. The cells that contribute the most to the χ ² value are those whose actual count is very different from the expected count Chi-Square Calculation: An Example Suppose a group of 1,500 people were surveyed. The gender of each person was noted. Each person has polled their preferred type of reading material as fiction or non-fiction. The observed frequency of each possible joint event is summarized in following table.( number in parenthesis are expected frequencies) . Calculate chi square. Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 E11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90 and so on For this table the degree of freedom are (2-1)(2-1) =1 as table is 2X2. for 1 degree of freedom , the χ ² value needed to reject the hypothesis at the 0.001 significance level is 10.828 (taken from the table of upper percentage point of the χ ² distribution typically available in any statistic text book). Since the computed value is above this, we can reject the hypothesis that gender and preferred reading are independent and conclude that two attributes are strongly correlated for given group. Duplication must also be detected at the tuple level. The use of renormalized tables is also a source of redundancies. Redundancies may further lead to data inconsistencies (due to updating some but not others). 2.2.2.2 Detection and resolution of data value conflicts Another significant issue in data integration is the discovery and resolution of data value conflicts. For example, for the same entity, attribute values from different sources may differ. For example weight can be stored in metric unit in one source and British imperial unit in another source. For instance, for a hotel cha

Wednesday, November 13, 2019

Comparing The Buried Life and A Room Of Ones Own :: comparison compare contrast essays

Comparing The Buried Life and A Room Of One's Own      Ã‚  Ã‚   Victorian writers did ask difficult and unsettling questions, and the modern writers continued on with the quest to display these unsettling thoughts and feelings in their works even more so. You can see this continuing easy from "The Buried Life," to the ideas of "A Room Of One's Own."    In "The Buried Life," Arnold questions why men in society bury their emotions and innermost thoughts from one another like they are the only one's with these qualities, even though every man has them: "I knew the mass of men concealed their thoughts, for fear that if they revealed they would by other men be met with blank indifference, or with blame reproved; I knew they lived and moved tricked in disguises, alien to the rest of men, and alien to themselves--and yet the same heart beats in every human breast" (p.2021). He doesn't understand why this is the case, and believes humanity would be better if we let this buried life out of its cage to be free, freeing us to be our true selves. The way to reach this goal is through open love by a fellow human being: "When a beloved hand is placed on ours...the heart lies plain, and what we mean, we say" (p. 2201).    In "A Room Of One's Own," Woolf questions society's view on how geniuses of art are created. She shows that this is a natural gift, but it is one that can either be stifled or let prosper and grow, depending on how the members in society rule and treat the artist with the gift. She says that these artists need to be allowed to garner in knowledge in order to feed their ideas for their art, and they must be allowed to be free in mind and spirit so that they can create their masterpieces: "The mind of an artist, in order to achieve the prodigious effort of freeing whole and entire the work that is in him, must be incandescent...There must be no obstacle in it, no foreign matter unconsumed" (p. 2472).    As you can see, both of these works question society in the matter of chaining up it's members true feelings and ideas.

Monday, November 11, 2019

Doctor Faustus’ Damnation Essay

Doctor Faustus chose to be damned, although the evil spirits may have influenced him, Faustus always wanted wealth and honor. Faustus was very intelligent but with all the knowledge he had pertaining to logic, medicine, and law, it was never enough for him. With his quest for all that he could know he would never be satisfied unless he was a magician of the black arts. The damnation of Faustus’s soul was his own doing; it is exactly what he wanted. Only by selling his soul too Lucifer could Faustus obtain all the he desired, having ultimate knowledge. The beginning of the play shows Doctor Faustus is already interested with the black arts and magicians. â€Å"These metaphysics of magicians/ and necromantic books are heavenly! / Lines, circles, schemes, letters, and characters! / Ay, these are those Faustus most desires.† (lines 49-52). Faustus has an undying need for knowledge that he can only get through the means of selling his soul. Obviously Faustus had no hesitation when he has summoned Mephastophilis for the first time and demands that he be his servant for twenty-four years. This shows that Faustus does not care what he must do to become an honored and wealthy person. However the good angel and the evil angel appear to him and try to influence his deal with Lucifer. Faustus is having indecisive thoughts on if he has done the right thing. â€Å"Ah Christ my Savior! seek to save/ Distressed Faustus’ soul!† (line 256). The good angel is trying to tell Faustus that he can still repent and his soul will be saved, but the evil angel is reminding him that if he stays with the deal that he made, he will be wealthy and honorable. Faustus only considers repenting for a moment and then disregards that. â€Å"O this feeds my soul!† (line 330). Through his own thoughts and free will, Doctor Faustus brought the damnation onto himself. He had the opportunity to repent more than once, but even then that wasn’t enough to make him see his fate. Faustus was not a sympathetic figure; he was simply out to do whatever necessary for his own personal gain. Although his dearest friend The German Valdes and Cornelius, they greatly help Faustus’ journey to damnation.

Friday, November 8, 2019

Acct. Term Paper

Acct. Term Paper Acct. Term Paper Kelly Okamura U57912123 SM 323 (A4) Core Pre-Assignment Part I Health Consciousness We live in an entirely progressive generation where we are bombarded by a plethora of newly emerging social trends. In fact, it seems that as soon as we are made aware of what is currently â€Å"trending,† a new social trend is discovered. However, what has seemed to steadfastly hold as trends are the fitness and health related ones. A few years ago, Atkins and the South Beach Diet were the most promising diet regimens. Now, we hear the terms â€Å"paleo† and â€Å"raw vegan† being thrown around daily. While these diet regimens have certainly made some headway towards more restrictive and extreme measures, the underlying purpose remains unchanged; that is, Americans who follow these dietary trends do so as a means of weight loss or maintaining a healthy body weight. As of this year, as many as two-thirds of Americans are categorized as overweight, whereby a third are considerably obese. We, as Americans, are becoming increasingly aware of the dangers of ma intaining unhealthy body weights, thereby offsetting a surge of diet regimens, workout routines, and the like; each â€Å"scientifically proven† to restore your health and bring you back to a healthy weight level. For anyone who has gone forth with one of these health trends, they can attest to the fact that it is not, in fact, an easy feat. Had it been so, we would not be faced with this sort of epidemic of American obesity. That being said, weight loss is not something that simply happens at the snap of the fingers or on a whim: more so if our genetics are not wired in such a way. Bearing this in mind, I believe it is safe to conclude that the social trend of health consciousness and diet will remain intact for years to come. Something that many Americans consistently struggle with is the ability to exercise self-control, specifically when it comes to food. We’ve all been there: lurking in front of the snack cabinet, making compromises with ourselves and promis ing that we’ll stop after one cookie. Yet in most cases, stopping after just one morsel is simply wishful thinking. The product that I have thought up for the market for the health conscious is called the Moral Support. In essence, it is a contraption that helps to promote a healthier lifestyle by exercising self-control and portion control for those who need the assistance of an external factor. The Moral Support is a programmable food dispenser, whereby users would put their favorite junk foods in and set a time and day whence the Moral Support will dispense a single serving of the food item. Upon dispensing the snack, the container will lock itself so as to prevent binge eating. It would also have an air tight seal on the lid, which would help keep the food items from being exposed to excessive moisture and becoming stale. Grooming/Hygiene Many of us start our days off by washing our faces as a means to wake up. For those of us who do so, we have all experienced the inevitable annoyance that comes from washing our faces: the trickle of water that streams down our elbows, wetting our shirtsleeves and our countertops. However small this annoyance may be, it is still an undesirable experience the first thing in the morning. It doesn’t make sense, though, to do away with proper hygiene because of this. A product that may appeal to those who run into this problem is some sort of wristband or wrist attachment that catches water right at the bottom of the palm. The market for this is promising, as nearly everyone undertakes this simple act of good hygiene, and this product may also be used for dishwashing. Part II Williams Sonoma Location: 100 Huntington Ave, #9C Boston, MA 02116 Phone: (617) 262-3080 Clerk: Rachel Date: 08/01/2013 Many of the clerks at Williams Sonoma were stuck in the same conundrum, after I had approached them with this pre-assignment: most clerks were used to being asked very

Wednesday, November 6, 2019

Morin Surname Meaning and Family History

Morin Surname Meaning and Family History The Morin surname derives from the Old French morin, a diminutive of the name More, meaning dark and swarthy [as a moor]. It may also have originated as a topographical surname for one who lived on or near a moor. The Morin surname could also possibly originate as an adaptation of Irish surnames such as OMorahan and OMoran, or as a patronymic surname meaning the son of Maurice. Surname Origin: French Alternate Surname Spellings:  MOREN, MORRIN, MORREN, MORINI, MORAN, OMORAN, MURRAN, MORO Famous People with the Morin Surname Jean-Baptiste Morin  -  French  mathematician,  astrologer, and  astronomer.Jean-Baptiste Morin - French composerArthur Morin  -  French physicistJames C. Morin  -  American Pulitzer Prize-winning editorial cartoonistRenà © Morin  - head of the Canadian Broadcasting Corporation during World War IIJean Morin - French Baroque artistLee Morin - American astronaut Where is the Mori Surname Most Common? The Morin surname, according to surname distribution information from Forebears, is the 3,333rd most common surname in the world. It is most commonly found today in Canada, where it ranks as the 24th most common surname in the country. It is also very prevalent in France (ranked 47th) and the Seychelles (97th). WorldNames PublicProfiler indicates the Morin surname is most common in France- particularly in the regions of Poitou-Charentes, Basse-Normandie, Bretagne, Haute-Normandie, Centre, Pays-de-la-Loire, and Bourgogne. It is also fairly prevalent in Canada, particularly in the Northwest Territories, as well as Maine and New Hampshire in the United States. Genealogy Resources for the Surname Morin Morin Family Crest - Its Not What You ThinkContrary to what you may hear, there is no such thing as a Morin family crest or coat of arms for the Morin surname.  Coats of arms are granted to individuals, not families, and may rightfully be used only by the uninterrupted male-line descendants of the person to whom the coat of arms was originally granted. MORIN Family Genealogy ForumThis free message board is focused on descendants of Morin ancestors around the world. Search the forum for posts about your Morin ancestors, or join the forum and post your own queries.   FamilySearch - MORIN GenealogyExplore over 2.4 million  results from digitized  historical records and lineage-linked family trees related to the Morin surname on this free website hosted by the Church of Jesus Christ of Latter-day Saints. MORIN Surname Mailing ListFree mailing list for researchers of the Morin surname and its variations includes subscription details and a searchable archives of past messages. GeneaNet - Morin RecordsGeneaNet includes archival records, family trees, and other resources for individuals with the Morin surname, with a concentration on records and families from France and other European countries. The Morin Genealogy and Family Tree PageBrowse genealogy records and links to genealogical and historical records for individuals with the Morin surname from the website of Genealogy Today. Genealogy of Canada: Morin Family TreeA collection of links and information for Morin ancestors shared by researchers. Ancestry.com: Morin SurnameExplore over 1.2 million digitized records and database entries, including census records, passenger lists, military records, land deeds, probates, wills and other records for the Morin surname on the subscription-based website, Ancestry.com References: Surname Meanings Origins Cottle, Basil.  Penguin Dictionary of Surnames. Baltimore, MD: Penguin Books, 1967. Dorward, David.  Scottish Surnames. Collins Celtic (Pocket edition), 1998. Fucilla, Joseph.  Our Italian Surnames. Genealogical Publishing Company, 2003. Hanks, Patrick and Flavia Hodges.  A Dictionary of Surnames. Oxford University Press, 1989. Hanks, Patrick.  Dictionary of American Family Names. Oxford University Press, 2003. Reaney, P.H.  A Dictionary of English Surnames. Oxford University Press, 1997. Smith, Elsdon C.  American Surnames. Genealogical Publishing Company, 1997.

Monday, November 4, 2019

Creative writing Essay Example | Topics and Well Written Essays - 1500 words

Creative writing - Essay Example The Introduction of the essay appropriately identifies the features and elements of the disease in question, GSD. The authors of the article appropriately establish the role of the disease in the context of the community and society at hand. The sociological and biological elements of the disease is discussed thoroughly and three important variables are introduced: These issues are discussed critically and appropriately through the evaluation of existing data and other scientific journals. This is authoritative because the utilisation of secondary sources build on credible and established sources and it is applied appropriately to provide important guidance to an existing research (MacFarlene, et al., 2014). The critical review of concepts and theories in the introduction give way for the formulation of a hypothesis. A hypothesis is a tentative statement that is tested for its truthfulness and falsity in a research (Lam, 2013). In this paper, the writer seem to make a very vague statement which does not seem to provide a very strong hypothesis that can be used as a basis for proper theorisation. It states that â€Å"There will be a significant association between selected demographic variables and risk factors of cholilithiasis†. In symbol terms the study is to evaluate the relationship between demographic variables and risks of GSD. The independent variables are the demographic variables whilst the dependent variables are risks of GSD. However, they are not clearly defined and aligned appropriately showing some tendencies of randomness and arbitrariness in the eventual conclusion. The study utilises a cross-sectional study method. A cross-sectional study is a descriptive study in which disease and exposure statuses are measured simultaneously in a given population (Kern, et al., 2013). This comes with some inherent

Saturday, November 2, 2019

Fallow the instrcsion Case Study Example | Topics and Well Written Essays - 500 words

Fallow the instrcsion - Case Study Example Based on the results of the ratio analysis, Seward Inc. is a weak player in the international trade finance market due to the loss made from its trading activities. After having realized net sales of $ 4, 500, Seward Inc. made a net income of $ 315 despite the $ 1,700 gross profit realized before deducting corresponding expenses. Unfortunately, net income is prone to criticism from managers as it increases from earnings gained from discontinued operations. Investors ought to focus on measures such as cash flows, sales, or profits before considering interests and taxes. Efficiency in a company operation prevails through total asset turnover, fixed asset turnover, and equity turnover. Total Asset Turnover is a ratio used to measure the ability of a business to generate sales given its sales in total assets. The rate tends to be lower in capital-intensive businesses compared to non-capital-intensive businesses. A firm is said to be efficient if it meets a total asset turnover of 1 and above. Having a Total asset turnover of 1.6, Seward Inc is a profitable company. On the other hand, liquidity is the ability of a business to cater adequately for its financial obligations upon the occurrence. Current ratio is the best liquidity determinant followed by acid ratio. The industry recommends a current ratio of 1 and above. The rate increases with the financial position of the company. Seward Inc. is in sound financial position as the current ratio of three indicates that it can pay its short-term obligations. Damodaran (2012) acknowledges operating leverage as a change indicator in operating income caused by a change in sale. Leverage ratio is any rate used in calculatin g the company financial leverage to know how it can meet its financial obligations. An operating profit margin 8.0 percent indicates better performance of the firm. Seward Inc. is capable of meeting its short-time obligations due to its current ratio of 3.0 and Acid 1.5. The rates