Gorbachev short course on methods of mathematical statistics. Solving problems in mathematical statistics. Methods of mathematical statistics

“Some people think they are always right. Such people could neither be good scientists nor have any interest in statistics... The case was brought down from heaven to earth, where it became part of the world of science.” (Diamand S.)

“Chance is only the measure of our ignorance. Random phenomena, if we define them, will be those whose laws we do not know.” (A. Poincaré “Science and Hypothesis”)

“Thank goodness. Isn't it the case
Always on par with the immutable...
Chance often rules the event,
Generates both joy and pain.
And life sets a task before us:
How to comprehend the role of chance"
(from the book “Mathematics studies randomness” by B.A. Kordemsky)

The world itself is natural - this is how we often consider and study the laws of physics, chemistry, etc., and yet nothing happens without the intervention of chance, which arises under the influence of unstable, collateral causal relationships that change the course of a phenomenon or experience when it is repeated. A “random effect” is created with the inherent regularity of “hidden predetermination”, i.e. chance has a need for a natural outcome.

Mathematicians consider random events only in the dilemma “to be or not to be” - whether it will happen or not.

Definition. A branch of applied mathematics that studies the quantitative characteristics of mass random events or phenomena is called mathematical statistics.

Definition. Combining elements of probability theory and mathematical statistics called stochastics.

Definition. Stochastics- this is the branch of mathematics that arose and is developing in close connection with practical activities person. Today, elements of stochastics are included in mathematics for everyone, becoming a new, important aspect of mathematical and general education.

Definition. Math statistics– the science of mathematical methods of systematization, processing and use of statistical data for scientific and practical conclusions.

Let's talk about this in more detail.

The generally accepted view now is that mathematical statistics is the science of general methods for processing experimental results. In solving these problems, what must an experiment have in order for the judgments made on its basis to be correct? Mathematical statistics becomes, in part, the science of experimental design.

The meaning of the word “statistics” has undergone significant changes over the past two centuries, write famous modern scientists Hodges and Lehman, “the word “statistics” has the same root as the word “state” (state) and originally meant the art and science of management: the first teachers of statistics in universities 18th century Germany would today be called specialists in social sciences. Because government decisions are to some extent based on data about population, industry, etc. statisticians, naturally, began to be interested in such data, and gradually the word “statistics” began to mean the collection of data about the population, about the state, and then the collection and processing of data in general. There is no point in extracting data unless something useful comes from it, and statisticians naturally become involved in interpreting the data.

The modern statistician studies methods by which inferences can be made about a population from data typically obtained from a sample of the “population.”

Definition. Statistician– a person who deals with the science of mathematical methods of systematization, processing and use of statistical data for scientific and practical conclusions.

Mathematical statistics arose in the 17th century and developed in parallel with probability theory. The further development of mathematical statistics (second half of the 19th and early 20th centuries) is due, first of all, to P.L. Chebyshev, A.A. Markov, A.M. Lyapunov, K. Gauss, A. Quetelet, F. Galton, K. Pearson, and others. In the 20th, the most significant contribution to mathematical statistics was made by A.N. Kolmogorov, V.I. Romanovsky, E.E. Slutsky, N.V. Smirnov, B.V. Gnedenko, as well as the English Student, R. Fisher, E. Purson and American scientists (Y. Neumann, A. Wald).

Problems of mathematical statistics and the meaning of error in the world of science

The establishment of patterns to which mass random phenomena are subject is based on the study of statistical data from observational results using the methods of probability theory.

The first task of mathematical statistics is to indicate ways of collecting and grouping statistical information obtained as a result of observations or as a result of specially designed experiments.

The second task of mathematical statistics is to develop methods for analyzing statistical data depending on the objectives of the study.

Modern mathematical statistics is developing ways to determine the number of necessary tests before the start of the study (experiment planning) and during the study (sequential analysis). It can be defined as the science of decision making under uncertainty.

Briefly, we can say that the task of mathematical statistics is to create methods for collecting and processing statistical data.

When studying a mass random phenomenon, it is assumed that all tests are carried out under the same conditions, i.e. a group of main factors that can be taken into account (measurable) and have a significant impact on the test result retains the same values ​​as possible.

Random factors distort the result that would have been obtained if only the main factors were present, making it random. The deviation of the result of each test from the true one is called observation error, which is a random variable. It is necessary to distinguish between systematic and random errors.

A scientific experiment is as unthinkable without error as an ocean without salt. Any flow of facts that adds to our knowledge brings some kind of error. According to a well-known saying, in life most people cannot be sure of anything except death and taxes, and the scientist adds: “And the errors of experience.”

A statistician is a “bloodhound” who hunts for error. Statistics tool for error detection.

The word “error” does not mean a simple “miscalculation”. The consequences of a miscalculation are a small and relatively uninteresting source of experimental error.

Indeed, our instruments break; our eyes and ears can deceive us; our measurements are never completely accurate, sometimes even our arithmetic calculations are erroneous. An experimental error is something more significant than an inaccurate tape measure or an optical illusion. And since the most important job of statistics is to help scientists analyze the error of an experiment, we must try to understand what an error really is.

Whatever problem a scientist works on, it will certainly turn out to be more complex than he would like. Let's say he measures radioactive fallout at different latitudes. Results will depend on the altitude of where samples are collected, the amount of local rainfall and the altitude of cyclones over a wider area.

Experimental error is an integral part of any truly scientific experiment.

The same result can be error and information depending on the problem and point of view. If a biologist wishes to investigate how changes in nutrition affect growth, then the presence of a related constitution is a source of error; if he studies the relationship between heredity and growth, the source of error will be differences in nutrition. If a physicist wants to study the relationship between electrical conductivity and temperature, differences in the density of the conducting material are a source of error; if he studies the relationship between this density and electrical conductivity, temperature changes will be a source of error.

This use of the word error may seem dubious, and it might be preferable to say that the effects obtained are confounded by “unintended” or “undesirable” influences. We design an experiment to study known influences, but random factors that we cannot predict or analyze skew the results by adding their own effects.

The difference between planned effects and effects due to random causes is like the difference between the movements of a ship at sea, sailing along a certain course, and a ship drifting aimlessly under the will of changing winds and currents. The movement of the second vessel can be called random movement. It is possible that this ship may arrive at some port, but it is more likely that it will not arrive at any specific place.

Statisticians use the word “random” to denote a phenomenon whose outcome at the next moment in time is completely impossible to predict.

The error caused by the effects foreseen in the experiment is sometimes more systematic than random.

Systematic error is more misleading than random error. Interference coming from another radio station can create a systematic musical accompaniment that you can sometimes predict if you know the tune. But this “accompaniment” may be the reason why we may make an incorrect judgment about the words or the music of the program we are trying to hear.

However, the discovery of a systematic error often leads us to the trail of a new discovery. Knowing how random errors occur helps us detect systematic errors and therefore eliminate them.

The same nature of reasoning is common in our everyday affairs. How often do we notice: “This is not an accident!” Whenever we can say this, we are on the path to discovery.

For example, A.L. Chizhevsky, analyzing historical processes: increased mortality, epidemics, outbreaks of wars, great movements of peoples, sudden changes climate, etc. discovered the relationship between these unrelated processes and periods of solar activity, which have cycles: 11 years, 33 years.

Definition. Under systematic error is understood as an error that is repeated and the same for all tests. It is usually associated with improper conduct of the experiment.

Definition. Under random mistakes refers to errors that arise under the influence of random factors and vary randomly from experiment to experiment.

Typically, the distribution of random errors is symmetrical about zero, from which an important conclusion follows: in the absence of systematic errors, the true test result is the mathematical expectation of a random variable, the specific value of which is fixed in each test.

The objects of study in mathematical statistics can be qualitative or quantitative characteristics of the phenomenon or process being studied.

In the case of a qualitative feature, the number of occurrences of this feature in the considered series of experiments is counted; this number represents the (discrete) random variable being studied. Examples of quality attributes include defects on a finished part, demographic data, etc. If the characteristic is quantitative, then the experiment produces direct or indirect measurements by comparison with some standard - a unit of measurement - using various measuring instruments. For example, if there is a batch of parts, then the standard of the part can serve as a qualitative sign, and the controlled size of the part can serve as a quantitative sign.

Basic definitions

A significant part of mathematical statistics is associated with the need to describe a large collection of objects.

Definition. The entire set of objects to be studied is called the general population.

The general population can be the entire population of the country, the monthly production of a plant, the population of fish living in a given reservoir, etc.

But the population is not just a set. If the set of objects we are interested in is too numerous, or the objects are difficult to access, or there are other reasons that do not allow us to study all the objects, we resort to studying some part of the objects.

Definition. That part of the objects that was subject to inspection, research, etc. is called sample population or simply sampling.

Definition. The number of elements in the population and sample is called their volumes.

How to ensure that the sample best represents the whole, i.e. would it be representative?

If the whole, i.e. if the population is little or completely unknown to us, we cannot offer anything better than a purely random selection. Greater awareness allows you to act better, but still, at some stage, ignorance sets in and, as a result, random choice.

But how to make a purely random choice? As a rule, selection occurs according to easily observable characteristics, for the sake of which research is conducted.

Violation of the principles of random selection led to serious errors. A poll conducted by the American magazine Literary Review regarding the outcome of the presidential election in 1936 became famous for its failure. The candidates in this election were F.D. Roosevelt and A.M. Landon.

Who has won?

As a general population, the editors used phone books. After randomly selecting 4 million addresses, she sent postcards asking about attitudes toward presidential candidates across the country. After spending a large sum on mailings and postcard processing, the magazine announced that Landon would win the upcoming presidential election by a landslide. The election result was the opposite of this forecast.

Two mistakes were made here at once. First, telephone books do not provide a representative sample of the US population - mostly wealthy heads of households. Secondly, not all people sent answers, but largely from representatives of the business world, who supported Landon.

At the same time, sociologists J. Gallan and E. Warner correctly predicted the victory of F.D. Roosevelt, based only on four thousand questionnaires. The reason for this success was not only the correct sampling. They took into account that society is divided into social groups that are more homogeneous in relation to presidential candidates. Therefore, the sample from the layer can be relatively small with the same accuracy result. In the end, Roosevelt, who was a supporter of reforms for the less wealthy sections of the population, won.

Having the results of the survey by strata, it is possible to characterize society as a whole.

What are samples?

These are series of numbers.

Let us dwell in more detail on the basic concepts that characterize the sample series.

A sample of size n was extracted from the general population > n 1, where n 1 is the number of times the appearance of x 1, n 2 - x 2, etc. was observed.

The observed values ​​of x i are called variants, and the sequence of variants written in ascending order is called a variation series. The numbers of observations n i are called frequencies and n i /n - relative frequencies (or frequencies).

Definition. Different values ​​of a random variable are called options.

Definition. Variation series is a series arranged in ascending (or descending) order of options with their corresponding frequencies (frequencies).

When studying variation series, along with the concepts of frequency, the concept of accumulated frequency is used. The accumulated frequencies (frequencies) for each interval are found by sequentially summing the frequencies of all previous intervals.

Definition. The accumulation of frequencies or frequencies is called cumulation. You can cumulate frequencies and intervals.

The characteristics of a series can be quantitative and qualitative.

Quantitative (variational) characteristics- These are characteristics that can be expressed in numbers. They are divided into discrete and continuous.

Qualitative (attributive) characteristics– these are characteristics that are not expressed in numbers.

Continuous Variables are variables that are expressed as real numbers.

Discrete Variables are variables that can only be expressed as integers.

The samples are characterized central tendencies: mean, mode and median. The average value of a sample is the arithmetic mean of all its values. The sampling mode is those values ​​that occur most often. The sample median is the number that “splits” in half the ordered population of all values ​​in the sample.

The variation series can be discrete or continuous.

Task

Sample given: 1.3; 1.8; 1.2; 3.0; 2.1; 5; 2.4; 1.2; 3.2;1.2; 4; 2.4.

This is a range of options. Arranging these options in ascending order, we get a variation series: 1.2; 1.2; 1.2; 1.3; 1.8; 2.1; 2.4; 2.4; 3.0; 3.2; 4; 5.

The average value of this series is 2.4.

The median of the series is 2.25.

The mode of the series is –1,2.

Let us define these concepts.

Definition. Median variation series The value of the random variable that falls in the middle of the variation series (Me) is called.

The median of an ordered series of numbers with an odd number of terms is the number written in the middle, and the median of an ordered series of numbers with an even number of terms is the arithmetic mean of the two numbers written in the middle. The median of an arbitrary series of numbers is the median of the corresponding ordered series.

Definition. Variation series fashion They call the option (the value of the random variable) to which the highest frequency (Mo) corresponds, i.e. which occurs more often than others.

Definition. The arithmetic mean value of the variation series is the result of dividing the sum of the values ​​of a statistical variable by the number of these values, that is, by the number of terms.

The rule for finding the arithmetic mean of a sample:

  1. multiply each option by its frequency (multiplicity);
  2. add up all the resulting products;
  3. divide the found sum by the sum of all frequencies.

Definition. Row range is called the difference between R=x max -x min, i.e. the largest and smallest values ​​of these options.

Let's check whether we correctly found the mean value of this series, median and mode, based on the definitions.

We counted the number of terms, there are 12 of them - an even number of terms, which means we need to find the arithmetic mean of the two numbers written in the middle, that is, the 6th and 7th options. (2.1+2.4)\2=2.25 – median.

Fashion. The fashion is 1.2, because only this number occurs 3 times, and the rest occur less than 3 times.

We find the arithmetic mean like this:

(1,2*3+1,3+1,8+2,1+2,4*2+3,0+3,2 +4+5)\12=2,4

Let's make a table

Such tables are called frequency tables. In them, the numbers in the second line are frequencies; they show how often certain values ​​occur in the sample.

Definition. Relative frequency sample values ​​is the ratio of its frequency to the number of all sample values.

Relative frequencies are otherwise called frequencies. Frequencies and frequencies are called scales. Let's find the range of the series: R=5-1.2=3.8; The range of the series is 3.8.

Food for thought

The arithmetic mean is a conventional value. In reality it doesn't exist. In reality there is a total amount. Therefore, the arithmetic mean is not a characteristic of one observation; it characterizes the series as a whole.

Average value can be interpreted as the center of dispersion of the values ​​of the observed characteristic, i.e. value around which all observed values ​​fluctuate, and the algebraic sum of deviations from the average is always zero, i.e. the sum of deviations from the average upward or downward are equal.

Average is an abstract (generalizing) quantity. Even when specifying a series only from natural numbers, the average value can be expressed as a fraction. Example: GPA test work 3,81.

Average value is found not only for homogeneous quantities. Average grain yield throughout the country (corn - 50-60 centners per hectare and buckwheat - 5-6 centners per hectare, rye, wheat, etc.), average food consumption, average national income per capita , average housing supply, weighted average housing cost, average labor intensity of building construction, etc. - these are the characteristics of the state as a single national economic system, these are the so-called system averages.

In statistics, such characteristics as mode and median. They are called structural averages, because the values ​​of these characteristics are determined by the general structure of the data series.

Sometimes a series may have two modes, sometimes a series may have no mode.

Fashion is the most acceptable indicator when identifying the packaging of a certain product, which is preferred by buyers; prices for goods of a given type, common on the market; as the size of shoes, clothes, which is in greatest demand; a sport that the majority of the population of a country, city, village, school, etc. prefer to engage in.

In construction, there are 8 options for slabs in width, and 3 types are more often used: 1 m, 1.2 m and 1.5 m. In length, there are 33 options for slabs, but slabs with a length of 4.8 m are most often used; 5.7 m and 6.0 m, the slab fashion is most often found among these 3 sizes. The same can be said about window brands.

The mode of a data series is found when one wants to identify some typical indicator.

A mode can be expressed in numbers and words; from a statistical point of view, a mode is an extremum of frequency.

Median allows you to take into account information about a series of data that is given by the arithmetic mean and vice versa.

Within educational program At a university, you are unlikely to find a separate discipline called “mathematical statistics”; however, the elements of mathematical statistics are often studied in conjunction with probability theory, but only after studying the basic course of probability theory.

Mathematical statistics: general information

Mathematical statistics is a branch of mathematics that develops methods for recording, describing and analyzing data from any observations and experiments, the purpose of which is to build probabilistic models of mass random phenomena.

Mathematical statistics as a science arose in the 17th century. and developed in parallel with probability theory. They made a great contribution to the development of science in the 19th-20th centuries. Chebyshev P.L., Gauss K., Kolmogorov A.N. and etc.

The general task of mathematical statistics is to create methods for collecting and processing statistical data to obtain scientific and practical conclusions.

The main sections of mathematical statistics are:

  • sampling method (familiarization with the concept of sampling, methods of collecting and processing data, etc.);
  • statistical assessment of sample parameters (estimates, confidence intervals, etc.);
  • calculation of summary characteristics of the sample (calculation of options, moments, etc.);
  • correlation theory (regression equations, etc.);
  • statistical testing of hypotheses;
  • one-way analysis of variance.

TO the most common Problems of mathematical statistics that are studied at university and often encountered in practice include:

  • problems of determining estimates of sampling parameters;
  • tasks to test statistical hypotheses;
  • the problem of determining the type of distribution law based on statistical data.

Problems of determining sample parameter estimates

The study of mathematical statistics begins with the definition of such concepts as “sample”, “frequency”, “relative frequency”, “empirical function”, “polygon”, “cumulate”, “histogram”, etc. Next comes the study of the concepts of estimates (biased and unbiased): sample mean, variance, corrected variance, etc.

Task

Children's height measurement junior group kindergarten represented by a sample:
92, 96, 95, 96, 94, 97, 98, 94, 95, 96.
Let's find some characteristics of this sample.

Solution

Sample size (number of measurements; N): 10.
Lowest sample value: 92. Highest value samples: 98.
Sample range: 98 – 92 = 6.
Let's write down the ranked series (options in ascending order):
92, 94, 94, 95, 95, 96, 96, 96, 97, 98.
Let’s group the series and write it in a table (we’ll assign each option the number of its occurrences):

x i 92 94 95 96 97 98 N
n i 1 2 2 3 1 1 10

Let's calculate the relative frequencies and accumulated frequencies, write the result in the table:

x i 92 94 95 96 97 98 Total
n i 1 2 2 3 1 1 10
0,1 0,2 0,2 0,3 0,1 0,1 1
Accumulated frequencies 1 3 5 8 1 10

Let's build a polygon of sampling frequencies (mark on the graph the options along the OX axis, frequencies along the OY axis, connect the points with a line).

We calculate the sample mean and variance using the formulas (respectively):


You can find other sample characteristics, but for general idea The characteristics found are quite sufficient.

Problems testing statistical hypotheses

Tasks related to this type, more difficult tasks of the previous type and their solution is often more voluminous and labor-intensive. Before starting to solve problems, the concepts of statistical hypothesis, null and competing hypothesis, etc. are first studied.

Let's consider simplest task of this type.

Task

Two independent samples of volume 11 and 14 are given, extracted from normal populations X, Y. Corrected variances are also known, equal to 0.75 and 0.4, respectively. It is necessary to test the null hypothesis about the equality of general variances at the significance level γ =0.05. Choose a competing hypothesis as desired.

Solution

The null hypothesis for our problem is written as follows:

As a competing hypothesis, consider the following:

Let us calculate the ratio of the larger corrected variance to the smaller one and obtain the observed value of the criterion:

Since the competing hypothesis we have chosen is of the form , the critical region is right-handed.
Using the table for a significance level of 0.05 and the numbers of degrees of freedom equal to 10 (11 – 1 = 10) and 13 (14 – 1 = 13), we find the critical point, respectively:

Since the observed value of the criterion is less than the critical value (1.875<2,67), то нет оснований отвергнуть гипотезу о равенстве генеральных дисперсий. Таким образом, исправленные дисперсии различаются между собой незначимо.

The problem considered is not easy at first glance, but it is quite standard and can be solved according to a template. Such problems differ from each other, as a rule, in the values ​​of the criteria and the critical region.

More labor-intensive (as they contain a lot of calculations, some of which are summarized in tables) are tasks to test the hypothesis about the type of distribution of the population. When solving such problems, various criteria are used, for example, the Pearson criterion.

Problems of determining the type of distribution law from statistical data

This type of problem belongs to the section that studies the elements of correlation theory. If we consider the dependences of Y on X, then we could recall the least squares method to determine the type of dependence. However, in mathematical statistics everything is much more complicated and in the theory of correlation two-dimensional quantities are considered, the values ​​of which are usually given in the form of tables.

x 1 x 1 x n n y
y 1 n 11 n 21 n n1
y 1 n 12 n 22 n n2
y m n 1m n 2m n nm
n x N

Let us give the formulation of one of the tasks of this section.

Task

Determine the sample equation of the straight line of regression of Y on X. The data are given in the correlation table.

Y X n y
10 20 30 40
5 1 3 4
6 2 1 3
7 3 2 5
8 1 1
n x 1 5 4 3 N=13

Conclusion

In conclusion, we note that the level of complexity of problems in mathematical statistics varies quite a lot when moving from one type to another. Problems of the first type are quite simple and do not require a special understanding of the theory; you can simply write down the formulas and solve almost any problem. Problems of the second and third types are a little more complicated and to successfully solve them, a certain amount of “knowledge” in this discipline is required.

We will give a list of only two books, but these books have long become reference books for the author of this article.

  1. Gmurman V.E. Probability theory and mathematical statistics: textbook. – 12th ed., revised. – M.: Publishing House Yurayt, 2010. – 479 p.
  2. Gmurman V.E. A guide to solving problems in probability theory and mathematical statistics. – M.: Higher School, 2005. – 404 p.

Custom mathematical statistics solution

We wish you good luck in mastering mathematical statistics. If there are problems, please contact us. We'll be happy to help!

RANDOM VARIABLES AND THE LAWS OF THEIR DISTRIBUTION.

Random They call a quantity that takes values ​​depending on a combination of random circumstances. Distinguish discrete and random continuous quantities.

Discrete A quantity is called if it takes on a countable set of values. ( Example: the number of patients at a doctor's appointment, the number of letters on a page, the number of molecules in a given volume).

Continuous is a quantity that can take values ​​within a certain interval. ( Example: air temperature, body weight, human height, etc.)

Law of distribution A random variable is a set of possible values ​​of this variable and, corresponding to these values, probabilities (or frequencies of occurrence).

EXAMPLE:

x x 1 x 2 x 3 x 4 ... x n
p p 1 p 2 p 3 p 4 ... p n
x x 1 x 2 x 3 x 4 ... x n
m m 1 m 2 m 3 m 4 ... m n

NUMERICAL CHARACTERISTICS OF RANDOM VARIABLES.

In many cases, along with the distribution of a random variable or instead of it, information about these quantities can be provided by numerical parameters called numerical characteristics of a random variable . The most common of them:

1 .Expected value - (average value) of a random variable is the sum of the products of all its possible values ​​and the probabilities of these values:

2 .Dispersion random variable:


3 .Standard deviation :

“THREE SIGMA” rule - if a random variable is distributed according to a normal law, then the deviation of this value from the average value in absolute value does not exceed three times the standard deviation

GAUSS LAW – NORMAL DISTRIBUTION LAW

Often there are quantities distributed over normal law (Gauss's law). main feature : it is the limiting law to which other laws of distribution approach.

A random variable is distributed according to the normal law if it probability density has the form:



M(X)- expected value random variable;

s- standard deviation.

Probability Density(distribution function) shows how the probability assigned to an interval changes dx random variable, depending on the value of the variable itself:


BASIC CONCEPTS OF MATHEMATICAL STATISTICS

Math statistics- a branch of applied mathematics directly adjacent to probability theory. The main difference between mathematical statistics and probability theory is that mathematical statistics does not consider actions on distribution laws and numerical characteristics of random variables, but approximate methods for finding these laws and numerical characteristics based on the results of experiments.

Basic concepts mathematical statistics are:

1. General population;

2. sample;

3. variation series;

4. fashion;

5. median;

6. percentile,

7. frequency range,

8. bar chart.

Population- a large statistical population from which part of the objects for research is selected

(Example: the entire population of the region, university students of a given city, etc.)

Sample (sample population)- a set of objects selected from the general population.

Variation series- a statistical distribution consisting of variants (values ​​of a random variable) and their corresponding frequencies.

Example:

X,kg
m

x- value of a random variable (weight of girls aged 10 years);

m- frequency of occurrence.

Fashion– the value of the random variable that corresponds to the highest frequency of occurrence. (In the example above, the fashion corresponds to the value 24 kg, it is more common than others: m = 20).

Median– the value of a random variable that divides the distribution in half: half of the values ​​are located to the right of the median, half (no more) - to the left.

Example:

1, 1, 1, 1, 1. 1, 2, 2, 2, 3 , 3, 4, 4, 5, 5, 5, 5, 6, 6, 7 , 7, 7, 7, 7, 7, 8, 8, 8, 8, 8 , 8, 9, 9, 9, 10, 10, 10, 10, 10, 10

In the example we observe 40 values ​​of the random variable. All values ​​are arranged in ascending order, taking into account the frequency of their occurrence. You can see that to the right of the highlighted value 7 are 20 (half) of the 40 values. Therefore, 7 is the median.

To characterize the scatter, we will find the values ​​not higher than 25 and 75% of the measurement results. These values ​​are called 25th and 75th percentiles . If the median divides the distribution in half, then the 25th and 75th percentiles are cut off by a quarter. (The median itself, by the way, can be considered the 50th percentile.) As can be seen from the example, the 25th and 75th percentiles are equal to 3 and 8, respectively.

Use discrete (point) statistical distribution and continuous (interval) statistical distribution.

For clarity, statistical distributions are depicted graphically in the form frequency range or - histograms .

Frequency polygon- a broken line, the segments of which connect points with coordinates ( x 1 ,m 1), (x 2 ,m 2), ..., or for relative frequency polygon – with coordinates ( x 1,р * 1), (x 2 ,р ​​* 2), ...(Fig.1).


m m i /n f(x)

Fig.1 Fig.2

Frequency histogram- a set of adjacent rectangles built on one straight line (Fig. 2), the bases of the rectangles are the same and equal dx , and the heights are equal to the ratio of frequency to dx , or R * To dx (probability density).

Example:

x, kg 2,7 2,8 2,9 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8 3,9 4,0 4,1 4,2 4,3 4,4
m

Frequency polygon

The ratio of relative frequency to interval width is called probability density f(x)=m i / n dx = p* i / dx

An example of constructing a histogram .

Let's use the data from the previous example.

1. Calculation of the number of class intervals

Where n - number of observations. In our case n = 100 . Hence:

2. Calculation of interval width dx :

,

3. Drawing up an interval series:

dx 2.7-2.9 2.9-3.1 3.1-3.3 3.3-3.5 3.5-3.7 3.7-3.9 3.9-4.1 4.1-4.3 4.3-4.5
m
f(x) 0.3 0.75 1.25 0.85 0.55 0.6 0.4 0.25 0.05

bar chart

1. Mathematical statistics. Introduction

Mathematical statistics is a discipline that is applied in all areas of scientific knowledge.

Statistical methods are designed to understand the “numerical nature” of reality (Nisbett, et al., 1987).

Definition of the concept

Math statistics is a branch of mathematics devoted to methods of data analysis, mainly of a probabilistic nature. She is engaged in systematization, processing and usestatistical data for theoretical and practicalical conclusions.

Statistical data refers to information about the number of objects in any more or less extensive collection that have certain characteristics. It is important to understand here that statistics deals specifically with the number of objects, and not with their descriptive characteristics.

The purpose of statistical analysis is to study the properties of a random variable. To do this, it is necessary to measure the values ​​of the random variable being studied several times. The resulting group of values ​​is considered as sample from a hypothetical population.

The sample is statistically processed, and after that a decision is made. It is important to note that due to the initial condition of uncertainty, the accepted solution always has the character of a “fuzzy statement”. In other words, statistical processing deals with probabilities rather than precise statements.

The main thing in the statistical method is to count the number of objects included in different groups. Objects are collected into a group according to a certain common characteristic, and then the distribution of these objects in the group according to the quantitative expression of this characteristic is considered. In statistics, the sampling method of analysis is often used, i.e. Not the entire group of objects is analyzed, but a small sample - several objects taken from a large group. Probability theory is widely used in the statistical assessment of observations and in drawing conclusions.

The main subject of mathematical statistics is the calculation statistician (may the reader forgive us for the tautology), which are criteria for assessing the reliability of a priori assumptions, hypotheses or conclusions based on the essence of empirical data.

Another definition is “Statistics are instructions by which a certain number is calculated from a sample - the value of the statistic for a given sample”[Sachs, 1976]. The sample mean and variance, the ratio of the variances of two samples, or any other functions of the sample may be considered like statisticians.

The calculation of “statistics” is a “single number” representation of a complex stochastic (probabilistic) process.

Student distribution

Statistics are also random variables. Distributions of statistics (test distributions) underlie criteria that are built on these statistics. For example, W. Gosset, working at the Guinness brewery and publishing under the pseudonym “Student,” in 1908 proved the very useful properties of the distribution of the ratio of the difference between the sample mean and the population mean () to the standard error of the population mean, or t –statistics ( Student distribution ):

. (5.7)

The Student distribution in shape under some conditions approaches normal.

The other two important distributions of sample statistics arec 2 -distribution And F -distribution, widely used in a number of branches of statistics to test statistical hypotheses.

So, item mathematical statistics is a formal quantitative side of the objects being studied, indifferent to the specific nature of the objects being studied themselves.

For this reason, the examples given here are about groups of data, about numbers, and not about specific measurable things. And therefore, using the sample calculations given here, you can calculate your data obtained at a variety of objects.

The main thing is to choose a statistical processing method suitable for your data..

Depending on the specific results of observations, mathematical statistics is divided into several sections.

Sections of mathematical statistics

        Statistics of numbers.

        Multivariate statistical analysis.

        Analysis of functions (processes) and time series.

        Statistics of objects of non-numerical nature.

In modern science, it is believed that any field of research cannot be a real science until mathematics penetrates it. In this sense, mathematical statistics is the authorized representative of mathematics in any other science and provides a scientific approach to research. We can say that the scientific approach begins where mathematical statistics appear in the study. This is why mathematical statistics is so important for any modern researcher.

If you want to be a real modern researcher, study and apply mathematical statistics in your work!

Statistics necessarily appear where there is a transition from a single observation to a multiple one. If you have a lot of observations, measurements and data, then you cannot do without mathematical statistics.

Mathematical statistics are divided intotheoretical and applied.

Theoretical statistics proves the scientific nature and correctness of statistics itself.

Theoretical mathematical statistics - the science that studies methods revealing patterns inherent in large populations of homogeneous objects based on their sampling.

This branch of statistics is dealt with by mathematicians, and they like to use their theoretical mathematical proofs to convince us that statistics itself is scientific and can be trusted. The trouble is that only other mathematicians can understand these proofs, and for ordinary people who need to use mathematical statistics, these proofs are still not available, and are completely unnecessary!

Conclusion: If you are not a mathematician, then do not waste your energy on understanding the theoretical calculations regarding mathematical statistics. Study the actual statistical methods, not their mathematical justifications.

Applied Statistics teaches users to work with any data and obtain generalized results. It doesn't matter what kind of data it is, what matters is how much of it you have at your disposal. In addition, applied statistics will tell us how much we can trust that the results obtained reflect the actual state of affairs.

Different disciplines in applied statistics use different sets of specific methods. Therefore, the following sections of applied statistics are distinguished: biological, psychological, economic and others. They differ from each other in the set of examples and techniques, as well as in their favorite calculation methods.

The following is an example of the differences between the application of applied statistics for different disciplines. Thus, the statistical study of the regime of turbulent water flows is carried out on the basis of the theory of stationary random processes. However, applying the same theory to the analysis of economic time series can lead to gross errors due to the fact that the assumption that the probability distribution remains unchanged in this case is, as a rule, completely unacceptable. Therefore, these different disciplines will require different statistical methods.

So, any modern scientist should use mathematical statistics in his research. Even the scientist who works in areas that are very far from mathematics. And he must be able to apply applied statistics to his data, even without knowing it.

© Sazonov V.F., 2009.

Methods of mathematical statistics


1. Introduction

Mathematical statistics is a science that deals with the development of methods for obtaining, describing and processing experimental data in order to study the patterns of random mass phenomena.

In mathematical statistics, two areas can be distinguished: descriptive statistics and inductive statistics (statistical inference). Descriptive statistics deals with the accumulation, systematization and presentation of experimental data in a convenient form. Inductive statistics based on these data allows one to draw certain conclusions regarding the objects about which data are collected or estimates of their parameters.

Typical areas of mathematical statistics are:

1) sampling theory;

2) theory of assessments;

3) testing statistical hypotheses;

4) regression analysis;

5) analysis of variance.

Mathematical statistics is based on a number of initial concepts without which it is impossible to study modern methods of processing experimental data. Among the first of these is the concept of a general population and a sample.

In mass industrial production, it is often necessary to determine whether the quality of the product meets the standards without checking each product produced. Since the quantity of products produced is very large or the testing of products is associated with rendering them unusable, a small number of products are checked. Based on this check, it is necessary to give a conclusion about the entire series of products. Of course, you cannot say that all transistors from a batch of 1 million pieces are good or bad by checking one of them. On the other hand, since the process of selecting samples for testing and the tests themselves can be time-consuming and lead to high costs, the scope of product testing should be such that it can give a reliable representation of the entire batch of products, while being of minimal size. For this purpose, we introduce a number of concepts.

The entire set of objects being studied or experimental data is called the general population. We will denote by N the number of objects or the amount of data that makes up the general population. The value N is called the volume of the population. If N>>1, that is, N is very large, then N = ¥ is usually considered.

A random sample, or simply a sample, is a portion of a population selected at random from it. The word "random" means that the probability of selecting any object from the population is the same. This is an important assumption, but it is often difficult to test in practice.

The sample size is the number of objects or the amount of data that makes up the sample and is denoted by n. In the future, we will assume that the sample elements can be assigned, respectively, numerical values ​​x 1, x 2, ... x n. For example, in the process of quality control of manufactured bipolar transistors, this could be measurements of their DC gain.


2. Numerical characteristics of the sample

2.1 Sample mean

For a particular sample of size n, its sample mean

is determined by the relation

where x i is the value of the sample elements. Usually it is necessary to describe the statistical properties of arbitrary random samples, and not one of them. This means that it is being considered mathematical model, which assumes a sufficiently large number of samples of size n. In this case, the sample elements are considered as random variables Xi, taking values ​​xi with a probability density f(x), which is the probability density of the general population. Then the sample mean is also random variable

equal to

As before, we will denote random variables in capital letters, and the values ​​of random variables are lowercase.

The average value of the population from which the sample is drawn will be called the general average and denoted by m x. It can be expected that if the sample size is significant, the sample mean will not differ significantly from the population mean. Since the sample mean is a random variable, the mathematical expectation can be found for it:

Thus, the mathematical expectation of the sample mean is equal to the general mean. In this case, the sample mean is said to be an unbiased estimate of the population mean. We will return to this term later. Since the sample mean is a random variable that fluctuates around the general mean, it is desirable to estimate this fluctuation using the variance of the sample mean. Consider a sample whose size n is significantly smaller than the population size N (n<< N). Предположим, что при формировании выборки характеристики генеральной совокупности не меняются, что эквивалентно предположению N = ¥. Тогда

Random variables X i and X j (i¹j) can be considered independent, therefore,

Let's substitute the result obtained into the formula for variance:

where s 2 is the variance of the population.

From this formula it follows that with increasing sample size, fluctuations of the sample average around the general average decrease as s 2 /n. Let us illustrate this with an example. Let there be a random signal with mathematical expectation and variance respectively equal to m x = 10, s 2 = 9.

Signal samples are taken at equally spaced times t 1, t 2, ...,

X(t)

X 1

t 1 t 2 . . . t n t

Since the samples are random variables, we will denote them X(t 1), X(t 2), . . . , X(tn).

Let us determine the number of samples so that the standard deviation of the estimate of the mathematical expectation of the signal does not exceed 1% of its mathematical expectation. Since m x = 10, it is necessary that

On the other hand, therefore or From here we obtain that n ³ 900 samples.

2.2 Sample variance

For sample data, it is important to know not only the sample mean, but also the spread of sample values ​​around the sample mean. If the sample mean is an estimate of the population mean, then the sample variance must be an estimate of the population variance. Sample variance

for a sample consisting of random variables is determined as follows

Using this representation of the sample variance, we find its mathematical expectation