Some thoughts on using big data to alleviate the practical problems of homogeneity and comparability of price statistics
■ Wang Jiangming
"Homogeneous comparability" is an important principle followed in the statistical investigation of PPI, CPI and HPI in China at present, and it is the basis for comparability between the price in the reporting period and the price in the base period. With the development of economy and society, the needs of the masses are more abundant and diverse, and the principle of "homogeneity and comparability" is also facing more challenges in practice. Under the background of the continuous updating of information technology and the continuous advancement of statistical modernization reform, how to make full use of big data, expand the sources of price statistics data, improve the system and methods of price statistics, keep pace with the times and improve the practical operability of the principle of homogeneity and comparability and the accuracy of price comparison is testing the wisdom of all parties.
First, the homogeneity of price statistics is facing the "difficulty" of practice.
At present, China’s more concerned price indexes mainly include industrial producer price index (PPI), consumer price index (CPI) and residential sales price index (HPI). All the above price indexes are "pure price indexes", that is, the price index only reflects the market price changes caused by the changes in market supply and demand and the purchasing power of money, and is a relative quantitative index reflecting the trend and degree of price level changes in different periods. This requires eliminating non-price factors such as quality, appearance, production and supply in the statistical process to achieve "homogeneity and comparability". According to the feedback from the front-line work of price statistical investigation, the practical operability of "homogeneous comparability" will become weaker and weaker if the traditional price statistical investigation method is simply adopted, and it will be difficult to match with the ever-changing social development. If it is not optimized, it will inevitably affect the credibility of statistical investigation.
(A) the lack of some prices affects "homogeneity and comparability".
1. The price of the previous period in HPI survey cannot be obtained. At present, the sales price surveys of newly-built houses in 70 large and medium-sized cities in China are all comprehensive surveys, and the basic data directly use the online signing and filing data of local real estate authorities, including the project name, project address, building number, total floor number, floor number, residential structure, total transaction price, construction area, signing time, administrative division, etc. The authenticity and reliability of the basic data are guaranteed. However, the previous period prices of newly-opened projects and intermittent sales projects in the online signing record data are naturally missing, so it is necessary to supplement the previous period prices when calculating the price index, which requires a higher method, and there are certain subjective factors and estimation errors when the big data information is not fully utilized.
2. Some conditions in PPI survey have changed. According to the existing system, the price survey of industrial producers is carried out monthly, and the "representative enterprises" regularly submit the price of "representative products". The report submitted by the enterprise includes the unit price in the reporting period and the average unit price in the last month. The unit price in the product reporting period is the simple arithmetic average of the unit prices taken twice on the 5th and 20th of the reporting month. In the front-line work of PPI survey, all localities try their best to choose products with great influence on the national economy and people’s livelihood, stable production, development prospects and local characteristics, that is, "representative products", but there are many categories of PPI survey. Taking the ex-factory price survey of industrial producers as an example, industrial products of 41 industrial categories, 207 industrial categories and 666 industrial categories were investigated and divided into 1310 basic categories. In practice, it is difficult to ensure that the same product of the same enterprise has a completely homogeneous sales record in the reporting month and last month. Even if the same product of the same enterprise is oriented to different customers, the price is different. For example, when the customer’s credit is good, the cooperation time is long or the order quantity is large, the price is often lower. The above situation can be regarded as the lack of the price of a "representative product" at a certain price acquisition point, and the price needs to be filled after removing the non-price change factors. Under the current traditional way of not using big data, this work is mainly done manually by enterprise statisticians. However, due to the limitations of professional knowledge, work experience, market judgment and other conditions, enterprise statisticians have different understandings of "homogeneity and comparability", which may affect the accuracy of price surveys.
(B) the special attributes of products affect "homogeneity and comparability".
1. The heterogeneity of residential products is outstanding. In HPI survey, it is difficult to follow the principle of homogeneity and comparability in price statistics of both newly-built commercial houses and second-hand houses. As we all know, as a special commodity, it is difficult to achieve absolute consistency in specifications, models and sizes like other commodities. There are differences in geographical location, supporting facilities and residential environment between the two different houses. Comparatively speaking, the heterogeneity of second-hand houses is more prominent, even if they are located in the same residential area, there are great differences in building location, specific floors, unit orientation, building depreciation, interior decoration and other qualities.
2. Some categories are frequently updated. Clothing products in CPI and electronic products in PPI survey all have the phenomenon of rapid update. At present, the CPI survey is mainly conducted by hand-held data collector, and the survey is conducted directly by means of fixed person, fixed point and timing. Clothing products pay more attention to fashion, and the clothing of the same brand has changeable characteristics in style design, fabric selection, color matching, etc. In addition, clothing products are seasonal and generally have a short life cycle, so it is often difficult to collect sustained and stable prices under the conditions of fixed person, fixed point and timing. PPI survey also stipulates that when selecting representative products, products with relatively stable production should be selected. "Once selected as representative products, it is necessary to continuously investigate for a period of time." When products are included in statistics, they are often in a relatively stable price stage, which easily leads to underestimation of the price index. This is also one of the important reasons why the ex-factory price index of clothing products in CPI and communication equipment, computers and other electronic equipment manufacturing industries in PPI has been "overcast" for many years, which has a certain gap with the actual feelings of the people.
(3) The impact of big data indirectly affects "homogeneity and comparability".
In the era of big data, a huge amount of information is generated, and people have more ways, wider channels and richer channels to obtain information, which also poses a great challenge to the traditional government statistical system. At present, there are many institutions or enterprises that publish various price indices in the society, with diverse data sources and different calculation methods. There is also a certain gap between the indices of each period and the data officially published by the National Bureau of Statistics. Most people don’t know or understand the principle of "homogeneity and comparability" in government price statistics. They simply judge by subjective feelings and question the government statistics. In this case, if we can’t do a good job of network public opinion monitoring and interpretation quickly, it will inevitably weaken the authority and credibility of government statistics. Take CPI as an example. A few years ago, some enterprises published the online shopping price index, and some researchers used it to evaluate the CPI released by the National Bureau of Statistics every month, believing that there was a calculation error in the CPI data. In fact, the company’s price index is only calculated by the transaction data of its e-commerce platform, which is far from the full sample requirements. Another example is the house price index. The National Bureau of Statistics officially released the sales price index of commercial housing in 70 large and medium-sized cities. At the same time, we can also see the relevant price indexes released by different institutions on the Internet platform. Except for the National Bureau of Statistics, the house price indexes of other institutions are mostly calculated based on the average prices of transaction samples in various cities, without distinguishing different apartment structures or fully considering "homogeneity and comparability", so there are bound to be differences with official statistics.
Second, government statistics share the "machine" of big data development
For government statistical work, big data is data, methods and their technical integration that are processed and mined at high speed with modern information technology and architecture, and have high application Value and decision support function. Generally speaking, it has the characteristics of "multi-V", that is, huge data Volume, Variety of data types, fast processing speed, great application value and authenticity. It is undeniable that big data provides unprecedented conditions and opportunities for the information reform of government statistical source data and macroeconomic measurement. Due to the wide range of sample collection and high statistical frequency, price statistics has become one of the most significant areas directly affected by big data. The exploration of using big data to improve price statistical surveys such as CPI, PPI and HPI is on the way, and the dilemma of "homogeneous comparability" in practice is expected to be continuously alleviated and even effectively solved.
(1) The application of big data to price statistics is one of the important contents of promoting the reform of statistical modernization.
In 2020, the Fifth Plenary Session of the 19th CPC Central Committee made a major deployment to promote the reform of statistical modernization. In 2021, the National Bureau of Statistics formulated the Reform Plan of Statistical Modernization in the Tenth Five-Year Plan period in order to build a modern statistical investigation system suited to the modernization of the national governance system and governance capacity, pointing out that "a new round of scientific and technological revolution has developed in depth, which has provided a strong impetus for improving statistical productivity, changing statistical production methods and reshaping statistical production relations", and at the same time, it has also seen that the digital transformation of statistical work is lagging behind, and put forward "improving and perfecting price statistics" and "promoting the application of departmental statistical data" In the same year, the "Work Plan for Big Data Application of National Bureau of Statistics (2021)" was issued, which clearly stated that it was necessary to "play the role of big data in expanding data sources, improving the efficiency of statistical investigation, improving the quality of statistical data, and achieving new breakthroughs in the statistical application of big data." It can be seen that the government level has fully understood the historical opportunities and important challenges brought by big data to government statistics.
(2) The application of big data to price statistics has been cutting-edge research.
Foreign scholars have studied the application of big data in the field of price statistics earlier. In 1993, Diewr proposed that scanning data could be used in the compilation of price index, thus reducing the substitution deviation and new product deviation in the compilation of price index. In recent years, some domestic experts, scholars and statisticians have also put forward pertinent opinions from various angles. He Qiang (2015) said that the future application of big data in China government statistics should be based on the wide application of big data, especially cloud computing, and establish a data quality evaluation mechanism of big data to create a more scientific and informative "second track" of government statistics data sources. Xie Zuozheng and Wang Kelin (2016) suggested using e-commerce data, scanning data and other data sources to realize the overall grasp of industrial product structure and ex-factory price. Dong Qian (2017) combined the characteristics of the characteristic price method and the repeated transaction method, compiled the second-hand housing price index through the repeated characteristic "R-H" transaction method, and selected different matching spaces to achieve maximum homogeneity and comparability under the existing data conditions. Yu Fangdong (2018) believes that the compilation of CPI based on online data is different from the traditional sampling statistical theory and method in terms of product matching, comparability and index compilation method, and new theories and methods need to be created.
(3) The application of big data to price statistics has been explored in practice.
Abroad, Australian Bureau of Statistics, American Bureau of Labor Statistics, Statistics New Zealand, etc. have formally used scanning data to compile their own CPI. In recent years, China’s statistical departments have taken the lead in the development and utilization of big data in government agencies, and steadily promoted the research and application of big data in government statistics in accordance with the core application ideas of "overall design, leading research, easy before difficult, and professional breakthrough". As early as 2014, the National Bureau of Statistics signed a big data strategic cooperation framework agreement with six companies including Tencent, and carried out substantive cooperation in the fields of public opinion monitoring and house price statistics. In recent years, the system and methods of price statistics have been constantly improved and updated. In terms of CPI survey, in December 2020, in order to meet the requirements of big data and informatization for price survey, the Urban Department of the National Bureau of Statistics formulated and issued the Measures for the Application and Management of Scanning Data, which set clear requirements for the national CPI survey professional norms to apply scanning data to price collection, and encouraged all provinces and cities to actively carry out pilot work in cities and counties with mature conditions. Taking Fujian Province as an example, the Fujian Investigation Corps of the National Bureau of Statistics has carried out pilot work in 17 investigation outlets in 8 cities and counties in the province, and the form of "coexistence of old and new" has been taken as an important supplement to the traditional pricing method.
Third, using big data to improve the "road" of price statistics homogeneity and comparability
As mentioned above, there have been many researches and practical explorations on the application of big data in price statistics at home and abroad, but there are not many achievements in the specific research on the homogeneity and comparability of price statistics. The author tries to combine some research achievements at home and abroad and the first-line practical experience of price statistics, and puts forward some path ideas for solving the practical problems of homogeneity and comparability of price statistics by using big data for reference.
(1) Follow the principle.
1. The principle of bold exploration and long-term gradual progress. "Homogeneous comparability" has always been a key and difficult issue in the practice of price statistics. It is obviously impossible to find the optimal solution by relying solely on traditional survey methods, and big data has opened up a new data source for price statistics with its advantages of high frequency, fine granularity and diversification. Under the wave of the new era, only by boldly exploring and promoting in-depth cooperation between statistical departments and social institutions and big data enterprises in accordance with the path of "complementary advantages, mutual benefit and win-win, data orientation and gradual progress" can big data become an important supplementary source of price statistics. At the same time, however, at this stage, China’s use of big data to alleviate the dilemma of "homogeneity and comparability" of price statistics in practice has difficulties in data acquisition and quality assurance, as well as bottlenecks in technology and methods, which can not completely replace the traditional survey methods for the time being, and can only be used as a useful supplement to the existing methods. Using big data to optimize the homogeneity and comparability of price statistics should be a long-term and gradual process, which requires repeated experiments and research. It is necessary to prevent "big data arrogance" and avoid damaging the science and rigor of government price statistics due to rashness.
2. The principle of security, confidentiality and continuous stability. The Statistics Law of the People’s Republic of China clearly stipulates that the objects of statistical investigation must "provide true, accurate, complete and timely information needed for statistical investigation", and also requires "statistical institutions and statisticians to keep confidential the state secrets, business secrets and personal information they know in their statistical work." The use of big data is beneficial to alleviate the problem of "homogeneity and comparability" in price statistics. However, big data has a wide range of sources, and it often needs the help of private enterprises and institutions outside government departments to implement applications. Enterprises pay attention to commercial interests, which is different from the purpose of government departments to serve the public. Therefore, in the process of data cooperation, how to prevent potential leakage risks and security risks becomes the key, and a complete legal system needs to be established to regulate them. In addition, the characteristics of CPI, PPI, HPI and other price statistics require the samples to be as stable as possible for a certain period of time. If Internet companies and data asset companies engaged in data cooperation fail to survive for a long time, it will inevitably affect the continuity of price statistics. Therefore, it is an indispensable and important principle to ensure the continuous stability of data acquisition channels.
(2) Expand data collection methods.
1. Make full use of electronic scanning data. Electronic scanning data is to scan the EAN code of goods in sales outlets through scanning equipment to obtain product feature information such as product name, product number and product model. When trading, the electronic processing system of retailers can also record relevant retail outlets and types, prices, trading quantity, trading time and other information.
The advantages of this acquisition method are as follows: first, the discrete price data of "fixed person, fixed point and timing" are replaced by high-frequency continuous scanning data to eliminate the deviation of discrete data; The second is to replace manual data with information data to avoid the measurement error of manual price collection and the burden of answering questions at price collection outlets; Third, product update information is more accurate, and it is more timely to include statistics. More comprehensive scanning data can provide more support for CPI survey to achieve "homogeneous comparability" accurately.
The shortcomings of this collection method are as follows: first, the application field is relatively limited, mainly used in CPI survey, and can not be used more in PPI, HPI and other price surveys; Second, the requirements for outlets are high, and the survey outlets need to have a complete database system. Large shopping malls, supermarkets, hospitals, etc. can facilitate the collection of scanning data, but the grounded farmers’ markets and small shops are difficult to obtain due to equipment restrictions; Third, it is difficult to maintain data. Compared with other countries that have adopted electronic scanning data in the world, China has a vast territory, strong regional color, obvious differences in regional development and price level, and many chain enterprises in the consumer market. Statistics departments need more enterprises to effectively cooperate in scanning data collection, which greatly increases the difficulty of data collection and data security.
At present, this method has been piloted in some conditional areas in China, but only in the form of "coexistence of old and new" is used as a reference for compiling price index. When the conditions are ripe in the later stage, we can consider gradually weighting the electronic scanning data and the traditional "three-fixed-straight" price acquisition data to calculate CPI.
2. Appropriate use of the network to capture data. Network crawling data refers to a collection of data that is partially targeted, professional and accurate by using Internet search engine technology and classified according to certain rules and screening criteria, also known as network crawling data.
The advantages of this collection method are as follows: first, the data source is rich, which can greatly increase the sample size of the basket of goods and services. Combined with the specific index compilation method, it can effectively solve the problem that the price of the reported month or last month is missing and cannot be "homogeneous and comparable" in price statistics such as HPI and PPI; Second, the acquisition frequency is high, and the compilation frequency of price index can be increased from monthly to semi-monthly, weekly or even daily, so as to improve the timeliness of price data release and better serve the public and decision-making; The third is to reduce the labor cost of the survey, use Internet technology to capture the price data, cross the geographical and time constraints, and greatly reduce the price collection burden of statisticians, survey enterprises and auxiliary investigators in grass-roots units.
The shortcomings of this collection method are as follows: first, it has great technical influence. For example, due to website changes and interception technical factors, the data captured by the network may be interrupted, repeated and incomplete, so it is necessary to continuously improve the capture technology to improve its stability; Second, the samples are not stable enough, and the products captured by the network are updated faster than those in the traditional collection mode. If the current index calculation method is still used, it is difficult to match effectively; Third, it is difficult to identify the transaction price. No matter whether it is online shopping platform or house price trading platform, most of the online data are sellers’ quotations or listing prices, and there is still a gap with the actual transaction price. If it cannot be effectively identified, it may affect the true accuracy of the data.
At present, Norway, Britain, the Netherlands and other countries have partially used the network to capture data in the process of CPI compilation and made breakthrough progress. In China, the application of online capture data in price statistics is still in the exploratory stage, and it is suggested that it can be piloted in some areas with mature information technology, such as Jiangsu, Zhejiang and other developed areas of online shopping, giving online capture data a certain weight to compile CPI;; You can also try to compile HPI by using the housing brokerage platform to capture data and POI data derived from geographic information system, combined with repeated transaction method and characteristic price method in developed areas such as Shanghai and Shenzhen.
(3) Improve the index compilation method.
1. Try to compare the prices of fixed groups. In the new era, with the expansion of data collection methods, we should appropriately break through the constraints of traditional statistical theories and methods in the face of the larger and faster total data in price statistics. Drawing lessons from Belgium, Britain and other countries, this paper compares the prices of relatively homogeneous and comparable products of fixed groups according to the data captured by the network and electronic scanning data, and observes and reflects the price changes of consumers buying homogeneous and similar products. During the comparison period, the product groups are fixed, but the specific products are variable. Under this method, we will calculate the price ratio of the same product group in different periods under the basic classification, rather than the price ratio of specific products. The premise of adopting this method is to cluster a huge number of products with price, to maximize the high homogeneity and similarity of products within the group, to ensure that there is no significant difference between products within the group and to reduce the deviation of price index. At the same time, because of breaking through the traditional framework and changing from "one-to-one" comparison to "group-to-group" comparison, it is necessary to study and explore the index method more suitable for new data sources.
2. Promote the practical application of the characteristic price method. Eigenvalue method, also known as Hedonic model method, is a method to homogenize samples by using Eigenvalue model. At present, France, Germany, the Netherlands and other European countries generally use the characteristic price method to calculate the house price index. This method is also applicable to the calculation of price indexes with many categories such as CPI and PPI. The characteristic price method holds that the price is determined by the utility brought to people, and each utility corresponds to a certain characteristic price value. After regression analysis with a large number of actual transaction data, the influence of characteristic change is eliminated item by item from the total price change, that is, the pure price change caused by the relationship between supply and demand and the purchasing power of money is obtained, that is, the "homogeneous and comparable" price change. Hedonic functions are mainly in linear form, semi-logarithmic form, exponential form and double logarithmic form, which can be selected according to specific needs. At present, there are many mature theoretical studies on this method at home and abroad, but the premise of using this method is to have a lot of product price and characteristic information, and the calculation tends to be complicated, which requires high background data processing ability and operator quality, and the specific practice in China is still shallow. Electronic scanning data and network crawling data make it possible to obtain large-scale commodity information, which will inevitably provide more favorable conditions for the implementation of the characteristic price method, making it an important method to optimize the homogeneity and comparability of price statistics at present. It is suggested to try it out in some areas and gradually promote it.
(Author: Fujian Investigation Corps of National Bureau of Statistics)