‘Even in the Current Climate, there are Ways to Study Russia, and Many of them Lie in the Realm of Digital Methods’
June 4, 2024
  • Arnold Khachaturov

    Editor of the data section at Novaya Gazeta Europe and cofounder of the Center for Data and Research on Russia (Cedar)

  • Maria Popova

    Independent researcher

Data journalist and sociologist Arnold Khachaturov discusses methods for getting reliable information out of Russia in the current conditions where the state systematically conceals and distorts data.
What is happening to Russian data journalism and data research now that so many sources of information have been closed off?

The broad picture looks pretty bad, but still, certain areas of journalism are developing rapidly. And there are a lot of new media projects in Russian: according to JX Fund, which supports Russian-language media in exile, at the end of 2023, 30% of the 93 media projects in exile had been created after the start of the full-scale war in Ukraine. Overall, the diversity of media has increased, a bit paradoxical in the context of a visible decline in the total audience.

About data journalism: on the one hand, the demand for it has clearly grown: when you are cut off from the country, working in exile, and many have the status of “foreign agents” and “undesirable organizations,” data becomes an important window into reality. Today, in my view, this is just about the most popular area, at least judging by how quickly it is developing.

On the other hand, after 2022 a kind of “state of emergency” was introduced with regard to Russian open data: a law adopted in February 2023 allowed the government to conceal any statistics without providing special justification. According to our estimates, over the past two and a half years, more than 50 government agencies have concealed approximately 600 datasets of varying degrees of importance.

Excluding broadly technical sets of data, we see that financial and economic indicators have suffered the most – for example, foreign trade statistics, reporting of companies and banks, government procurement, updates on budget execution, income of officials, and so on. All this data is no longer published under the pretext of sanctions risks.
In particular, because of this it is becoming less and less clear how financial flows move in the country, who is managing the ballooning state expenditures and how. And in the occupied territories of Ukraine, it is basically the “wild West” – huge a mounts of money are being poured in, without control by civil society and sometimes even the Ministry of Finance. Still, though all these processes started before 2022, they really picked up steam after.

The second group of indicators [that have been concealed] are those directly or indirectly related to combat operations, for example, detailed data on causes of death, statistics on payments for disabilities and families of victims, and the size of the prison population. As a rule, they were hidden after journalist investigations. For example, Novaya Gazeta Europe had estimated the number of victims in the war based on the size of payments for victims, while Mediazona had gauged the scale of prisoner recruitment using open data from the Federal Penitentiary Service.

And the third group is potentially “dangerous” data, which is concealed preemptively, apparently for fear of public attention.
“For example, the Prosecutor General has stopped updating its portal with detailed crime statistics, the Judicial Department has deleted data on military service crimes, Rosaviation has deleted information on the condition of Russian civil aircraft, and so on.”
Sometimes, when the state hides socially important data, journalists and researchers manage to “save” it. This, for example, is the work of projects like Infoculture by Ivan Begtin and To be Precise (“Yesli byt’ tochnym”; note that Khachaturov led the latter – MP). Here you can find archival data from the Federal Penitentiary Service on the prison population and the structure of correctional institutions, which is no longer publicly available. Or data on emissions of harmful substances that Rosprirodnadzor has concealed.

The Rosprirodnadzor data had been used by activists to monitor the environment, but it was deleted after the full-scale war started. We managed to download it using a hidden API (application programming interface) on the Rosprirodnadzor website. Most likely, this is also a temporary solution: sooner or later, such loopholes will be closed.
The Rosstat office in Moscow. Source: Wiki Commons
How pervasive is the concealment of data?

There are many cases, but I would not call it full-on. A lot of data is still not concealed, and some officials are even calling for previously classified statistics to be declassified. Central Bank Chair Elvira Nabiullina noted that with holding data is fraught with losses for the entire economy. And there are already some cases of restrictions’ being eased: for example, the government has partially started publishing customs data that disappeared at the beginning of the war.

Government agencies operate very differently. The Ministry of Health, for example, is rather closed, does not communicate well and hides even what could be made public without big risks. This also includes the entire security bloc [of the government]. Compared with them, Rosstat is much more open and communicative, although there are also many complaints about it. Meanwhile, the Bank of Russia even launched a new service for accessing statistics.

Data is rather deeply integrated into decision-making systems in the executive branch. You cannot dismantle this infrastructure overnight, so for now we have seen a relatively cautious approach toward concealing data.

Fortunately for us, the process of “cleansing” open data is so far being handled unsystematically, chaotically and very selectively. Bureaucratic inertia also plays a role: the agencies that are responsible for digitalization and the development of open data do not want to lose their “place in the sun.”
So, even with the military censorship, compared to other countries of the former Soviet Union and not only them, the availability of government data in Russia is still high.
Ella Pamfilova, chair of the Central Election Commission. Source: Wiki Commons
Of course, this availability is more a by-product of administrative processes and authoritarian digitalization than a desire for openness and transparency on the part of the state, as had been the case 10-12 years ago, when the Russian authorities were taking steps to make data more open. But thanks to the fact that there is still a lot of data, we understand where and what they are not telling us.
Let’s take the example of election statistics. We can quantify the scale of fraud in the last presidential election (see our study) precisely because the Central Election Commission (CEC) continues to publish results from precinct election commissions. Accessing them is deliberately made more difficult; the CEC uses data obfuscation (i.e., it changes the data when you try to copy it – MP), but it is possible to get around these problems. In other words, the availability of statistics in Russia in 2024 is still such that you can detect fraud even in sensitive areas like elections.

Data in Russia is often manipulated. How do you recognize reliable and unreliable data sources?

There is the classic example of statistics related to Putin’s May 2012 decrees being fudged. A KPI for governors to reduce the number of suicides or deaths from tuberculosis in Russia’s regions led to a situation where local doctors began to record different causes of death. In reality, nothing had changed – deaths from suicide were simply recorded as being caused by “injuries of undetermined intent” – but on paper, in reports to the government, the results were good. Basically the same thing happened during the pandemic, when incidence statistics were significantly understated in nearly every region.

As is the case with the election results released by the CEC, this does not mean that the data is completely useless. At the very least, these manipulations can be seen, since if you massively reclassify the cause of death from one category to another, this will definitely be visible. It’s the same story with Covid: you can make up the number of sick people, but you cannot make up the total number of deaths, meaning that the approximate scale of manipulation will still be clear from the level of excess mortality.
The reliability of Russian data is the subject of our recent study. We took 30 important indicators from different areas – from GDP and unemployment to crime rates and the number of people with HIV – and evaluated them in terms of the completeness of the methodology, the quality of counting, the presence of signs of manipulation, and so on.
This is the pilot phase of the project. We want to answer the question of what data about Russia can be trusted and with what reservations, and also develop practices for “scrubbing” Russian statistics.

The main outcome of our work was a tool for identifying red flags, which you need to pay attention to so as to understand where the defect in the indicator lies.
Twelve indicators fell in the “red” zone, including poverty, migration, and crime rates – these are the most unreliable gauges and must be treated with the utmost caution. Some of them are intentionally embellished by the state, like poverty numbers, while others, for example, like migration data, simply reflect reality very poorly due to counting problems.

There were eleven indicators in the “yellow” zone, in particular, GDP, population and abortion statistics. We are not aware of any direct manipulation of Russian GDP data. The problem is that given the current structure of the Russian economy, GDP does not reflect the real quality of life among people but rather the volume of defense production.

Or unemployment: when Vladimir Putin says that it is at record lows, he is relying on more or less reliable data but drawing the wrong conclusions. Low unemployment indicates that there is a shortage of labor due to mobilization, emigration and an aging population, so this is more bad than good. There is no manipulation at the level of statistics – it is at the level of interpretation.

The remaining seven indicators are in the “green” zone, i.e., we did not find any significant distortions. They are almost all purely “administrative,” such as the birth rate or the number of court sentences.

Our takeaway is that manipulation is not as extreme as many people think. Also, we often have a way to separate “junk” from more or less reliable data. It is wrong to say that all official data about Russia should be thrown out. Even from distorted data you can sometimes extract useful knowledge, if you know how to use it.
As long as the Russian state still cares about effectiveness and technocrats remain in the government, the country will not completely turn into a black box.
A maternity ward in Moscow. Source: VK
In one of your data investigations, you talk about how, due to a change in methodology, three million poor people simply dropped out of Rosstat’s statistics. What methods do state statistics use to overestimate or underestimate indicators?

This case shows the multidirectional trends within the Russian bureaucracy. Before the full-scale war started, there was a desire to make the methodology for calculating poverty more modern and similar to Western ones, i.e., to take as a basis the median salary in the country, rather than a fixed poverty threshold.

Rosstat even enacted these changes, but it turned out that with this approach, poverty in Russia is rising, not falling. I do not know what happened next, but the methodology was changed again and the concept of a “poverty line” reintroduced. The new formula is hard to justify, but it produces better numbers, pleasing the government, which has reported record-low poverty for two years in a row.

Since Rosstat publishes the methodology, we know exactly how these figures were obtained and we can calculate poverty ourselves using different methodologies and compare the results, which is what we did. It turned out that at least three million poor people were excluded from the official data. And if we use more multidimensional approaches, the figure doubles or even triples.

There are a lot of ways to get a desired figure – it all depends on the ingenuity of those doing the counting. Another example: the government is trying to reduce the number of schoolchildren who attend the second (late) session, as such an order has come from higher-ups. But in reality there are not enough seats, so the term “staggered schedule” has appeared. According to statistics, classes on a “staggered schedule” are part of the first (early) session, but in reality they begin at 1:00 in the afternoon (when the second session starts).

Sometimes data is distorted in a more “natural” way. For example, in Moscow there is “maternity tourism,” when women come from other regions to give birth, which inflates the birth rate in the capital. Another story is the “republics” in the North Caucasus, where there are also quite a few anomalies in the statistics. Yet there are also innocent mistakes when firms input data.

You launched an open data portal Cedar, a nongovernmental digital archive of data about Russia. How is it different from other, existing platforms?

Our project is aimed at academic researchers, experts and analysts who are engaged in Russian studies in the broad sense of the word. After 2022, many of them lost the ability to conduct field research in Russia, with the loss of sources and other problems associated with physical inaccessibility. All this negatively affects the level of expertise and reduces the quality of decision-making.

Cedar’s goal is to show that even in the current climate, there are ways to study Russia, and many of them lie in the realm of digital methods. Sure, data and computational methods cannot replace traditional sociology and ethnography, but they can productively complement them and provide new insights. We use a variety of approaches, from OSINT techniques and critical work with official statistics to scraping social media data and natural language processing.

For example, we have a complete database of court decisions from the beginning of the 2000s, which we have scraped from the websites of a couple of thousand Russian courts. Or data on the results of federal elections since 2000 by precinct election commission with a calculated share of anomalous votes.

You collect a thematically wide array of datasets. What data have you found to be the most difficult to find or not available at all?

I would identify four main stages in working with data: searching, downloading, processing and analyzing. Then “combinatorics” begins. There is data that is difficult to find but relatively easy to download. A big chunk of official data falls into this category, because it is not always easy to find the indicator you need on government websites, [since] they are not very user-friendly, and the form you need can be buried very deeply. Or it is spread across dozens of Word files, inside which scans of tables are inserted... And this becomes a processing problem.

Sometimes the data is not that hard to find but you need to get creative to download it. This was the case with pollution statistics, which we obtained using a hidden API, or with the latest data from the e-budget, which could be downloaded in a roundabout way through a graphic widget on the website. In the case of court data, for example, the difficulty is that it is located on many separate sites and is protected by captchas.
From the social networks that are popular in Russia – VK and Telegram – it is very easy to download data but not so easy to analyze it. We strive to make the work of researchers and journalists easier at all four stages.

You probably monitor views and citations in the media. What is the Russian-speaking reader following closely now?

If we look at readers inside Russia, I would highlight two trends. First, there is war fatigue. Second, there is a feeling of the media in exile “losing touch” with readers who remain in the country. Obviously, the war in Ukraine remains the number one topic for Russian-language independent media, but it is important to respond to the trends that I mentioned.

Many newsrooms are trying to water down political content with social issues, add elements of solutions journalism and find “good news” as an alternative to doomscrolling. Everyone has their own answers, and I believe that data journalism can also be one of the “antidotes.” Numbers are often more credible, more objective and more accepted than opinions. Therefore, working with data can be a way to attract a less politicized reader.
Share this article
Read More
You consent to processing your personal data and accept our privacy policy