In this era of big data content and AI grabbing headlines everyday, everyone seems to want to learn a bit of programming knowledge. After mastering programming technology, it is no longer a dream to go to the top of life. But in fact, in the beginning, even the choice of learning which programming language, it is necessary to pay attention to. A data analyst, after analyzing the visits of different technical labels on IT technology Q & a website Stack Overflow, thinks that the global software development ecosystem is actually divided into two parts, and the programming languages popular in different countries are actually quite different. You think you have chosen a programming language, but maybe it's the programming language, and you have chosen it.
At Stack Overflow, we are interested in using our data to share ideas about the global software development community. Recently released aboutMobile phone mobile developer postThis is a good example: This article explores the flow of accessing Android related problems around the world, and finds that the Android system is more frequently used in low-income countries than in high-income countries.
This post enables us to know the difference between programming technology in rich and poor countries, and how it will affect our view of the global software development industry. In this paper, we will explore these differences and prove that it is beneficial to distinguish the high-income countries from the rest of the world in the software development industry.
All the analyses discussed in this paper were launched from January 2017 to August. We chose 250 tags with the highest traffic volume during that time to study. In order to reduce other unnecessary effects, we only analyzed 64 countries that had contributed at least 5 million times of problem visits during this period. It is noteworthy that these data represent the activities of the developers who know English. Some analysis of Spanish and Portuguese language sites showed similar trends in non English countries, such as Mexico and Brazil.
Science and technology are related to per capita GDP
In a recent article, we saw that the traffic on the Android problem, which accounts for the percentage of Stack Overflow visits in a country, is often negatively correlated with the GDP per capita in a country. This makes us wonder if the other labels have such a correlation.
When we search for the main programming languages and platforms, there are some more prominent than Android, including PHP, Python, and R..
Label traffic vs per capita GDP
The number of flows to Android and PHP is negatively correlated with the level of per capita income of a country, while the flow of topics of Python and R is positively correlated with the level of per capita income of a country.Either way, we can see some exceptions (South Korea uses Android more than we expected, and China more people use Python, but the correlation is usually more pronounced.(after adjusting and testing many times, each R ^ 2 value is between 0.5-0.6 / p
It is necessary to emphasize that we are not going to point out the inevitable causality in it, nor do we choose which programming language will affect the average income of a country, nor that the wealth level of a country will directly affect their use of technology. We suspect that the reasons for these correlations may be mixed by various economic and social factors (for example, education level, development time of software industry, outsourcing level). In general, these factors are usually related to the wealth of a country.
How do we divide the software development industry into two parts?
When we look at this trend, we divide the country into two groups: high - income countries and non - high - income countries, instead of mixing all countries together. The level of wealth of the country, as an existing classification, can be quoted by the world bank. The following chart shows the different levels of wealth presented by countries according to the per capita GNI (gross national income) as the standard.
Map of the world bank's income classification
There are 78 high-income economies in the map. Besides the US and Canada, there are also some Western European countries, some Middle East and East Asian countries, and Australia / New Zealand. I have done some basic reason about the difference between countries (such as the analysis of principal component analysis), which proved to be a reasonable division method, and this classification method is more meaningful than the other division, for example according to the country's geographic position, Rudong and western hemispheres to classify countries. For example, the technical labels that the users from Australia usually access to the US and Europe, rather than China or Indonesia.
According to income classification, which country is the main flow of Stack Overflow
This distinction divides Stack Overflow traffic into 2/3 and another 1/3: 63.7% traffic on Stack Overflow comes from high income countries. This may be due to the larger proportion of the world's software development in high - income countries, more people who can enjoy the Internet, and the number of English users. Most of the flows from non - high - income countries come from India, followed by Brazil, Russia and China.
What are the differences in the technology used by high income countries?
We now divide the software development world into two parts. What is the difference between high - income countries and non - high - income countries in terms of technology use?
Differences in access technology labels in high - and low - income countries
We can draw some interesting insights from the diagram.
Differences in data science and technology: as we have seen before, Python and R are positively related to a country's income. In high - income countries, the Python tag has two times the frequency of access in the rest of the world, while the R tag has about three times as much access as other countries. We may also notice that in a relatively small number of labels, many major changes in science are written in Python language and R language, such as Panda, numpy, Matplotlib and ggplot2. This suggests that more people in high - income countries may use these two languages because of the importance of science and technology and academic research in high - income countries. This can explain why the two languages are more common in the wealthier industrialized countries. Often in high - income countries, scientific research accounts for a larger proportion of the economy, and the programmers in these countries are more likely to have a high degree of education.
C / C: C / C is another two famous programming languages, favored by high income countries.One hypothesis is that this may have something to do with education: just as we did in theIn the previous articleThe C and C languages that can be seen in the United States are particularly popular among American universities.Of course, it may also have something to do with the global geographical distribution of electronics and manufacturing.
PHP and Android: in the previous article, we explored the global development of Android, and the Android system was more popular in lower income countries. PHP is a language that is favored by lower income countries. CodeIgniter is a PHP open source framework. This technology label has the largest amount of visits in low-income countries, far exceeding the amount of visits from other countries. It is a very unbalanced geographical distribution of labels. Further examination showed that this label had a very large number of visits in South Asia / Southeast Asia (especially in India, Indonesia, Pakistan and Philippines), but there was a very small number of visits from the US and Europe. It is possible that many outsourcing companies often choose CodeIgniter when they build a web site.
Conclusion: why is it necessary to do this research?
Of course I'm very interested in these results, because I think they show interesting facts in the programming language ecosystem. They will also have an impact on other data studies that we will release in the near future.
When we ask questions about the software development industry, it is important that we understand that we are
Knowing the reasons for dividing the industry into two will provide us with more information.
For example, we usually have interest to know which technical tags can bring the most traffic. For example, people who look at Flash technology tags gradually decrease over time. If we want to create a list of programming techniques that have the most access times, the list of high - and low - income countries will be very different:
The most frequently visited programming technology label based on national income
For example, by the end of 2017, Python was the second highest technology label in high-income countries, but in other parts of the world, its access rank only eighth. My learning language, R language, is the fifteenth highest number of labels in high-income countries, but its traffic has not even entered the top 50 in other parts of the world.
When we use Stack Overflow data to understand the developer ecosystem, it is necessary for us to understand the two different worlds in software development, it is a very important background knowledge, a means for the future of this industry is very interested in American technical recruiters, an uncertain what to learn programming the language of the students in India, and a Kenya technology company to understand investors, they for a variety of programming languages may have very different views.
In future articles, we sometimes look back on this division, which will help us continue to explore the global developer ecology.
Compile group out. Editor: Hao Pengcheng