June 2021, Mohsin Raza & Paul van Puijenbroek, DPulse
We compare the four most commonly used statistical analysis programs, two open source languages (Python and R) with two commercial parties (SPSS and SAS). The world around data is moving very fast. That's why it's high time to take a look at how things are with Python, R, SPSS and SAS. We also look at Power BI and Tableau, applications for dashboarding and reporting, and remark where they stand in this discussion.
R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data scientists for developing statistical software and data analysis. R is open source, with a huge community. This community makes R so strong in both the explanatory and the predictive side of data analytics.
Python is also an open source language with an ever-increasing open source community. Python was conceived in the late 1980s at Centrum Wiskunde & Informatica (CWI) in the Netherlands. Its implementation began in December 1989, and its first release came in 1991. Python has developed strongly in the field of machine learning and can be used especially well in advanced analytics.
SAS was developed at North Carolina State University from 1966 until 1976, when SAS Institute was incorporated. SAS delivers analytics solutions that include data mining, statistical analysis, forecasting and econometrics, optimization and simulation, and text analytics.
SPSS Statistics is a software package used for interactive, or batched, statistical analysis. Long produced by SPSS Inc., it was acquired by IBM in 2009. SPSS offers advanced statistical analysis, a vast library of machine learning algorithms, text analysis, open source extensibility, integration with big data and seamless deployment into applications.
In Table 1, we present a summary of the comparison of the 4 technologies. The comparison is done on the following dimensions: Methods and Techniques, Ease of Learning, Graphical Capabilities, Customer Service Support & Community, Deep Learning Support, and Cost.
Criterion | SAS & SPSS | R | Python |
---|---|---|---|
Methods & Techniques | Medium | High | High |
Ease of Learning | High | Medium | Medium |
Graphical Capabilities | Low | High | Medium |
Customer Service Support & Community | Medium | High | Medium |
Deep Learning Support | Low | Medium | High |
Cost | High | Low | Low |
Table 1: Comparison of R & Python vs. SAS & SPSS
We use three categories to describe their qualities, “High”, “Medium”, and “Low” to compare their relative strengths. In Appendix 1, an explanation of the scores is given.
R: Facebook, Google, Twitter, Microsoft, Uber, Airbnb, IBM, ANZ, Stack Overflow, AstraZeneca
Python: Google, Facebook, Instagram, Spotify, Quora, Netflix, Dropbox, Reddit
SAS: Honda, Apple, Bank of America, DSM, HP, Lenovo, Rabobank, VISA, AEGON
SPSS: eBay, KPMG, Cognizant, IBM, Accenture, Baylor Scott & White Health, Weill Cornell Medicine, BMO Financial Group, The American Red Cross
It is clear that the more “traditional” companies are still working with SPSS or SAS and the “younger” companies are more working with R & Python.
As many companies use applications like Power BI and Tableau to generate their dashboards & reporting and also might use them for predictive analytics, we briefly outline where they stand in the data science ecosystem. PBI and Tableau focus in the area of Business Intelligence. Additionally, Power BI and Tableau provide limited capabilities in advanced analytics. To complement, the limitations of their capabilities, these packages support limited integration with the R and Python ecosystems. The limitations are further explained in Appendix 2.
It is clear that these companies realize that there is nothing more powerful than R and Python when it comes to predictive analytics and statistical analysis, including the visualizations of the results, currently and probably for many years in the future. Thanks to their open source coding and their huge communities who continuously attribute to the improvement of R and Python.
Stack Overflow is predominantly the most widely used platform where developers ask questions about tooling. Looking at how often the Python, R and SAS tags appear (SPSS is not in them), a clear trend can be seen in the below Figure 1:
Figure1: Stack Overflow Trends, 19/06/2021
The graph in Figure 1 shows the number of questions asked about Python, R and SAS in Stackoverflow compared to all of the questions asked about any other technologies since 2008.
Given the popularity of Stackoverflow among the developers and data scientists, it is a useful indicator about the popularity of a particular technology. What we see in figure 1 is that 16% of the questions are related to Python and almost 4% to R and still growing. Questions about SAS are hardly asked, close to 0%. Clearly showing the popularity of R and Python over SAS.
The open source community is also growing with large organizations that develop programs and packages and publish them as open source, such as Netflix, Airbnb or Google. Lately, there is an important trend emerging, i.e., the call for explainability within predictive machine learning models. This is also another argument for using R and Python. On the one hand, this still makes R very strong. On the other hand, you see that the interest within Python in this is also growing. Both environments contribute to transparency regarding what is happening within the algorithms. 'Explainable Artificial Intelligence', or also XAI, is an emerging term in the field for a reason.
In recent years there is a clear decrease in the use of SAS and SPSS and an increase in use of R and especially Python. The communities behind this open source tooling make these languages so strong, which only makes for a bigger community. The growth in possibilities is therefore increasing exponentially, which, combined with the developments in deep learning, gives R and Python a competitive edge that SAS and SPSS cannot compete with. Looking forward this competitive edge will only get bigger as the new techniques like deep learning will become more important.
https://insights.stackoverflow.com/trends?tags=r%2Cpython%2Csas
https://www.analyticsvidhya.com/blog/2017/09/sas-vs-vs-python-tool-learn/
https://www.tableau.com/learn/whitepapers/using-r-and-tableau
https://docs.microsoft.com/en-us/power-bi/create-reports/desktop-r-visuals#known-limitations
All 4 ecosystems have all the basic and most needed functions available. Due to their open nature, R & Python get latest features quickly. SAS and SPSS, on the other hand update their capabilities in new version roll-outs. Since R has been used widely in academics in past, development of new techniques is fast. Python community, also responds quickly to the new developments. This matters particularly for the more advanced latest developments in the form of packages and algorithms.
R and Python have a clear dominance in this due to the large community that helps develop packages and algorithms within R and Python. SAS and SPSS are lagging behind.
SAS and SPSS provide a good stable GUI which offer a point and click interface, which means the user does not necessarily have to code. In terms of resources, there are tutorials available on websites of various universities and they provide a comprehensive documentation.
R has the steepest learning curve among the 4 technologies listed here. It requires you to learn and understand coding. Python is known for its simplicity in programming world. This remains true for data analysis as well. While there are no widespread GUI interfaces as of now, Python notebooks will become more and more mainstream.
SAS and SPSS have decent functional graphical capabilities. However, they are just functional. Any customization on plots are difficult and requires you to understand intricacies of the packages.
R has highly advanced graphical capabilities along with Python. There are numerous packages which provide you advanced graphical capabilities.
This one also goes to R and Python. The graphs don't just look a lot slicker when using, for example, ggplot or D3. But R and Python also give you the opportunity to develop interactive tools.
SAS and SPSS have dedicated customer support that helps with all issues about installation and usage. However, due to its cost, the community is not that large.
R does not have a dedicated customer service team, but it does have a massive community. The R community has people from almost all industries and from all over the world. A solution for any issue can be provided by the large community. Python is also open-source, and therefore, it also has a large community.
In practice, it is observed that the communities of R and Python are better and much faster. The elaboration of these communities not only ensures more development speed, but also better support via dedicated platforms.
Deep learning is a collection of techniques that have proven themselves over the past few years as algorithms with the most potential. Deep Learning support in SAS and SPSS is still in its beginning phase and there’s a lot to work on it. On the other hand, Python has had great advancements in the field and has numerous packages like Tensorflow and Keras. R has recently added support for those packages.
SAS and SPSS are commercial software. They are expensive and still beyond reach for most of the professionals and organizations. R & Python, on the other hand are completely free.
Also this one is for R and Python. Both are open source, making them free to use. Nevertheless, they do have a longer learning curve than the GUIs of SPSS and SAS, which often makes training more expensive.
Power BI’s supported current R runtime is Microsoft R 3.4.4 (compared to the latest R 4.1.0)
The Power BI service supports only a limited number of packages of R.
R visuals in Power BI Desktop have the following limitations:
Power BI only supports 150,000 rows for plotting an R visual. For any more rows, only the top 150,000 rows are used (data loss).
Power BI has has an output size limit of 2MB for R visuals.
Power BI only supports 72 DPI for R visuals.
Tableau supports only functions that return one of the four data types for the integration with R, i.e., a real number, a string, an integer, or a Boolean. This means there are several core capabilities of R that users will not be able to utilize directly in Tableau itself.
You cannot export data from Tableau into R directly to run a new model outside of using the functions of the form mentioned above.
Visualizations created in R cannot be imported directly into Tableau. However, image files of R visualizations or URLs to R visualizations may be used in Tableau dashboards.
At the current time, Tableau Online and Tableau Public are not supporting R, so the R statistical capabilities will not be available through these services