Minor Miner 49er ⛏️
Let us get ready to do some data mining! Rapidly!
This week was an introduction to data mining and RapidMiner (RapidMiner, n.d.), a data analytics software package shown in Figure 1 that brings the power of data science to almost anyone so that almost anyone can mine their own data. The “almost anyone” tag simply means that you must have a computing platform and some data to mine with the faintest of ideas of what you are looking for (this last part is critical). If that is you, then you are all set!
Figure 1 – First Run of RapidMiner
My last post discussed my introduction to R, the programming language of data analytics. While R is a great tool for those who can code, you not only need to be able to code but also to have the ability to envision what output you wish to get from the code to generate useful information.
With RapidMiner, at least a first pass with the data can begin to generate useful insights in just a few clicks.
First-time users get a side-by-side, same-screen windowed tutorial that helps you walk through a sample project using the ship’s registry data from the ill-fated Titanic. See the screenshots below. For example, in Figure 4, simply pulling the data into RapidMiner certain insights and advice begin to appear, such as the heads up that some data may be biased and provides a warning to be sure that these data points are not going to corrupt the expected output. Visual representations also allow for quick anomalies to jump out, such as why are some ages whole numbers and some real numbers as we see in Figure 5, or quickly be able to resort the data like in a spreadsheet as done in Figure 6.
Figure 2 – RapidMiner Vocabulary Orientation
Figure 3 – Scatter plot of Titanic Fares vs Ages
Figure 4 – Data Bias Warning
Figure 5 – Interesting Real Number Age
Figure 6 – Sorted by Age – Oldest Survived
Once the data has been pulled into RapidMiner, the powerful tools native to the platform are available with just a couple of clicks, such as the ability to generate decision trees shown below in Figure 7.
Figure 7 – Decision Tree Example
Figure 8 – Import Excel Data to the Local Repository
Figure 9 – Life Boats vs Passenger Class from Imported Data
In my previous post on R, I mentioned the graphical abilities of the programming language, specifically the geom_smooth() function (Wickham et al., n.d.) which takes data rendered in a scatter plot like that of Figure 3 and allows the programmer to smooth that data to find trends that are hidden in the data (Mailman, 2021). The default locally estimated scatterplot smoothing (LOESS) (EPA) function of geom_smooth() requires some coding that, while short and sweet, is rather cryptic and hard to follow (but could be harder in R (Zach, 2022)). Even related filters, like LOWESS or Savitzky-Golay (Mailman, 2021) or dedicated engineering calculation software like MATLAB (MathWorks, 2006) cannot help with remembering this due to code requirements for the functions if someone doesn’t use them often. However, RapidMiner can provide the same function with a simple drop-down menu selection as demonstrated by Figure 3 (before), and Figure 10 (after).
Figure 10 – Scatter plot of Titanic Fares vs Ages with Loess Applied
These features are just the tip of the pick as to what data mining capabilities RapidMiner can do without having a graduate minor in statistics or being a full-time miner. In fact, I became a minor miner within 49 minutes of installing the software. Not bad! ⛏️⛏️
References
EPA. (n.d.). LOESS (or LOWESS) . Retrieved from Environmental Protection Agency : https://www.epa.gov/sites/default/files/2016-07/documents/loess-lowess.pdf
Mailman, J. B. (2021, March 26). Data Smoothing in Data Science Visualization (The Goldilocks Trio). Retrieved from Towards Data Science: https://towardsdatascience.com/data-smoothing-for-data-science-visualization-the-goldilocks-trio-part-1-867765050615
MathWorks. (2006). Savitzky-Golay filter design. Retrieved from MathWorks: https://www.mathworks.com/help/signal/ref/sgolay.html
RapidMiner. (n.d.). RapidMiner. Retrieved from RapidMiner: https://rapidminer.com/
Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., . . . Dewey. (n.d.). Smoothed conditional means. Retrieved from Tidyverse: https://ggplot2.tidyverse.org/reference/geom_smooth.html#ref-examples
Zach. (2022, May 17). How to Perform LOESS Regression in R (With Example). Retrieved from Statology: https://www.statology.org/loess-regression-in-r/