Minor Miner 49er ⛏️

Futuristic Graphics for decoration only.

Let us get ready to do some data mining! Rapidly!

Brunette woman looking at a holographic computer display.
Data mining is in progress!
Generated by DALL-E via Bing Chat by my commission.

This week was an introduction to data mining and RapidMiner (RapidMiner, n.d.), a data analytics software package shown in Figure 1 that brings the power of data science to almost anyone so that almost anyone can mine their own data. The “almost anyone” tag simply means that you must have a computing platform and some data to mine with the faintest of ideas of what you are looking for (this last part is critical). If that is you, then you are all set!

Screenshot of RapidMiner software

Figure 1 – First Run of RapidMiner

My last post discussed my introduction to R, the programming language of data analytics. While R is a great tool for those who can code, you not only need to be able to code but also to have the ability to envision what output you wish to get from the code to generate useful information.

With RapidMiner, at least a first pass with the data can begin to generate useful insights in just a few clicks.

First-time users get a side-by-side, same-screen windowed tutorial that helps you walk through a sample project using the ship’s registry data from the ill-fated Titanic. See the screenshots below. For example, in Figure 4, simply pulling the data into RapidMiner certain insights and advice begin to appear, such as the heads up that some data may be biased and provides a warning to be sure that these data points are not going to corrupt the expected output. Visual representations also allow for quick anomalies to jump out, such as why are some ages whole numbers and some real numbers as we see in Figure 5, or quickly be able to resort the data like in a spreadsheet as done in Figure 6.

Screenshot of RapidMiner software

Figure 2 – RapidMiner Vocabulary Orientation

Screenshot of RapidMiner software

Figure 3 – Scatter plot of Titanic Fares vs Ages

Screenshot of RapidMiner software

Figure 4 – Data Bias Warning

Screenshot of RapidMiner software

Figure 5 – Interesting Real Number Age

Screenshot of RapidMiner software

Figure 6 – Sorted by Age – Oldest Survived

Once the data has been pulled into RapidMiner, the powerful tools native to the platform are available with just a couple of clicks, such as the ability to generate decision trees shown below in Figure 7.

Screenshot of RapidMiner software

Figure 7 – Decision Tree Example

Screenshot of RapidMiner software

Figure 8 – Import Excel Data to the Local Repository

Screenshot of RapidMiner software

Figure 9 – Life Boats vs Passenger Class from Imported Data

In my previous post on R, I mentioned the graphical abilities of the programming language, specifically the geom_smooth() function (Wickham et al., n.d.) which takes data rendered in a scatter plot like that of Figure 3 and allows the programmer to smooth that data to find trends that are hidden in the data (Mailman, 2021). The default locally estimated scatterplot smoothing (LOESS) (EPA) function of geom_smooth() requires some coding that, while short and sweet, is rather cryptic and hard to follow (but could be harder in R (Zach, 2022)). Even related filters, like LOWESS or Savitzky-Golay (Mailman, 2021) or dedicated engineering calculation software like MATLAB (MathWorks, 2006) cannot help with remembering this due to code requirements for the functions if someone doesn’t use them often. However, RapidMiner can provide the same function with a simple drop-down menu selection as demonstrated by Figure 3 (before), and Figure 10 (after).

Figure 10 – Scatter plot of Titanic Fares vs Ages with Loess Applied

These features are just the tip of the pick as to what data mining capabilities RapidMiner can do without having a graduate minor in statistics or being a full-time miner. In fact, I became a minor miner within 49 minutes of installing the software. Not bad! ⛏️⛏️

Cyborg miner in a mine for decoration only.
Data mining doesn’t have to be this hard!
Generated by DALL-E via Bing Chat by my commission.

References

EPA. (n.d.). LOESS (or LOWESS) . Retrieved from Environmental Protection Agency : https://www.epa.gov/sites/default/files/2016-07/documents/loess-lowess.pdf

Mailman, J. B. (2021, March 26). Data Smoothing in Data Science Visualization (The Goldilocks Trio). Retrieved from Towards Data Science: https://towardsdatascience.com/data-smoothing-for-data-science-visualization-the-goldilocks-trio-part-1-867765050615

MathWorks. (2006). Savitzky-Golay filter design. Retrieved from MathWorks: https://www.mathworks.com/help/signal/ref/sgolay.html

RapidMiner. (n.d.). RapidMiner. Retrieved from RapidMiner: https://rapidminer.com/

Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., . . . Dewey. (n.d.). Smoothed conditional means. Retrieved from Tidyverse: https://ggplot2.tidyverse.org/reference/geom_smooth.html#ref-examples

Zach. (2022, May 17). How to Perform LOESS Regression in R (With Example). Retrieved from Statology: https://www.statology.org/loess-regression-in-r/

Comments are closed.