Town planners will tell you traffic expands to fill the roads available. The establishment of big data systems has meant that there is now a great deal more data sloshing around and it’s exponentially more difficult to analyse these volumes and extract trends, information or to predict answers. Sufficiently accurate machine decisions can greatly reduce the need for manual interventions and can speed up throughput and profit.
The term machine learning and was first coined, with admirable foresight, in 1959. It evolved from computational statistical mathematics and pattern recognition research. With regard to queries and expressions, it’s not SQL. There are two computer languages of choice; the statisticians tend to use ‘R’ whereas the computer professionals go for Python. As Python is an excellent general computing language that can be ‘glued’ into most systems and that we are already familiar with – combined with the fact that the libraries and tools for machine learning are usually written in it – we use Python. The interactive nature of Python tools and shells also makes it a natural choice. If you are using Apache Spark, then Scala is also an interactive option.
It’s probably some sort of an unofficial crime to write a technical comment on machine learning without including a single equation, but in order to facilitate you scrolling to the end that’s what I’ve done. Stephen Hawkins was told that for every equation he wrote in A Brief History of Time, the readership would halve – so he went with just one, E=mc2.
I’ll start with one of the most preeminent questions on the internet today; can you tell the difference between a muffin and a Chihuahua?
Photo: Twitter/Karen Zack/@teenybiscuit
This question is just the sort of thing machine learning is good at. Google obviously knows how to do this as the above image is the result of my search in relation to this question.
Other (perhaps more appropriate) machine learning example questions could be:
- How does Google or Apple automatically categorise photos?
- How do credit card companies detect fraud?
- How does Apple validate your fingerprint?
- Given a few facts, can you estimate someone’s income?
- Given the number of rooms and location, can you estimate the value of a house?
- Is an email spam?
Perhaps instead of hard currency limits being set for a trader, a machine learning system could constantly analyse his or her performance and move the limits accordingly (ballsy, but possible). Or maybe, by token analysis of news feeds that mention a specific company and watching its historic stock price, future performance could be predicted? This would be really difficult to achieve to any degree of confidence, but could be worth a punt…
So, can I write a program that can correctly deduce whether an image is a Chihuahua or a muffin with a certainty that’s, for example, greater than 70% (a common minimum)? Answering this question will lead us through some of the major ideas and techniques in machine learning. But firstly, an overview of the process.
The machine learning process
The shape and volume of the data along with the type of answer is crucial and will lead you to a particular approach. Imagine all the data you need in a single massive Excel spreadsheet; how many columns would it have? How many rows? Do I want a Yes/No type of answer, a numerical value or a collection of categories with data in each? Estimating salaries or house prices is clearly a numerical solution, whereas aberrant behaviour or mortgage approval is a binary answer.
Experience shows that during a machine learning project, 50% to 80% of time will be used refining, filtering and cleaning the data so that intelligent deductions can be made from it. The degree of interactivity you can have with the data is also very important as you will be able to shape the data on-the-fly and achieve a feel for the complexity of the problem far more quickly – and ultimately, reduce actual development time. This is where the software tool selection is crucial.
The data will have to be sculpted into an optimal structure. Not only must you check for blank values, but what if some of the rows contain correlated data (the value in one column is dependent on another), or there are essential columns that have logically invalid values? Do you just delete the row or do you impute (substitute a value – the machine libraries can do this for you automatically based on, e.g. the mean of other rows’ values). For binary answers, what is the balance between one value and the other? You don’t want 999,999 photos of Chihuahuas and only one of a muffin as you’ll never get the algorithm to work. Are all the columns necessary or can you delete some?
A suitable algorithm will then be chosen according to your findings from your analysis and these have been well established over the past few decades. Each algorithm has a variety of parameters, speed and accuracy. Almost all the algorithms for the types of problems we need to solve are of the classification type (fixed ‘buckets’) or regression type (any values). Common algorithms have names such as naive Bayes, Random Forest and Logistic Regression.
Some algorithms (called ‘ensemble algorithms’) are actually a collection of algorithms inside a unifying façade and these are usually faster but more complex, with many more parameters.
The algorithm is applied to about 75% of the training data to generate a model. This model is then validated against the remainder of the data. The results are the deduced outputs expected and a confusion matrix. This indicates the performance of the algorithm (actual versus predicted, false positives/negatives versus true positives/negatives) and assuming the confidence is sufficient, it’s then applied to actual data to deduce real outputs. The training data can then be thrown away as all we’re looking for are the correct parameters to the algorithm.
The system may have a too low confidence in the answers, so alternative algorithms or refinement of the parameters will be required and the whole process can be repeated many times. As some of the algorithms have many parameters, a common technique is to pass a collection of values for each parameter and these are permuted against the algorithm to find the best combination (called a grid search).
Certain algorithms have a tendency to work really well with the training data but then fail to achieve the same accuracy with live data. This is called ‘over-fitting’ and is explained in the next section. Another technique to avoid over-fitting is to slice your training data into small chunks and validate each chunk against the whole to get a better average of the parameters (this is called ‘cross-validation’).
Once a confident output is achieved, the steps taken to clean the data and the algorithm parameters are then passed to the developers to implement in code – potentially employing a clustered solution to achieve the specified throughput.
That’s a brief overview, now here’s a bit of theory.
The Tao of machine learning
It’s all about finding a formula that will plot a line on a chart that separates or categorises values. In data science parlance, ‘features’ are columns that contribute to the output. So imagine you have just two features and some combinations give rise to answer ‘blue star’ and the rest ‘yellow circle’. An initial plot could look like this:
What you want is a formula that, as inclusively as possible, describes the categories. A straight line approximation could be:
Illustration 2: with a straight-line categorisation
But what you really want is something more like this:
Illustration 3: perfect categorisation
This is what the algorithms try to achieve; the closest fit to your data, so that when a new data point arrives, it can be immediately put into the correct category. In the case of illustration 2, a straight line is a simple formula and categorisation is simple, but not perfect – reducing the confidence. In the case of illustration 3, the categorisation is perfect but requires a far more complex equation to describe it with many higher order terms (e.g. x4, x5 with suitable divisors) to describe it. Over-fitting was mentioned earlier and that’s where the formula very accurately categorises the training data with a very high confidence, but that turns out to not match the real data very well – essentially, the algorithm has categorised almost every data point yet it only describes the training data. When this happens, you can specify additional algorithm parameters to generalise it.
In reality, you usually have many features, not just two, so imagine the above charts as ‘n-dimensional’.
Machine learning categories
The two main categories of machine learning are ‘Supervised’ and ‘Unsupervised’ with the latter being rarely used within fintech.
- Supervised learning: is where you know the inputs and outputs for a volume of data and you ‘train’ the system so that it can learn the rules to give answers when supplied with real data. The vast majority of machine learning, and everything we’ll ever do, is in this category.
- Unsupervised learning: is where you want the system to find patterns in the data or you want it to group the data into clusters. Given a recording of a conversation between several people, an unsupervised learning system could analyse the speech patterns and frequency of each person into a ‘cluster’ and then transcribe that to text. It didn’t need to know what each person’s voice sounded like to begin with, it just matched patterns. This is where data-mining differs from machine learning – when you don’t quite know what you’re looking for and trends and clusters of similar things may be enough.
As mentioned earlier, there are several factors that dictate the algorithm and approach. In our experience with fintech and as we’re only concerned with supervised learning, it usually boils down to one of these families of algorithm:
- Classification: is where the required result falls into one of two or any one of several discrete ‘buckets’, e.g. Logical (true/false).
- Regression: is where the required result can be any numeric value, e.g. a financial cost/value.
This is where a great deal of time is taken. From this point onward, all code references and data formats will relate to Python. In Python, the package of choice and de-facto standard for machine learning (probably on any platform/language) is called ‘Scikit-learn’.
The tool of choice is ‘Jupyter Notebook’ (It’s ‘py’ as it’s Python – I know, it hurts. Sometimes I think people name their programs like this just to confuse with Microsoft Word). This used to be called iPython Notebook and is an excellent tool for analysing data – it is to data science what Microsoft Excel is to accountants. It works with over 40 languages but by default, it’s Python (with R a close second). Essentially, the product is a single column spreadsheet where each cell is either Markdown text or a data command. It keeps each step of how you transformed the data together with your text explaining your thinking and these can be passed around and the transformations copied out and placed directly into the code. There are over 100,000 Jupyter notebooks for anyone to experiment with on GitHub alone.
Working with images – eigenfaces and principal component analysis
Cleaning images is obviously different to cleaning text. Most images are far too complex for algorithms to work with within a useful timescale. Too many pixels (‘dimensionality’), in colour, etc. These sorts of things need to be unified to best represent the unique parameters in a photo. We need to generate a Principal Component Analysis (PCA) image for each real image – called an eigenpicture. When applied to portrait photos, the effect is ghostly and these are called eigenfaces.
Illustration 4: eigenface example. Photo credit: Scikit-learn.
This simplification is possible as most people’s faces have the same features in roughly the same position and size. It is the relationship between these dark and light areas that describes a person’s face. For example, the eye region is darker than the upper-cheeks and the nose bridge region is brighter than the eyes – facial hair doesn’t matter. Eigenfaces can be thought of as ‘recipes’ for the human face. Principal Component Analysis (PCA) isn’t just used for human faces though, it’s also used for handwriting recognition, voice recognition, even lip reading. In the City of London, there’s an operational system that tracks the movement of people via CCTV cameras using this approach in near real-time.
From a non-mathematical point of view, PCA removes commonality from a series of observations to leave a set of key differences that can be worked with.
In Python, the code is as simple as follows:
n_components= 200 pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train) eigenfaces = pca.components_.reshape((n_components, h, w) X_train_pca = pca.transform(X_train) X_test_pca = pca.transform(X_test)
Where X_train is the data to be trained and h & w are the image dimensions. The ‘_pca’ variables contain the ‘unified’ training and test data.
Websites exist specifically for machine learning that contain vast repositories of textual datasets and categorised images that algorithms can be refined against. One of these essential sites is the Labelled Faces in the Wild site. This contains 13,000 images of famous faces scraped off the internet. Each face is named and there are up to 70 images for each person. When it comes to face recognition, there’s a standard algorithm used, called Viola-Jones, which was proposed in 2001 by Paul Viola and Michael Jones. It is around 80% accurate in the Python implementation I used (it doesn’t work so well on rotated or tilted faces and an improved version called the KLT – from ‘Kanade, Lucas and Tomasi’ – algorithm exists to combat this at the expense of speed).
Another essential site is the UCI Machine Learning Repository. It has loads of datasets, from iris patterns to wine-quality, via poker hands, car evaluations and more health and disease collections than you can shake a stethoscope at.
So, in order for me to be able to train my system to classify muffins or Chihuahuas, I need a large number of images (roughly 1,000) and it would help if the dog’s heads were in the same position (the muffins can stare in any direction they feel like).
Working with text
In Python, the package for scientific computing is NumPy and the package for data analysis is ‘pandas’. It’s difficult to underestimate how brilliant and useful pandas is. Within it, the core object for manipulating textual data is the DataFrame. Think of it as a SQL Table with rows, columns, indexes, labels, etc. In Jupyter Notebook, once you’ve loaded the NumPy package, the data cleaning stage is all about operating on DataFrames.
Apache Spark also has the same concept, also called DataFrame, but a quite different implementation and focus. The Spark version is designed to represent a table of data even though it’s distributed across any number of nodes and doesn’t have the table manipulation features of pandas. Both DataFrames support the ingestion of data in a variety of standard text formats.
When it comes to textual analysis, there can be several pre-processing stages before the machine learning is started, e.g. word categorisation or written language detection.
The role of the algorithm is to build the model that best describes the data. Most machine learning systems feature 50-100 different algorithms. There is no one ‘right’ algorithm to do a particular task and many Computer Scientists will have preferences and would disagree amongst themselves about a particular choice. The algorithm has a method – often called ‘fit()’ or ‘train()’) that generates the parameters that match the data the closest. It’s those parameters that form the model. Real data is then passed into the algorithm with the model and the ‘predict()’ method is called and this generates the desired output.
We’re really only interested in regression or classification problems so that narrows down the choice of algorithms we can use, although some will work with both types. Common examples are Logistic Regression, Decision Trees and Support Vector Machine (SVM). Which algorithm you choose will depend on the volume of test data, the number of features and the result type. They each have their strengths and weaknesses:
- Decision Trees: are simple to understand and tools can show a visualised output. They handle both regression and classification. They have a tendency to overfit and work better with balanced data as they can generate biased outputs. What’s better than a single Decision Tree? A forest of them. Random Forest is an ensemble algorithm that works on sub-samples of its training data and then averages out the answers to improve accuracy.
- Logistic Regression: which despite its name is a classification algorithm (often positive/negative). It’s from a family called Generalized Linear Models (GLM). They’re ‘linear’ as the target value is expected to be some combination of linear values. Certain GLMs are useful where you have a lot of features and relatively little data. You’d probably want to use the Cross Validation versions if you can (they have ‘CV’ on the ends of their names, e.g. LogisticRegressionCV). These split the data into k-folds and train against these shards to get a better match.
- Support Vector Machines: are a good first choice for either regression or classification, thanks to their versatility. They perform very well with large numbers of dimensions both with text and images. They are memory efficient but you might still want to allocate 1GB to it. They have a plugin mechanism called kernel functions to specify the approach.
So, to classify my images in Python:
classifier = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid) classifier = clf.fit(X_train_pca, y_train)
I create a SVC (Support Vector Classifier) passing in a kernel function called ‘rbf’ (which uses exponential terms to find a match and is the default if you don’t mention it). I specified a class_weight of ‘balanced’ as I have a similar number of Chihuahuas to muffins. I can bias the result curve in a particular direction if the ratio of Chihuahuas to muffins is far from 50/50. I previously made a dictionary of parameter arrays so that I could get the package to choose the best results. The GridSearchCV does a lot of work here. It runs the SVC using the permutation of all the parameter arrays I gave it. The eventual parameters chosen are those with the highest scores. The data is cross validated (hence the ‘CV’ at the end) so it’s sliced into (by default) 10 folds and each fold is used to train the SVC algorithm and then the results are averaged out.
I then generate a model from the fit() method:
Y_predicted = classifier.predict(X_test_pca)
Y_predicted now contains my answer and the quality of my results can be viewed in the ‘confusion matrix’ which contains the values for how confident the algorithm was at getting the right answers. ‘Y’s and ‘X’s appear a lot as these things boils down to y=f(x) type functions with x being the input and y the output.
With regards to the volume of data used for training; the more, the merrier. The algorithms will be more accurate if more data is used for training although by changing the algorithm, you can mitigate having smaller volumes.
Spark has a machine learning module called MlLib. It comes in two flavours, one based on Resilient Distributed DataSets (RDDs) and the other on Spark DataFrames; the former is deprecated, so only use DataFrames. MlLib contains a standard collection of algorithms and in the latest Spark release (v2.0), Python has caught up with Java/Scala with regards to implementation and support. The advantage that Spark gives for machine learning is that you can use all the other parts of Spark; streaming/micro-batching, distributed file system, great file ingestion and scalability as the training and predicting can be achieved from its distributed grid. However, if you don’t want to use MlLib, then you can use Scikit-learn over the top and reap the benefits of a very well-known library with all of Sparks benefits. Spark-sklearn is backed by Databricks – the team that originally wrote Spark – and it offers the added benefit of not requiring any code changes to the Scikit code to make it run distributed on Spark.
Having said that, Scikit-learn is the Python standard, an excellent alternative is available from H2O. It has written its highly optimised algorithms in a Java core and wrapped them in a number of languages. They clearly know what they’re doing so if you’re not on a Python project, then here’s a great alternative. If you’re running Apache Spark, they also have a product called Sparkling Water (Spark on H2O, that’s a data science joke, there) that implements all their APIs on Spark.
According to the classification report I receive as a result, I have a ‘recall’ (or True Positive) value of 0.71 at deciding whether an image is a Chihuahua or a muffin, but I know I could get a higher score. I need more images but as it was such a grind getting a hundred or so, that’s not going to happen. I could also investigate additional parameters for the GridSearchCV method or give it wider ranges to play with if I had more time.