How an uninformed market gambled on data science to help combat the rise of tech-enabled disruptors. Why it failed, and how it’s finally paying off.

Image by mohamed Hassan from Pixabay

Disruption and the rise of data science

Financial crises act as accelerators of both creative destruction and technological accumulation. In the years following the 2008 financial crisis, widespread tech-enabled disruption produced a metamorphosis in the established order of America’s mega-corporations. Tech companies’ unrivalled ability to scale cheaply allowed them to de-perch established dinosaur competitors and define the zeitgeist for a new generation; think Uber, Airbnb, WhatsApp, Dropbox, Instagram, etc.

Incumbents are rightly terrified of disruptors because — typically, though certainly not always — their large size, established business model, low risk-appetite, and disproportionate focus on their most profitable customer segment…

Image by Pexels from Pixabay

What are the longest words spelled out from the first letters of names in football tables?

Given infinite time, a monkey typing at random will almost surely write out the full works of, say, Shakespeare.

In our problem set up, we don’t have infinite time, nor infinite monkeys, but we do have football tables, and plenty of them.

If still unclear, then rephrased as a more longwinded question, we want to know what is the longest English word found in the consecutive first letters of the team names of a league table? Here’s an example:

Image by Maria_Domnina from Pixabay

At-home projects should be free, here’s how.

The ROI for a hobbyist is rarely measured in dollar bills. Instead, people start hobby projects because they’re interested; maybe to learn, maybe for fun, maybe to build a PoC for an idea which could eventually yield riches.

Consequently, people are generally unwilling to pay for services to get their hobby projects up and running. At least, I sure am. Recently, I built a web scraper with GitHub Actions that would run every hour, dropping the data in a Google Sheets workbook, and sending me a Telegram message summarising the scrape job. All free.

With all that being said, here’s…

Image by TheAndrasBarta, license.

In 7 simple steps

All innovative technologies will eventually pass over a ‘peak of inflated expectations’, a phenomenon especially true when said technology is backed by heavy marketing war-chests, which serve chiefly to fuel hype and exacerbate public expectations.

The landscape of Machine Learning is slowly maturing beyond its ‘I’m a hammer and everything is a nail’ framing, born out of that explosion of marketing fervor over the past decade or so.

This is also true for Graph Databases; once lauded as the hot new technology that every business simply must have for Christmas, they’ve reached that happy place where they’re part and parcel…

An apt summary of public discourse about A.I. — by Tabor (license).

The public debate around A.I. is consequential for funding, research, regulation, and the extent of its malign misuse. Our discourse is failing because we collectively flaunt several definitions of the term.

The shiny hype train

Ever since the overworked and cringeworthy remark that data science is the sexiest job of the 21st century, and the resulting hype train that torpedoed the modern conception of Machine Learning — Breiman’s conception — from the academic fringes to the dizzying lights of the mainstream labour market, the frantic bum’s-rush to repackage Logistic Regressions as bleeding edge A.I. began in earnest. I’m looking at you, IBM Watson.

Most found-in-the-wild, laymen sentiments towards A.I. seesaw between extremes of the utopian and the dystopian. The utopian vision is marked by unrealistic expectations for the short term potential of what can be…

Data science teams tend to pull in two competing directions. On one side there’s the data engineers who value highly reliable, robust code which carries low technical debt. On the other, there are the data scientists who value the rapid prototyping of ideas and algorithms in Proof-of-Concept like settings.

While more mature data science functions enjoy a fruitful working partnership between the two sides, have sophisticated CI/CD pipelines in place, and have well defined segregation of responsibilities, oftentimes early stage teams are dominated by a high ratio of inexperienced data scientists. …

Disclaimer: not all data scientists do, or even should have to, write production grade code. Whether they should is ultimately down to context. But if they could, it would make the field a much better place.

Common knowledge would have you think a data scientist spends the majority of their time modelling and evaluating those models. This is a falsehood. For many data scientists, the majority of their time is spent developing data pipelines which act as a requisite precursor to machine learning. Such pipelines do not come out of thin air, and failing the use of some third party…

Linear regression. It’s the first type of regression analysis ever to be studied intensely, the foundation of any supervised learning course, the cornerstone of… you get the picture. Well, it sucks.

In real world settings, Linear Regression (GLS) underperforms for multiple reasons:

  • It is sensitive to outliers and poor quality data—in the real world, data is often contaminated with outliers and poor quality data. If the number of outliers relative to non-outlier data points is more than a few, then the linear regression model will be skewed away from the true underlying relationship.
  • It requires all variables to be multivariate…

The main benefit of using the median as oppose to other average approximation aggregates like the mean, is because it is less skewed by extremely large or small values. It offers a better approximation of what is called a typical value of the data.

To estimate the median of a dataset, you must read all the elements into memory, sort them and then find the middle value; a process which is inefficient and often impractical if the dataset is very large.

Ideally, what we want is to estimate a value for the median that uses as few comparisons and arithmetic…

One of the most crucial pieces of any data science puzzle is perhaps also the least glamorous: feature engineering. It can be protracted and frustrating, but if it’s not done right, it can spell disaster for any modelling or analysis that follows. In this post, I hope to shed some light on a delightful inference technique that can be used.

“Applied machine learning” is basically feature engineering — Andrew Ng

It’s not an uncommon tactic to ‘bucket’ together data points within a feature. Depending on how you want a feature bucketed, it can range from the simple to the laborious…

Andy Greatorex

London based data scientist @Revolut. Formerly in NYC @Barclays. Building stuff for the fun of it.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store