Open Pretrained Transformer

Meta AI’s release of Open Pretrained Transformer (OPT-175B), which is on par with OpenAI’s GPT-3 at 175 billion parameters/weights, emphasizes responsible compute and claims one-seventh the computational cost in terms of carbon footprint. Pretrained model weights are free to download (link in the comments). This is good news for open collaboration and better news for the environment.

Source

When reverse causation is more profitable

You may have heard of ESG (Environmental, Social, and Governance) investing. It’s also called “socially responsible investing” when ethics is added to the picture. Public companies are assigned an ESG score, which is a quantification of the social impact. What social impact though? You would probably expect ESG ratings to quantify the societal impact of (not on) a company, right? Well, you’ll be disappointed. “Socially responsible investing” is a misnomer when associated with the ESG ratings, at least those reported by MSCI, a leading provider of the ESG ratings globally.

MSCI basically quantifies the impact of environmental, social, and governance risks on a company’s operations (not the other way around!). In other words, if we rely on ESG ratings while making investment decisions, we may not be doing any social good. We are essentially ensuring that our investments are protected from the environmental, social, and other risks such as climate change. After all, why would we care about the carbon footprint of our investments on the environment as long as profits are good?

MSCI’s plot offers some takeaways on how to generate data and model it. Apparently, measuring reverse causation and packaging it to look like the cause and effect are in the right place can be quite profitable. To be fair, MSCI is explicit about its data generation and modeling process residing in the darkside.

Source

On the proof-of-concept to production gap

A valuable insight on the proof-of-concept to production gap in computer vision that underlines again the importance of context:

“It turns out,” Ng said, “that when we collect data from Stanford Hospital, then we train and test on data from the same hospital, indeed, we can publish papers showing [the algorithms] are comparable to human radiologists in spotting certain conditions.”

But, he said, “It turns out [that when] you take that same model, that same AI system, to an older hospital down the street, with an older machine, and the technician uses a slightly different imaging protocol, that data drifts to cause the performance of AI system to degrade significantly. In contrast, any human radiologist can walk down the street to the older hospital and do just fine.”

Source

99/1 is the new 80/20

An obvious but often neglected fact is the overemphasized value of accuracy as a performance metric. In a two-class problem where 99% of the cases are of 0 (Not a spam email), achieving an accuracy of 99% is as easy as classifying all emails as safe. Sensitivity, specificity, and other metrics exist for a reason.

The story of Waymo, Google’s self-driving car, resembles the value of solving the remaining 1% of the problem where conventional machine learning gets stuck due to the limitations of training data. If 1% of the error turns into a make or break point, one needs to get creative. On a long tail that extends to infinity, walking faster or running does not probably help as much as a leap of imagination.

I must note that it’s not fair to expect an autonomous car to be “error-free” given we do not expect human drivers to perform error-free at the driver license exams and road tests. The two will just make different errors.

When to normalize / apply weights

To me, this is interesting not because of the lack of transparency in methodology but the potential reason for the rankings to be wrong.

I want to believe that this is a mistake not fraud, but really? Applying the weights before normalizing the scores? And the Bloomberg Businessweek spokesperson says “the magazine’s methodology was vetted by multiple data scientists.”

I have created a quick scenario as a reminder to my former (and current) students (posted in the comments as LinkedIn doesn’t allow here). In the example, the scores are standardized across the five items (which are randomly generated and assigned weights). In the Businessweek rankings, standardization is supposed to be across institutions so that the weights proportionately affect each institution’s score on the corresponding item. Nevertheless, the source of the error is the same. If the weights are applied before normalizing the data, the scores are adjusted by the weights disproportionately. Ranking changes accordingly.

Algorithmic fashioning

For years, Zara has been my go-to case to discuss data centricity in fashion retail. Zara is a staple example of how a focus on data and analytics combined with the right, complementary business processes can create wonders even in a market with high degrees of demand uncertainty due to the hedonic nature of consumption.

Shein seems to be emerging as a contender, moving further into data-driven (not only data-informed) fast fashion. Its operation is also called real-time fashion rather than fast fashion. Shein doesn’t own any physical stores (none at all) and ships all of its products directly from China.

Bloomberg reports that “Shein has developed proprietary technology that harvests customers’ search data from the app and shares it with suppliers, to help guide decisions about design, capacity and production. It generates recommendations for raw materials and where to buy them, and gives suppliers access to a deep database of designs for inspiration.”

Shein reduces the design to customer turnaround to 10 days, a record compared to already-fast Zara’s two- to three-week lead time. It’s not a niche operation either, given the reports of a $10 billion annual sales and a potential $30 billion valuation.

I’ve found the whole story interesting. It all sounds impressive but also dangerous. The article already mentions some of the “accidents” its algorithm-driven fashion caused along with sustainability concerns.

“But it would be naïve to predict that unpredictable events won’t happen in the future.”

“Zillow Quits Home-Flipping Business, Cites Inability to Forecast Prices,” WSJ reports.* I try to avoid passing along news stories but it’s not everyday I receive a predictive analytics story as breaking news.

I wonder whether the reason is really “an inability to forecast the prices” or “relying too much on an ability to forecast the prices” for a “$20 billion a year” venture as it was debuted.

Zillow announced plans for this data-driven venture in 2018 by citing consumers who “expect magic to happen with a simple push of a button.” In a statement yesterday, Zillow seems to have realized magic is not happening: “But it would be naïve to predict that unpredictable events won’t happen in the future.”

Maybe it is never a good idea to develop a whole business model that grossly underestimates the changes in error (both reducible and irreducible) due to potential bifurcations in market forces.

Source

If tech is everything, then it is nothing

What do #Facebook, #Tesla, #DoorDash, #Nvidia, and #GM* have in common? They are all “tech” companies.

Alex Webb of Bloomberg offers a linguistic explanation for why technology ceased to be meaningful:

“English lacked an equivalent to the French technique and German Technik. The English word “technique” hadn’t caught up with the innovations of the Industrial Revolution, and it still applied solely to the way in which an artist or artisan performed a skill.”

He contrasts technique as in “artistic technique” in English with technique as in “Lufthansa Technik” in German and argues that technology emerged in the early 20th century for the lack of a better alternative.

Whether the reason is linguistic, sheer overhype, or semantic satiation, we may be better off dropping the “tech company” reference at this point unless it is elaborated further. For the companies that are more tech than your average tech, a good alternative may be “deep tech.”

Data-driven paralysis

Data-driven decision making can lead to paralysis. Last week, the FDA and CDC committees couldn’t make a decision about the booster shots because (complete) data was not available. Well, making decisions in the absence of complete data is a process of imagination and deep thinking, one that puts hypothesis development at the center and humans continue to prevail over machines in the process.

To avoid such a paralysis, more focus can be put on developing and rethinking hypotheses and their likelihoods. In emergent problems, an in-depth discussion on hypotheses and likelihoods is probably more helpful than an obsession to access complete data. Otherwise, by defining complete data as a prerequisite, as it would be in data-driven decision making, we will continue to be paralyzed looking into the future.

If we turn to data-informed decision making, however, hypotheses would take more control (not gut feeling but properly developed hypotheses*). We could then make decisions to be improved as more data becomes available without being paralyzed in the present. Rather than seeking the truth, we would seek probable truths (as in Bayesian thinking).

While we may be able to remain strictly data-driven for some problems and decisions, we should be comfortable proceeding informed (not driven) by data for others.

* This post made me think of a book I enjoyed reading last Fall: Defense of the Scientific Hypothesis: From Reproducibility Crisis to Big Data

To log or how to log

I avoid posting technical notes here. This is an exception because I have an agenda.

Log transformation is widely used in modeling data for several reasons: Making data “behave,” calculating elasticity etc.

When an outcome variable naturally has zeros, however, log transformation is tricky. Many data modelers (including seasoned researchers) instinctively add a positive constant to each value in the outcome variable. One popular idea is to add 1 to the variable and transform raw zeros to log-transformed zeros. Another idea is to add a very small constant, especially when the scale of the outcome variable is small.

Well, bad news is these are arbitrary choices and the resulting estimations may be biased. To me, if an analysis is correlational (as most are), a small bias may not be a big concern. If it is causal, and for example, an estimated elasticity will be used to take action (with an intention to change an outcome), that’s trouble waiting to happen. This is a problem of data centricity.

What is a solution (other than deserting to Poisson etc.)? A recent study by Christophe Bellégo and his coauthors offers a solution called iOLS (iterated OLS). To avoid bias, the iOLS algorithm adds an observation-specific value to the outcome variable. Voila! I haven’t tested it yet but I like the idea. Read their nicely written paper here: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3444996

My (not so hidden) agenda is regarding the implementation. The authors offer a Stata implementation (https://github.com/ldpape/iOLS). I would love to see it in R (or Python). Hence this is a call for action.

In defense of Amazon (Trends)

#WSJ continues to report on #Amazon’s shady practices. An earlier article said Amazon used sales data on third-party sellers to offer copycat, private-label products (like AmazonBasics). It was a coherent story but making hasty generalizations. Another piece showed how Amazon manipulates product search ads to favor its products. Both articles (linked within) underlined a data access problem: Amazon has access to the data on its rivals and exploits it for competitive advantage.

This latest article is not as coherent and a bit all over the place, but Amazon’s response is not helping either. Amazon says “Offering products inspired by the trends to which customers are responding is a common practice across the retail industry.” Amazon needs to nurture trust in its ecosystem but seems to be doing the opposite.

I don’t actually see any rampant issues except for access to product search data. Amazon is the dominant leader of the product search market (above Google and others). As a sign of good faith in building trust, Amazon could make (aggregated, anonymous) search data available and offer “Amazon Trends” like Google Trends. Needless to say, third-party sellers may be offered a more in-depth access.

Visualizing the death of James Wolfe

History paintings are like data visualizations. Here, NYT’s Jason Farago presents Benjamin West’s 1770 painting “The Death of General Wolfe.”

If your dashboard looks like West’s painting, you are in trouble. Then you need a Jason Farago to make it accessible to the management team. Dashboards summarize data, as West did in this history painting in 1770 (accurately or not -See Jason’s walkthrough on that). The higher the density of information, the lower the chances of communicating successfully. Businesses increasingly need data translators or communicators, not so much “data artists.” West is the data artist. Jason is the data translator. West skillfully abuses ggplot and matplotlib for the sake of art. Jason further masters Plotly, Shiny, and Dash.

Even guesswork starts with “I don’t know”

To guess is to admit not knowing in the first place.

The problem with Dilbert’s coworkers and with most managerial teams is resisting to admit they don’t know. Even horoscopes and guesswork should start with the acknowledgment of a knowledge gap. Without such an acknowledgment, the time and effort needed to formulate and solve a problem is not justified. To guess is then to pretend knowing.

Guesswork supersedes learning from data because there is nothing to learn when it is all known. Successful data centric companies need a culture that encourages not knowing as much as knowing.

Data worker vs. intelligent agent of AI

Absent of imagination, data workers perform at best on par with intelligent agents, finding associations but failing in causality. Identifying causal links requires thinking in counterfactuals, which, in turn, requires imagining what could have been.

What is absent must be imagined while what is present remains obvious, even to an algorithm. Data centric companies should invest at least as much in the thinking skills and imaginative ability as in the coding skills of their data workers for value creation.

Swimming in data but blindly

Data show that masks can slow down the spread. Getting our economy back on its feet depends on slowing down the spread. Yet, wearing masks is not mandated, not at the federal level, not decisively. We are swimming in data but blindly. In addition to the likely direct effect on the spread, behavioral change following such a mandate can potentially help regain consumer confidence, increase spending, and boost economy (or not, but an experiment worth pursuing given there is little to lose, if any).

Data centricity requires a shift in mindset, no matter whether it is policy making or business strategy making. Without this shift, decision makers may swim in a pool of charts and tables but can’t see.

From lock-in to “Trust us”

What struck me in this opinion piece is the depiction of how multisided (e.g., two-sided) platforms evolve, in an animated GIF by Ryan Kuo. Platform owners feel the need to say “Trust us” at some point, long after contractual relationships are established.

Platform owners gain power and lock in participants (e.g., sellers, buyers, app developers, users) by accumulating network effects and creating switching costs*. More power leads to governance decisions that are increasingly one-sided (e.g., decisions on application approval, product listings, content sharing, or commission/fees). Conflict of interest arises quickly. Trust deteriorates.

Lack of trust can make data centric companies vulnerable to disruption in the long term, even if network effects offer a protection in the short term. One sure way not to gain trust is having to say “Trust us.”

*Cross-side network effects: The more sellers on a platform, the more value for buyers. More buyers join and more sellers follow. As a seller builds a profile full of five star reviews, switching becomes costly. Lock-in can also arise from same-side network effects. In a social platform, value for a user increases with more users.

Mistaken like a human

Traditionally, computers process data quite differently from how human brains do so. Computers are designed for precision while human brains rely on intuition. With artificial intelligence (#AI), or more specifically, deep learning and neural networks, one idea is to mimic the way human brains work. Does this mean that the hardware, or the body also needs to change? Are CPUs and GPUs not up to the task anymore?

Graphcore.ai claims so, and argues that CPUs and even GPUs are out, and IPUs are in. Graphcore’s #IPU stands for intelligence processing units and is prone to imprecision by design. It is a high-performance computing unit that processes data very imprecisely.

Consider a task like going to a restaurant. A human brain wouldn’t calculate the GPS coordinates but use associations; e.g., recall the restaurant’s name, its neighborhood, and neighboring shops. The difference resembles one between Boolean logic and fuzzy logic, and is true.

What is under the hood, or hardware, matters. One component of achieving data centricity is building an infrastructure that fits the objective, and successful data centric companies know they need to invest in it.

Analyzing data to do nothing

With an increasing availability of data and off-the-shelf analysis tools, interventions are thriving.

Interventions rarely create value. Rarity is expected simply because the probability of noise is often disproportionately higher. However, larger amounts of data exacerbate the problem of finding value in interventions while none exists. E.g., a frequentist test using a 0.01 p-value threshold would justify an intervention if the probability of an effect occurring by chance is less than 1%. This probability gets smaller with more data, not because the intervention gains value*. 1% should be a moving target, but it is often treated as a fixed one. It should be adjusted also for other reasons, such as running multiple tests.

More importantly, it should be adjusted for unintended consequences. While quantifying the consequences is difficult, we can incentivize analytics teams for finding out what not to do. Action is visible but inaction is not. Successful data centric companies should not mistake thoughtful inaction for idleness. On the contrary, they should encourage and reward it.

*Assuming the actual effect is not zero. Valid for most (if not all) problems outside natural sciences.

Has Apple become the -old- Microsoft?

Why old? Well, it would be unfair to compare #Apple with today’s #Microsoft, the owner of #GitHub, a sponsor of Open Source Initiative and proponent of innovation through collaboration and co-creation (!). The exclamation will have to stay for a while.

The fight between #Apple and #Hey (hey.com, a contender to #Gmail) is not a surprise but a reminder that Apple is increasingly in the business of value capture, not creation. The gist of the story is, Apple forces Hey to sell subscriptions on its iOS platform but Hey refuses because the cost of doing so is a 30% commission for every subscriber. You can find the details in Kara Swisher’s article: nyti.ms/3ebfyvL

Apple seems to be stuck with incremental one-sided ideas, another iPhone with a larger screen or “dark mode” on its iOS platform, and have forgotten the value of co-creation, which propelled the company at the first place. Apple should be encouraging not oppressing experiments like Hey. For that, it is time for Apple to analyze its data from a fresh perspective that is not short-sighted on quarterly revenue, and rethink its model to embrace diversity again. That is what a successful data centric company would do.