Theft, stagnation and sweatshops in the world of AI

AI might be the buzzword right now but the pitfalls are already visible.

Welcome to this edition of Over a Cup of Coffee!

In this newsletter

Stanford’s $500 AI tool is copied from China

Last week, researchers at Stanford University unveiled a new AI model called Llama 3V. It matched the capabilities of GPT-4, Gemini Ultra, and Claude Opus.

It gained instant fame because researchers claimed it was trained for under $500 instead of the billions that tech companies are currently spending.

Members of the AI community soon began seeing similarities in the tool's structure and that of MiniCPM-Llama3-V 2.5, a tool developed by researchers at Tsinghua University and ModelBest, a startup based in China and released much earlier.

Interestingly, the Chinese-developed AI model had a hidden feature. It could identify text written on bamboo slips dating back to 475 and 221 BC, and the Stanford University model had the same ability.

However, Tsinghua University had acquired the bamboo slips in 2008 but never made them public. Even the mistakes of the original AI model were repeated by the “Stanford-developed” AI model.

Stanford researchers have since apologized for the blatant plagiarism but largely shifted the blame to one of the research team members, who has been incommunicado since the controversy blew up.

You can read more about this in the South China Morning Post.

AI running out of training data

AI models have been trained on tens of trillions of words written and shared by humans online. As models get bigger, they are taking in more and more publicly available data and might exhaust it all in the next six to eight years.

Tech firms building AI models are already scraping the internet for data and are also willing to pay (more on this later) for data that can help improve their models.

This could include blogs, news articles, social media posts, and other human-generated and publicly shared information. Other bits of data, such as emails and text messages, are considered private and out of their reach (at least as of now).

Somewhere near the end of the decade, tech companies will hit a dead end. They won’t have enough data, and this will begin stagnating their capabilities.

An option available with the companies is to let AI generate more content, referred to as ‘synthetic‘ content, which can then be used to further train the model. But so far, synthetic content has been found to be inferior in quality to human-made content.

When such a stagnation point is reached, it is called a “model collapse”.

You can read more about it in this research paper from Epoch AI, a team of scientists investigating the future of AI.

Data for AI is coming from sweatshops

Remember the backlash Nike faced when sourcing its goods from sweatshops in developing nations in the 1990s?

The world may have technologically advanced in the following decades, but the way it works is the same. Instead of working in warehouses, these workers work on laptops, but the working hours are long, and wages are meager.

Representational image created using AI. Image credit: Microsoft Designer

The job involves labeling data or translating text so that AI models can use it to train themselves but wages are stuck at as little as $1.70 an hour for the past many years.

The data processed at dirt-cheap rates goes to companies like Microsoft and OpenAI, which are valued in billions, but the workers struggle even to get a minimum wage.

World Bank estimates suggest that between 150-430 million people are employed this way.

Read more about this in Bloomberg piece.

Thank you for reading till the end.

If you found this edition interesting, share it with others

Thanks for reading.
Until next time,
Ameya

Reply

or to participate.