
重点 (Top highlight)

意见 (Opinion)

When I first heard the phrase, “data is the new oil”, I wrote it off as clever marketing.


But having worked in machine learning for a few years, I take it back.


It’s an understatement.


The right data can launch a company, create jobs, and solve real problems.


If only it were that easy.


您不能只建立AI公司 (You cannot just build an AI company)

The smartest AI scientists cannot train a model without data.


This poses a chicken and egg problem for startups.


You need data to build an “AI Company”. But you need a functioning company to collect data in a specific domain.

您需要数据来建立“ AI公司”。 但是,您需要一家运作正常的公司来收集特定域中的数据。

这解释了为什么: (This explains why:)

  • companies pretend to do AI

  • very few products have AI at their core

  • big companies (with data) have a huge advantage


Solutions include data partnerships with larger companies, and utilizing public datasets.


但我提出了另一种选择。 (But I propose an alternative.)

Build a business (software or otherwise), collect data, and THEN use ML to augment that business.


Make data collection a priority from day one. Lack of data is a bigger impediment to AI than lack of talent.

从第一天开始就将数据收集作为优先事项。 缺少数据比缺少人才是对AI的更大障碍。

您不需要大数据,您需要利基数据 (You do not need big data, you need niche data)

Self-driving cars and AI-powered drug discovery require huge amounts of data.


But if you minimize the scope of a problem, you often don’t need much data.


Recipe generation in a specific cuisine, optimizing water levels for greenhouse tomatoes and brewing the perfect shot of espresso likely don’t require a million data points.


If you can automate a tiresome/time-consuming piece of work with a combination of narrow models, domain knowledge and hardcoded logic, you’ve built something valuable.


General AI to overrated. Solve a specific problem in a specific domain.

一般AI被高估了。 解决特定领域中的特定问题。

人工智能的民主化被夸大了 (The democratization of AI is overhyped)

In definition, democratization means increasing access to those without knowledge and resources.

从定义上讲 ,民主化意味着增加接触那些没有知识和资源的人的机会。

In reality, it’s marketing from companies providing AI-powered APIs.

实际上 ,它是由提供AI驱动的API的公司进行的营销。

The ability to easily add AI-powered chat, image recognition, or sentiment analysis is great for mildly augmenting an existing product.


But not for building the core of a product.


  • It provides no moat against other companies using the same API

  • You’re giving away hard-earned data that a larger company will use to train its models

  • The API could be deprecated one day


You need to own your models to build a sustainable enterprise.


存放您自己的数据 (Silo your own data)

I wish there was enough open data for everyone. There’s not.

我希望每个人都有足够的开放数据。 没有。

For an AI startup, data is your moat.


大公司拥有大量孤立的数据。 (Big companies have huge amounts of siloed data.)

  • Google has browsing history

  • Facebook has your images, friends and interests

  • Amazon has purchase history


This combined data could spawn a hundred new companies. But it won’t.

这些综合数据可能产生一百家新公司。 但是不会。

You need your own private store of data. You can open source what you’ve built after you’re successful.

您需要自己的私人数据存储。 成功后,您可以开源自己构建的内容。

用领域知识增强您的数据 (Supercharge your data with domain knowledge)

If data is a moat. Data + domain knowledge is an ocean.

如果数据很麻烦。 数据+领域知识是一片海洋。

Most of the real opportunities are solving problems you wouldn’t know about unless you worked in a field.


Given 20 years of granular weather data, I could come up with a few potential startup ideas. But a farmer, a general contractor, or a logistics company, could come up with use-cases I couldn’t imagine. The world’s real problems are better suited to domain experts than teams of engineers in FAAMG.

鉴于20年来的细颗粒天气数据,我可以提出一些潜在的启动想法。 但是,农民,总承包商或物流公司可能会提出我无法想象的用例。 与FAAMG的工程师团队相比,全球的实际问题更适合领域专家。

Labelling data that requires domain expertise is near-impossible to outsource. I’ve tried. Outsourcing the labelling of dog VS cat images is easy. But classifying legal cases takes an expert.

标记需要领域专业知识的数据几乎不可能外包。 我试过了。 将狗VS猫图像的标签外包很容易。 但是对法律案件进行分类需要专家。

You probably need to label your own data. It sucks. But it’s a good thing if only you can do it.

您可能需要标记自己的数据。 糟透了 但是,只有您能做到,这是一件好事。

不要过度依赖公共数据集 (Do not over-rely on public data sets)

If you find a great opportunity for a public dataset, take it. But anecdotally, these are far and few between.

如果您发现公开数据集的绝佳机会,那就抓住它。 但有趣的是,这些之间是遥不可及的。

It’s less defensible because anyone can use it, and you likely can’t generate additional data points unless the dataset is updated.


As a proportion of data in the world, public datasets make up a tiny fraction of a tiny fraction.


To riff on my previous example, finding images of dogs VS cats is easy. Finding images of hamburger buns without enough seeds is hard.

在前面的例子中,找到狗与猫的图像比较容易。 很难找到没有足够种子的汉堡面包的图像。

In my experience, this effect is even more pronounced in NLP than in images.


Collect your own data.


这是一个机会 (This is an opportunity)

AI has a data problem. And we know the future has more AI. So solving that is an opportunity.

AI有数据问题。 我们知道,未来将拥有更多的人工智能。 因此解决这是一个机会。

政府可以通过数据刺激创新 (Governments can spur innovation with data)

Governments own a lot of data. Not all of it is sensitive.

政府拥有大量数据。 并非全部都是敏感的。

Open-data and data-partnerships can attract companies to solve specific problems, as well as generate economic value if provided the right data.


This Fish Hackathon comes to mind.


将铲子卖给矿工 (Sell shovels to the miners)

During historical gold rushes, selling shovels was more profitable than mining. Supporting AI companies with data is a product. We need more:

在历史性的淘金热期间, 出售铁锹比开采铁矿更有利可图。 用数据支持AI公司是一种产品。 我们需要更多:

  • data marketplaces

  • the ability to rent domain expertise

  • data curation


Solving the data problem is an integral part of solving problems with AI.


结论 (Conclusion)

Disclaimer: I conflate AI with ML.


There are a lot of problems that artificial intelligence has the potential to solve. AI requires data to do that.

人工智能有很多潜在的问题可以解决。 人工智能需要数据来做到这一点。

Being a machine learning expert is not enough. Acquiring and creatively using data is a business problem that comes first.

仅仅作为机器学习专家是不够的。 获取和创造性地使用数据是首先要解决的业务问题。

This isn’t easy and that’s a good thing. When you have it, it provides an advantage over the competition and others’ with technical expertise.

这并不容易,这是一件好事。 当您拥有它时,它比竞争对手和其他拥有专业技术的人更具优势。

翻译自: https://towardsdatascience.com/data-over-everything-abbeb9ee758
