什么和为什么 (The What and the Why)
The vast majority of data scientists nowadays operate in the land of python and pandas. With the gold-star scientific computing stack of python, it’s really difficult to make the case for why it should be any different. The combination of “convenience when you can” and “performance when you must” via bindings is difficult to beat.
如今,绝大多数数据科学家都在Python和大熊猫的土地上活动。 使用python的金牌科学计算堆栈,很难说明为什么它应该与众不同。 通过绑定将“在可能时带来便利”和“在必须时表现”相结合是很难击败的。
However, there is a similar channel of popular usage that serves application developers more-so than bread-and-butter data scientists: machine learning in javascript.
但是,有一个类似的流行用法渠道为应用程序开发人员提供了比黄油数据科学家更多的服务:使用javascript进行机器学习。
Wait…what?
等等... 什么 ?
Java机器学习 (Machine Learning in Javascript)
It’s not as ridiculous as it sounds. Well, mostly.
它并不像听起来那样荒谬。 好吧,主要是。
For sure, the vast majority of data scientists should continue to use the existing and popular python-based frameworks (PyTorch, Tensorflow, ONNX, etc). These frameworks have been highly optimized to support fast research and application, both small and at scale.
当然,绝大多数数据科学家应该继续使用现有的和流行的基于python的框架(PyTorch,Tensorflow,ONNX等)。 这些框架已经过高度优化,可支持小型和大规模的快速研究和应用。
However, machine learning democratization is well underway, and everyone wants a piece of the action, from enthusiasts to application developers to physicists. Application developers have experienced an explosion of end-to-end functionality since the rise of NodeJS, so it makes sense that the desire arose for machine learning to be possible in the browser as well. Note that this means the browser truly processes samples into predictions, not just forks off requests to a remote back end.
但是,机器学习的民主化进展良好,从发烧友到应用程序开发人员再到物理学家, 每个人都想采取行动。 自从NodeJS兴起以来,应用程序开发人员就经历了端到端功能的爆炸式增长,因此有意义的是,人们也希望在浏览器中实现机器学习。 请注意,这意味着浏览器将样本真正地处理成预测,而不仅仅是将请求分叉到远程后端。
This in-browser ML movement has been developing for some time now, largely with TensorflowJS. Google’s Javascript-based Tensorflow library has brought machine learning to the masses of full-stack developers who don’t want to spin up multiple services just so a python-based service can do all the algorithmic processing.
浏览器内ML运动已经发展了一段时间,主要是通过TensorflowJS开发的。 Google的基于Javascript的Tensorflow库将机器学习带给了全栈开发人员,他们不想仅仅启动多个服务,而基于python的服务可以完成所有算法处理。
脏数据蒙格 (A Grungy Data Munge)
There has still been a key piece missing in the pipeline, though. Providing ML modeling to common developers only means so much when it is not accompanied by powerful and flexible data processing libraries.
但是,管道中仍然缺少关键的部分。 仅当没有强大而灵活的数据处理库时,向普通开发人员提供ML建模才有意义。
Enter DanfoJS.
输入DanfoJS。
Danfo.js is an open-source, JavaScript library providing high-performance, intuitive, and easy-to-use data structures for manipulating and processing structured data.- DanfoJS Documentation
Danfo.js是一个开放源代码JavaScript库,提供用于处理和处理结构化数据的高性能,直观且易于使用的数据结构。- DanfoJS文档
Imagine the verbosity of pre-processing and ETL code without our beloved pandas, NumPy, scikit-learn, and others making it much easier and more concise. Besides many additional hours of custom work, everyone’s ML pipelines end up looking even more different and fragmented than in the python landscape. The developers of Danfo understand. Look no further than some of the key points advertised on the main page:
想象一下,如果没有我们心爱的熊猫,NumPy,scikit-learn和其他产品,预处理和ETL代码的冗长性将使其变得更加轻松和简洁。 除了许多额外的自定义工作时间之外,每个人的ML管道最终看起来都比python环境中的差异和支离破碎 。 丹佛的开发人员了解。 在首页上宣传的一些关键点没有什么比其他的要好:
- Easy handling of missing data (represented as
NaN
) in floating point as well as non-floating point data- Size mutability: columns can be inserted/deleted from DataFrame- Automatic and explicit alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and letSeries
,DataFrame
, etc. automatically align the data for you in computations- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data- Make it easy to convert Arrays, JSONs, List or Objects, Tensors, and differently-indexed data structures into DataFrame objects- Intelligent label-based slicing, fancy indexing, and querying of large data sets- Intuitive merging and joining data sets- Robust IO tools for loading data from flat-files (CSV and delimited) and JSON data format.- Powerful, flexible, and intuitive API for plotting DataFrames and Series interactively.- Timeseries-specific funct ionality: date range generation and date and time properties.-容易处理遗失 浮点数据和非浮点数据中的数据(表示为
NaN
)-大小可变性:可以从DataFrame中插入/删除列-自动和显式对齐:对象可以显式对齐到一组标签,或者用户可以只需忽略标签,让Series
,DataFrame
等自动为您对齐数据即可进行计算-强大,灵活的分组功能,可对数据集执行拆分应用合并操作,以汇总和转换数据-轻松实现将数组,JSON,列表或对象,张量和不同索引的数据结构转换为DataFrame对象-基于智能标签的切片,奇特索引和对大型数据集的查询-直观的合并和联接数据集-用于加载数据的强大IO工具从平面文件 (CSV和定界的)和JSON数据格式。-功能强大,灵活且直观的API,用于交互绘制DataFrame和Series。-特定于时间序列的功能:日期范围生成和 日期和时间属性。- DanfoJS Documentation
-DanfoJS文档
初看 (A First Look)
Take a look at the snippet below, taken from an example notebook that trains a Titanic survival prediction model with TensorflowJS. Not too different from typical pandas
syntax, if you ask me.
看看下面的片段,该片段来自一个示例笔记本,该笔记本使用TensorflowJS训练了Titanic生存预测模型。 如果您问我,它与典型的pandas
语法没有太大区别。
// see more at https://danfo.jsdata.org/examples/titanic-survival-prediction-using-danfo.js-and-tensorflow.js
const dfd = require("danfojs-node")
const tf = require("@tensorflow/tfjs-node")
async function load_process_data() {
let df = await dfd.read_csv("https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv")
//A feature engineering: Extract all titles from names columns
let title = df['Name'].apply((x) => { return x.split(".")[0] }).values
//replace in df
df.addColumn({ column: "Name", value: title })
//label Encode Name feature
let encoder = new dfd.LabelEncoder()
let cols = ["Sex", "Name"]
cols.forEach(col => {
encoder.fit(df[col])
enc_val = encoder.transform(df[col])
df.addColumn({ column: col, value: enc_val })
})
let Xtrain,ytrain;
Xtrain = df.iloc({ columns: [`1:`] })
ytrain = df['Survived']
// Standardize the data with MinMaxScaler
let scaler = new dfd.MinMaxScaler()
scaler.fit(Xtrain)
Xtrain = scaler.transform(Xtrain)
return [Xtrain.tensor, ytrain.tensor] //return the data as tensors
}
load_process_data()
A few things pop out, just from this snippet:
只是从此片段中弹出了几件事:
- the syntax is very familiar to python data ecosystem users 语法是python数据生态系统用户非常熟悉的
the code also has a MinMaxScaler helper
该代码还具有MinMaxScaler帮助器
- the data types have first-class support for tensors 数据类型具有对张量的一流支持
Nice, no more googling “np.array to tensor” .
很好,不再使用谷歌搜索“ np.array to tensor”。
The library’s offerings include additional scaling/labeling helpers typically present in modeling libraries: OneHotEncoder, StandardScaler, MinMaxScaler, and LabelEncoder.
该库提供的产品还包括建模库中通常存在的其他缩放/标记助手: OneHotEncoder , StandardScaler , MinMaxScaler和LabelEncoder 。
Photo by Max Duzij on Unsplash Max Duzij在 Unsplash上 拍摄的照片结论 (Conclusion)
Many view in-browser machine learning as a lame duck. I think it has valid use cases, such as end-to-end javascript applications. With the development of DanfoJS, ML pipelines in TensorflowJS can clean up a lot, as well as shift the data processing code in from other places as is appropriate.
许多人将浏览器内的机器学习视为as脚鸭。 我认为它具有有效的用例,例如端到端的javascript应用程序。 随着DanfoJS的发展, TensorflowJS中的ML管道可以清理很多东西,并且可以适当地将数据处理代码从其他地方移入。
Besides validating the movement, I’m much more interested in seeing what projects come of this as time goes on. Do you have any project ideas? Let me know here.
除了验证机芯,我还对随着时间的流逝而看到哪些项目感兴趣。 您有什么项目创意吗? 在这里让我知道。
资源资源 (Resources)
资源资源 (Resources)
DanfoJS Home Page
DanfoJS主页
10 Minutes to DanfoJS
10分钟到DanfoJS
DanfoJS API Reference
DanfoJS API参考
TensorflowJS Examples
TensorflowJS示例
Titanic Survival Prediction with DanfoJS & TensorflowJS
使用DanfoJS和TensorflowJS进行泰坦尼克号生存预测
保持最新 (Stay Up To Date)
Aside from here on Medium, keep yourself updated with the LifeWithData blog, the Machine Learning UTD Newsletter, and my Twitter. Through those platforms, I provide more long-form and short-form thoughts, respectively.
除了Medium,您还可以通过LifeWithData博客, Machine Learning UTD新闻和我的Twitter随时了解最新信息。 通过这些平台,我分别提供了更多的长篇和短篇思想。
If you’re not a fan of emails and social media, but still want to stay in the loop, consider adding lifewithdata.org/blog and lifewithdata.org/newsletter to a Feedly aggregation setup.
如果您不喜欢电子邮件和社交媒体,但仍然想了解最新情况,请考虑将lifewithdata.org/blog和lifewithdata.org/newsletter添加到Feedly聚合设置中。
翻译自: https://towardsdatascience.com/hello-danfo-pandas-for-javascript-from-tensorflow-3d1d0ea3f3be