加州大学伯克利分校的david bourgin博士使用numpy手撸各种机器学习源码,Github爆砍13.3k小星星。
https://github.com/ddbourgin/numpy-ml/tree/master
我为什么要推荐这个资源?
目前开源的机器学习框架有很多,例如sklearn,scipy,tensorflow等等。
但是,当你想调试时,或者想查看某些细节是如何实现时,你会发现,这些框架都依赖了很多其他的库。
而numpy-ml仅依赖numpy。
由于没有使用其他第三方库,很多方法都是从零开始实现,当你想通过查看源码验证理论时,numpy-ml是个不错的选择。
例如,对于ALS矩阵分解,你可以通过代码查看求解子矩阵的迭代过程。
对于决策树的创建,如何通过信息增益计算分割条件的代码也非常详细。
主要内容:
Gaussian mixture model
EM training
Hidden Markov model
Viterbi decoding
Likelihood computation
MLE parameter estimation via Baum-Welch/forward-backward algorithm
Latent Dirichlet allocation (topic model)
Standard model with MLE parameter estimation via variational EM
Smoothed model with MAP parameter estimation via MCMC
Neural networks
col2im
(MATLAB port)
im2col
(MATLAB port)
conv1D
conv2D
deconv2D
minibatch
Bernoulli variational autoencoder
Wasserstein GAN with gradient penalty
word2vec encoder with skip-gram and CBOW architectures
ReLU
Tanh
Affine
Sigmoid
Leaky ReLU
ELU
SELU
GELU
Exponential
Hard Sigmoid
Softplus
Cross entropy
Squared error
Bernoulli VAE loss
Wasserstein loss with gradient penalty
Noise contrastive estimation loss
Glorot/Xavier uniform and normal
He/Kaiming uniform and normal
Standard and truncated normal
Constant
Exponential
Noam/Transformer
Dlib scheduler
SGD w/ momentum
AdaGrad
RMSProp
Adam
Batch normalization (spatial and temporal)
Layer normalization (spatial and temporal)
Dropout
Bidirectional LSTM
ResNet-style residual blocks (identity and convolution)
WaveNet-style residual blocks with dilated causal convolutions
Transformer-style multi-headed scaled dot product attention
Add
Flatten
Multiply
Softmax
Fully-connected/Dense
Sparse evolutionary connections
LSTM
Elman-style RNN
Max + average pooling
Dot-product attention
Embedding layer
Restricted Boltzmann machine (w. CD-n training)
2D deconvolution (w. padding and stride)
2D convolution (w. padding, dilation, and stride)
1D convolution (w. padding, dilation, stride, and causality)
Layers / Layer-wise ops
Modules
Regularizers
Normalization
Optimizers
Learning Rate Schedulers
Weight Initializers
Losses
Activations
Models
Utilities
Tree-based models
Decision trees (CART)
[Bagging] Random forests
[Boosting] Gradient-boosted decision trees
Linear models
Unknown mean, known variance (Gaussian prior)
Unknown mean, unknown variance (Normal-Gamma / Normal-Inverse-Wishart prior)
Ridge regression
Logistic regression
Ordinary least squares
Weighted linear regression
Generalized linear model (log, logit, and identity link)
Gaussian naive Bayes classifier
Bayesian linear regression w/ conjugate priors
n-Gram sequence models
Maximum likelihood scores
Additive/Lidstone smoothing
Simple Good-Turing smoothing
Multi-armed bandit models
Beta-Bernoulli sampler
UCB1
LinUCB
Epsilon-greedy
Thompson sampling w/ conjugate priors
LinUCB
Reinforcement learning models
Cross-entropy method agent
First visit on-policy Monte Carlo agent
Weighted incremental importance sampling Monte Carlo agent
Expected SARSA agent
TD-0 Q-learning agent
Dyna-Q / Dyna-Q+ with prioritized sweeping
Nonparameteric models
Nadaraya-Watson kernel regression
k-Nearest neighbors classification and regression
Gaussian process regression
Matrix factorization
Regularized alternating least-squares
Non-negative matrix factorization
Preprocessing
Discrete Fourier transform (1D signals)
Discrete cosine transform (type-II) (1D signals)
Bilinear interpolation (2D signals)
Nearest neighbor interpolation (1D and 2D signals)
Autocorrelation (1D signals)
Signal windowing
Text tokenization
Feature hashing
Feature standardization
One-hot encoding / decoding
Huffman coding / decoding
Byte pair encoding / decoding
Term frequency-inverse document frequency (TF-IDF) encoding
MFCC encoding
Utilities
Similarity kernels
Distance metrics
Priority queue
Ball tree
Discrete sampler
Graph processing and generators
既然numpy支持各种类型数据运算,为什么还需要其他机器学习框架?
虽然 NumPy 是一个功能强大的库,支持各种类型的数据运算,但它主要专注于数组操作和数值计算。在机器学习领域,除了基本的数值计算,还涉及到许多其他复杂的任务和算法。这就是为什么需要其他专门的机器学习框架的原因,其中一些主要包括:
高级机器学习算法:NumPy 只提供了有限的几种经典算法,不如完整框架包括的算法多,如果需要更高级的功能和优化。就需要专门的机器学习框架,如 TensorFlow、PyTorch 和 scikit-learn等。
自动微分和梯度计算:在训练神经网络等深度学习模型时,梯度计算是反向传播过程中进行参数更新的关键步骤。而Numpy没有提供自动求导功能,专门的框架提供了自动微分和梯度计算的功能。
高级数据处理和预处理:在机器学习任务中,数据的处理和预处理是非常重要的。专门的机器学习框架提供了丰富的工具和函数,用于数据加载、转换、特征工程和数据增强等操作。这些功能使得数据的准备和处理更加方便和灵活。
分布式计算和加速计算:对于大规模的数据集和复杂的模型,需要进行分布式计算和高性能的加速计算。一些机器学习框架提供了分布式计算的支持,可以在集群或GPU等加速硬件上运行模型训练和推理,以提高计算效率和速度。