Niklauseik

神经网络深度学习论文阅读

This figure shows my classification and summary of these papers.

My reading notes are below. Each note following the headline is divided into several parts, which are the summary, advantages, evaluation and improvement of the paper.
Learning representations by back-propagating errors

The team describe a new back-propagation learning procedure for networks of neurone-like units, which adjusts the weight of those connections in the network over and over in order to reduce the difference between the actual output vector and the desired output vector little by little. And as a consequence of the adjustments of weight, the internal ‘hidden’ units which do not belong to input or output come to present significant features of the task domain. Meanwhile, the interactions of these units extract the regularities in this task. This new learning procedure distinguishes from earlier, simpler methods like the perceptron-convergence(they do not learn representation) procedure because of its ability to create useful new features.
Aiming at finding a powerful synaptic modification rule that allow neural network to develop an internal structure that is appropriate for a particular task domain, there have been many attempts used self-organizing neural networks. So the task is important, the difficulty is closely related to whether the task is specified. If the task is specified by the desired state vector of the output units and the input units are directly connected to output units, it is easy to find learning rules, finally, the difference between the actual output and the desired output is progressively reduced. Things are different and more difficult when we introduce hidden units whose actual or desired states are not specified by the task. In order to achieve the desired input-output behaviour,when the hidden units should be active must be decided by the learning procedure.And the team prove that a general purpose and relatively simple procedure is powerful enough to construct appropriate internal representations.

2.Divided into three levels, the simplest form of the learning procedure has a layer of input units at the bottom; any number of intermediate layers; and a layer of output units at the top. Connections within a layer or from higher to lower layers are forbidden,but connections can skip intermediate layers. An input vector is presented to the network by setting the states of the input units.Then the states of the units in each layer are determined by applying equations (1) and (2) to the connections coming from lower layers. All units within a layer have their states set in parallel, but different layers have their states set sequentially, starting at the bottom and working upwards until the states of the output units are determined.

3.The most obvious limitation of this learning procedure is about the local and overall problems. The error-surface may contain local minima so that gradient descent is not guaranteed to find a global minimum. But this rarely happens.

4.Although the learning procedure, in its current form, is not a plausible model of learning in brains. Its application in various tasks shows that interesting internal representations can be constructed by gradient descent in weight-space, and this suggests that it is worth looking for more biologically plausible ways of doing gradient descent in neural networks.
Attention Is All You Need
1.The team propose the Transformer, a new simple network architecture which is based on attention mechanisms and totally dispenses with recurrence and convolutions. It consists of two parts: encoder and decoder which associate with the dominant sequence transduction models are based on complex recurrent or convolutional neural networks. Through an attention mechanism connecting encoder and decoder, the model can show best performance. After experiments and displays, the results show that: these models have higher quality and do better in parallel tasks, at the same time needs less time to train.

2.Recurrent language models and encoder-decoder architectures, associating with sequence modeling and transduction problems such as language modeling and machine translation, are current nowadays. Although recent work has achieved significant improvements in computational efficiency through factorization tricks and conditional computation, as a result of sequential computation, constrains and the Gradient disappearance problem still remain. But with the help of attention mechanisms which allows modeling of dependencies without regard to their distance in the input or output sequences, the Transformer can eschewing recurrence. So, Transformer relies entirely on an attention mechanism to draw global dependencies between input and output, which allows for significantly more parallelization and can reach a new state of the art in translation quality especially after enough training.

3.The Transformer follow the overall architecture which contains encoder and decoder. Encoder map an input sequence of symbol representations to a sequence of continuous representations and then decoder generate an output sequence. The whole process is auto-regressive and the next output relies on previous output. The encoder is composed of a stack of N = 6 identical layers so do the decoder.
The attention function mechanism operates through a batch of vectors: the query, keys, values, and output. Map query and a set of key value pairs to an output which is weighted sum of values. The Transformer uses multi-head attention in three different ways: In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. The encoder contains self-attention layers. self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position.

4.Considering of total computational complexity per layer, parallel computing and s the path length between long-range dependencies in the network, the team chose the self-attention. Self-attention provides with feasible parallel computing, lower operational complexity and generate more interpretable models.

5.The team trained on the standard dataset and used the Adam to optimize. Residual Dropout and Label Smoothing were used for regularization. Further experiments show that the large transformer model has higher scores and lower training costs than the previously published best model. As expected,bigger models are better, and dropout is very helpful in avoiding over-fitting.
The first sequence transduction model based entirely on attention,the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers for translation tasks. And the team is planning to extend the Transformer to problems involving input and output modalities other than text.

GPipe: Efficient Training of Giant Neural Networks
using Pipeline Parallelism

Known as an effective method to improve the quality of model for several different machine learning tasks, scaling up the capacity deep of neural network increases model capacity beyond the memory limit of a single accelerator has required developing special algorithms or infrastructure, in many instances. And this kind of solution usually is specified by architecture so it cannot transfer to other tasks. Considering all these constrains and limitations, the team proposed and introduced GPipe, which is a pipeline parallelism library allowing scaling any network which can be expressed as a sequence of layers. The GPipe is able to address the meet the requirements of efficient and task-independent model parallelism. GPipe provides the flexibility to efficiently scale a variety of different networks to a large scale by By pipelining different sub-sequences of layers on separate accelerators. Besides, when the model is divided across multiple accelerators, GPipe can almost achieve linear speedup, using a new a novel batch-splitting pipelining algorithm. The team trained large-scale neural networks on Image Classification attaining a top-1 accuracy of 84.4% on ImageNet-2012. The model also works well on Multilingual Neural Network Translation, they trained a single large-scale Transformer model on a corpus spanning over 100 languages and achieve better quality than all bilingual models.
Thanks to the improvement of methods that have facilitated scaling the effective capacity of neural networks, the deep learning has made great progress. In general, the larger the model, the better its performance in the task. It it globally acknowledged that there is a strong correlation between model size and classification accuracy. However, larger models have notable advantages among many fields and brought remarkable quality improvements, scaling neural networks itself introduces significant practical challenges. There are hardware constrains with big scale models such as memory limitations and communication bandwidths on accelerators. Actually users have to divide large models into parts assigning to different accelerators(GPU or CPU) to deal with the problem. But at the same time the efficient model parallelism algorithms are extremely difficult to design and implement. So it it usual for the practitioner to make difficult decision among scaling capacity, flexibility (or specificity to particular tasks and architectures) and training efficiency. Therefore if the architecture and tasks are specific, the model-parallel algorithms are able to be very efficient. Researchers can scale neural networks easily, with the development and application of deep learning and the increasing demand for reliable and flexible infrastructure. The neural networks now can complete various of machine learning tasks.
GPipe solves the memory limitation problem by partitioning the model across different accelerators. Each model can be specified as a sequence of layers and consecutive groups of layers can be partitioned into cells which are placed on a separate accelerator. Based on the work of partitioned setup the team propose a novel pipeline parallelism algorithm splitting a mini-batch of training examples into smaller micro-batches, then pipeline the execution of each set of micro-batches over cells. This algorithm helps researchers train increasingly large models by deploying more accelerators. Meanwhile, GPipe is also helpful in data parallelism to further scale training. Therefore it can be combined with data parallelism to use more accelerators in a complementary manner to expand the scale of neural network training. The team applied RMSProp during training and also introduces low communication overhead to optimize performance.
3.In such scenarios, imperfect partitioning algorithms might lead to load imbalance. Better partitioning algorithms can potentially improve the performance over the heuristic approach.We found that re-computation time was the main contributor to GPipe overhead, taking up to 23% of the total step time. Another source of overhead was load imbalance

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

The paper proposes a new language representation model named BERT standing for Bidirectional Encoder Representations from Transformers.
As well all know, the pre-trained model has been proven to be effective in many natural language tasks including sentence-level tasks such as natural language inference and paraphrasing. These tasks aim at predicting the connections between sentences by analyzing them holistically, as well as token-level tasks like named entity recognition and question answering where the models are required to give fine-grained output at the token-level. In previous work, the two approaches for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning share the same objective function during pre-training. And in these strategies they all use unidirectional language models to learn the representation of general language. The team argue that it is the current techniques that restrict the ability of the pre-trained representations. The core limitation is that standard language models are unidirectional, which limits the options of used architecture during pre-training. These restrictions are sub-optimal for sentence-level tasks and devastating for token-level tasks. These limitations can be extremely harmful for tasks that is crucial to incorporate context from both directions.
In this paper, the team propose BERT to improve the fine-tuning based approaches. Inspired by the Cloze task, BERT uses a “masked language model” pre-training objective to solve the previously mentioned unidirectional constrain. Unlike the left-to-right pre-training language model BERT enable the representation to incorporate the context, which allows the team to pre-train a unique deep bidirectional Transformer. In their work, they demonstrate the significance of bidirectional pre-training, show that pre-trained representations can reduce many complicated work for task-specified architectures. And BERT become the first fine-tuning based model that show the state-of-the-art performance on a large scale of both sentence-level and token-level tasks. Besides all these, BERT performs best in 11 NLP tasks.
The BERT framework contains two steps: pre-training and fine-tuning. In pre-training, the model use unlabeled data to train over different tasks. While for fine-tuning, BERT firstly initialize with the pre-trained parameters and all of these parameters are fine-tuned via using labeled data from the downstream tasks. And each downstream tasks own their unique fine-tuning models, despite they are all initialized with the same pre-trained parameters. Both pre-training and fine-tuning use the same architecture except the output layers. So the difference between the pre-trained architecture and the final downstream architecture is minimal. That means a significant strength of BERT is its unified architecture among different tasks. Besides, in order to train a model that understands sentence-level relationships, they also introduces next-sentence prediction task for pre-train.

3.In their experiments, they mask 15% of all WordPiece tokens. Compared with denoising auto-encoders, BERT only predict the masked words in stead of reconstructing the whole input, which leads to slower convergence than left-to-right model and mismatch between pre-training and fine-tuning, because tokens does not involved in fine-tuning step. If the [MASK] token is used to much in pre-train, the model’s performance will be effected. And obviously, the BERT consumes a lot of hardware resources.

NASPipe：High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism

In this work, inspired by conventional CPU instruction pipeline problems, the team present NASPipe the first high-performance and reproducible distributed parallel supernet training system via causal synchronous parallel (CSP) pipeline scheduling abstraction. The NASPipe divides a supernet across GPUs executing multiple generated sub-tasks (subnets) in a pipelined manner at the same time. In the meantime, it monitors the correlations between the subnets and resolves all the causal dependency which is caused by subnets’ layer sharing deterministically.
Neural Architecture Search (NAS) has a significant contribution on building high-quality Deep Neural Networks (DNNs) for different applications and devices. And because of low computation cost and high quality, the supernet paradigm is the most prevalent and widely accepted which composes the entire search space into a whole supernet and train it recurrently. Traditional NAS paradigms which train each explored DNN to convergence costs too much. Unlike the supernet paradigms train tens of thousands of standalone DNNs to training a monolithic one meanwhile keep the quality of searched DNNs. Unfortunately, though the industry and academia have created systems for easily defining NAS supernets and large standalone models like GPipe, Deepspeed and Pipedream, but none of these are designed for effectively train extremely large supernets.
There are two main problems. One is that it is difficult to deterministically resolve the dependencies between subnets activated in parallel because now available large-scale DNN train systems are designed to parallel .the training of multiple batches within the same DNN model, they are not designed for capturing and enforcing the causal dependency. The other is to leave more cache space for larger batch training to achieve higher GPU utilization via managing the extra-large supernet context among GPUs efficiently.
To enforce high-quality and reproducible supernet training, NASPipe at the same time executes the subnets which are generated by supernet-based exploration algorithms, oversees the connection between subnets and keep all causal dependencies resulted from layer sharing to enforce high-quality and reproducible supernet training. With DNNs are getting larger and larger, a single GPU cannot have a capacity of a subnet itself, which means the pipeline parallelism becomes one of the most efficiently way to train large models. In order to realize parallelization of subnet execution, the team divided each subnet into parts like subset of layers instead of deploying each subnet task on one GPU which allow each GPU executes one part and make the subnet executions into a pipeline. Because of pipeline parallelism, the team is able to efficiently resolve causal dependencies meanwhile execute synchronizations in supernet training ,locally on each GPU in a decentralized manner.
The team introduces a pipeline scheduler which promotes the subnet tasks with larger chronological order into execution to improve the pipeline efficiency. And via leveraging the status of the current stage and the status passed from other stages, NASPipe predicts the upcoming subnets with the biggest chance to be scheduled in the next few steps to efficiently manage GPU memory and precisely exchange in the context of subnets to be executed.
The team have prototyped NASPipe on PyTorch which is one of the most prevalent DNN execution frameworks. As a training system for a supernet-based NAS algorithm, NASPipe in service. The NASPipe has advances among recently pipeline training system, firstly it is reproducible which produces the same training process and results independent of GPU numbers. Secondly, despite there are existing dependencies across subnets, the NASPipe can also execute efficiently. Thirdly, it is scalable with the increase of GPU numbers, it can provides a roughly linearly increased computation power.

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

In this paper, the team proposed a novel interleaving pipeline scheduling, a technology called PTD-P. We show how to combine tensor, pipeline, and data parallelism to scale to thousands of GPUs.
In recent years, the converter based language model has become more available with large-scale computing and the data set has become larger, which has promoted the rapid development of research in the field of natural language processing (NLP). Large language models have achieved state-of-the-art accuracy in multiple tasks. As a result, the number of parameters in the most advanced NLP model increases exponentially. There are two main challenges in training such a model. First, the memory capacity of GPU is limited, and even multiple GPU servers cannot adapt to large-scale models. Even the main memory of the largest NVIDIA 80GB-A100 card cannot fit the parameters of these models. Second, even if we can fit the model in a single GPU (for example, by exchanging parameters between the host and the device memory), the large amount of computing operations required may lead to an unrealistically long training time.
This requires parallelism. In this paper, they show how to combine PTD-P (inter-node pipeline parallelism, intra-node tensor parallelism and data parallelism) to achieve high aggregation throughput and train large-scale models with trillions of parameters. This facilitates end-to-end training in a reasonable time.

2.They proposed a novel interleaving pipeline scheduling, which can increase the throughput by more than 10%, and the memory consumption is equivalent to the existing methods.
Data parallel horizontal expansion usually works well, but there are two limitations: a) if it exceeds one point, the batch size of each GPU becomes too small, thus reducing GPU utilization and increasing communication cost; and b) the maximum number of devices that can be used is the batch size, which limits the number of accelerators that can be used for training.
The parallelism of tensor (in layer) models, in which the matrix multiplication in each converter layer is split into multiple GPUs, can be used to overcome these limitations. It is not suitable for larger models. A larger model needs to be split across multiple multi GPU servers,
Pipeline model parallelism is another technology that supports large-scale model training, in which the model layer is striped on multiple GPUs. In order to achieve high efficiency, a larger batch size is usually required. In this work, we also introduced a new pipeline plan, which can improve the efficiency of small batches
They consider the performance impact of combining the parallelism of pipeline and tensor models with data parallelism, and also want to consider the interaction between data parallelism and the parallelism of the two models. With data parallelism, each worker has a copy of the complete model, the input data set is segmented, and workers regularly aggregate their gradients. With the pipeline parallelism, the model layer can be partitioned across multiple devices. When used on a model that repeatedly uses the same transformer block, the same number of transformer layers can be assigned to each device. Periodic pipeline refresh is introduced to accurately preserve strict optimizer semantics so that optimizer steps can be synchronized between devices. At the beginning and end of each batch, the equipment is idle. Using the parallelism of tensor model, each layer of the model is divided into multiple devices.
They implemented three model specific optimizations on the computational graph to achieve high performance.

PipeDream Generalized Pipeline Parallelism for DNN Training

The team demonstrated Pipedream, which is a system that adds inter-batch pipeline to intra-batch parallelism to further improve parallel training throughput, help better overlap computation and communication and reduce the traffic as much as possible. Deep neural network (DNN) has made great progress in a series of applications, but its training is very time-consuming and requires efficient multi accelerator parallelization. As the deployment of DNNS becomes more and more extensive, their training and computing costs become higher and higher, so they need to be executed in parallel across multiple accelerators (such as GPUs).DNN training was performed in iterations of forward and backward calculations. In each iteration, the training loop processes a small batch of input data and updates the model parameters. The current method focuses on parallelizing each iteration of the optimization algorithm in a group of workers. Unfortunately, intra-batch parallelization may be affected by large-scale communication costs.
Although the pipeline is a simple idea, DNN training poses an important challenge that the traditional pipeline does not exist: first complete the forward transfer of all input small batches, and then reverse transfer. However, the statistical efficiency of this method is low, increasing the number of passes through the data set required to generate a high-quality model. In addition, this strategy may prevent the model from achieving the desired target accuracy.
In this paper, we propose Pipedream, which is a system that uses pipeline parallelism to realize faster DNN training by combining intra-batch parallelism with inter batch parallelism. Pipedream divides the model among the available workers, assigns a group of continuous operators (called layers in DNN terminology) in the operator graph to each worker, and then overlaps the calculation and communication of different inputs in a pipeline manner.
Pipedream automatically determines how to divide the DNN operator according to the short analysis run performed on a single GPU, balances the computational load between different stages, and at the same time minimizes the communication of the target platform. It extends 1F1B to merge cyclic scheduling in the data parallel stage, while ensuring that the gradient in the reverse transfer is routed from the forward transfer to the corresponding worker, Because the same weight version and intermediate output are required to obtain the correct gradient calculation. The combined scheduling algorithm 1F1B-RR generates the operator static scheduling that each worker repeatedly runs to maintain the high utilization rate of all workers. Therefore, pipeline parallel training can be regarded as a combination of inter-batch pipeline and intra-batch parallel principle. Our evaluation includes many combinations of DNN model, data set and hardware configuration, which confirms the training time advantage of Pipedream pipeline parallelism.
PipeDream achieves high hardware efficiency with no pipeline stalls in steady state, and high statistical efficiency comparable to data parallelism using the same number of workers, which takes a more nuanced approach to pipelining that outperforms other solutions. Pipeline-parallel DNN training helps reduce the communication overheads that bottleneck intra-batch parallelism. PipeDream can better overlap computation with communication while minimizing the amount of data communicated via automatically partitioning DNN training process across workers, combining inter-batch pipelining with intra-batch parallelism.

Single Path One-Shot Neural Architecture Search with Uniform Sampling

This work proposes a single path one shot model to solve the challenges in training. Their central idea is to build a simplified supernet, in which all architectures are single paths, thus reducing the weight adaptation problem. The training is performed by uniform path sampling. All structures (and their weights) are fully and equally trained.
Neural architecture search (NAS) aims to automate architecture engineering by solving architecture design Recent methods use weight sharing strategy to reduce the amount of calculation. A supernet subsuming all architectures is trained only once. Each architecture inherits its weight from the Supernet. Only fine adjustments are made. The calculation cost is greatly reduced. Most weight sharing methods use continuous relaxation to parameterize the search space. There are two questions. First, the weights in the supernet are deeply coupled. Secondly, joint optimization introduces further coupling between architecture parameters and supernet weights. The greedy nature of gradient based methods inevitably introduces bias in the optimization process, and it is easy to mislead the architecture search. The existing one-time method still has coupling weight in the supernet. Their optimization is complex and involves sensitive hyper parameters. They did not show competitive results on large datasets.
In order to alleviate these problems, they proposed a simple but effective single path one-shot method.
This work revisits the one-shot paradigm and proposes a new approach to further simplify training and enhance architecture search. Based on the observation that the accuracy of the architecture using inherited weights should be able to predict the accuracy of using optimized weights, we suggest that supernet training should be stochastic. All architectures optimize their weights at the same time. This results in a uniform sampling strategy.
They discussed the disadvantages of existing NAS methods using nesting and joint optimization. A single path one-time method with uniform sampling is proposed. It overcomes the disadvantages of the existing one-time method. Its simplicity supports rich search space, including novel design of channel size and bit width, all of which are solved in a unified way. They proposed a simple search space composed of a single path architecture to reduce the weight coupling in the supernet. The training is hyperparameter-free and easy to converge.
Comprehensive experiments show that their method is flexible and effective. It is easy to train and fast to search. It effortlessly supports complex search spaces (e.g., building blocks, channels, mixed precision quantization) and different search constraints (e.g., flop, delay). And thus can be conveniently used for various needs. It defines supernet and performs weight inheritance in a similar manner. It is sequential and combines the advantages of nesting and joint optimization methods. Architecture search is efficient and flexible. It achieves the most advanced performance. on ImageNet, a large dataset.
Comprehensive experiments show that their method can achieve better results than other methods in several different search spaces. We also analyzed the search cost and relevance of our method. Our approach is more efficient, especially when multiple searches are required.

Scaling Distributed Machine Learning with the Parameter Server

This paper describes the third generation open source implementation of parameter server, focusing on the system aspects of distributed inference.
In recent years, the key factors to solve large-scale machine learning problems are distributed optimization and inference. The growth of data and the improvement of model complexity are the phenomena of parameter explosion. No machine can solve these problems quickly and effectively. The intensive computing workload and data traffic require careful system design, so it is not easy to design an efficient distributed algorithm.
Large and complex models are often shared globally by all work nodes, and they must frequently access shared parameters when performing calculations to optimize it. Sharing brings three challenges: 1) parameter access requires a large amount of bandwidth. 2) Sequential machine learning algorithms will damage performance 3) in the cloud environment, machines are unreliable and training tasks will be preempted.
This paper describes the third generation open source implementation of parameter server, focusing on the system aspects of distributed information. They proposed a parameter server framework for distributed machine learning. Data and workload are distributed on the work node, while the server node maintains global shared parameters, expressed as dense or sparse vectors and matrices. The framework manages asynchronous data communication between nodes and supports flexible consistency model, elastic scalability and continuous fault tolerance.

2.There are two key challenges in building a high-performance parameter server system: 1) communication: Although parameters can be updated to key value pairs in traditional data storage, it is inefficient to use this abstraction naively. 2) Fault tolerance: fault tolerance is critical in scale, and for efficient operation, it cannot require complete restart of long-running computing.
The parameter server framework provides developers with two advantages: first, it keeps the application specific code simple by decomposing the common components of the machine learning system. At the same time, as a sharing platform for system level optimization, it provides a powerful, universal and high-performance implementation that can handle various algorithms from sparse logistic regression to topic models and distributed sketches. Their design is guided by the workload in the actual system. The novelty of the system lies in the synergy achieved by selecting correct system technologies, adapting them to machine learning algorithms, and modifying machine learning algorithms to make them more system friendly. Their parameter server provides five key features: efficient communication: the asynchronous communication model does not block computation (unless requested). It is optimized for machine learning tasks to reduce network traffic and overhead.
This framework is flexible, scalable, error tolerant, durable and easy to use, and is designed to achieve long-term deployment. The parameter server provides five key features: efficient communication: the asynchronous communication model does not block computation (unless requested). It is optimized for machine learning tasks to reduce network traffic and overhead.
They demonstrated experiments on several challenging tasks on a real dataset with billions of variables, fully demonstrating their efficiency.

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

In recent years, most of the progress made in deep learning is directly related to the significant increase in model scale. The language model parameters are expanded to hundreds of billions, and training is conducted on a larger data set, thus realizing new functions. However, training these very large models on distributed clusters currently requires a large number of model definitions and engineering work specific to the cluster environment, such as adjusting and selecting parallel dimensions and selecting pipeline schemes (partition selection).
Parallelization of automated large-scale models will significantly accelerate ml research and production by enabling model developers to quickly explore new model designs without considering the underlying system challenges. Unfortunately, it needs to navigate a complex planning space that grows exponentially with the dimension of parallelism and the size of the model and cluster.
This paper proposes Alpa, which automates model parallel training of large-scale deep learning (DL) models by generating an execution plan with uniform data, operators and pipeline parallelism.
Existing model parallel training systems either require users to manually create a parallelization plan or automatically generate one from the limited model parallel configuration space. They are not sufficient to extend complex deep learning models on distributed computing devices.
Parallelization of automated large-scale models will significantly accelerate ml research and production. Unfortunately, it needs to navigate a complex planning space that grows exponentially with the dimension of parallelism and the size of the model and cluster. Our main observation is that we can organize different parallelization technologies into a hierarchical space and map these parallelization technologies into the hierarchical structure of the computing cluster. You can hierarchically represent parallel execution plans by specifying plans in each parallel category, which brings many advantages.
They designed and implemented Alpa, the first compiler to automatically generate parallel execution plans covering all data, operators and pipeline parallelism. Given the model description and cluster configuration, Alpa realizes this by dividing the cluster into multiple device grids. Alpa allocates the training of large-scale DL models by treating the parallelism as two levels: inter operator parallelism and intra operator parallelism. On this basis, Alpa builds a new hierarchical space for the massive model parallel execution plan. Alpa has designed many compilation channels to automatically export effective parallel execution plans at each parallel level. Alpa implements an efficient runtime to orchestrate two-level parallel execution on distributed computing devices.
Their evaluation shows that the parallelization plan generated by Alpa matches or is better than the manually adjusted model parallel training system, even on the model they designed. Unlike special systems, Alpa can also be extended to models with heterogeneous architectures and models without manual design plans. They evaluated Alpa’s training of large models with billions of parameters and achieved efficient execution results.

vPIPE: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

The team introduces vPIPE, the first dynamic layer live partition and memory management system for pipeline parallelism, serving as a transparent acceleration. It works as a virtualized layer between the typical pipeline parallel system and its underlying execution engine. In order to resolve both designed goals G1 (carefully manage all stages’ training memory to avoid exceeding the physical memory capacity on any GPU) and G2 (enforce a “balanced” partition) transparently, it automatically find a globally near-optimal plan, which migrates layers among stages and relocates each layer’s activations and parameters to its current stage’s GPU or CPU memory. Moreover, vPIPE can significantly relieve the tension stage of the pipeline and improve the throughput of the pipeline in a balanced manner.
The vPIPE has two outstanding innovations, the first is an online search algorithm for layer partitioning and memory managing plans ,which is fast and near-optimal stage distributed and used to look for globally effective exchange, recalculation and partition strategies. It improves the efficiency and scalability of vPIPE. And the second contribution is a transparent real-time migration protocol for rebalancing layer distribution across training pipelines, which not delay the function or change the obsolescence of the parameters of the system above.
In recent years, in order to get higher modeling capacity, the scale of large deep neural networks is explosively increasing, with more layers and more parameters in each layer. Pipeline parallelism is an effective method to train large DNNS. An efficient pipeline system should achieve the two key goals of G1 and G2. Although previous work has made a lot of efforts on building pipeline parallel systems, it is still difficult to achieve complex and dynamic design goals at the same time.
There are two types of existing pipeline parallel systems. The first type stores the activation tensors generated during the forward passes directly in the GPU memory. It has to maintain a moderate batch size, but a larger training batch size may lead to higher GPU Alu utilization and higher throughput.The second type displays all activation tensors in the forward passes and recalculates them in the backward passes. This significantly alleviates the imbalance in GPU memory utilization between the previous stage and the subsequent stage, but at the cost of additional forward pass. In addition, when NAS is enabled in the DNN model, both types of pipeline parallel systems will experience more severe throughput degradation. The number and layout of model’s layers can be modified by the runtime algorithm and evaluated by running the NAS enabled converter.
The team believes that the key is that the memory management and layer partition strategies of these systems are static. When the stage becomes tense due to the GPU memory explodes or the newly activated hierarchy, these static strategies cannot use the idle GPU resources available in the adjacent stages to alleviate the tension. Vpipe computes a mixed plan of swap and recompute for all layers on each stage instead of all-recompute strategy. Instead of using a static partition strategy, vPIPE generates a new partition plan and transparently live migrates layers from the dense stage to the adjacent stage. This not only reduces the memory burden of the dense stage (G1), but also realizes a more balanced partition (G2) with higher throughput.
In order to achieve the goal, there are two challenges. The first challenge is to find globally effective swap, recompute and repartition (SRP) strategies among all stages. By using a powerful decomposition method, they created a fast-convergence and near-optimal search algorithm to cope with it. The second challenge is how to live(without GPU pause or pipeline cleaning) the migration layer while maintaining the transparency of vPIPE to the general upper layer pipeline parallelism system. They propose a new live migration protocol. Our key observation is that the time window between activation generation (in the forward pass) and its end usage (in the corresponding backward pass) allows vPIPE to perform subtle interleaving to transparently migrate layers without changing the parameter obsolescence of the upper system.
VPIPE has two limitations. First, it assumes that a single layer fits within the memory limits of a single GPU for any DNN workload trained with vPIPE. Second, vPIPE’s layermigration protocol remains live when the time cost of transferring a layer’s tensors can overlap with the computation time of DNN training.

你可能感兴趣的:(深度学习,人工智能,机器学习)

使用AIOps进行更好的事件管理茵赛飞3D CAD数据转换软件 pagerduty devops 人工智能运维
DevOps为科技界带来了更加协作和高效的工作流程。随着AIOps的集成，自动化更进一步，使用人工智能为团队提供更快的根本原因分析和算法降噪。主要从采用AIOps中受益的主要领域之一是事件管理。AIOps可以帮助DevOps团队自动化工作流程，以实现更智能、更高效的事件管理，从而腾出时间让IT运营团队成员专注于创新以改善用户体验。在本文中，我们将了解AIOps如何从检测和识别到响应改进事件管理，以
AI大模型编程能力对比：Deepseek&Claude&Gemini 黑夜路人（heiyeluren） AI人工智能人工智能 ai AIGC 语言模型
在当今快速发展的技术领域，人工智能（AI）模型在编程和数据处理方面的应用越来越广泛。不同的AI模型因其独特的设计理念和技术优势，适用于不同的编程任务和场景。本文将对三种主流的AI模型——DeepSeekv3、GeminiFlash2.0和Claude3.5Sonnet的编程能力进行详细对比，帮助读者根据具体需求选择最合适的工具。同时对DeepSeekv3、GeminiFlash2.0和Claude
DeepSeek：智能搜索与分析的新纪元 XRC2231 学习
在人工智能浪潮席卷全球的今天，DeepSeek如同一颗璀璨的新星，以其独特的魅力和强大的功能，在AI领域脱颖而出。DeepSeek，这一基于深度学习和数据挖掘技术的智能搜索与分析系统，不仅重新定义了搜索引擎的边界，更以其卓越的性能和广泛的应用场景，为全球用户带来了前所未有的智能体验。本文将从DeepSeek的定义、特点、应用场景、优势等方面进行全面而深入的介绍，带您领略这一新兴技术的独特魅力。一、
哈尔滨工业大学DeepSeek公开课人工智能：大模型原理技术与应用-从GPT到DeepSeek｜附视频下载方法你觉得205 人工智能机器学习大数据 ai 知识图谱 python 运维
导读INTRODUCTION今天继续哈尔滨工业大学车万翔教授带来了一场主题为“DeepSeek技术前沿与应用”的报告。本报告深入探讨了大语言模型在自然语言处理（NLP）领域的核心地位及其发展历程，从基础概念出发，延伸至语言模型在机器翻译、拼音输入法、语音识别等任务中的关键作用。强调了语言模型不仅辅助其他NLP任务，本身也蕴含大量知识，如地理信息、语义理解和推理能力。随着技术的发展，尤其是trans
机器学习knnlearn1 XW-ABAP 机器学习机器学习人工智能
importmatplotlib.pyplotaspltimportnumpyasnpimportoperator#定义一个函数用于创建数据集defcreateDataSet():#定义特征矩阵，每个元素是一个二维坐标点，代表不同策略数据点的坐标group=np.array([[20,3],[15,5],[18,1],[5,17],[2,15],[3,20]])#定义每个数据点对应的标签，用于区分
基于 MySQL 和 Spring Boot 的在线论坛管理系统设计与实现城南|阿洋-计算机从小白到大神 mysql spring boot 数据库
markdownCopy✌全网粉丝20W+,csdn特邀作者、博客专家、CSDN[新星计划]导师、java领域优质创作者,博客之星、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java、pyhton、机器学习技术领域和毕业项目实战✌哈喽兄弟们，好久不见哦～最近整理了一下之前写过的一些小项目/毕业设计。发现还是有很多存货的，想一想既然放在电脑里面也吃灰，那么还不如分享出去，没准还可以帮助到
零基础入门机器学习：用Scikit-learn实现鸢尾花分类藍海琴泉机器学习 scikit-learn 分类
适合人群：机器学习新手|数据分析爱好者|需快速展示案例的学生一、引言：为什么要学这个案例？目的：明确机器学习解决什么问题，建立学习信心。机器学习定义：让计算机从数据中自动学习规律（如分类鸢尾花品种）。为什么选鸢尾花数据集：数据量小、特征明确，适合教学演示。Scikit-learn优势：提供现成算法和工具，无需从头写数学公式。二、环境准备：5分钟快速上手目的：搭建可运行的代码环境，避免卡在工具安装环
机器学习--DBSCAN聚类算法详解 2201_75491841 机器学习算法聚类人工智能
目录引言1.什么是DBSCAN聚类？2.DBSCAN聚类算法的原理3.DBSCAN算法的核心概念3.1邻域（Neighborhood）3.2核心点（CorePoint）3.3直接密度可达（DirectlyDensity-Reachable）3.4密度可达（Density-Reachable）3.5密度相连（Density-Connected）4.DBSCAN算法的步骤5.DBSCAN算法的优缺点5
【机器学习】机器学习工程实战-第3章数据收集和准备腊肉芥末果机器学习工程实战机器学习人工智能
上一章：第2章项目开始前文章目录3.1关于数据的问题3.1.1数据是否可获得3.1.2数据是否相当大3.1.3数据是否可用3.1.4数据是否可理解3.1.5数据是否可靠3.2数据的常见问题3.2.1高成本3.2.2质量差3.2.3噪声（noise）3.2.4偏差（bias）3.2.5预测能力低（lowpredictivepower）3.2.6过时的样本3.2.7离群值3.2.8数据泄露/目标泄漏3
机器学习实战第一章机器学习基础 LuoY、 Machine Learning 机器学习算法人工智能
第一章机器学习1.1何谓机器学习1.2关键术语1.3机器学习的主要任务1.4如何选择合适的算法1.5开发机器学习应用程序的步骤1.6Python语言的优势1.1何谓机器学习 1、简单地说，机器学习就是把无序的数据转换成有用的信息； 2、机器学习能让我们自数据集中受启发，我们会利用计算机来彰显数据背后的真实含义； 3、机器学习横跨计算机科学、工程技术和统计学等多个学科，需要多学科的
数据挖掘实战-基于机器学习的垃圾邮件检测模型艾派森数据挖掘实战合集数据挖掘机器学习人工智能 python
‍♂️个人主页：@艾派森的个人主页✍作者简介：Python学习者希望大家多多支持，我们一起进步！如果文章对你有帮助的话，欢迎评论点赞收藏加关注+目录1.项目背景2.数据集介绍
集成学习（随机森林） herry57 数学建模大数据随机森林集成学习
目录一、集成学习概念二、Bagging集成原理三、随机森林四、例子（商品分类）一、集成学习概念集成学习通过建⽴⼏个模型来解决单⼀预测问题。它的⼯作原理是⽣成多个分类器/模型，各⾃独⽴地学习和作出预测。这些预测最后结合成组合预测，因此优于任何⼀个单分类的做出预测。只要单分类器的表现不太差，集成学习的结果总是要好于单分类器的二、Bagging集成原理分类圆形和长方形三、随机森林在机器学习中，随机森林是
【机器学习】朴素贝叶斯入门：从零到垃圾邮件过滤实战吴师兄大模型 0基础实现机器学习入门到精通机器学习人工智能朴素贝叶斯深度学习 pytorch sklearn 开发语言
Langchain系列文章目录01-玩转LangChain：从模型调用到Prompt模板与输出解析的完整指南02-玩转LangChainMemory模块：四种记忆类型详解及应用场景全覆盖03-全面掌握LangChain：从核心链条构建到动态任务分配的实战指南04-玩转LangChain：从文档加载到高效问答系统构建的全程实战05-玩转LangChain：深度评估问答系统的三种高效方法（示例生成、手
【机器学习】机器学习工程实战-第2章项目开始前腊肉芥末果机器学习工程实战机器学习人工智能
上一章：第1章概述文章目录2.1机器学习项目的优先级排序2.1.1机器学习的影响2.1.2机器学习的成本2.2估计机器学习项目的复杂度2.2.1未知因素2.2.2简化问题2.2.3非线性进展2.3确定机器学习项目的目标2.3.1模型能做什么2.3.2成功模型的属性2.4构建机器学习团队2.4.1两种文化2.4.2机器学习团队的成员2.5机器学习项目为何失败2.5.1缺乏有经验的人才2.5.2缺乏领
机器学习怎么做特征工程全栈你个大西瓜人工智能机器学习人工智能特征工程数据预处理特征变换特征降维特征构造
一、特征工程通俗解释特征工程就像厨师做菜前的食材处理：原始数据是“生肉和蔬菜”，特征工程是“切块、腌制、调料搭配”，目的是让机器学习模型（食客）更容易消化吸收，做出更好预测（品尝美味）。二、为什么要做特征工程？数据质量差：原始数据常有缺失、噪声、不一致问题（如年龄列混入“未知”）。模型限制：算法无法直接理解原始数据（如文本、日期需要数值化）。提升效果：好特征能显著提升模型性能（准确率提升10%~5
【机器学习】机器学习四大分类藓类少女机器学习机器学习分类人工智能
机器学习的方法主要可以分为四大类，根据学习方式和数据标注情况进行分类：1.监督学习（SupervisedLearning）特点：有标注数据（即训练数据有明确的输入(X)和输出(Y)）。学习目标是找到一个映射(f(X)\approxY)。适用于分类和回归问题。主要算法：分类（Classification）：逻辑回归（LogisticRegression）支持向量机（SVM）朴素贝叶斯（NaïveBa
大模型学习终极指南：从新手到专家的必经之路，全网最详尽解析，你敢挑战吗？大模型入门教程学习人工智能 AI 大模型大模型学习大模型教程 AI大模型
随着人工智能技术的飞速发展，大模型（Large-ScaleModels）已经成为推动自然语言处理（NLP）、计算机视觉（CV）等领域进步的关键因素。本文将为您详细介绍从零开始学习大模型直至成为专家的全过程，包括所需掌握的知识点、学习资源以及实践建议等。无论您是初学者还是有一定基础的专业人士，都能从中获得有价值的指导。一、基础知识准备在开始学习大模型之前，需要先掌握一些基础知识，这些知识将为后续的学
机器学习——KNN超参数练习AI两年半机器学习人工智能深度学习
sklearn.model_selection.GridSearchCV是scikit-learn中用于超参数调优的核心工具，通过结合交叉验证和网格搜索实现模型参数的自动化优化。以下是详细介绍：一、功能概述GridSearchCV在指定参数网格上穷举所有可能的超参数组合，通过交叉验证评估每组参数的性能，最终选择最优参数组合。其核心价值在于：自动化调参：替代手动参数调试，提升效率3。交叉验证支持：通
编程内容简述！恶霸不委屈开发语言青少年编程汇编 java python
编程是指通过计算机语言来开发软件、程序和应用的过程，通常通过编写一系列的指令，来让计算机完成特定的任务。编程可以涉及多个领域和技术，以下是一些主要的编程内容：1.编程语言编程语言是程序员与计算机进行沟通的桥梁，不同的编程语言适用于不同的任务。常见的编程语言有：Python：简单易学，适用于数据分析、人工智能、网页开发等。JavaScript：网页开发中不可或缺的语言，用于动态网页和前端开发。Jav
大模型Agent 和 RAG 的关系大数据追光猿大模型语言模型人工智能学习方法 transformer
Agent和RAG（Retrieval-AugmentedGeneration）是两种在自然语言处理（NLP）和人工智能领域中广泛使用的技术，它们在功能、目标和实现方式上既有区别又有联系。以下是它们的关系及其协同作用的详细分析。1.Agent和RAG的定义（1）Agent定义：Agent是一种智能体，能够感知环境并采取行动以完成特定任务。在NLP领域，Agent通常指一个基于大语言模型（LLM）的
国产模型能否挑战 GPT-4？一文拆解 DeepSeek-V3 架构与实战应用 AI筑梦师人工智能学习框架架构深度学习 python agi 人工智能 tensorflow
✳️一、引言✅1.1DeepSeek-V3发布背景与定位随着大模型技术的快速演进，从GPT-3到GPT-4，全球在通用人工智能方向取得了长足进展。但与此同时，开源社区始终缺乏一个真正兼顾性能、效率、中文能力和实用性的高质量大模型。DeepSeek-V3的推出正是在这个背景下的一次关键突破。DeepSeek-V3是由中国团队DeepSeek开发的第三代大语言模型，它具备以下几个核心特性：开源可商用：
Agent、RAG、LangChain的概念及作用北极冰雨大模型人工智能
Agent：概念：在人工智能中，Agent通常指的是能够执行任务或做出决策的实体，可以是简单的程序，也可以是复杂的系统，如自动化客服助手、推荐系统等，甚至可以是软件代理、机器人或虚拟助手等各种形式。作用：它能利用内置的大语言模型来做出规划，决定执行哪些步骤，以及每个步骤需要调用哪些工具（如RAG），之后调用相应的工具，最终完成任务。例如，在客服问答场景中，Agent可以根据用户的问题，规划出需要查
AI模型技术演进与行业应用图谱智能计算研究中心其他
内容概要当前AI模型技术正经历从基础架构到行业落地的系统性革新。主流深度学习框架如TensorFlow和PyTorch持续优化动态计算图与分布式训练能力，而MXNet凭借高效的异构计算支持在边缘场景崭露头角。与此同时，模型压缩技术通过量化和知识蒸馏将参数量降低60%-80%，联邦学习则通过加密梯度交换实现多机构数据协同训练。在应用层面，医疗诊断模型通过迁移学习在CT影像分类任务中达到98.2%的准
DeepSeek多语言AI高效应用实践智能计算研究中心其他
内容概要在人工智能技术快速迭代的背景下，DeepSeek系列模型凭借混合专家架构（MoE）与670亿参数规模，在多语言处理、视觉语言理解及复杂任务生成领域实现了突破性进展。本文系统性拆解其技术架构设计逻辑，聚焦论文写作、代码生成、SEO关键词拓展三大核心场景，分析模型在高生成质量、低使用成本维度的差异化优势。技术维度DeepSeekProver传统单模态模型多语言支持97种语言动态切换单一语种优化
重要重要！！fisher矩阵是怎么计算和更新的，以及计算过程中参数的物理含义 ZhangJiQun&MXP 教学 2021 论文 2024大模型以及算力矩阵概率论线性代数 windows 微信机器学习
fisher矩阵是怎么计算和更新的，以及计算过程中参数的物理含义Fisher信息矩阵（FisherInformationMatrix,FIM）用于衡量模型参数估计的不确定性，其计算和更新在统计学、机器学习和优化中具有重要作用。以下是其计算和更新的关键步骤：一、Fisher矩阵的计算定义Fisher矩阵的元素表示对数似然函数关于参数的二阶导数的期望值的负数，即：Fi,j=−
AI大模型训练教程 Small踢倒coffee_氕氘氚 python自学经验分享笔记
1.引言随着人工智能技术的快速发展，大模型（如GPT-3、BERT等）在自然语言处理、计算机视觉等领域取得了显著的成果。训练一个大模型需要大量的计算资源、数据和专业知识。本教程将带你了解如何从零开始训练一个AI大模型。2.准备工作2.1硬件要求GPU：推荐使用NVIDIA的高性能GPU，如A100、V100等。内存：至少64GBRAM。存储：SSD存储，至少1TB。#2.2软件环境操作系统：Lin
使用Jupyter Notebook进行深度学习编程 - 深度学习教程 shandianfk_com ChatGPT AI jupyter 深度学习 ide
大家好，今天我们要聊聊如何使用JupyterNotebook进行深度学习编程。深度学习是人工智能领域中的一项重要技术，通过模仿人脑神经网络的方式进行学习和分析。JupyterNotebook作为一个强大的工具，可以帮助我们轻松地进行深度学习编程，尤其适合初学者和研究人员。本文将带领大家一步步了解如何在JupyterNotebook中开展深度学习项目。一、什么是JupyterNotebook？Jup
英伟达常用GPU参数速查表，含B300..... Ai17316391579 深度学习服务器人工智能机器学习服务器电脑计算机视觉深度学习神经网络
英伟达常用GPU参数速查表，收藏备用：含RTX5090、RTX4090D、L40、L20、A100、A800、H100、H800、H20、H200、B200、B300、GB300.....专注于高性能计算人工智能细分领域kyfwq001#5090##4090##英伟达“新核弹”B200发布##英伟达##英伟达B300##GPU##服务器##显卡##英伟达H800/A800芯片将禁售#
深度学习 Deep Learning 第8章深度学习优化 odoo中国 AI编程人工智能深度学习人工智能优化
深度学习第8章深度学习的优化章节概述本章深入探讨了深度学习中的优化技术，旨在解决模型训练过程中面临的各种挑战。优化是深度学习的核心环节，直接关系到模型的训练效率和最终性能。本章首先介绍了优化在深度学习中的特殊性，然后详细讨论了多种优化算法，包括随机梯度下降（SGD）、动量法、Nesterov动量法、AdaGrad、RMSProp和Adam等。此外，还探讨了参数初始化策略、自适应学习率方法以及二阶优
景联文科技提供高质量文本标注服务，驱动AI技术发展景联文科技科技人工智能
文本标注是指在原始文本数据上添加标签的过程，这些标签可以用来指示特定的实体、关系、事件等信息，以帮助计算机理解和处理这些数据。文本标注是自然语言处理（NLP）领域的一个重要环节，它通过为文本的不同部分提供具体的含义和上下文信息，增强机器学习和深度学习模型对文本内容的理解能力。标注类型情感分析情感极性：确定文本表达的情感倾向，如正面、负面或中立。强度评估：衡量情感的强烈程度，从轻微到极端不等。命名实
Linux的Initrd机制被触发 linux
Linux 的 initrd 技术是一个非常普遍使用的机制，linux2.6 内核的 initrd 的文件格式由原来的文件系统镜像文件转变成了 cpio 格式，变化不仅反映在文件格式上， linux 内核对这两种格式的 initrd 的处理有着截然的不同。本文首先介绍了什么是 initrd 技术，然后分别介绍了 Linux2.4 内核和 2.6 内核的 initrd 的处理流程。最后通过对 Lin
maven本地仓库路径修改 bitcarter maven
默认maven本地仓库路径：C:\Users\Administrator\.m2 修改maven本地仓库路径方法： 1.打开E:\maven\apache-maven-2.2.1\conf\settings.xml 2.找到
XSD和XML中的命名空间 darrenzhu xml xsd schema namespace 命名空间
http://www.360doc.com/content/12/0418/10/9437165_204585479.shtml http://blog.csdn.net/wanghuan203/article/details/9203621 http://blog.csdn.net/wanghuan203/article/details/9204337 http://www.cn
Java 求素数运算周凡杨 java 算法素数
网络上对求素数之解数不胜数，我在此总结归纳一下，同时对一些编码，加以改进，效率有成倍热提高。第一种：原理: 6N(+-)1法任何一个自然数，总可以表示成为如下的形式之一： 6N，6N+1，6N+2，6N+3，6N+4，6N+5 (N=0，1，2，…)
java 单例模式 g21121 java
想必单例模式大家都不会陌生，有如下两种方式来实现单例模式： class Singleton { private static Singleton instance=new Singleton(); private Singleton(){} static Singleton getInstance() { return instance; }
Linux下Mysql源码安装 510888780 mysql
1.假设已经有mysql-5.6.23-linux-glibc2.5-x86_64.tar.gz (1)创建mysql的安装目录及数据库存放目录解压缩下载的源码包，目录结构，特殊指定的目录除外：
32位和64位操作系统墙头上一根草 32位和64位操作系统
32位和64位操作系统是指：CPU一次处理数据的能力是32位还是64位。现在市场上的CPU一般都是64位的，但是这些CPU并不是真正意义上的64 位CPU，里面依然保留了大部分32位的技术，只是进行了部分64位的改进。32位和64位的区别还涉及了内存的寻址方面，32位系统的最大寻址空间是2 的32次方= 4294967296（bit）= 4（GB）左右，而64位系统的最大寻址空间的寻址空间则达到了
我的spring学习笔记10-轻量级_Spring框架 aijuans Spring 3
一、问题提问： → 请简单介绍一下什么是轻量级？轻量级（Leightweight）是相对于一些重量级的容器来说的，比如Spring的核心是一个轻量级的容器，Spring的核心包在文件容量上只有不到1M大小，使用Spring核心包所需要的资源也是很少的，您甚至可以在小型设备中使用Spring。
mongodb 环境搭建及简单CURD antlove Web Install curd NoSQL mongo
一搭建mongodb环境 1. 在mongo官网下载mongodb 2. 在本地创建目录 "D:\Program Files\mongodb-win32-i386-2.6.4\data\db" 3. 运行mongodb服务 [mongod.exe --dbpath "D:\Program Files\mongodb-win32-i386-2.6.4\data\
数据字典和动态视图百合不是茶 oracle 数据字典动态视图系统和对象权限
数据字典（data dictionary）是 Oracle 数据库的一个重要组成部分，这是一组用于记录数据库信息的只读（read-only）表。随着数据库的启动而启动,数据库关闭时数据字典也关闭数据字典中包含数据库中所有方案对象（schema object）的定义(包括表，视图，索引，簇，同义词，序列，过程，函数，包，触发器等等) 数据库为一
多线程编程一般规则 bijian1013 java thread 多线程 java多线程
如果两个工两个以上的线程都修改一个对象，那么把执行修改的方法定义为被同步的，如果对象更新影响到只读方法，那么只读方法也要定义成同步的。不要滥用同步。如果在一个对象内的不同的方法访问的不是同一个数据，就不要将方法设置为synchronized的。
将文件或目录拷贝到另一个Linux系统的命令scp bijian1013 linux unix scp
一.功能说明 scp就是security copy，用于将文件或者目录从一个Linux系统拷贝到另一个Linux系统下。scp传输数据用的是SSH协议，保证了数据传输的安全，其格式如下： scp 远程用户名@IP地址：文件的绝对路径
【持久化框架MyBatis3五】MyBatis3一对多关联查询 bit1129 Mybatis3
以教员和课程为例介绍一对多关联关系，在这里认为一个教员可以叫多门课程，而一门课程只有1个教员教，这种关系在实际中不太常见，通过教员和课程是多对多的关系。示例数据：地址表： CREATE TABLE ADDRESSES ( ADDR_ID INT(11) NOT NULL AUTO_INCREMENT, STREET VAR
cookie状态判断引发的查找问题 bitcarter form cgi
先说一下我们的业务背景： 1.前台将图片和文本通过form表单提交到后台，图片我们都做了base64的编码，并且前台图片进行了压缩 2.form中action是一个cgi服务 3.后台cgi服务同时供PC，H5，APP 4.后台cgi中调用公共的cookie状态判断方法（公共的，大家都用，几年了没有问题）问题：（折腾两天。。。。） 1.PC端cgi服务正常调用，cookie判断没
通过Nginx,Tomcat访问日志(access log)记录请求耗时 ronin47
一、Nginx通过$upstream_response_time $request_time统计请求和后台服务响应时间 nginx.conf使用配置方式： log_format main '$remote_addr - $remote_user [$time_local] "$request" ''$status $body_bytes_sent "$http_r
java-67- n个骰子的点数。把n个骰子扔在地上，所有骰子朝上一面的点数之和为S。输入n，打印出S的所有可能的值出现的概率。 bylijinnan java
public class ProbabilityOfDice { /** * Q67 n个骰子的点数 * 把n个骰子扔在地上，所有骰子朝上一面的点数之和为S。输入n，打印出S的所有可能的值出现的概率。 * 在以下求解过程中，我们把骰子看作是有序的。 * 例如当n=2时，我们认为（1，2）和（2，1）是两种不同的情况 */ private stati
看别人的博客，觉得心情很好 Cb123456 博客心情
以为写博客，就是总结，就和日记一样吧，同时也在督促自己。今天看了好长时间博客: 职业规划: http://www.iteye.com/blogs/subjects/zhiyeguihua android学习: 1.http://byandby.i
[JWFD开源工作流]尝试用原生代码引擎实现循环反馈拓扑分析 comsci 工作流
我们已经不满足于仅仅跳跃一次，通过对引擎的升级，今天我测试了一下循环反馈模式，大概跑了200圈，引擎报一个溢出错误在一个流程图的结束节点中嵌入一段方程，每次引擎运行到这个节点的时候，通过实时编译器GM模块，计算这个方程，计算结果与预设值进行比较，符合条件则跳跃到开始节点，继续新一轮拓扑分析，直到遇到
JS常用的事件及方法 cwqcwqmax9 js
事件描述 onactivate 当对象设置为活动元素时触发。 onafterupdate 当成功更新数据源对象中的关联对象后在数据绑定对象上触发。 onbeforeactivate 对象要被设置为当前元素前立即触发。 onbeforecut 当选中区从文档中删除之前在源对象触发。 onbeforedeactivate 在 activeElement 从当前对象变为父文档其它对象之前立即
正则表达式验证日期格式 dashuaifu 正则表达式 IT其它 java其它
正则表达式验证日期格式 function isDate(d){ var v = d.match(/^(\d{4})-(\d{1,2})-(\d{1,2})$/i); if(!v) { this.focus(); return false; } } <input value="2000-8-8" onblu
Yii CModel.rules() 方法、validate预定义完整列表、以及说说验证 dcj3sjt126com yii
public array rules () {return} array 要调用 validate() 时应用的有效性规则。返回属性的有效性规则。声明验证规则，应重写此方法。每个规则是数组具有以下结构：array('attribute list', 'validator name', 'on'=>'scenario name', ...validation
UITextAttributeTextColor = deprecated in iOS 7.0 dcj3sjt126com ios
In this lesson we used the key "UITextAttributeTextColor" to change the color of the UINavigationBar appearance to white. This prompts a warning "first deprecated in iOS 7.0." Ins
判断一个数是质数的几种方法 EmmaZhao Math python
质数也叫素数，是只能被1和它本身整除的正整数，最小的质数是2，目前发现的最大的质数是p=2^57885161-1【注1】。判断一个数是质数的最简单的方法如下： def isPrime1(n): for i in range(2, n): if n % i == 0: return False return True 但是在上面的方法中有一些冗余的计算，所以
SpringSecurity工作原理小解读坏我一锅粥 SpringSecurity
SecurityContextPersistenceFilter ConcurrentSessionFilter WebAsyncManagerIntegrationFilter HeaderWriterFilter CsrfFilter LogoutFilter Use
JS实现自适应宽度的Tag切换 ini JavaScript html Web css html5
效果体验：http://hovertree.com/texiao/js/3.htm 该效果使用纯JavaScript代码，实现TAB页切换效果，TAB标签根据内容自适应宽度，点击TAB标签切换内容页。 HTML文件代码： <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"
Hbase Rest API : 数据查询 kane_xie REST hbase
hbase（hadoop）是用java编写的，有些语言（例如python）能够对它提供良好的支持，但也有很多语言使用起来并不是那么方便，比如c#只能通过thrift访问。Rest就能很好的解决这个问题。Hbase的org.apache.hadoop.hbase.rest包提供了rest接口，它内嵌了jetty作为servlet容器。启动命令：./bin/hbase rest s
JQuery实现鼠标拖动元素移动位置（源码+注释）明子健 jquery js 源码拖动鼠标
欢迎讨论指正！ print.html代码： <!DOCTYPE html> <html> <head> <meta http-equiv=Content-Type content="text/html;charset=utf-8"> <title>发票打印</title> &l
Postgresql 连表更新字段语法 update qifeifei PostgreSQL
下面这段sql本来目的是想更新条件下的数据，可是这段sql却更新了整个表的数据。sql如下： UPDATE tops_visa.visa_order SET op_audit_abort_pass_date = now() FROM tops_visa.visa_order as t1 INNER JOIN tops_visa.visa_visitor as t2 ON t1.
将redis,memcache结合使用的方案? tcrct redis cache
公司架构上使用了阿里云的服务，由于阿里的kvstore收费相当高，打算自建，自建后就需要自己维护，所以就有了一个想法，针对kvstore(redis)及ocs(memcache)的特点，想自己开发一个cache层，将需要用到list，set，map等redis方法的继续使用redis来完成，将整条记录放在memcache下，即findbyid，save等时就memcache，其它就对应使用redi
开发中遇到的诡异的bug wudixiaotie bug
今天我们服务器组遇到个问题：我们的服务是从Kafka里面取出数据，然后把offset存储到ssdb中，每个topic和partition都对应ssdb中不同的key，服务启动之后，每次kafka数据更新我们这边收到消息，然后存储之后就发现ssdb的值偶尔是-2,这就奇怪了，最开始我们是在代码中打印存储的日志，发现没什么问题，后来去查看ssdb的日志，才发现里面每次set的时候都会对同一个key