sunwq06

深度强化学习：Deep Q-Learning

在前两篇文章强化学习基础：基本概念和动态规划和强化学习基础：蒙特卡罗和时序差分中介绍的强化学习的三种经典方法（动态规划、蒙特卡罗以及时序差分）适用于有限的状态集合$\mathcal{S}$，以时序差分中的Q-Learning算法为例，一般来说使用n行(n = number of states)和m列(m= number of actions)的矩阵(Q table)来储存action-value function的值，如下图所示：

对于连续的状态集合$\mathcal{S}$，上述方法就不能适用了，这时可以引入神经网络来估计Q的值，即Deep Q-Learning，如下图所示：

接下来介绍Deep Q-Learning中常用的几种技巧，用于提升学习效果：

Stack States：对于连续的状态集合，单个状态不能很好地描述整体的状况。例如下图所示，要判断黑色方块的移动方向，仅凭一副图像是无法判断的，需要连续的多幅图像才能判断出黑色方块在向右移动

Experience Replay：如下图所示，防止算法在训练过程中忘记了之前场景获得的经验，创建一个Replay Buffer，不断回放之前的场景对算法进行训练；另一方面，相邻的场景之间(例如$[S_{t},A_{t},R_{t+1},S_{t+1}]$与$[S_{t+1},A_{t+1},R_{t+2},S_{t+2}]$)有着一定的相关性,为了防止算法被固定在某些特定的状态空间，从Replay Buffer中随机抽样选取场景进行训练（打乱场景之间的顺序，减少相邻场景的相关性）

Fixed Q-targets：针对Deep Q-Learning中计算Q值的神经网络的权重系数的更新，有公式如左图所示，此时将TD target近似为了$q_{\pi}(S,A)$的真值，但是当不断更新权重系数时TD target也是不断变化的，这就会使得在训练过程中$q_{\pi}(S,A)$的估计值要接近一个不断变化的值，加大了训练难度，减小了训练效率。解决方案如右图所示，使用相对固定的参数来估计TD target

Double DQNs：解决TD target对$q_{\pi}(S,A)$的真值可能高估的问题，方案是在计算TD target时使用两个不同的神经网络，将动作$a$的选择过程与TD target的计算过程进行分割。如果和Fixed Q-targets结合起来，可以直接使用$w$和$w^{-}$这两组参数（如下图所示）

Dueling DQN：如下图所示，相对于直接计算action-value function $Q(s,a)$，将$Q(s,a)$分解为state-value function $V(s)$与advantage-value function $A(s,a)$之和，这样做的原因是多数状态下采取何种行动对$Q(s,a)$的值影响不大，适合直接对状态函数$V(s)$进行估计，再叠加上不同行动对其的影响。在实际计算中，为了使得从$Q(s,a)$能够唯一确定$V(s)$与$A(s,a)$，可以令$A(s,a)$的均值（即$\frac{1}{|\mathcal{A}|}\sum_{a^{\prime}} A(s,a^{\prime})$）为0

Prioritized Experience Replay：如下图所示，对Replay Buffer中的每个场景加入一个优先级，一个场景的TD error越大，它对应的优先级就越大。其中$e$是一个大于0的常数，防止抽样概率为0；$a$控制按优先级抽样和均匀抽样的比重，$a=1$时完全按优先级抽样，$a=0$时退化为均匀抽样；另外由于是按照优先级进行抽样，还需要改写神经网络中权重系数的更新规则，针对优先级高的场景减少权重更新的步长，使权重更新的过程与场景出现的真实概率一致（特别是在训练的后期，通过参数$b$控制），避免过拟合高优先级的场景

代码实现

使用vizdoom强化学习环境，以其中的一个任务为基础进行训练，该任务要确保玩家活着走到目标位置（要做到这一点，路上要躲避敌人的射击或者杀死敌人，否则不可能成功）。该任务每步的奖励与玩家和目标的距离变化有关（+dX for getting closer, -dX for getting further），此外若玩家死亡会有-100的惩罚。

值得注意的是在代码中使用SumTree(二叉树，每个父节点的值是两个子节点的值的和)这一数据结构来存储Replay Buffer中的Priority，具体结构如下图所示，这样做的目的是方便按优先级进行抽样，可以使得每次抽样以及更新优先级的计算均为$O(\ln{n})$。在抽样过程中抽得的叶子节点$x$满足$P(x\leq{k})=\frac{\sum_{i=1}^{k}Priority_i}{\sum_{i=1}^{n}Priority_i},\text{ }k\in\{1,2,\cdots,n\}$

import tensorflow as tf      # Deep Learning library
import numpy as np           # Handle matrices
from vizdoom import *        # Doom Environment
import random                # Handling random number generation
import time                  # Handling time calculation
from skimage import transform# Help us to preprocess the frames
from collections import deque# Ordered collection with ends
import matplotlib.pyplot as plt # Display graphs
import warnings # This ignore all the warning messages that are normally printed during the training because of skiimage
warnings.filterwarnings('ignore')

### Here we create our environment
def create_environment():
    game = DoomGame()    
    # Load the correct configuration
    game.load_config("deadly_corridor.cfg")    
    # Load the correct scenario (in our case deadly_corridor scenario)
    game.set_doom_scenario_path("deadly_corridor.wad")    
    game.init()
    # Create an hot encoded version of our actions (7 possible actions)
    possible_actions = np.identity(7,dtype=int).tolist()    
    return game, possible_actions
game, possible_actions = create_environment()

### Preprocess(reduce the complexity of states and the training time)
def preprocess_frame(frame):
    # Grayscale frame(color not add important information, already done by the config file)
    # Crop the screen (remove part that contains no information)   
    cropped_frame = frame[15:-5,20:-20] #[Up: Down, Left: right]    
    # Normalize Pixel Values
    normalized_frame = cropped_frame/255.0
    # Resize
    preprocessed_frame = transform.resize(normalized_frame, [100,120]) 
    return preprocessed_frame # 100x120x1 frame

### Stack frames
stack_size = 4 #stack 4 frames
# Initialize deque with zero-images, one array for each image
stacked_frames  =  deque([np.zeros((100,120), dtype=np.int) for i in range(stack_size)], maxlen=4)
def stack_frames(stacked_frames, state, is_new_episode):
    frame = preprocess_frame(state) #preprocess frame   
    if is_new_episode:
        # Clear our stacked_frames
        stacked_frames = deque([np.zeros((100,120), dtype=np.int) for i in range(stack_size)], maxlen=4)        
        # Because we're in a new episode, copy the same frame 4x
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)
        stacked_frames.append(frame)        
        # Stack the frames
        stacked_state = np.stack(stacked_frames, axis=2)
    else:
        # Append frame to deque, automatically removes the oldest frame
        stacked_frames.append(frame)
        # Build the stacked state
        stacked_state = np.stack(stacked_frames, axis=2)   
    return stacked_state, stacked_frames

### Set the hyperparameters
# MODEL HYPERPARAMETERS
state_size = [100,120,4] # Our input is a stack of 4 frames hence 100x120x4 (Width, height, channels) 
action_size = game.get_available_buttons_size() # 7 possible actions
learning_rate =  0.00025 # Alpha (i.e., learning rate)
# TRAINING HYPERPARAMETERS
total_episodes = 5000 # Total episodes for training
max_steps = 5000 # Max possible steps in an episode
batch_size = 64             
# FIXED Q TARGETS HYPERPARAMETERS 
max_tau = 10000 # The number of steps where we update our target network
# EXPLORATION HYPERPARAMETERS for epsilon greedy strategy
explore_start = 1.0 # exploration probability at start
explore_stop = 0.01 # minimum exploration probability 
decay_rate = 0.00005 # exponential decay rate for exploration prob
# Q LEARNING hyperparameters
gamma = 0.95 # Discounting rate
# MEMORY HYPERPARAMETERS(If you have GPU change to 1 million)
pretrain_length = 100000 # Number of experiences stored in the Memory when initialized for the first time
memory_size = 100000 # Number of experiences the Memory can keep
# MODIFY THIS TO FALSE IF YOU JUST WANT TO SEE THE TRAINED AGENT
training = True

### Set up Deep Q network and Target network (both are Dueling Network)
class DDDQNNet:
    def __init__(self, state_size, action_size, learning_rate, name):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        self.name = name     
        # use tf.variable_scope to know which network we're using (DQN or target_net)
        # it will be useful when we will update our w- parameters (by copy the DQN parameters)
        with tf.variable_scope(self.name):            
            # create the placeholders
            self.inputs_ = tf.placeholder(tf.float32, [None, *state_size], name="inputs") #[None,100,120,4]
            self.ISWeights_ = tf.placeholder(tf.float32, [None,1], name='IS_weights')
            self.actions_ = tf.placeholder(tf.float32, [None, action_size], name="actions_")
            # Remember that target_Q is the R(s,a) + max Qhat(s', a')
            self.target_Q = tf.placeholder(tf.float32, [None], name="target")            
            # first conv layer
            self.conv1 = tf.layers.conv2d(inputs = self.inputs_, filters = 32, kernel_size = [8,8], \
                                          strides = [4,4], padding = "VALID", name = "conv1", \
                                          kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d())    
            self.conv1_out = tf.nn.elu(self.conv1, name="conv1_out")
            # second conv layer
            self.conv2 = tf.layers.conv2d(inputs = self.conv1_out, filters = 64, kernel_size = [4,4], \
                                          strides = [2,2], padding = "VALID", name = "conv2", \
                                          kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d())
            self.conv2_out = tf.nn.elu(self.conv2, name="conv2_out")
            # third conv layer
            self.conv3 = tf.layers.conv2d(inputs = self.conv2_out, filters = 128, kernel_size = [4,4], \
                                          strides = [2,2], padding = "VALID", name = "conv3", \
                                          kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d())
            self.conv3_out = tf.nn.elu(self.conv3, name="conv3_out")    
            self.flatten = tf.layers.flatten(self.conv3_out)            
            # Here we separate into two streams (Dueling Network)
            # The one that calculate V(s)
            self.value_fc = tf.layers.dense(inputs = self.flatten, units = 512, activation = tf.nn.elu, \
                                            kernel_initializer=tf.contrib.layers.xavier_initializer(), \
                                            name="value_fc")
            self.value = tf.layers.dense(inputs = self.value_fc, units = 1, activation = None, \
                                         kernel_initializer=tf.contrib.layers.xavier_initializer(), \
                                         name="value")
            # The one that calculate A(s,a)
            self.advantage_fc = tf.layers.dense(inputs = self.flatten, units = 512, activation = tf.nn.elu, \
                                                kernel_initializer=tf.contrib.layers.xavier_initializer(), \
                                                name="advantage_fc")
            self.advantage = tf.layers.dense(inputs = self.advantage_fc, units = self.action_size, activation = None, \
                                             kernel_initializer=tf.contrib.layers.xavier_initializer(), \
                                             name="advantages")
            # Agregating layer
            # Q(s,a) = V(s) + (A(s,a) - 1/|A| * sum A(s,a'))
            self.output = self.value + tf.subtract(self.advantage, tf.reduce_mean(self.advantage, axis=1, keepdims=True))
            # Predicted Q value
            self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_), axis=1)            
            # For computing priority and updating Sumtree
            self.absolute_errors = tf.abs(self.target_Q - self.Q) 
            # The loss is modified because of Priority Experience Replay
            self.loss = tf.reduce_mean(self.ISWeights_ * tf.squared_difference(self.target_Q, self.Q))           
            self.optimizer = tf.train.RMSPropOptimizer(self.learning_rate).minimize(self.loss)
# Reset the graph
tf.reset_default_graph()
# Instantiate the DQNetwork
DQNetwork = DDDQNNet(state_size, action_size, learning_rate, name="DQNetwork")
# Instantiate the target network
TargetNetwork = DDDQNNet(state_size, action_size, learning_rate, name="TargetNetwork")

### Data Struture to store experience and priority(SumTree)
class SumTree(object):
    data_pointer = 0
    # Here we initialize the tree with all nodes = 0, and initialize the data with all values = 0
    def __init__(self, capacity):
        self.capacity = capacity # Number of leaf nodes (final nodes) that contains experiences
        # Generate the tree with all nodes values = 0
        # Parent nodes = capacity - 1, Leaf nodes = capacity
        self.tree = np.zeros(2 * capacity - 1)
        # Contains the experiences (so the size of data is capacity)
        self.data = np.zeros(capacity, dtype=object)
    # Here we add our priority score in the sumtree leaf and add the experience in data
    def add(self, priority, data):
        # Look at what index we want to put the experience
        tree_index = self.data_pointer + self.capacity - 1 # the leaves from left to right
        self.data[self.data_pointer] = data # Update data frames
        self.update(tree_index, priority) # Update the leaf
        self.data_pointer += 1 # Add 1 to data_pointer
        # If we're above the capacity, you go back to first index (we overwrite)
        if self.data_pointer >= self.capacity:  
            self.data_pointer = 0
    # Update the leaf priority score and propagate the change through tree
    def update(self, tree_index, priority):
        # Change = new priority score - former priority score
        change = priority - self.tree[tree_index]
        self.tree[tree_index] = priority      
        # Propagate the change through tree
        while tree_index != 0:
            tree_index = (tree_index - 1) // 2
            self.tree[tree_index] += change
    # Here we get the leaf and associated experience 
    # the returned index is the smallest index satisfying: sum(leaf priority) >= v for leaf index <= returned index
    def get_leaf(self, v):
        parent_index = 0    
        while True:
            left_child_index = 2 * parent_index + 1
            right_child_index = left_child_index + 1            
            # If we reach bottom, end the search
            if left_child_index >= len(self.tree):
                leaf_index = parent_index
                break
            else: # downward search               
                if v <= self.tree[left_child_index]:
                    parent_index = left_child_index                   
                else:
                    v -= self.tree[left_child_index]
                    parent_index = right_child_index            
        data_index = leaf_index - self.capacity + 1
        return leaf_index, self.tree[leaf_index], self.data[data_index]   
    @property
    def total_priority(self):
        return self.tree[0] # Returns the root node

### Create Replay Buffer and Prioritized Experience Replay
class Memory(object):
    PER_e = 0.01  # Hyperparameter that we use to avoid some experiences to have 0 probability of being taken
    PER_a = 0.6  # Hyperparameter that we use to make a tradeoff between taking only exp with high priority and sampling randomly
    PER_b = 0.4  # importance-sampling, from initial value increasing to 1  
    PER_b_increment_per_sampling = 0.001
    absolute_error_upper = 1.  # clipped abs error
    def __init__(self, capacity):
        # Making the tree 
        self.tree = SumTree(capacity)
    # Store a new experience in our tree
    # Each new experience have a score of max_prority (it will be then improved when we use this exp to train our DDQN)
    def store(self, experience):
        # Find the max priority
        max_priority = np.max(self.tree.tree[-self.tree.capacity:])
        # If the max priority = 0 we can't put priority = 0 since this exp will never have a chance to be selected
        # So we use an upper limit
        if max_priority == 0:
            max_priority = self.absolute_error_upper
        self.tree.add(max_priority, experience)   # set the max p for new exp        
    # First, to sample a minibatch of n size, the range [0, priority_total] is split into n ranges.
    # Then a value is uniformly sampled from each range
    # We search in the sumtree, the experience where priority score correspond to sample values are retrieved from
    # Finally, we calculate IS weights for each minibatch element
    def sample(self, n):
        memory_b = [] # Create a sample array that will contains the minibatch       
        b_idx, b_ISWeights = np.empty((n,), dtype=np.int32), np.empty((n, 1), dtype=np.float32)
        priority_segment = self.tree.total_priority / n # priority segment
        # Here we increasing the PER_b each time we sample a new minibatch
        self.PER_b = np.min([1., self.PER_b + self.PER_b_increment_per_sampling])  # max = 1
        # Calculating the max_weight
        p_min = np.min(self.tree.tree[-self.tree.capacity:]) / self.tree.total_priority
        max_weight = (p_min * n) ** (-self.PER_b)
        for i in range(n):
            # A value is uniformly sample from each range
            a, b = priority_segment * i, priority_segment * (i + 1)
            value = np.random.uniform(a, b)
            # Experience that correspond to each value is retrieved
            index, priority, data = self.tree.get_leaf(value)
            sampling_probabilities = priority / self.tree.total_priority # P(j)
            #  IS = (1/N * 1/P(i))**b /max wi == (N*P(i))**-b  /max wi
            b_ISWeights[i, 0] = np.power(n * sampling_probabilities, -self.PER_b)/ max_weight              
            b_idx[i]= index
            experience = [data]
            memory_b.append(experience)
        return b_idx, memory_b, b_ISWeights
    # Update the priorities on the tree
    def batch_update(self, tree_idx, abs_errors):
        abs_errors += self.PER_e  # convert to abs and avoid 0
        clipped_errors = np.minimum(abs_errors, self.absolute_error_upper)
        ps = np.power(clipped_errors, self.PER_a)
        for ti, p in zip(tree_idx, ps):
            self.tree.update(ti, p)

### Deal with the empty memory problem (pre-populate memory by taking random actions and storing the experience)
memory = Memory(memory_size) # Instantiate memory
game.new_episode() # Render the environment
for i in range(pretrain_length):
    # If it's the first step
    if i == 0:
        # First we need a state
        state = game.get_state().screen_buffer
        state, stacked_frames = stack_frames(stacked_frames, state, True)   
    action = random.choice(possible_actions) # Random action  
    reward = game.make_action(action) # Get the rewards   
    done = game.is_episode_finished() # Look if the episode is finished
    # If the player is dead
    if done:
        # the episode ends so no next state 
        next_state = np.zeros((120,140), dtype=np.int)
        next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
        # Add experience to memory
        experience = state, action, reward, next_state, done
        memory.store(experience)
        # Start a new episode
        game.new_episode()
        # First we need a state
        state = game.get_state().screen_buffer
        # Stack the frames
        state, stacked_frames = stack_frames(stacked_frames, state, True)        
    else:
        # Get the next state
        next_state = game.get_state().screen_buffer
        next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)   
        # Add experience to memory
        experience = state, action, reward, next_state, done
        memory.store(experience)        
        # Our state is now the next_state
        state = next_state

### Choose action from Q (use ϵ-greedy strategy)
def predict_action(explore_start, explore_stop, decay_rate, decay_step, state, actions):
    exp_exp_tradeoff = np.random.rand() # First we randomize a number
    explore_probability = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * decay_step)    
    if (explore_probability > exp_exp_tradeoff):
        # Make a random action (exploration)
        action = random.choice(actions)       
    else:
        # Get action from Q-network (exploitation)
        # Estimate the Qs values state
        Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))}) 
        # Take the biggest Q value (= the best action)
        choice = np.argmax(Qs)
        action = actions[int(choice)]
    return action, explore_probability

### Copy the parameters of DQN to Target_network (used for Fixed Q-target and Double DQN)
def update_target_graph():
    # Get the parameters of our DQNNetwork
    from_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "DQNetwork")
    # Get the parameters of our Target_network
    to_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "TargetNetwork")
    op_holder = []
    # Update our target_network parameters with DQNNetwork parameters
    for from_var,to_var in zip(from_vars,to_vars):
        op_holder.append(to_var.assign(from_var))
    return op_holder

### Train the agent
saver = tf.train.Saver() # Saver will help us to save our model
if training == True:
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer()) # Initialize the variables
        decay_step = 0 # Initialize the decay step  
        tau = 0 # Set tau = 0
        game.init() # Init the game
        # Update the parameters of our TargetNetwork with DQN_weights
        update_target = update_target_graph()
        sess.run(update_target)
        for episode in range(total_episodes):
            step = 0 # Set step to 0
            episode_rewards = [] # Initialize the rewards of the episode
            game.new_episode() # Make a new episode and observe the first state
            state = game.get_state().screen_buffer
            state, stacked_frames = stack_frames(stacked_frames, state, True)
            while step < max_steps:
                step += 1
                tau += 1
                decay_step +=1
                # ϵ-greedy stragety
                action, explore_probability = predict_action(explore_start, explore_stop, decay_rate, decay_step, state, possible_actions)
                reward = game.make_action(action) # Do the action
                done = game.is_episode_finished() # Look if the episode is finished
                episode_rewards.append(reward) # Add the reward to total reward
                # If the game is finished
                if done:
                    # the episode ends so no next state 
                    next_state = np.zeros((120,140), dtype=np.int)
                    next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
                    # Set step = max_steps to end the episode
                    step = max_steps
                    # Get the total reward of the episode
                    total_reward = np.sum(episode_rewards)
                    print('Episode: {}'.format(episode), 'Total reward: {}'.format(total_reward), \
                          'Training loss: {:.4f}'.format(loss), 'Explore P: {:.4f}'.format(explore_probability))
                    # Add experience to memory
                    experience = state, action, reward, next_state, done
                    memory.store(experience)
                else:
                    next_state = game.get_state().screen_buffer # Get the next state
                    # Stack the frame of the next_state
                    next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
                    # Add experience to memory
                    experience = state, action, reward, next_state, done
                    memory.store(experience)
                    state = next_state
                # LEARNING PART            
                tree_idx, batch, ISWeights_mb = memory.sample(batch_size) # Obtain random mini-batch from memory
                states_mb = np.array([each[0][0] for each in batch], ndmin=3)
                actions_mb = np.array([each[0][1] for each in batch])
                rewards_mb = np.array([each[0][2] for each in batch]) 
                next_states_mb = np.array([each[0][3] for each in batch], ndmin=3)
                dones_mb = np.array([each[0][4] for each in batch])
                target_Qs_batch = []
                # DOUBLE DQN
                # Use DQNNetwork to select the action a' to take at next_state s' (action with the highest Q-value)
                # Use TargetNetwork to calculate the Q_val of Q(s',a')
                # Get Q values for next_state 
                q_next_state = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: next_states_mb}) 
                # Calculate Qtarget for all actions at that state
                q_target_next_state = sess.run(TargetNetwork.output, feed_dict = {TargetNetwork.inputs_: next_states_mb})
                # Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma * Qtarget(s',a') 
                for i in range(0, len(batch)):
                    terminal = dones_mb[i]
                    # We got a'
                    action = np.argmax(q_next_state[i])
                    # If we are in a terminal state, only equals reward
                    if terminal:
                        target_Qs_batch.append(rewards_mb[i])
                    else:
                        # Take the Qtarget for action a'
                        target = rewards_mb[i] + gamma * q_target_next_state[i][action]
                        target_Qs_batch.append(target)
                targets_mb = np.array([each for each in target_Qs_batch])
                # Optimize
                _, loss, absolute_errors = sess.run([DQNetwork.optimizer, DQNetwork.loss, DQNetwork.absolute_errors], \
                                                    feed_dict={DQNetwork.inputs_: states_mb, DQNetwork.target_Q: targets_mb, \
                                                               DQNetwork.actions_: actions_mb, DQNetwork.ISWeights_: ISWeights_mb})
                # Update priority
                memory.batch_update(tree_idx, absolute_errors)
                # Fixed Q target
                if tau > max_tau:
                    # Update the parameters of our TargetNetwork with DQN_weights
                    update_target = update_target_graph()
                    sess.run(update_target)
                    tau = 0
                    print("Model updated")
            # Save model every 5 episodes
            if episode % 5 == 0:
                save_path = saver.save(sess, "./models/model.ckpt")
                print("Model Saved")

### Watch the agent play
with tf.Session() as sess:   
    game = DoomGame()
    # Load the correct configuration (TESTING)
    game.load_config("deadly_corridor_testing.cfg")
    # Load the correct scenario (in our case deadly_corridor scenario)
    game.set_doom_scenario_path("deadly_corridor.wad")
    game.init()    
    # Load the model
    saver.restore(sess, "./models/model.ckpt")
    for i in range(10):
        game.new_episode()
        state = game.get_state().screen_buffer
        state, stacked_frames = stack_frames(stacked_frames, state, True)
        while not game.is_episode_finished():
            # EPSILON GREEDY STRATEGY
            exp_exp_tradeoff = np.random.rand()
            explore_probability = 0.01
            if (explore_probability > exp_exp_tradeoff):
                # Make a random action (exploration)
                action = random.choice(possible_actions)
            else:
                # Get action from Q-network (exploitation)
                Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))})
                choice = np.argmax(Qs) # Take the biggest Q value (= the best action)
                action = possible_actions[int(choice)]
            game.make_action(action)
            done = game.is_episode_finished()
            if done:
                break  
            else:
                next_state = game.get_state().screen_buffer
                next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)
                state = next_state
        score = game.get_total_reward()
        print("Score: ", score)
    game.close()

View Code

【深度学习】DeepSeek模型介绍与部署 Nerous_ 深度学习深度学习人工智能
原文链接：DeepSeek-V31.介绍DeepSeek-V3，一个强大的混合专家(MoE)语言模型，拥有671B总参数，其中每个token激活37B参数。为了实现高效推理和成本效益的训练，DeepSeek-V3采用了多头潜在注意力(MLA)和DeepSeekMoE架构，这些架构在DeepSeek-V2中得到了充分验证。此外，DeepSeek-V3首次提出了无辅助损失的负载平衡策略，并设置了多to
【深度学习】 PyTorch一文详解 Nerous_ 深度学习深度学习 pytorch 人工智能机器学习 python
“PyTorchisadeeplearningframeworkthatprioritizessimplicityandflexibility,makingitthego-tochoiceforbothresearchersanddevelopers.”—Anonymous1.PyTorch简介1.1PyTorch的背景与发展PyTorch是由Facebook人工智能研究院（FAIR）开发的一个开
一口气告诉你Deepseek与manus有什么区别？小二爱编程· ai 人工智能
DeepSeek像是个特别聪明的“顾问”，你问他问题，他能给你写论文、改合同、算数学题，甚至能讲冷笑话。但他有个特点：动嘴不动手。比如你说“帮我做个PPT”，他会给你写个特别详细的提纲，但最后你得自己打开电脑动手做。Manus更像是个“动手达人”，你只要说“帮我做个PPT”，他能直接打开软件，自己找模板、排版、插图片，最后把做好的PPT文件甩给你，全程不用你动手。具体区别在哪？擅长的事不一样Dee
DeepSeek来袭！低代码+AI竟让程序员摸鱼接私单月入5W！工业甲酰苯胺低代码人工智能
目录一、引言：开启低代码+AI新时代二、DeepSeek与低代码、AI的关联（一）DeepSeek简介（二）低代码开发概述（三）AI赋能低代码三、低代码+AI开启私单赚钱大门（一）成功案例剖析（二）私单项目类型（三）赚钱模式解析四、实战：利用DeepSeek接私单（一）工具准备与环境搭建（二）需求分析与项目规划（三）低代码开发实战（四）AI技术融合应用（五）项目测试与交付五、挑战与应对策略（一）技
AI界劳斯莱斯o1 -Pro来了！百万token收费600刀，OpenAI在AI普惠反方向狂奔？算家计算话题文章人工智能算家云 OpenAI o1-pro API OpenAI发布最贵模型 DeepSeek
刚刚，OpenAI宣布推出其最新的高性能推理模型o1-pro。当大家还在为GPT-4.5的订阅费感到肉痛时，OpenAI用一记价格暴击刷新了认知——全新推理模型o1-pro的API定价，输入每百万token收费150美元，输出每百万token收费600美元，比前代模型贵了10倍，更是将DeepSeek-R1甩出270倍价差。与OpenAI其他模型相比，o1-pro的价格高出了不止一点：目前o1-p
办公提效高阶 DeepSeek 提示词，适用于多种 AI 工具东锋17 人工智能人工智能
1、高效会议管理请根据[会议主题]和[参会人角色]生成会议议程框架，包含会前准备清单（背景材料/数据需求）、会中讨论要点（需决策事项+时间分配）、会后跟进任务表（责任人/DDL），最后用思维导图形式输出。2、周报自动生成基于我本周完成的[任务清单]和[工作数据]，请先总结3项核心成果与2个待改进点，再结合OKR目标制定下周工作计划，要求用对比柱状图呈现进度数据，以PPT分页形式输出。3、周报自动生
u-net系列算法㡽闧㔯人工智能算法
语义分割M整体结构：M概述就是编码解码过程简单但是很实用，应用广起初是做医学方向，现在也是U-net主要网络结构：还引入了特征拼接操作M以前我们都是加法，现在全都要这么简单的结构就能把分割任务做好U-net++整体网络结构：特征融合，拼接更全面其实跟densenet思想一致把能拼能凑的特征全用上就是升级版了U-net++DeepSupervision：也是很常见的事，多输出损失由多个位置计算，再更
DeepSeek带来服务器与显卡需求激增的核心逻辑 DeepSeek+NAS 人工智能服务器运维网络安全计算机网络
随着DeepSeek等开源AI模型的普及，个人开发者和小型企业正加速构建私有化AI服务器，以处理敏感数据和定制化任务。这种趋势不仅重构了算力需求的结构，更推动服务器和显卡市场进入新一轮增长周期。以下从技术迭代、行业需求、市场格局三个维度展开论述。一、私有化部署：从数据安全到算力自主的核心驱动力数据隐私与合规性需求公共AI平台的数据泄露风险促使企业选择本地化部署。例如，医疗机构的患者数据、金融企业的
【DeepSeek】全方位使用指南————简版諰. 人工智能 ai AI写作
一、平台概述DeepSeek（深度求索）是专注实现AGI的中国的人工智能公司，提供多款AI产品：智能对话（Chat）文生图（Art）代码助手（Coder）API开发接口企业定制解决方案二、注册与登录2.1账号创建访问官网https://www.deepseek.com点击右上角「注册」支持三种方式：手机号+短信验证邮箱注册（需验证邮件）第三方登录（微信/Google账号）2.2订阅计划套餐类型免费
deepseek api参数详解孽小倩大语言模型 python java 前端人工智能 deepseek
deepseek的参数与openai保持兼容，所以openai能用的参数deepseek都可以使用，以下是常用的参数介绍。在使用Deepseek/OpenAI的PythonAPI时，最常用的API端点是chat/completions，用于调用deepseek生成文本对话内容。以下是openai.ChatCompletion.create()方法的主要参数及其作用：1.model作用：指定使用的模
智见未来：多大模型协同的数据分析新范式一ge科研小菜菜人工智能大数据人工智能大数据
个人主页：一ge科研小菜鸡-CSDN博客期待您的关注1.引言随着大语言模型（LLM）的快速发展，ChatGPT、DeepSeek、Grok等AI模型在数据分析和洞察生成方面展现出巨大潜力。利用多个LLM的协同能力，可以增强数据分析的多角度解读、减少单一模型的偏差，并优化洞察生成的深度和精准度。本文探讨如何结合多个LLM，在数据分析领域实现更可靠的洞察生成，并提供具体的策略、方法和应用场景。2.主要
Pollinations AI文生图html源码酷爱码 html HTML
源码介绍用deepseek辅助制作了一个电脑端文生图小程序，html语言的，接口使用的是Pollinations，上传服务器访问首页即可一次生成4张，提示词最好用英文，点击小图可以预览大图，也可以点击下载按钮直接下载截图预览源码免费获取PollinationsAI文生图html源码
正则表达式：编程中的瑞士军刀，如何借助智能工具实现高效开发 inscode_039
最新接入DeepSeek-V3模型，点击下载最新版本InsCodeAIIDE正则表达式：编程中的瑞士军刀，如何借助智能工具实现高效开发正则表达式（RegularExpression，简称regex或regexp）是一种用于匹配字符串的模式描述语言。它广泛应用于文本处理、数据验证、搜索和替换等场景中。然而，正则表达式的复杂性和晦涩性常常让编程初学者望而却步。幸运的是，随着AI技术的进步，像InsCo
深入解析 DeepSeek-R1 模型的显存与内存需求 gs80140 基础知识科谱 deepseek
DeepSeek-R1系列模型涵盖从轻量级到超大规模的多个版本，适用于不同的应用场景。了解各版本在不同量化精度下的显存和内存需求，有助于选择适合自身硬件配置的模型。模型参数与量化精度的关系模型的参数量决定了其基础大小，而量化精度（如FP16、INT8、INT4）则影响每个参数所占用的存储空间。通过降低量化精度，可以显著减少模型的显存和内存占用，但可能会对模型性能产生一定影响。以下是不同量化精度下，
《北京大学-DeepSeek系列教程（1）》电子书下载 AI智研社人工智能 ai AI写作 AIGC 生活
哈喽！伙伴们，我是小智，你们的AI向导。欢迎来到每日的AI学习时间。今天，我们将一起深入AI的奇妙世界，探索“《北京大学-DeepSeek系列教程（1）》电子书下载”，并学会本篇文章中所讲的全部知识点。还是那句话“不必远征未知，只需唤醒你的潜能！”跟着小智的步伐，我们终将学有所成，学以致用，并发现自身的更多可能性。话不多说，现在就让我们开始这场激发潜能的AI学习之旅吧。《北京大学-DeepSeek
【微信小程序（云开发模式）变通实现DeepSeek支持语音】技术与健康微信小程序 notepad++小程序
整体架构前端（微信小程序）：使用微信小程序云开发能力，实现录音功能。将录音文件上传到云存储。调用云函数进行语音识别和DeepSeek处理。界面模仿DeepSeek，支持文本编辑。后端（云函数+Node.js）：使用云函数调用腾讯云语音识别（ASR）服务。调用DeepSeekAPI处理文本。步骤1：初始化云开发环境在微信开发者工具中创建小程序项目，并开通云开发。在project.config.jso
多家车企接入DeepSeek，AI汽车战争爆发，谁站上风口，谁会下牌桌？高工智能汽车人工智能汽车
日前，多家车企宣布接入DeepSeek。在吉利汽车、岚图汽车率先宣布后，东风汽车、零跑汽车、奇瑞、上汽集团、长城几家车企也紧随其后。其中东风汽车宣布旗下自主品牌已完成DeepSeek全系列大语言模型接入工作，并将于近期陆续搭载应用于包括东风岚图、东风猛士、东风奕派、东风风神、东风纳米在内的东风自主品牌车型。其中岚图品牌方面，岚图知音将成为汽车行业首个融合DeepSeek的量产车型，全新岚图梦想家也
职场人必存！DeepSeek提示词大合集：周报速成、爆款文案、旅行攻略一键生成阳光永恒736 AI工具人工智能 deepseek AI提示词
引言：AI时代，为什么你的提示词总“词不达意”？“同样的AI工具，同事用DeepSeek半小时写完周报还附赠数据分析图，我却只会问‘帮我总结本周工作’？”这可能是多数职场人的真实写照。AI工具的能力边界早已超越基础问答，但90%的用户仍停留在“无效提问”阶段10。而真正拉开差距的，是一套精准的提示词指令库——它能将模糊需求转化为AI可执行的“操作指南”，让效率提升10倍不止。一、职场效率：从“加班
DeepLabv3+改进18:在主干网络中添加REP_BLOCK AICurator 深度学习 python 机器学习 deeplabv3+语义分割
【DeepLabv3+改进专栏！探索语义分割新高度】你是否在为图像分割的精度与效率发愁？本专栏重磅推出：✅独家改进策略：融合注意力机制、轻量化设计与多尺度优化✅即插即用模块：ASPP+升级、解码器PS:订阅专栏提供完整代码论文简介我们提出了一种通用的卷积神经网络（ConvNet）构建模块，可在不增加推理时间成本的情况下提升性能。该模块名为多样化分支块（DBB），通过结合不同尺度和复杂度的多样化分支
【DeepSeek干货总结】对不同类型学术内容进行润色的顶级提示词汇总！ AIWritePaper官方账号 DeepSeek Prompt AIWritePaper AIWritePaper deepseek 深度学习人工智能 AIGC 论文润色
目录1.英文润色2.中文润色3.SCI润色4.润色Prompt汇总连贯性与句子逻辑提示词多参考版本提示词语法矫正提示词润色内容定位提示词修改建议提示词大家好这里是AIWritePaper官方账号！AIWritePaper官网AIWritePaper宝子们在写学术论文的过程中要想让DeepSeek发挥出最佳效能，尤其在进行文本润色时，精确和具体的提示词至关重要。很多宝子们在请求DeepSeek文本润
华为OD机试九日集训第2期 - 按算法分类，由易到难，循序渐进，提升编程能力和解题技巧，从而提高机试通过率哪吒搬砖工逆袭Java架构师华为od 算法九日集训 Java
目录一、适合人群二、本期训练时间三、如何参加四、数据结构与算法大纲五、华为OD九日集训第1期第1天、逻辑分析第2天、队列第3天、双指针第4天栈第5天滑动窗口第6天、二叉树第7天、并查集第8天、矩阵第9天、贪心算法六、国内直接使用满血ChatGPT4o、o1、o3-mini-high、Claude3.7Sonnet、满血DeepSeekR11、纯原版ChatGPT、Claude2、技术支持3、支持所
DeepSeek多语言670亿参数高效创作解析智能计算研究中心其他
内容概要本文聚焦DeepSeek系列模型的核心技术突破与应用价值，通过解析其混合专家架构（MoE）的设计逻辑与670亿参数的规模化优势，揭示其在多语言处理、视觉语言理解及代码生成领域的创新表现。从技术特性出发，文章将对比OpenAI等主流模型的性能差异，探讨参数效率与计算资源优化如何支撑低成本、高精度的内容生成场景，例如学术论文写作、智能选题规划及SEO关键词拓展。同时，通过分析DeepSeekP
【愚公系列】《高效使用DeepSeek》020-专业术语解释愚公搬代码愚公系列-书籍专栏人工智能 AI Agent deepseek 学习
【技术大咖愚公搬代码：全栈专家的成长之路，你关注的宝藏博主在这里！】开发者圈持续输出高质量干货的"愚公精神"践行者——全网百万开发者都在追更的顶级技术博主！江湖人称"愚公搬代码"，用七年如一日的精神深耕技术领域，以"挖山不止"的毅力为开发者们搬开知识道路上的重重阻碍！【行业认证·权威头衔】✔华为云天团核心成员：特约编辑/云享专家/开发者专家/产品云测专家✔开发者社区全满贯：CSDN博客&商业化双料
DeepSeek混合专家架构赋能智能创作智能计算研究中心其他
内容概要在人工智能技术加速迭代的当下，DeepSeek混合专家架构（MixtureofExperts）通过670亿参数的动态路由机制，实现了多模态处理的范式突破。该架构将视觉语言理解、多语言语义解析与深度学习算法深度融合，构建出覆盖文本生成、代码编写、学术研究等场景的立体化能力矩阵。其核心优势体现在三个维度：精准化内容生产——通过智能选题、文献综述自动生成等功能，将学术论文写作效率提升40%以上；
【手把手教学】DeepSeek官方搜索API博查本地使用指南：从原理到实战，全面解锁智能搜索！ BigNorthBear python 人工智能自然语言处理机器学习语言模型
前言：当大模型遇见本地搜索你是否遇到过这些问题？想在企业内网部署智能搜索，但担心数据泄露风险？需要定制搜索逻辑，但云端API灵活性不足？网络环境不稳定时，搜索服务频繁中断？博查AI搜索API的本地化方案完美解决了这些问题！通过将本地大模型与云端API结合，既能保障数据安全，又能享受实时搜索能力。本文将手把手教你如何实现这一技术方案，即使你是零基础开发者，也能轻松上手！一、本地化原理：为什么能“既本
智慧畜牧：智能化监控系统如何提升养殖效率与质量 inscode_041
最新接入DeepSeek-V3模型，点击下载最新版本InsCodeAIIDE智慧畜牧：智能化监控系统如何提升养殖效率与质量在当今数字化时代，畜牧业正经历着前所未有的变革。传统的人工监控方式已经难以满足现代养殖场对高效管理和精准控制的需求。为了应对这一挑战，越来越多的养殖场开始引入智能化监控系统，以提高生产效率、优化资源利用并确保动物健康。而在这个过程中，一款名为InsCodeAIIDE的智能开发工
智慧畜牧：用AI技术革新养殖监控与管理 inscode_057
最新接入DeepSeek-V3模型，点击下载最新版本InsCodeAIIDE智慧畜牧：用AI技术革新养殖监控与管理随着科技的不断进步，农业领域的智能化转型已经成为必然趋势。特别是在畜牧业中，借助先进的监控和管理系统，不仅可以提高生产效率，还能确保动物健康和食品安全。本文将探讨如何通过智能化工具，如InsCodeAIIDE，为畜牧监控带来全新的变革。一、传统畜牧监控的挑战传统的畜牧监控主要依赖人工巡
DeepSeek+知网研学轻松搞定研究生选题 AI新视界 AI学术学术软件推荐 AI工具 AI学术学习人工智能学术
选题是研究生学术研究的起点，一个好的选题不仅决定了研究的方向，还直接影响研究的深度和成果。本文将详细介绍如何结合DeepSeek大模型与知网研学，帮助研究生高效完成选题工作。一、选题的重要性与挑战选题的重要性：选题是研究的核心，决定了研究的创新性和可行性。好的选题能够为后续研究提供明确的方向和动力。选题的挑战：如何从海量文献中找到有价值的研究方向？如何判断选题的创新性和研究价值？如何确保选题的可行
PyTorch 深度学习实战（19）：离线强化学习与 Conservative Q-Learning (CQL) 算法进取星辰 PyTorch 深度学习实战深度学习 pytorch 算法
在上一篇文章中，我们探讨了分布式强化学习与IMPALA算法，展示了如何通过并行化训练提升强化学习的效率。本文将聚焦离线强化学习（OfflineRL）这一新兴方向，并实现ConservativeQ-Learning(CQL)算法，利用Minari提供的静态数据集训练安全的强化学习策略。一、离线强化学习与CQL原理1.离线强化学习的特点无需环境交互：直接从预收集的静态数据集学习数据效率高：复用历史经验
deepseek时代，快消行业AI搜索破局战：3步抢占3亿用户决策入口白雪讲堂人工智能大数据
——2025年滋补品牌必须掌握的AI搜索生存法则一、残酷现状：滋补行业正被AI搜索重构规则1.AI搜索用户规模爆发，高净值人群加速迁移3.31亿用户：2025年AI搜索用户规模（QuestMobile数据），中青年、高学历人群占比超60%决策路径缩短50%：用户从“搜索-比价-购买”转变为“提问-获取答案-下单”品牌生死线：当用户搜索“阿胶品牌推荐”，若答案中无品牌露出，等于永久失去客户2.滋补行
java解析APK 3213213333332132 java apk linux 解析APK
解析apk有两种方法 1、结合安卓提供apktool工具，用java执行cmd解析命令获取apk信息 2、利用相关jar包里的集成方法解析apk 这里只给出第二种方法，因为第一种方法在linux服务器下会出现不在控制范围之内的结果。 public class ApkUtil { /** * 日志对象 */ private static Logger
nginx自定义ip访问N种方法 ronin47 nginx 禁止ip访问
　　　因业务需要，禁止一部分内网访问接口，　由于前端架了F5，直接用deny或allow是不行的，这是因为直接获取的前端Ｆ５的地址。　　　所以开始思考有哪些主案可以实现这样的需求，目前可实施的是三种：　　　一：把ip段放在redis里，写一段lua 二：利用geo传递变量，写一段
mysql timestamp类型字段的CURRENT_TIMESTAMP与ON UPDATE CURRENT_TIMESTAMP属性 dcj3sjt126com mysql
timestamp有两个属性，分别是CURRENT_TIMESTAMP 和ON UPDATE CURRENT_TIMESTAMP两种，使用情况分别如下： 1. CURRENT_TIMESTAMP 当要向数据库执行insert操作时，如果有个timestamp字段属性设为 CURRENT_TIMESTAMP，则无论这
struts2+spring+hibernate分页显示 171815164 Hibernate
分页显示一直是web开发中一大烦琐的难题，传统的网页设计只在一个JSP或者ASP页面中书写所有关于数据库操作的代码，那样做分页可能简单一点，但当把网站分层开发后，分页就比较困难了，下面是我做Spring+Hibernate+Struts2项目时设计的分页代码，与大家分享交流。　　1、DAO层接口的设计，在MemberDao接口中定义了如下两个方法： public in
构建自己的Wrapper应用 g21121 rap
我们已经了解Wrapper的目录结构，下面可是正式利用Wrapper来包装我们自己的应用，这里假设Wrapper的安装目录为:/usr/local/wrapper。首先，创建项目应用 &nb
[简单]工作记录_多线程相关 53873039oycg 多线程
最近遇到多线程的问题,原来使用异步请求多个接口(n*3次请求) 方案一使用多线程一次返回数据,最开始是使用5个线程,一个线程顺序请求3个接口,超时终止返回缺点测试发现必须3个接
调试jdk中的源码，查看jdk局部变量程序员是怎么炼成的 jdk 源码
转自：http://www.douban.com/note/211369821/ 学习jdk源码时使用-- 学习java最好的办法就是看jdk源代码，面对浩瀚的jdk（光源码就有40M多，比一个大型网站的源码都多）从何入手呢，要是能单步调试跟进到jdk源码里并且能查看其中的局部变量最好了。可惜的是sun提供的jdk并不能查看运行中的局部变量
Oracle RAC Failover 详解 aijuans oracle
Oracle RAC 同时具备HA(High Availiablity) 和LB(LoadBalance). 而其高可用性的基础就是Failover(故障转移). 它指集群中任何一个节点的故障都不会影响用户的使用，连接到故障节点的用户会被自动转移到健康节点，从用户感受而言，是感觉不到这种切换。 Oracle 10g RAC 的Failover 可以分为3种： 1. Client-Si
form表单提交数据编码方式及tomcat的接受编码方式 antonyup_2006 JavaScript tomcat 浏览器互联网 servlet
原帖地址：http://www.iteye.com/topic/266705 form有2中方法把数据提交给服务器，get和post,分别说下吧。（一）get提交 1.首先说下客户端（浏览器）的form表单用get方法是如何将数据编码后提交给服务器端的吧。对于get方法来说，都是把数据串联在请求的url后面作为参数，如：http://localhost:
JS初学者必知的基础百合不是茶 js函数 js入门基础
JavaScript是网页的交互语言,实现网页的各种效果, JavaScript 是世界上最流行的脚本语言。 JavaScript 是属于 web 的语言，它适用于 PC、笔记本电脑、平板电脑和移动电话。 JavaScript 被设计为向 HTML 页面增加交互性。许多 HTML 开发者都不是程序员，但是 JavaScript 却拥有非常简单的语法。几乎每个人都有能力将小的
iBatis的分页分析与详解 bijian1013 java ibatis
分页是操作数据库型系统常遇到的问题。分页实现方法很多，但效率的差异就很大了。iBatis是通过什么方式来实现这个分页的了。查看它的实现部分，发现返回的PaginatedList实际上是个接口，实现这个接口的是PaginatedDataList类的对象，查看PaginatedDataList类发现，每次翻页的时候最
精通Oracle10编程SQL(15)使用对象类型 bijian1013 oracle 数据库 plsql
/* *使用对象类型 */ --建立和使用简单对象类型 --对象类型包括对象类型规范和对象类型体两部分。 --建立和使用不包含任何方法的对象类型 CREATE OR REPLACE TYPE person_typ1 as OBJECT( name varchar2(10),gender varchar2(4),birthdate date ); drop type p
【Linux命令二】文本处理命令awk bit1129 linux命令
awk是Linux用来进行文本处理的命令，在日常工作中，广泛应用于日志分析。awk是一门解释型编程语言，包含变量，数组，循环控制结构，条件控制结构等。它的语法采用类C语言的语法。 awk命令用来做什么？ 1.awk适用于具有一定结构的文本行，对其中的列进行提取信息 2.awk可以把当前正在处理的文本行提交给Linux的其它命令处理，然后把直接结构返回给awk 3.awk实际工
JAVA(ssh2框架)+Flex实现权限控制方案分析白糖_ java
目前项目使用的是Struts2+Hibernate+Spring的架构模式，目前已经有一套针对SSH2的权限系统，运行良好。但是项目有了新需求：在目前系统的基础上使用Flex逐步取代JSP，在取代JSP过程中可能存在Flex与JSP并存的情况，所以权限系统需要进行修改。【SSH2权限系统的实现机制】权限控制分为页面和后台两块：不同类型用户的帐号分配的访问权限是不同的，用户使
angular.forEach boyitech AngularJS AngularJS API angular.forEach
angular.forEach 描述: 循环对obj对象的每个元素调用iterator, obj对象可以是一个Object或一个Array. Iterator函数调用方法: iterator(value, key, obj), 其中obj是被迭代对象，key是obj的property key或者是数组的index，value就是相应的值啦. (此函数不能够迭代继承的属性.)
java-谷歌面试题-给定一个排序数组，如何构造一个二叉排序树 bylijinnan 二叉排序树
import java.util.LinkedList; public class CreateBSTfromSortedArray { /** * 题目:给定一个排序数组，如何构造一个二叉排序树 * 递归 */ public static void main(String[] args) { int[] data = { 1, 2, 3, 4,
action执行2次 Chen.H JavaScript jsp XHTML css Webwork
xwork 写道 <action name="userTypeAction" class="com.ekangcount.website.system.view.action.UserTypeAction"> <result name="ssss" type="dispatcher">
[时空与能量]逆转时空需要消耗大量能源 comsci 能源
无论如何,人类始终都想摆脱时间和空间的限制....但是受到质量与能量关系的限制,我们人类在目前和今后很长一段时间内,都无法获得大量廉价的能源来进行时空跨越..... 在进行时空穿梭的实验中,消耗超大规模的能源是必然
oracle的正则表达式(regular expression)详细介绍 daizj oracle 正则表达式
正则表达式是很多编程语言中都有的。可惜oracle8i、oracle9i中一直迟迟不肯加入，好在oracle10g中终于增加了期盼已久的正则表达式功能。你可以在oracle10g中使用正则表达式肆意地匹配你想匹配的任何字符串了。正则表达式中常用到的元数据(metacharacter)如下： ^ 匹配字符串的开头位置。 $ 匹配支付传的结尾位置。 *
报表工具与报表性能的关系 datamachine 报表工具 birt 报表性能润乾报表
在选择报表工具时，性能一直是用户关心的指标，但是，报表工具的性能和整个报表系统的性能有多大关系呢？要回答这个问题，首先要分析一下报表的处理过程包含哪些环节，哪些环节容易出现性能瓶颈，如何优化这些环节。一、报表处理的一般过程分析 1、用户选择报表输入参数后，报表引擎会根据报表模板和输入参数来解析报表，并将数据计算和读取请求以SQL的方式发送给数据库。 2、
初一上学期难记忆单词背诵第一课 dcj3sjt126com word english
what 什么 your 你 name 名字 my 我的 am 是 one 一 two 二 three 三 four 四 five 五 class 班级，课 six 六 seven 七 eight 八 nince 九 ten 十 zero 零 how 怎样 old 老的 eleven 十一 twelve 十二 thirteen
我学过和准备学的各种技术 dcj3sjt126com 技术
语言VB https://msdn.microsoft.com/zh-cn/library/2x7h1hfk.aspxJava http://docs.oracle.com/javase/8/C# https://msdn.microsoft.com/library/vstudioPHP http://php.net/manual/en/Html
struts2中token防止重复提交表单蕃薯耀重复提交表单 struts2中token
struts2中token防止重复提交表单 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 蕃薯耀 2015年7月12日 11:52:32 星期日 ht
线性查找二维数组 hao3100590 二维数组
1.算法描述有序（行有序，列有序，且每行从左至右递增，列从上至下递增）二维数组查找，要求复杂度O(n) 2.使用到的相关知识：结构体定义和使用，二维数组传递（http://blog.csdn.net/yzhhmhm/article/details/2045816） 3.使用数组名传递这个的不便之处很明显，一旦确定就是不能设置列值 //使
spring security 3中推荐使用BCrypt算法加密密码 jackyrong Spring Security
spring security 3中推荐使用BCrypt算法加密密码了，以前使用的是md5， Md5PasswordEncoder 和 ShaPasswordEncoder，现在不推荐了，推荐用bcrpt Bcrpt中的salt可以是随机的，比如： int i = 0; while (i < 10) { String password = "1234
学习编程并不难,做到以下几点即可! lampcy java html 编程语言
不论你是想自己设计游戏，还是开发iPhone或安卓手机上的应用，还是仅仅为了娱乐，学习编程语言都是一条必经之路。编程语言种类繁多，用途各异，然而一旦掌握其中之一，其他的也就迎刃而解。作为初学者，你可能要先从Java或HTML开始学，一旦掌握了一门编程语言，你就发挥无穷的想象，开发各种神奇的软件啦。 1、确定目标学习编程语言既充满乐趣，又充满挑战。有些花费多年时间学习一门编程语言的大学生到
架构师之mysql----------------用group+inner join,left join ,right join 查重复数据（替代in) nannan408 right join
1.前言。如题。 2.代码 (1)单表查重复数据,根据a分组 SELECT m.a,m.b, INNER JOIN （select a,b,COUNT(*) AS rank FROM test.`A` A GROUP BY a HAVING rank>1 )k ON m.a=k.a （2）多表查询，使用改为le
jQuery选择器小结 VS 节点查找（附css的一些东西） Everyday都不同 jquery css name选择器追加元素查找节点
最近做前端页面，频繁用到一些jQuery的选择器，所以特意来总结一下：测试页面： <html> <head> <script src="jquery-1.7.2.min.js"></script> <script> /*$(function() { $(documen
关于EXT tntxia ext
ExtJS是一个很不错的Ajax框架，可以用来开发带有华丽外观的富客户端应用，使得我们的b/s应用更加具有活力及生命力。ExtJS是一个用 javascript编写，与后台技术无关的前端ajax框架。因此，可以把ExtJS用在.Net、Java、Php等各种开发语言开发的应用中。 ExtJs最开始基于YUI技术，由开发人员Jack
一个MIT计算机博士对数学的思考 xjnine Math
在过去的一年中，我一直在数学的海洋中游荡，research进展不多，对于数学世界的阅历算是有了一些长进。为什么要深入数学的世界？作为计算机的学生，我没有任何企图要成为一个数学家。我学习数学的目的，是要想爬上巨人的肩膀，希望站在更高的高度，能把我自己研究的东西看得更深广一些。说起来，我在刚来这个学校的时候，并没有预料到我将会有一个深入数学的旅程。我的导师最初希望我去做的题目，是对appe

深度强化学习：Deep Q-Learning

你可能感兴趣的:(深度强化学习：Deep Q-Learning)