老婆是云梦

地铁人流量预测python

根据地铁人流量进行的一个预测案例，这是以前学过的项目。

#!/usr/bin/env python
# coding: utf-8

# # 地铁人流量预测

# 
#
# 
# Can we predict how many people will use the metro in New York City (NYC) at any given time? In particular in this article we will explore whether the current weather conditions help predict the number of people entering and exit subway stations.
# 
# Answering this question could help the Metro Transit Authority (MTA) in NYC save money and make the metro safer. More accurate predictions of how many people will visit certain stations could allow the MTA to better assign staff and predict rush periods that otherwise would be unexpected.
# 
# Thie article will both answer this question and introduce readers to the basics of data science. We will:
# 
# - explore a large and complex set of data, using numeric and visual techniques.
# - use appropriate statistical methods to develop meaningful answers to questions
# - use basic machine learning methods to develop and test predictive models of data
# - understand what MapReduce is and how it would help in processing larger amounts of data.
# 
# ### Prerequisites
# 
# A full understanding of this article requires a background in mathematics and statistics. If you have followed the "Intro to Data Science" course on Udacity you have this background.
# 
# You can still follow this article without such a background but don't fret if many details don't make sense!

# ## 0. Imports that allow the Python code to work

# In[1]:


from __future__ import division

import datetime
import itertools
import operator

import brewer2mpl
import ggplot as gg
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pylab
import scipy.stats
import statsmodels.api as sm

get_ipython().run_line_magic('matplotlib', 'inline')


# ## 1. Obtaining and cleaning data

# The data is in a regular CSV file, the kind which you can open in Microsoft Excel and see a large number of columns for:

# In[2]:


with open("turnstile_data_master_with_weather.csv") as f_in:
    for i in xrange(3):
        print f_in.readline().strip()


# However, we're going to use the pandas library to make it much easier to analyse and chart the data contained in this CSV file.

# In[3]:


turnstile_data = pd.read_csv("turnstile_data_master_with_weather.csv")
turnstile_data.head()


# For convenience we add a new column that combines the date and time columns. This will make our plotting just below a little easier:

# In[4]:


turnstile_data["DATETIMEn"] = pd.to_datetime(turnstile_data["DATEn"] + " " + turnstile_data["TIMEn"], format="%Y-%m-%d %H:%M:%S")


# ## 2. Exploring data
# 
# The first question we always try answer is "what does it look like?". Let's plot the number of people who enter and exit all metro stations throughout NYC over time:

# In[5]:


turnstile_dt = turnstile_data[["DATETIMEn", "ENTRIESn_hourly", "EXITSn_hourly"]]                .set_index("DATETIMEn")                .sort_index()
fig, ax = pylab.subplots(figsize=(12, 7))
set1 = brewer2mpl.get_map('Set1', 'qualitative', 3).mpl_colors
turnstile_dt.plot(ax=ax, color=set1)
ax.set_title("Entries/exits per hour across all stations")
ax.legend(["Entries", "Exits"])
ax.set_ylabel("Entries/exits per hour")
ax.set_xlabel("Date")
pass


# From this chart we can already tell that:
# 
# - some days are less busy than others, but it's difficult to tell which days in particular are less busy, and
# - there seems to be several spikes per day in both entries and exits, but it's difficult to tell when these spikes occur.
# 
# Time to dig in a bit deeper. Firstly, let's see how the day of the week affects the number of entries/exits per hour:

# In[6]:


turnstile_day = turnstile_dt
turnstile_day["day"] = turnstile_day.index.weekday
turnstile_day = turnstile_day[["day", "ENTRIESn_hourly", "EXITSn_hourly"]]             .groupby("day")             .agg(sum)

fig, ax = plt.subplots(figsize=(12, 7))
set1 = brewer2mpl.get_map('Set1', 'qualitative', 3).mpl_colors
turnstile_day.plot(ax=ax, kind="bar", color=set1)
ax.set_title("Total entries/exits per hour by day across all stations")
ax.legend(["Exits", "Entries"])
ax.set_ylabel("Entries/exits per hour")
ax.set_xlabel("Day")
ax.set_xticklabels(["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"],
                   rotation=45)
pass


# Fridays and Saturdays are the least busy days.
# 
# Secondly, we want to plot the total hourly entries and exits per hour grouped by hour to see which hours are more busy than others. Our data however won't make this easy, because it's only sampled once every four hours (1AM, 5AM, 9AM, ...). We need to resample our data and guess what the entries/exits for e.g. 2AM, 3AM, etc would be, and we guess these values by "drawing straight lines" between known values.

# In[7]:


turnstile_dt["hour"] = turnstile_dt.index.hour
turnstile_by_hour = turnstile_dt
turnstile_by_hour = turnstile_by_hour[["hour", "ENTRIESn_hourly", "EXITSn_hourly"]]             .groupby("hour")             .sum()

fig, ax = pylab.subplots(figsize=(12, 7))
set1 = brewer2mpl.get_map('Set1', 'qualitative', 3).mpl_colors
turnstile_by_hour.plot(ax=ax, color=set1)
ax.set_title("Total entries/exits per hour by hour across all stations")
ax.legend(["Entries", "Exits"])
ax.set_ylabel("Entries/exits per hour (1e6 is a million)")
ax.set_xlabel("Hour (0 is midnight, 12 is noon, 23 is 11pm)")
ax.set_xlim(0, 23)
pass


# By plotting the total number of hourly entries and exits grouped by hour we can see that there are indeed several spikes during the day. Since we had to fill in the blanks to draw this chart though it's not worth reading too much into this chart; higher precision data would make this chart far more useful.
# 
# Rather than dividing up our data by hour and day, looking at all of it at once what effect does rain have on the number of people who use the subway? Our Weather Underground data has a "rain" column that is 0 if it wasn't raining, 1 if it was, during a particular hourly turnstile measurement:

# In[8]:


turnstile_rain = turnstile_data[["rain", "ENTRIESn_hourly", "EXITSn_hourly"]]
turnstile_rain["rain2"] = np.where(turnstile_rain["rain"] == 1, "raining", "not raining")
turnstile_rain.groupby("rain2").describe()


# The median and mean number of entries and exits per hour are higher when it is raining than when it is not raining. The numeric difference however is very small, as show below when we plot Kernel Density Estimates (KDEs) of the data. A KDE can be thought of as a "smooth histogram", but more formally is an estimate of the probability distribution function (PDF) of our data, i.e. "at this point here what is the probability from [0, 1) of this value occurring?".

# In[9]:


turnstile_rain = turnstile_data[["rain", "ENTRIESn_hourly", "EXITSn_hourly"]]
turnstile_rain["ENTRIESn_hourly_log10"] = np.log10(turnstile_rain["ENTRIESn_hourly"] + 1)
turnstile_rain["rain2"] = np.where(turnstile_rain["rain"] == 1, "raining", "not raining")
set1 = brewer2mpl.get_map('Set1', 'qualitative', 3).mpl_colors
plot = gg.ggplot(turnstile_rain, gg.aes(x="ENTRIESn_hourly_log10", color="rain2")) + gg.geom_density()  + gg.xlab("log10(entries per hour)") + gg.ylab("Number of turnstiles") + gg.ggtitle("Entries per hour whilst raining and not raining")
plot


# Note what we've plotted on the x-axis. This time we didn't plot the number of entries per hour by itself, but rather applied "log10" (log base 10) to it. That means that 0 = 1, 1 = 10, 2 = 100, 3 = 1000, 4 = 1000, and 5 = 10000. Plotting it this way makes it easier to fit in data with extreme values in the same chart.
# 
# log10 came in handy, because a lot of turnstiles are idle during the day! Applying a function to our data and then charting it allows us to still use charts to visualise strange data. This is a very common technique in, for example, electrical engineering, where you often deal with both very small and very large numbers at the same time.
# 
# Although our numeric estimate above indicates that more people travel per hour when it is raining, this conclusion isn't clear when looking at these charts. However drawing a KDE gives us another interesting perspective: the "pattern" of the numbers of people traveling on the MTA doesn't change according to the rain. It isn't as if the MTA, or the way people use it, magically changes when it starts raining.

# ## 2. Analysing data

# We sounded very confident above when we said that more people ride the subway when it's raining. Rather than relying on a chart, is there a numerical way of seeing how sure we are that there is a difference?
# 
# This is a common statistical problem. Given two groups of data you need to figure out whether they have the same averages or not. We have a vertiable goody bag full of statistical techniques which will let us answer this question. However different techniques make different assumptions!
# 
# One method, the t-test, assumes that the data we're analysing is "normal". What does normal mean? Data like the heights of people in a room often look like this:

# In[10]:


np.random.seed(42)
data = pd.Series(np.random.normal(loc=180, scale=40, size=600))
data.hist()
pass


# Notice how it's neatly symmetrical, with only one peak in the middle? You may also know this as the "bell curve". However, it's clear that our data is not normally distributed:

# In[11]:


p = turnstile_data["ENTRIESn_hourly"].hist()
pylab.suptitle("Entries per hour across all stations")
pylab.xlabel("Entries per hour")
pylab.ylabel("Number of occurrences")
pass


# Again, though, the purpose of this section is to use numerical methods for answering our questions. Posed the question "Is my data normally distributed, so that I can use the t-test on it?" we can do the following:

# In[12]:


(k2, pvalue) = scipy.stats.normaltest(turnstile_data["ENTRIESn_hourly"])
print pvalue


# Here we've avoided using the Shapiro-Wilk test (`scipy.stats.shapiro`) because it does not work for data sets with more than 5000 values. Moreover, this test has returned a pvalue of close to 0, meaning that it is very likely that the data is not normally distributed. That means we can't use the t-test!
# 
# Bummer! Fortunately, another test, the Mann-Whitney U test, doesn't assume the data is normally distributed and lets us answer the same question. Let's give it a whirl!

# In[13]:


not_raining = turnstile_data[turnstile_data["rain"] == 0]
raining = turnstile_data[turnstile_data["rain"] == 1]
(u, pvalue) = scipy.stats.mannwhitneyu(not_raining["ENTRIESn_hourly"],
                                       raining["ENTRIESn_hourly"])
print "median entries per hour when not raining: %s" % not_raining["ENTRIESn_hourly"].median()
print "median entries per hour when raining: %s" % raining["ENTRIESn_hourly"].median()
print "p-value of test statistic: %.4f" % (pvalue * 2, )


# Since `scipy.stats.mannwhiteneyu` returns a one-sided p-value, and we did not predict which group would have a higher average before we gathering any data, we must multiply it by two to give a two-sided p-value \[1\] \[2\]. Using this allows us to say that, with a 5% degree of certainty, that there is a difference in the average number of entries per hour depending on whether it's raining or not \[3\].

# ## 3. Creating and evaluating models of data

# Both by drawing a variety of charts and using some statistical tests we've determined that the day, hour of day, and whether its raining affects the number of people riding the MTA. However, our original goal was to predict how many people use the metro in NYC at any given time. Given that we know we have some information available that can help us answer this question, but we're not precisely sure how this data maps onto our desired results, what is the best next step?
# 
# In machine learning the task of predicting a continuous variable like height, weight, number of entries per hour, etc. is called **regression**. We're going to a simple method called linear regression to attempt to predict the number of entries per hour using variables we've already confirmed using charts as having an effect on subway ridership:
# 
# - the amount of rain falling ('precipi'),
# - the hour of the day (0 to 23 inclusive) ('Hour')
# - the mean temperature in Fahrenheit as an integer ('meantempi')
# - which particular metro station we're trying to predict ridership for ('UNIT').
# 
# The last feature is odd because is a **categorical variable**, some choice amongst a handful, akin to handedness, colour, etc. In order to represent a categorical variable in our numerical model we turn it into a set of **dummy variables**. For a categorical variable with *n* possible values we create *n* new columns, where only one of the columns can be 1 for a particular row, else is 0. For example if we had a categorical variable called "handedness" with possible values "left" and "right" we'd add two columns, "handedness_left" and "handedness_right", where only one of the columns could be 1 for a given row.

# In linear regression we have:
# 
# - A set of inputs called **features**, X. In here we put all our input variables and a column of 1's, to allow a constant value to be used.
# - A set of outputs called **values**, y. Here we'll be predicting ENTRIESn_hourly.
# - Values **theta** we get to choose to map one row of features onto a single value.
# 
# Formally we say that:
# 
# $
# \begin{aligned}
# y &= \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + \ldots + \theta_n x_n \\
# &= \vec{\theta}^{T} \vec{x}
# \end{aligned}
# $
# 
# In our case:
# 
# $
# \vec{x}_i =
# \begin{bmatrix}
# 1 \\ \textrm{precipi}_i \\ \textrm{Hour}_i \\ \textrm{meantempi}_i \\
# \textrm{UNIT}_{1i} \\
# \textrm{UNIT}_{2i} \\
# \ldots \\
# \textrm{UNIT}_{ni} \\
# \end{bmatrix}
# $
# 
# So to be clear what we're trying to figure out is some values for $\vec{\theta}$ that'll let us make a model such that:
# 
# $
# \begin{aligned}
# \textrm{ENTRIESn_HOURLY} &= \theta_0 \\ &+ \theta_1 \times \textrm{precipi} \\ &+ \theta_2 \times \textrm{Hour} \\ &+ \theta_3 \times \textrm{meantempi} \\ &+ \theta_4 \times \textrm{UNIT}_1 \\ &+ \theta_5 \times \textrm{UNIT}_2 \\ &+ \\ \ldots
# \end{aligned}
# $
# 
# In order to do this we want to find some values for $\vec{\theta}$ that minimize the **cost function** for linear regression:
# 
# $
# \begin{aligned}
# J(\theta) &= \frac{1}{2m} \displaystyle \sum_{i=1}^{m} \left( \textrm{predictions} - \textrm{actuals} \right) ^ 2 \\
# &= \frac{1}{2m} \displaystyle \sum_{i=1}^{m} \left( \theta^T x - y \right) ^ 2
# \end{aligned}
# $

# In[14]:


def compute_cost(features, values, theta):
    m = len(values)
    predictions = features.dot(theta)
    return (1 / (2 * m)) * np.square(predictions - values).sum()


# In order to minimize the cost we start off at $\vec{\theta} = \vec{0}$ and then run a **gradient descent** step many times. Each time we run the following step we get closer and closer to the values of $\vec{\theta}$ the minimize the cost function:
# 
# $
# \theta_j = \theta_j - \frac{\alpha}{m} \displaystyle \sum_{i=1}^{m} \left( \theta^T x - y \right ) x
# $
# 
# Where $\alpha$ is the **learning rate**. Setting this value too small will cause us to converge very slowly but we'll always get the right answer eventually, whereas setting it too large goes faster but may cause us to *diverge* and never get the right answer!

# In[15]:


def gradient_descent(features, values, theta, alpha, num_iterations):
    m = len(values)
    cost_history = []
    for i in xrange(num_iterations):
        loss = features.dot(theta) - values
        delta = (alpha/m) * loss.dot(features)
        theta -= delta
        cost_history.append(compute_cost(features, values, theta))
    return theta, pd.Series(cost_history)


# One catch with using gradient descent to solve the cost function for linear regression is that if we don't use **feature scaling** to scale each input feature to the range $-1 \leq x_i \leq 1$ then it will take a very long time for our algorithm to converge. Hence we will scale our input features in order to speed up convergence, but not that this does not affect the correctness of our solution.

# In[16]:


def normalize_features(array):
    mu = array.mean()
    sigma = array.std()
    normalized = (array - mu) / sigma
    return (normalized, mu, sigma)


# In[17]:


def predictions(features, values, alpha, num_iterations):
    m = len(values)
    theta = np.zeros(features.shape[1])
    theta, cost_history = gradient_descent(features, values, theta, alpha, num_iterations)
    predictions = features.dot(theta)
    # Predictions less than 0 make no sense, so just set them to 0
    predictions[predictions<0] = 0
    return pd.Series(predictions)

dummy_units = pd.get_dummies(turnstile_data['UNIT'], prefix='unit')
features = turnstile_data[['rain', 'precipi', 'Hour', 'meantempi']].join(dummy_units)
features, mu, sigma = normalize_features(features)
features = np.array(features)
features = np.insert(features, 0, 1, axis=1)
values = np.array(turnstile_data[['ENTRIESn_hourly']]).flatten()
pred = predictions(features=np.array(features),
                   values=values,
                   alpha=0.1,
                   num_iterations=150)
turnstile_data[["UNIT", "DATETIMEn", "ENTRIESn_hourly"]].join(pd.Series(pred, name="predictions")).head(n=10)


# We have some predictions, but how are we going to evaluate how good these predictions are? One simple method is to compute the **coefficient of determination**, which measure how much variability in the data your model is able to capture. It varies from 0 to 1 and bigger values are better:
# 
# $
# \begin{aligned}
# R^2 &= 1 - \frac{\sigma_\textrm{errors}^2}{\sigma_\textrm{data}^2} \\
# &= 1 - \frac{ \displaystyle \sum_{i=1}^{m} \left( y_i - f_i \right)^2 }{ \displaystyle \sum_{i=1}^{m} \left( y_i - \bar{y}_i \right)^2 }
# \end{aligned}
# $

# In[18]:


def compute_r_squared(data, predictions):
    numerator = np.square(data - predictions).sum()
    denomenator = np.square(data - data.mean()).sum()
    return 1 - numerator / denomenator


# In[19]:


"%.3f" % compute_r_squared(turnstile_data["ENTRIESn_hourly"], pred)


# So loosely speaking our model identifies 46% of the variation present in the data it was trained on.
# 
# Another way of determining how good a statistical model is is to plot its **residuals**, which are defined as the differences between the predicted and actual values. Residuals are elements of variation in the data unexplained by a model. A good model's residual plot will be distributed as a Normal distribution with a mean of 0 and some finite variance \[4\].

# In[20]:


def plot_residuals(df, predictions):
    """Reference: http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm
    """
    plt.figure()
    (df["ENTRIESn_hourly"] - predictions).hist(bins=50)
    pylab.title("Residual plot")
    pylab.xlabel("Residuals")
    pylab.ylabel("Frequency")
    return plt

plot_residuals(turnstile_data, pred)


# In[21]:


residuals = turnstile_data["ENTRIESn_hourly"] - pred
(zscore, pvalue) = scipy.stats.normaltest(residuals)
print "p-value of normaltest on residuals: %s" % pvalue
print "mean of residuals: %s" % residuals.mean()


# In summary our current model seems to be quite a good fit and the residual plot indicates that there might be some structure left in our residual errors that could be reduced. 

# ## 4. Going Further

# How can we improve our model? In general there are always two steps available to us when we need to improve the performance of some statistical model: make our model more complex, or feed it more data.
# 
# With our current linear regression model we're just passing in the original values for a small set of features. We could make our model more complex and pass in **more features** to see if that helps:

# In[22]:


dummy_units = pd.get_dummies(turnstile_data['UNIT'], prefix='unit')
features = turnstile_data[['rain', 'precipi', 'Hour', 'meantempi', 'mintempi', 'maxtempi',
                           'mindewpti', 'meandewpti', 'maxdewpti', 'minpressurei',
                           'meanpressurei', 'maxpressurei', 'meanwindspdi']].join(dummy_units)
features, mu, sigma = normalize_features(features)
features = np.array(features)
features = np.insert(features, 0, 1, axis=1)
values = np.array(turnstile_data[['ENTRIESn_hourly']]).flatten()
pred2 = predictions(features=np.array(features),
                    values=values,
                    alpha=0.1,
                    num_iterations=150)
print("original R^2: %.3f" % compute_r_squared(turnstile_data["ENTRIESn_hourly"], pred))
print("more existing features R^2: %.3f" %
      compute_r_squared(turnstile_data["ENTRIESn_hourly"], pred2))


# Alternatively we can use **polynomial combinations** of features to give our linear regression model a better chance of fitting the data. One disadvantage of this approach is the number of features used increases dramatically and it'll take much longer to both train and use our model.

# In[23]:


def add_polynomial_features(df, degree, add_sqrt):
    for i in xrange(2, degree + 1):
        for combination in itertools.combinations_with_replacement(df.columns, i):
            name = " ".join(combination)
            value = np.prod(df[list(combination)], axis=1)
            df[name] = value
    if add_sqrt:
        for column in df.columns:
            df["%s_sqrt" % column] = np.sqrt(df[column])

dummy_units = pd.get_dummies(turnstile_data['UNIT'], prefix='unit')
features = turnstile_data[['rain', 'precipi', 'Hour', 'meantempi', 'mintempi', 'maxtempi',
                           'mindewpti', 'meandewpti', 'maxdewpti', 'minpressurei',
                           'meanpressurei', 'maxpressurei', 'meanwindspdi']]
add_polynomial_features(features, 2, add_sqrt=True)
features = features.join(dummy_units)
features, mu, sigma = normalize_features(features)
features = np.array(features)
features = np.insert(features, 0, 1, axis=1)
values = np.array(turnstile_data[['ENTRIESn_hourly']]).flatten()
pred3 = predictions(features=np.array(features),
                    values=values,
                    alpha=0.025,
                    num_iterations=150)
print("original R^2: %.3f" % compute_r_squared(turnstile_data["ENTRIESn_hourly"], pred))
print("more existing features R^2: %.3f" %
      compute_r_squared(turnstile_data["ENTRIESn_hourly"], pred2))
print("more existing features with polynomial combinations R^2: %.3f" %
      compute_r_squared(turnstile_data["ENTRIESn_hourly"], pred3))


# The downside with increasing the complexity of the model is that there's a risk that our model is already complex enough, and instead we need to feed it more data. In order to gather and process large amounts of data we may have to resort to a concurrency model like MapReduce, where massive amounts of data are processed by many computers that don't communicate with one another but instead write to a shared file system.
# 
# How can we tell whether our model is complex enough? In machine learning terms this is referred to as the **bias/variance** problem:
# 
# - If our model is experiencing high bias it is underfitting the data and not fitting the data well enough. Adding more data will not help because the model is already unable to model the data. Instead you need to make the model more complex.
# - If our model is experiencing high variance it is overfitting the data and too complex. Adding more data will help because this makes the data more complex and reduces the likelihood of the model overfitting the data.
# 
# In order to determine whether our model is experiencing high bias or high variance we plot **learning curves**:
# 
# - Rather than training our model on all available data we split the data set into three pieces: a **training set**, a **cross-validation** set, and a **testing** set.
# - A good default split of around 60% training, 20% cross-validation, and 20% testing. In our case since we're just drawing learning curves and not evaluating our model's generalizability let's set it to 60% training and 40% cross validation.
# - Plot $J_\textrm{CV}(\theta)$ and $J_\textrm{train}(\theta)$ with respect to the number of training samples. Whilst feeding in an ever increasing amount of training samples, re-train for new values of $\theta$ and plot the cost function on the training set ($J_\textrm{train}(\theta)$) and the cross-validation set ($J_\textrm{CV}(\theta)$).
# - If $J_\textrm{CV}(\theta) \approx J_\textrm{train}(\theta)$ as the number of samples increases then you have high bias. The model isn't able to capture patterns in the data.
# - If $J_\textrm{CV}(\theta) >> J_\textrm{train}(\theta)$ it's likely that your model is complex enough and overfitting your training data. Increasing the number of training samples will help improve your model's performance.

# In[24]:


def get_shuffled_df(df):
    m = len(turnstile_data)
    df = turnstile_data.copy()
    df = df.reindex(np.random.permutation(df.index))
    return df

np.random.seed(42)
df = get_shuffled_df(turnstile_data)


# In[25]:


m = len(df)
df_train = df[:int(0.6*m)]
df_cv = df[int(0.6*m):]

feature_names = ['rain', 'precipi', 'Hour', 'meantempi']
value_name = 'ENTRIESn_hourly'

dummy_units = pd.get_dummies(df_train['UNIT'], prefix='unit')
features_train = df_train[feature_names].join(dummy_units)
features_train, mu, sigma = normalize_features(features_train)
features_train = np.array(features_train)
features_train = np.insert(features_train, 0, 1, axis=1)
values_train = df_train[value_name]

dummy_units = pd.get_dummies(df_cv['UNIT'], prefix='unit')
features_cv = df_cv[feature_names].join(dummy_units)
features_cv = (features_cv - mu) / sigma
features_cv = np.array(features_cv)
features_cv = np.insert(features_cv, 0, 1, axis=1)
values_cv = df_cv[value_name]


# In[26]:


def get_learning_curves(features_train, values_train, features_cv, values_cv,
                        num_points=10, alpha=0.1, num_iterations=50):
    cost_train = []
    r_squared_train = []
    cost_cv = []
    r_squared_cv = []
    m = len(features_train)
    number_of_samples = np.array(np.linspace(m/num_points, m - 1, num_points), dtype=np.int)        
    for i in number_of_samples:
        features_train_subset = features_train[:i]
        values_train_subset = values_train[:i]
        theta = np.zeros(features_train_subset.shape[1])
        theta, _ = gradient_descent(features_train_subset, values_train_subset, theta, alpha, num_iterations)
        cost_train_subset = compute_cost(features_train_subset, values_train_subset, theta)
        cost_train.append(cost_train_subset)
        r_squared_train.append(compute_r_squared(values_train_subset, features_train_subset.dot(theta)))
        cost_cv_subset = compute_cost(features_cv, values_cv, theta)
        cost_cv.append(cost_cv_subset)
        r_squared_cv.append(compute_r_squared(values_cv, features_cv.dot(theta)))
    return pd.DataFrame({'Number of samples': number_of_samples,
                          'Training error': cost_train,
                          'Testing error': cost_cv,
                          'R^2 training': r_squared_train,
                          'R^2 testing': r_squared_cv})

np.random.seed(42)
costs = get_learning_curves(features_train, values_train, features_cv, values_cv)


# In[27]:


fig, (ax1, ax2) = pylab.subplots(nrows=2, ncols=1, figsize=(12, 7))
set1 = brewer2mpl.get_map('Set1', 'qualitative', 3).mpl_colors
costs.plot(x="Number of samples", y=['Training error', 'Testing error'],
           ax=ax1, color=set1)
ax1.set_title("Traditional learning curves")
ax1.legend()
ax1.set_xlabel("Number of training samples")
ax1.set_ylabel("Value of cost function")
costs.plot(x="Number of samples", y=["R^2 training", "R^2 testing"],
           ax=ax2, color=set1)
ax2.set_title("Learning curves using R^2")
ax2.legend()
ax2.set_xlabel("Number of training samples")
ax2.set_ylabel("R^2")
pass


# If you're confused as to why the training error can exceed the testing error, note that this is probably random error at play. A more robust drawing of the learning curves would use bootstrapping to estimate the values of the cost function on the training and testing sets, rather than using a single sample.
# 
# Looking at a non-traditional set of learning curves that use the coefficient of determination $R^2$ instead it's clear that our current model is experiencing high bias. The training and testing coefficients of determination are almost equal and not really changing with respect to the number of samples. Hence we would not expect gathering more training data to improve our learning algorithm's performance. Instead we should focus on increasing its complexity, as we were doing above by e.g. adding polynomial features. Alternatively one could use more advanced regression techniques such as neural networks or Support Vector Regression (SVR).
# 
# However another unfortunate possibility is that the features available to us, relating to the weather, do not contain enough information to better predict the ridership of the MTA. It is a difficult judgement to make as to whether this is true.

# ### 为什么不试试xgboost呢

# In[29]:


features_train.shape


# In[30]:


values_train.shape


# In[31]:


from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

params = [3,5,6,9]
test_scores = []
for param in params:
    clf = XGBRegressor(max_depth=param)
    test_score = np.sqrt(-cross_val_score(clf, features_train, values_train, cv=3, scoring='neg_mean_squared_error'))
    test_scores.append(np.mean(test_score))


# In[32]:


import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
plt.plot(params, test_scores)
plt.title("max_depth vs CV Error");


# ## 5. References
# 
# \[1\] scipy.stats.mannwhiteyu (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html#scipy.stats.mannwhitneyu)  
# 
# \[2\] One tail vs Two tail p-values (http://graphpad.com/guides/prism/6/statistics/index.htm?one-tail_vs__two-tail_p_values.htm)  
# 
# \[3\] How the Mann-Whitney Test Works (http://graphpad.com/guides/prism/6/statistics/index.htm?how_the_mann-whitney_test_works.htm)  
# 
# \[4\] Are the model residuals well behaved? NIST Engineering Handbook. (http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm)

# In[29]:


get_ipython().run_line_magic('autosave', '10')

from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

本人考研中，需要数据集的请到这：https://www.yuanpy.top/3333.html，提供免费下载。

你可能感兴趣的:(数据分析,python3,学习编程,python)

python中的运算符走过.. python 开发语言
目录文章目录前言一、算数运算符1.算数运算符包括+，-，*，/，**，//，%1.1、加减乘除（+，-，*，/）运算符的使用1.2、**是求次方m的n次方1.3、%是求余，m%2可以用来验证奇数偶数0为偶，1为奇数。m%n有n中情况，m%n==0证明m是n的倍数。二、赋值运算符1.赋值运算符有=,+=,-=,*=,/=,//=,**=,%=1.1赋予（=）1.2（+，-，*，/，**，//，%）=
【Python 中的几类运算符】
文章目录文章目录一、算术运算符二、比较运算符三、赋值运算符四、逻辑运算符附加知识：五、其他运算符1.位运算符2.成员运算符3.身份运算符总结一、算术运算符加法（+）：用于两个数值相加。例如，a=5，b=3，a+b的结果为8。也可以用于字符串拼接，如"Hello,"+"World"的结果为"Hello,World"。示例：a=5b=3result=a+bprint("求和",result)a="He
Windows PowerShell中无法将"python"项识别为cmdlet、函数、脚本文件或可运行程序的名称 xqhrs232 ROS系统/Python
原文地址::https://blog.csdn.net/Blateyang/article/details/86421594相关文章1、如何在Powershell中运行python程序?----https://cloud.tencent.com/developer/ask/1426072、Windows下如何方便的运行py脚本----https://blog.csdn.net/Naisu_kun/
Vscode中Python无法将pip/pytest”项识别为 cmdlet、函数、脚本文件或可运行程序的名称
在Python需要pip下载插件时报错，是因为没有把Python安装路径下的Scripts添加到系统的path路径中。如果到了对应路径没发现pip文件，查看是否有pip相关文件，一般会存在pip3命令行使用pip3install后会进行提示更新，按照提示进行更新即可bug2：通过piplist发现其实已经安装pytest但使用pytest--version提示相同错误可通过pipuninstall
1688按关键词获取商品列表API接口详解蓝倾976 python 开发语言电商开放平台开放API 1688开放平台
一、接口功能概述1688商品列表API是阿里巴巴开放平台提供的核心接口之一，主要用于通过关键词、价格区间、销量范围、类目ID等条件筛选商品，并返回商品标题、价格、销量、图片等基本信息。该接口广泛应用于电商数据分析、竞品调研、商品监控、价格比对等场景，助力开发者高效获取1688平台商品数据。二、接口调用流程1.注册与认证注册账号：在1688开放平台/万邦开放平台注册开发者账号，完成企业或个人资质审核
Python中if name == ‘main‘的妙用 el psy congroo Python python
参考：Python中的ifname==‘main’是干嘛的？先运行下面代码：print(__name__)if__name__=="__main__":print(__name__)print("helloworld")print(__name__)当py文件作为主程序直接运行时，__name__无论在哪都是__main__那if__name__=="__main__"有什么用呢?一个py文件也是
Python爬取与可视化-豆瓣电影数据木子空间Pro 项目集锦 #课程设计 python 信息可视化开发语言
引言在数据科学的学习过程中，数据获取与数据可视化是两项重要的技能。本文将展示如何通过Python爬取豆瓣电影Top250的电影数据，并将这些数据存储到数据库中，随后进行数据分析和可视化展示。这个项目涵盖了从数据抓取、存储到数据可视化的整个过程，帮助大家理解数据科学项目的全流程。环境配置与准备工作在开始之前，我们需要确保安装了一些必要的库：urllib：用于发送HTTP请求和获取网页数据Beauti
突破反爬防线：Python3反爬虫原理与绕过策略深度解析程序员威哥爬虫网络 scrapy python 开发语言
在信息化时代，数据已成为互联网的重要资产。为了保护数据的安全和防止恶意抓取，越来越多的网站开始采用反爬虫技术。然而，随着反爬虫技术的不断演化，爬虫开发者面临的挑战也在日益增大。如何理解反爬虫原理并有效绕过这些防护措施，是每个爬虫开发者必须掌握的技能。本文将全面解析Python3在爬虫开发中的应用，深入探讨常见的反爬虫原理，并提供绕过反爬策略的实战经验。通过结合实际案例，帮助开发者掌握应对复杂反爬措
基于Python的携程景点评价爬虫与情感评分分析程序员威哥 python 爬虫开发语言
一、项目背景携程（Ctrip）是中国最流行的旅游预订平台之一，其景点用户评论包含了大量真实的游客反馈。通过分析评论的情感倾向，可以：为用户提供更可靠的景点推荐辅助景区运营方了解用户口碑构建情感评分系统，为评分失衡提供补充二、项目目标自动化抓取携程指定景点的用户评论清洗与分词评论文本对评论进行情感分析打分分析整体情绪趋势并可视化结果三、技术栈与工具模块工具/库数据爬取requests,re,json
从0到1构建智能招聘数据引擎：基于 Python 的 BOSS直聘信息采集实战与反爬破解指南程序员威哥 python 开发语言
前言在大数据浪潮席卷的时代，招聘平台蕴藏着海量的岗位信息，揭示着行业走向、人才趋势、薪资结构等核心价值。BOSS直聘作为国内极具代表性的直招平台，其数据对职业分析、市场监测甚至智能推荐系统的构建都有着重要意义。本文将手把手带你打造一个高质量、抗封锁的Python爬虫系统，精准采集BOSS直聘网的岗位数据，并全面解析其中涉及的反爬机制识别、加密参数处理、数据提取与存储等高级技巧，助你在Web数据采集
揭秘影评数据的金矿：基于 Python 的豆瓣电影排行榜热度挖掘与数据智能分析实战程序员威哥 python 开发语言
前言：从数据出发，看见银幕之外的流行密码在内容为王的时代，影视作品既是大众娱乐的主阵地，也是数据分析的重要入口。豆瓣作为中国最具影响力的影视评分平台之一，凝聚了数千万用户对电影、剧集的真实反馈。本文将带你一步步深入，从爬取豆瓣电影排行榜数据出发，结合Python技术栈，构建一个完整的热门电影数据采集+分析+可视化系统。我们不仅要采数据，更要从中挖掘背后的价值：哪些类型影片最受欢迎？评分是否与评论数
微博热搜数据采集全攻略：利用 Python 爬虫实时捕捉社会热点与舆情风向程序员威哥 python 爬虫开发语言
微博作为国内最具影响力的社交媒体平台，其热搜榜单被广泛认为是社会热点的风向标。无论是娱乐八卦、社会事件，还是突发新闻，微博热搜往往能够迅速反映出公众关注的焦点。对于数据分析师、舆情监测专家、或者企业品牌分析师来说，如何抓取并分析这些实时热搜数据，已成为一种核心竞争力。在这篇文章中，我们将结合Python爬虫技术，深入探讨如何高效抓取微博热搜数据，如何规避反爬虫机制，如何处理与存储数据，并展示如何利
Python知识点：如何使用Nvidia Jetson与Python进行边缘计算杰哥在此 Python系列 python 边缘计算开发语言面试编程
开篇，先说一个好消息，截止到2025年1月1日前，翻到文末找到我，赠送定制版的开题报告和任务书，先到先得！过期不候！如何使用NvidiaJetson与Python进行边缘计算NvidiaJetson平台是专为边缘计算设计的一系列AI计算机，它们能够处理和分析来自物联网(IoT)设备和边缘节点的数据。这些设备小巧、节能且功能强大，非常适合用于执行机器学习、计算机视觉和自然语言处理等任务。Python
AI工作流平台对比分析 come11234 Ai 人工智能
以下是和「扣子工作流」（KoFlow）类似的AI工作流平台对比分析，涵盖主流工具的核心特点、使用方式、优缺点及区别：一、主流工作流平台分类平台类型核心定位代表用户扣子(KoFlow)低代码AI流程中文场景优化，深度集成大模型中文开发者/企业LangChain代码框架开发者灵活构建AI链Python开发者/AI工程师LlamaIndex数据增强框架企业级RAG（检索增强生成）数据工程师/知识库应用M
Python爬虫（57）Python数据可视化全攻略：Matplotlib从入门到三维动态图表（8000字实战教程）一个天蝎座白勺程序猿 Python爬虫入门到高阶实战 python 爬虫信息可视化
目录背景与需求分析第一章：Matplotlib基础与核心工作流1.1环境配置与基础架构1.2基础图表类型实战1.2.1折线图进阶1.2.2分组柱状图第二章：高阶可视化技术2.1子图矩阵与多面板布局2.2动态可视化与动画第三章：行业案例实战案例1：电商用户行为分析案例2：医疗影像数据可视化第四章：可视化美学与工程优化4.1配色方案实战4.2百万级数据渲染优化第五章：交互式扩展方案5.1Matplot
Python多进程编程
Python多任务提升程序性能之一---------多进程#Python的多进程编程的方法是multiprocessing，他是可以在当前的主进程下面去创建n个子进程所以所以他，执行相当于n+1个进程#首先导入multimprocessing包importmultiprocessing#防止执行熟读太快看出出多进程的区别importtime#编写尊卑使用多进程的方法deftest01():fori
python三角网格代码_Python 实现 Delaunay Triangulation weixin_39828457 python三角网格代码
DelaunayTriangulation是一种空间划分的方法，它能使得分割形成的三角形最小的角尽可能的大，关于DelaunayTriangulation的详细介绍，请参考这里，DelaunayTriangulation在很多领域都有应用，科学计算领域它是有限元和有限体积法划分网格的重要方法，除此之外在图像识别、视觉艺术等领域也有它的身影。贴一段有趣的油管视频，用DelaunayTriangula
python-多线程编程 Protein Designer 蛋白质结构 python
文章目录1.多任务介绍2.进程介绍3.使用多进程来完成多任务3.1进程的创建步骤3.2进程执行带有参数的任务3.3获取进程编号3.4多进程编程的注意点主进程会等待所有的子进程执行结束在结束设置守护主进程：**主进程结束后不会再继续执行子进程中剩余的工作**3.5进程池与进程锁3.6进程的通信3.7线程3.8GIL全局锁3.9异步1.多任务介绍多任务是指在同一时间内执行多个任务。定义举例并发在一段时
python之多进程(multiprocessing)
multiprocessing模块提供了一个Process类来代表一个进程对象，multiprocessing模块像线程一样管理进程，这个是multiprocessing的核心，它与threading很相似，对多核CPU的利用率会比threading好的多前言Multiprocessing.Pool可以提供指定数量的进程供用户调用，当有新的请求提交到pool中时，如果池还没有满，那么就会创建一个新
AI人工智能领域深度学习的跨模态检索技术 AI学长带你学AI AI人工智能与大数据应用开发 AI应用开发高级指南人工智能深度学习 ai
AI人工智能领域深度学习的跨模态检索技术关键词：跨模态检索、深度学习、多模态学习、特征提取、相似度计算、注意力机制、Transformer摘要：本文深入探讨了AI领域中基于深度学习的跨模态检索技术。我们将从基础概念出发，详细分析跨模态检索的核心算法原理、数学模型和实际应用。文章包含完整的Python实现示例，展示如何构建一个跨模态检索系统，并讨论当前的技术挑战和未来发展方向。通过本文，读者将全面理
Python-多进程编程 (multiprocessing 模块) Kusunoki_D Python 操作系统 python 进程
目录一、创建进程1.Process的语法结构2.进程不共享全局变量二、进程间通信1.队列通信2.管道通信三、进程池1.常用函数2.进程池中的Queue四、应用：复制文件夹（多进程版）五、守护进程和进程同步六、注意事项通过使用multiprocessing模块，Python程序可以在多核处理器上实现并行处理，提高程序的执行效率和响应速度。一、创建进程要创建一个新的进程，需要实例化multiproce
使用 Python 调用 Instagram API 爬取 Instagram 图片（完整指南） Python爬虫项目 python 开发语言爬虫 selenium beautifulsoup
一、引言在社交媒体平台中，Instagram以其图片和视频为主的独特风格，吸引了全球数十亿用户。无论是旅行博主、美食摄影师，还是品牌推广，Instagram上的数据具有极高的商业和研究价值。为了获取Instagram的公开数据，我们需要使用官方提供的InstagramGraphAPI。通过这个API，我们可以获取以下信息：✅账户基本信息（用户ID、用户名、头像等）✅用户的图片和视频✅用户的评论、点
轻松开发AI应用：Dify、Langchain与Coza全方位对比分析 AI Agent首席体验官人工智能 langchain
1.Dify与Langchain区别Dify和Langchain都是用于开发AI应用的平台，但在设计理念、功能特点及适用场景等方面存在明显差异。以下是两者的详细对比：总体概述Dify：一个开源低代码平台，旨在简化AI应用的开发，提供完整的UI解决方案和无缝的集成能力，适合技术背景不强的用户，帮助他们快速开发和部署AI应用。Langchain：一个灵活的Python开发库，为开发者提供精细控制，适合
python 函数的定义 SFH-松风寒 python 开发语言后端
#函数的定义#定义一个函数#def表示定义函数的关键字#msg表示函数的名称#()里面放置参数可以为空#：函数的固定格式defmsg():#函数体函数里面的代码用于实现函数的特定功能print('Helloworld')#msg（）函数的调用调用函数之后函数中的代码就会被执行#msg是函数本身msg()#函数的简单用法#打印ATM机的提示defselect_func():print('-----请
python——异常程丞Q香 python python 开发语言 pycharm 异常 raise try except
1、定义异常是在代码执行过程中发生的，它会影响到程序的正常运行。python程序不会自动来进行异常处理。python中常见异常父类：Exception。2、常见异常TypeError：类型错误异常。ValueError：值的异常。KeyError：键的异常。IndexError：索引异常。SyntaxError：语法异常。FileNotFoundError：读取文件内容，如果这个文件不存在，就会报
Python爬虫代理IP 巴里巴气 Python爬虫知识记录 python 爬虫 tcp/ip
前言在Python爬虫中,代理IP基本是必备的,因为基本上网站都会有反爬措施,对请求频繁和异常的IP进行自动封锁,拉入黑名单,所以我们需要有代理IP来实现动态IP的效果,保证请求的IP会变化,是动态的,这样网站就不会把我们的IP当作爬虫IP了目录国内代理IP和海外代理IP的现状代理IP最常用最实用的作用使用方法国内代理IP和海外代理IP的现状市面上的代理IP分为国内代理IP和海外代理IP国内代理I
脑机新手指南（十七）EEG-ExPy 新手入门教程（上篇）：基础概念与环境搭建 Brduino脑机接口技术答疑脑机新手指南新手入门算法脑机接口
一、EEG-ExPy是什么？EEG-ExPy是一个基于Python的开源工具包，专为脑电（EEG）实验设计、数据采集和实时分析而开发。它的核心优势在于低门槛易用性和模块化设计，即使是没有编程基础的新手，也能通过简单的代码或图形界面快速搭建EEG实验流程。其功能覆盖：1.自定义实验范式设计（如视觉刺激、运动想象任务）2.实时EEG信号采集与预处理3.简单的脑机接口（BCI）应用开发4.实验数据的存储
从多源融合文档：使用LangChain合并加载器的指南 dsndnwfk langchain php 开发语言 python
#从多源融合文档：使用LangChain合并加载器的指南在数据驱动的世界中，处理和分析数据并不总是来自单一来源。通常，我们需要从多个文档中提取信息，以便全面了解一个主题或进行复杂的数据分析。本文将介绍如何使用LangChain的各种文档加载器来合并多个来源的数据，使得数据处理变得更加高效和简便。##1.引言在现代数据分析中，我们经常需要从多个文档中提取有价值的信息。这些文档可能以不同的格式存在，并
RabbitMQ消息发送与接收 VksgShapes rabbitmq ruby 分布式
RabbitMQ是一个功能强大的开源消息代理，用于在应用程序之间传递消息。它实现了AMQP（高级消息队列协议），提供了可靠的消息传递机制，支持多种消息模式和灵活的消息路由。在本篇文章中，我们将详细介绍如何在应用程序中使用RabbitMQ进行消息的发送和接收。我们将使用Python作为示例编程语言，并使用Pika作为RabbitMQ的Python客户端。安装依赖库首先，我们需要安装Pika库。可以使
Python程序设计第6章：函数和函数式编程若北辰 Python程序设计 python 开发语言
Python程序设计Python是全球范围内最受欢迎的编程语言之一，学好Python将对个人职业生涯产生很大的助力，Python在机器学习、深度学习、数据挖掘等领域应用极为广泛。在数据科学家/数据分析师、人工智能工程师、网络安全工程师、软件工程师/全栈工程师、自动化测试工程师等岗位，年入50万，很普遍，学好Python，高薪就业不是问题，因此推出Python程序设计系列文章：Python程序设计第
Dom 周华华 JavaScript html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml&q
【Spark九十六】RDD API之combineByKey bit1129 spark
1. combineByKey函数的运行机制 RDD提供了很多针对元素类型为(K,V)的API，这些API封装在PairRDDFunctions类中，通过Scala隐式转换使用。这些API实现上是借助于combineByKey实现的。combineByKey函数本身也是RDD开放给Spark开发人员使用的API之一首先看一下combineByKey的方法说明：
msyql设置密码报错：ERROR 1372 (HY000): 解决方法详解 daizj mysql 设置密码
MySql给用户设置权限同时指定访问密码时，会提示如下错误： ERROR 1372 (HY000): Password hash should be a 41-digit hexadecimal number；问题原因：你输入的密码是明文。不允许这么输入。解决办法：用select password('你想输入的密码');查询出你的密码对应的字符串，然后
路漫漫其修远兮吾将上下而求索周凡杨学习思索
王国维在他的《人间词话》中曾经概括了为学的三种境界古今之成大事业、大学问者，罔不经过三种之境界。“昨夜西风凋碧树。独上高楼，望尽天涯路。”此第一境界也。“衣带渐宽终不悔，为伊消得人憔悴。”此第二境界也。“众里寻他千百度，蓦然回首，那人却在灯火阑珊处。”此第三境界也。学习技术，这也是你必须经历的三种境界。第一层境界是说，学习的路是漫漫的，你必须做好充分的思想准备，如果半途而废还不如不要开始。这里，注
Hadoop(二)对话单的操作朱辉辉33 hadoop
Debug： 1、 A = LOAD '/user/hue/task.txt' USING PigStorage(' ') AS (col1,col2,col3); DUMP A; //输出结果前几行示例： (>ggsnPDPRecord(21),,) (-->recordType(0),,) (-->networkInitiation(1),,)
web报表工具FineReport常用函数的用法总结（日期和时间函数）老A不折腾 finereport 报表工具 web开发
web报表工具FineReport常用函数的用法总结（日期和时间函数）说明：凡函数中以日期作为参数因子的，其中日期的形式都必须是yy/mm/dd。而且必须用英文环境下双引号(" ")引用。 DATE DATE(year,month,day):返回一个表示某一特定日期的系列数。 Year:代表年，可为一到四位数。 Month:代表月份。
c++ 宏定义中的##操作符墙头上一根草 C++
#与##在宏定义中的--宏展开 #include <stdio.h> #define f(a,b) a##b #define g(a) #a #define h(a) g(a) int main() { &nbs
分析Spring源代码之，DI的实现 aijuans spring DI 现源代码
(转) 分析Spring源代码之，DI的实现 2012/1/3 by tony 接着上次的讲，以下这个sample [java] view plain copy print
for循环的进化 alxw4616 JavaScript
// for循环的进化 // 菜鸟 for (var i = 0; i < Things.length ; i++) { // Things[i] } // 老鸟 for (var i = 0, len = Things.length; i < len; i++) { // Things[i] } // 大师 for (var i = Things.le
网络编程Socket和ServerSocket简单的使用百合不是茶网络编程基础 IP地址端口
网络编程;TCP/IP协议网络:实现计算机之间的信息共享,数据资源的交换协议:数据交换需要遵守的一种协议,按照约定的数据格式等写出去端口:用于计算机之间的通信每运行一个程序，系统会分配一个编号给该程序，作为和外界交换数据的唯一标识 0~65535 查看被使用的
JDK1.5 生产消费者 bijian1013 java thread 生产消费者 java多线程
ArrayBlockingQueue：一个由数组支持的有界阻塞队列。此队列按 FIFO（先进先出）原则对元素进行排序。队列的头部是在队列中存在时间最长的元素。队列的尾部是在队列中存在时间最短的元素。新元素插入到队列的尾部，队列检索操作则是从队列头部开始获得元素。 ArrayBlockingQueue的常用方法：
JAVA版身份证获取性别、出生日期及年龄 bijian1013 java 性别出生日期年龄
工作中需要根据身份证获取性别、出生日期及年龄，且要还要支持15位长度的身份证号码，网上搜索了一下，经过测试好像多少存在点问题，干脆自已写一个。 CertificateNo.java package com.bijian.study; import java.util.Calendar; import
【Java范型六】范型与枚举 bit1129 java
首先，枚举类型的定义不能带有类型参数，所以，不能把枚举类型定义为范型枚举类，例如下面的枚举类定义是有编译错的 public enum EnumGenerics<T> { //编译错，提示枚举不能带有范型参数 OK, ERROR; public <T> T get(T type) { return null;
【Nginx五】Nginx常用日志格式含义 bit1129 nginx
1. log_format 1.1 log_format指令用于指定日志的格式，格式： log_format name(格式名称) type(格式样式) 1.2 如下是一个常用的Nginx日志格式： log_format main '[$time_local]|$request_time|$status|$body_bytes
Lua 语言 15 分钟快速入门 ronin47 lua 基础
- - 单行注释 - - [[ [多行注释] - - ]] - - - - - - - - - - - 1. 变量 & 控制流 - - - - - - - - - - num = 23 - - 数字都是双精度 str = 'aspythonstring'
java-35.求一个矩阵中最大的二维矩阵 ( 元素和最大 ) bylijinnan java
the idea is from: http://blog.csdn.net/zhanxinhang/article/details/6731134 public class MaxSubMatrix { /**see http://blog.csdn.net/zhanxinhang/article/details/6731134 * Q35 求一个矩阵中最大的二维
mongoDB文档型数据库特点开窍的石头 mongoDB文档型数据库特点
MongoDD: 文档型数据库存储的是Bson文档-->json的二进制特点：内部是执行引擎是js解释器，把文档转成Bson结构，在查询时转换成js对象。 mongoDB传统型数据库对比传统类型数据库：结构化数据，定好了表结构后每一个内容符合表结构的。也就是说每一行每一列的数据都是一样的文档型数据库：不用定好数据结构，
[毕业季节]欢迎广大毕业生加入JAVA程序员的行列 comsci java
一年一度的毕业季来临了。。。。。。。。正在投简历的学弟学妹们。。。如果觉得学校推荐的单位和公司不适合自己的兴趣和专业，可以考虑来我们软件行业，做一名职业程序员。。。软件行业的开发工具中，对初学者最友好的就是JAVA语言了，网络上不仅仅有大量的
PHP操作Excel – PHPExcel 基本用法详解 cuiyadll PHP Excel
导出excel属性设置//Include classrequire_once('Classes/PHPExcel.php');require_once('Classes/PHPExcel/Writer/Excel2007.php');$objPHPExcel = new PHPExcel();//Set properties 设置文件属性$objPHPExcel->getProperties
IBM Webshpere MQ Client User Issue (MCAUSER) darrenzhu IBM jms user MQ MCAUSER
IBM MQ JMS Client去连接远端MQ Server的时候，需要提供User和Password吗？答案是根据情况而定，取决于所定义的Channel里面的属性Message channel agent user identifier (MCAUSER)的设置。 http://stackoverflow.com/questions/20209429/how-mca-user-i
网线的接法 dcj3sjt126com
一、PC连HUB (直连线)A端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。 B端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。二、PC连PC （交叉线）A端：(568A)：白绿，绿，白橙，蓝，白蓝，橙，白棕，棕； B端：（标准568B）：白橙，橙，白绿，蓝，白蓝，绿，白棕，棕。三、HUB连HUB&nb
Vimium插件让键盘党像操作Vim一样操作Chrome dcj3sjt126com chrome vim
什么是键盘党？键盘党是指尽可能将所有电脑操作用键盘来完成，而不去动鼠标的人。鼠标应该说是新手们的最爱，很直观，指哪点哪，很听话！不过常常使用电脑的人，如果一直使用鼠标的话，手会发酸，因为操作鼠标的时候，手臂不是在一个自然的状态，臂肌会处于绷紧状态。而使用键盘则双手是放松状态，只有手指在动。而且尽量少的从鼠标移动到键盘来回操作，也省不少事。在chrome里安装 vimium 插件
MongoDB查询（2）——数组查询[六] eksliang mongodb MongoDB查询数组
MongoDB查询数组转载请出自出处：http://eksliang.iteye.com/blog/2177292 一、概述 MongoDB查询数组与查询标量值是一样的，例如，有一个水果列表，如下所示： > db.food.find() { "_id" : "001", "fruits" : [ "苹
cordova读写文件（1） gundumw100 JavaScript Cordova
使用cordova可以很方便的在手机sdcard中读写文件。首先需要安装cordova插件：file 命令为： cordova plugin add org.apache.cordova.file 然后就可以读写文件了，这里我先是写入一个文件，具体的JS代码为： var datas=null;//datas need write var directory=&
HTML5 FormData 进行文件jquery ajax 上传到又拍云 ileson jquery Ajax html5 FormData
html5 新东西：FormData 可以提交二进制数据。页面test.html <!DOCTYPE> <html> <head> <title> formdata file jquery ajax upload</title> </head> <body> <
swift appearanceWhenContainedIn:(version1.2 xcode6.4) 啸笑天 version
swift1.2中没有oc中对应的方法： + (instancetype)appearanceWhenContainedIn:(Class <UIAppearanceContainer>)ContainerClass, ... NS_REQUIRES_NIL_TERMINATION; 解决方法：在swift项目中新建oc类如下： #import &
java实现SMTP邮件服务器 macroli java 编程
电子邮件传递可以由多种协议来实现。目前，在Internet 网上最流行的三种电子邮件协议是SMTP、POP3 和 IMAP，下面分别简单介绍。　　◆ SMTP 协议　　简单邮件传输协议(Simple Mail Transfer Protocol,SMTP)是一个运行在TCP/IP之上的协议，用它发送和接收电子邮件。SMTP 服务器在默认端口25上监听。SMTP客户使用一组简单的、基于文本的
mongodb group by having where 查询sql qiaolevip 每天进步一点点学习永无止境 mongo 纵观千象
SELECT cust_id, SUM(price) as total FROM orders WHERE status = 'A' GROUP BY cust_id HAVING total > 250 db.orders.aggregate( [ { $match: { status: 'A' } }, { $group: {
Struts2 Pojo（六） Luob. POJO strust2
注意：附件中有完整案例 1.采用POJO对象的方法进行赋值和传值 2.web配置 <?xml version="1.0" encoding="UTF-8"?> <web-app version="2.5" xmlns="http://java.sun.com/xml/ns/javaee&q
struts2步骤 wuai struts
1、添加jar包 2、在web.xml中配置过滤器 <filter> <filter-name>struts2</filter-name> <filter-class>org.apache.st