shelley__huang

kaggle比赛--Quora Question Pairs

文章目录

数据来源
数据分析

训练集
测试集
训练集的数据分析

字符个数
词的个数
词云

逻辑回归

获得特征
训练数据
ROC 评价
Precision-Recall Curve 评价

XGBoost

两个句子共有的词数
TF-IDF

统计词语
计算tfidf

平衡数据
划分数据
XGBoost
另一种方法

Random Forest随机森林

词袋模型

LSTM
decompesition attention
time distributed cnn
reference

数据来源

Quora是一个提出问题并与提供独特见解和有质量答案的人联系的平台。这使人们能够相互学习，更好地了解世界。每个月有超过1亿人访问Quora，因此很多人提出类似措辞的问题也就不足为奇了。具有相同意图的多个问题可能会导致寻求者花更多时间找到问题的最佳答案，并使作者觉得他们需要回答同一问题的多个版本。目前，Quora使用随机森林模型来识别重复的问题。在本次比赛中，Kagglers面临着通过应用先进技术来分类问题对是否重复来解决这种自然语言处理问题的挑战。

数据分析

训练集

id：训练集中question pair的id
qid1, qid2：每个问题唯一的id
question1, question2：问题的具体内容
is_duplicate：目标变量, 1表示question1和question2有相同的意思，0表示不相同

import pandas as pd
df_train = pd.read_csv('./train.csv')
df_train.head()

print('训练样本个数：{}'.format(len(df_train)))
print('样本重复比例：{}%'.format(round(df_train['is_duplicate'].mean()*100, 2)))

qids = pd.Series(df_train['qid1'].tolist() + df_train['qid2'].tolist())

print('训练集中的问题个数: {}'.format(len(np.unique(qids))))
print("训练集中问题出现多次的问题个数：{}".format(np.sum(qids.value_counts() > 1)))

训练样本个数：404290
样本重复比例：36.92%
训练集中的问题个数: 537933
训练集中问题出现多次的问题个数：111780

plt.figure(figsize=(12, 5))
plt.hist(qids.value_counts(), bins=50)
plt.yscale('log', nonposy='clip')
plt.title('Log-Histogram of question appearance counts')
plt.xlabel('Number of occurences of question')
plt.ylabel('Number of questions')

由图可知，大部分问题只出现几次，很少问题出现多次，而且有个问题出现多于160，是一个outlier。

测试集

df_test = pd.read_csv('./test.csv')
df_test.head()

print('测试集样本个数: {}'.format(len(df_test)))

测试集样本个数: 3563475

训练集的数据分析

字符个数

train_qs = pd.Series(df_train['question1'].tolist() + df_train['question2'].tolist()).astype(str)
test_qs = pd.Series(df_test['question1'].tolist() + df_test['question2'].tolist()).astype(str)
print(train_qs[0])
print(train_qs.shape)

dist_train = train_qs.apply(len)  # get all the length of qs
dist_test = test_qs.apply(len)
print(dist_train[0])

What is the step by step guide to invest in share market in india?
(808580,)
66

plt.figure(figsize=(15, 10))
plt.hist(dist_train, bins=200, range=[0, 200], color=pal[2], normed=True, label='train')
plt.hist(dist_test, bins=200, range=[0, 200], color=pal[1], normed=True, alpha=0.5, label='test')
plt.title('Normalised histogram of character count in questions', fontsize=15)
plt.legend()
plt.xlabel('Number of characters', fontsize=15)
plt.ylabel('Probability', fontsize=15)
print('mean-train {:.2f} std-train {:.2f} mean-test {:.2f} std-test {:.2f} max-train {:.2f} max-test {:.2f}'.format(dist_train.mean(), 
                          dist_train.std(), dist_test.mean(), dist_test.std(), dist_train.max(), dist_test.max()))

mean-train 59.82 std-train 31.96
mean-test 60.07 std-test 31.63
max-train 1169.00 max-test 1176.00

大部分问题有15-150个字符，test的字符个数的分布和train的字符个数的分布有所不同，所有问题的字符个数都在1200个以下，而且超过200的问题已经很少了。

词的个数

dist_train = train_qs.apply(lambda x: len(x.split(' ')))
dist_test = test_qs.apply(lambda x: len(x.split(' ')))

plt.figure(figsize=(15, 10))
plt.hist(dist_train, bins=50, range=[0, 50], color=pal[2], normed=True, label='train')
plt.hist(dist_test, bins=50, range=[0, 50], color=pal[1], normed=True, alpha=0.5, label='test')
plt.title('Normalised histogram of word count in questions', fontsize=15)
plt.legend()
plt.xlabel('Number of words', fontsize=15)
plt.ylabel('Probability', fontsize=15)

print('mean-train {:.2f} std-train {:.2f} mean-test {:.2f} std-test {:.2f} max-train {:.2f} max-test {:.2f}'.format(dist_train.mean(), 
                          dist_train.std(), dist_test.mean(), dist_test.std(), dist_train.max(), dist_test.max()))

mean-train 11.06 std-train 5.89
mean-test 11.02 std-test 5.84
max-train 237.00 max-test 238.00

大多数问题的词的个数是10个

词云

表示出现最频繁的词语。

from wordcloud import WordCloud
cloud = WordCloud(width=1440, height=1080).generate(" ".join(train_qs.astype(str)))
plt.figure(figsize=(20, 15))
plt.imshow(cloud)
plt.axis('off')

逻辑回归

1 导入数据

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from subprocess import check_output

%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

df = pd.read_csv("./train.csv").fillna("")
df.head()

2 处理数据

df['q1len'] = df['question1'].str.len()
df['q2len'] = df['question2'].str.len()

df['q1_n_words'] = df['question1'].apply(lambda row: len(row.split(" ")))
df['q2_n_words'] = df['question2'].apply(lambda row: len(row.split(" ")))

def normalized_word_share(row):
    w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
    w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
    return 1.0 * len(w1 & w2)/(len(w1) + len(w2))


df['word_share'] = df.apply(normalized_word_share, axis=1)

df.head()

plt.figure(figsize=(12, 8))
plt.subplot(1,2,1)
sns.violinplot(x = 'is_duplicate', y = 'word_share', data = df[0:50000])
plt.subplot(1,2,2)
sns.distplot(df[df['is_duplicate'] == 1.0]['word_share'][0:10000], color = 'green')
sns.distplot(df[df['is_duplicate'] == 0.0]['word_share'][0:10000], color = 'red')

获得特征

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, auc, roc_curve
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler().fit(df[['q1len', 'q2len', 'q1_n_words', 'q2_n_words', 'word_share']])

X = scaler.transform(df[['q1len', 'q2len', 'q1_n_words', 'q2_n_words', 'word_share']])
y = df['is_duplicate']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

训练数据

clf = LogisticRegression()
grid = {
    'C': [1e-6, 1e-3, 1e0],
    'penalty': ['l1', 'l2']
}
cv = GridSearchCV(clf, grid, scoring='neg_log_loss', n_jobs=-1, verbose=1)
cv.fit(X_train, y_train)

for i in range(1, len(cv.cv_results_['params'])+1):
    rank = cv.cv_results_['rank_test_score'][i-1]
    s = cv.cv_results_['mean_test_score'][i-1]
    sd = cv.cv_results_['std_test_score'][i-1]
    params = cv.cv_results_['params'][i-1]
    print("{0}. Mean validation neg log loss: {1:.3f} (std: {2:.3f}) - {3}".format(
        rank,
        s,
        sd,
        params
    ))

输出

6–Mean validation neg log loss: -0.693 (std: 0.000) - {‘C’: 1e-06, ‘penalty’: ‘l1’}
5–Mean validation neg log loss: -0.690 (std: 0.000) - {‘C’: 1e-06, ‘penalty’: ‘l2’}
3–Mean validation neg log loss: -0.582 (std: 0.001) - {‘C’: 0.001, ‘penalty’: ‘l1’}
4–Mean validation neg log loss: -0.586 (std: 0.001) - {‘C’: 0.001, ‘penalty’: ‘l2’}
1–Mean validation neg log loss: -0.568 (std: 0.001) - {‘C’: 1.0, ‘penalty’: ‘l1’}
2–Mean validation neg log loss: -0.569 (std: 0.001) - {‘C’: 1.0, ‘penalty’: ‘l2’}

print(cv.best_params_)
print(cv.best_estimator_.coef_)

{‘C’: 1.0, ‘penalty’: ‘l1’}
[[-13.54349298 4.32508628 10.35869457 -13.07001066 3.29722094]]

ROC 评价

colors = ['r', 'g', 'b', 'y', 'k', 'c', 'm', 'brown', 'r']
lw = 1
Cs = [1e-6, 1e-4, 1e0]

plt.figure(figsize=(12,8))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for different classifiers')

plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

labels = []
for idx, C in enumerate(Cs):
    clf = LogisticRegression(C = C)
    clf.fit(X_train, y_train)
    print("C: {}, parameters {} and intercept {}".format(C, clf.coef_, clf.intercept_))
    fpr, tpr, _ = roc_curve(y_test, clf.predict_proba(X_test)[:,1])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=lw, color=colors[idx])
    labels.append("C: {}, AUC = {}".format(C, np.round(roc_auc, 4)))

plt.legend(['random AUC = 0.5'] + labels)

输出
C: 1e-06, parameters [[-0.00419999 -0.00232428 -0.00354653 -0.00199889 -0.0018606 ]] and intercept [-0.03324753]
C: 0.0001, parameters [[-0.15977647 -0.09050407 -0.13253665 -0.08000122 0.68634253]] and intercept [-0.70425466]
C: 1.0, parameters [[-10.24038938 -0.91761175 6.78291511 -7.1645702 3.29874314]] and intercept [-1.35659649]

Precision-Recall Curve 评价

pr, re, _ = precision_recall_curve(y_test, cv.best_estimator_.predict_proba(X_test)[:,1])
plt.figure(figsize=(12,8))
plt.plot(re, pr)
plt.title('PR Curve (AUC {})'.format(auc(re, pr)))
plt.xlabel('Recall')
plt.ylabel('Precision')

XGBoost

两个句子共有的词数

from nltk.corpus import stopwords

stops = set(stopwords.words("english"))

def word_match_share(row):
    q1words = {}
    q2words = {}
    # 去除停用词
    for word in str(row['question1']).lower().split():
        if word not in stops:
            q1words[word] = 1
    for word in str(row['question2']).lower().split():
        if word not in stops:
            q2words[word] = 1
    if len(q1words) == 0 or len(q2words) == 0:
        # The computer-generated chaff includes a few questions that are nothing but stopwords
        return 0
    # 计算此问题的词语在另个问题的个数
    shared_words_in_q1 = [w for w in q1words.keys() if w in q2words]
    shared_words_in_q2 = [w for w in q2words.keys() if w in q1words]
    # 共有的词语/所有的词语
    R = (len(shared_words_in_q1) + len(shared_words_in_q2))/(len(q1words) + len(q2words))
    return R

plt.figure(figsize=(15, 5))
train_word_match = df_train.apply(word_match_share, axis=1, raw=True)
plt.hist(train_word_match[df_train['is_duplicate'] == 0], bins=20, normed=True, label='Not Duplicate')
plt.hist(train_word_match[df_train['is_duplicate'] == 1], bins=20, normed=True, alpha=0.7, label='Duplicate')
plt.legend()
plt.title('Label distribution over word_match_share', fontsize=15)
plt.xlabel('word_match_share', fontsize=15)

TF-IDF

统计词语

print('最常见的词语和频率:')
print(sorted(weights.items(), key=lambda x: x[1] if x[1] > 0 else 9999)[:10])
print('最不常见的词语和频率: ')
print(sorted(weights.items(), key=lambda x: x[1], reverse=True)[:10])

最常见的词语和频率:
[(‘the’, 2.5891040146646852e-06), (‘what’, 3.115623919267953e-06), (‘is’, 3.5861702928825277e-06), (‘how’, 4.366449945201053e-06), (‘i’, 4.4805878531263305e-06), (‘a’, 4.540645588989843e-06), (‘to’, 4.671434644293609e-06), (‘in’, 4.884625153865692e-06), (‘of’, 5.920242493132519e-06), (‘do’, 6.070908207867897e-06)]
最不常见的词语和频率:
[(‘シ’, 9.998000399920016e-05), (‘し?’, 9.998000399920016e-05), (‘19-year-old.’, 9.998000399920016e-05), (‘1-855-425-3768’, 9.998000399920016e-05), (‘confederates’, 9.998000399920016e-05), (‘asahi’, 9.998000399920016e-05), (‘fab’, 9.998000399920016e-05), (‘109?’, 9.998000399920016e-05), (‘samrudi’, 9.998000399920016e-05), (‘fulfill?’, 9.998000399920016e-05)]

计算tfidf

def tfidf_word_match_share(row):
    q1words = {}
    q2words = {}
    for word in str(row['question1']).lower().split():
        if word not in stops:
            q1words[word] = 1
    for word in str(row['question2']).lower().split():
        if word not in stops:
            q2words[word] = 1
    if len(q1words) == 0 or len(q2words) == 0:
        # The computer-generated chaff includes a few questions that are nothing but stopwords
        return 0
    
    shared_weights = [weights.get(w, 0) for w in q1words.keys() if w in q2words] + [weights.get(w, 0) for w in q2words.keys() if w in q1words]
    total_weights = [weights.get(w, 0) for w in q1words] + [weights.get(w, 0) for w in q2words]
    
    R = np.sum(shared_weights) / np.sum(total_weights)
    return R

plt.figure(figsize=(15, 5))
tfidf_train_word_match = df_train.apply(tfidf_word_match_share, axis=1, raw=True)
plt.hist(tfidf_train_word_match[df_train['is_duplicate'] == 0].fillna(0), bins=20, normed=True, label='Not Duplicate')
plt.hist(tfidf_train_word_match[df_train['is_duplicate'] == 1].fillna(0), bins=20, normed=True, alpha=0.7, label='Duplicate')
plt.legend()
plt.title('Label distribution over tfidf_word_match_share', fontsize=15)
plt.xlabel('word_match_share', fontsize=15)

# roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None, max_fpr=None)
from sklearn.metrics import roc_auc_score
print('Original AUC:', roc_auc_score(df_train['is_duplicate'], train_word_match))
print('   TFIDF AUC:', roc_auc_score(df_train['is_duplicate'], tfidf_train_word_match.fillna(0)))

Original AUC: 0.7804327049353577
TFIDF AUC: 0.7704802292218704
TFIDF 的表现能力比较差，

平衡数据

# First we create our training and testing data
x_train = pd.DataFrame()
x_test = pd.DataFrame()
x_train['word_match'] = train_word_match
x_train['tfidf_word_match'] = tfidf_train_word_match
x_test['word_match'] = df_test.apply(word_match_share, axis=1, raw=True)
x_test['tfidf_word_match'] = df_test.apply(tfidf_word_match_share, axis=1, raw=True)

y_train = df_train['is_duplicate'].values

pos_train = x_train[y_train == 1]
neg_train = x_train[y_train == 0]

# 过采样负类的样本
# There is likely a much more elegant way to do this...
p = 0.165
scale = ((len(pos_train) / (len(pos_train) + len(neg_train))) / p) - 1
while scale > 1:
    neg_train = pd.concat([neg_train, neg_train])
    scale -=1
neg_train = pd.concat([neg_train, neg_train[:int(scale * len(neg_train))]])
print(len(pos_train) / (len(pos_train) + len(neg_train)))

x_train = pd.concat([pos_train, neg_train])
y_train = (np.zeros(len(pos_train)) + 1).tolist() + np.zeros(len(neg_train)).tolist()
del pos_train, neg_train

划分数据

# 把训练数据分为train和valid两部分
from sklearn.cross_validation import train_test_split

x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, test_size=0.2, random_state=4242)

XGBoost

import xgboost as xgb

# Set our parameters for xgboost
params = {}
params['objective'] = 'binary:logistic'
params['eval_metric'] = 'logloss'
params['eta'] = 0.02
params['max_depth'] = 4

d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_valid, label=y_valid)

watchlist = [(d_train, 'train'), (d_valid, 'valid')]

bst = xgb.train(params, d_train, 400, watchlist, early_stopping_rounds=50, verbose_eval=10)

保存数据

d_test = xgb.DMatrix(x_test)
p_test = bst.predict(d_test)

sub = pd.DataFrame()
sub['test_id'] = df_test['test_id']
sub['is_duplicate'] = p_test
sub.to_csv('simple_xgb.csv', index=False)

另一种方法

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from scipy.sparse import csr_matrix

train_data = pd.read_csv('../input/train.csv')
print (train_data.shape)
train_data.head()
test_data = pd.read_csv('../input/test.csv')
print (test_data.shape)
test_data.head()
# 处理数据
train_data = train_data.drop(['id', 'qid1', 'qid2'], 1)
test_data = test_data.drop(['test_id'], 1)
train_data.isnull().sum()
train_data = train_data.fillna('empty question')
test_data.isnull().sum()
test_data = test_data.fillna('empty question')

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(analyzer = 'word', stop_words = 'english', lowercase = True, norm = 'l1')

train_data_q1_tfidf = tfidf.fit_transform(train_data.question1.values)
train_data_q2_tfidf = tfidf.fit_transform(train_data.question2.values)

test_data_q1_tfidf = tfidf.fit_transform(test_data.question1.values)
test_data_q2_tfidf = tfidf.fit_transform(test_data.question2.values)

print (train_data_q1_tfidf.shape)
print (train_data_q2_tfidf.shape)
# (404290, 67533)
# (404290, 62375)
train_data_q1_tfidf = csr_matrix((train_data_q1_tfidf.data, train_data_q1_tfidf.indices, train_data_q1_tfidf.indptr), shape=(404290,90824))
train_data_q2_tfidf = csr_matrix((train_data_q2_tfidf.data, train_data_q2_tfidf.indices, train_data_q2_tfidf.indptr), shape=(404290,90824))

print (train_data_q1_tfidf.shape)
print (train_data_q2_tfidf.shape)
# (404290, 90824)
# (404290, 90824)
test_data_q2_tfidf = csr_matrix((test_data_q2_tfidf.data, test_data_q2_tfidf.indices, test_data_q2_tfidf.indptr), shape=(2345796,90824))

print (test_data_q1_tfidf.shape)
print (test_data_q2_tfidf.shape)
# (2345796, 90824)
# (2345796, 90824)
X = abs(train_data_q1_tfidf-train_data_q2_tfidf)
y = train_data[['is_duplicate']]

X_test = abs(test_data_q1_tfidf-test_data_q2_tfidf)

from xgboost import XGBClassifier
xg_model = XGBClassifier()
xg_model.fit(X, y)
xg_pred = xg_model.predict(X_test)
xg_pred = pd.Series(xg_pred, name='is_duplicate')
submission = pd.concat([pd.Series(range(2345796), name='test_id'),xg_pred], axis = 1)
submission.to_csv('xg_tfidf_submission_file.csv', index=False)

reference：
https://www.kaggle.com/ananthreddy/only-tf-idf-vectors

Random Forest随机森林

import numpy as np 
import pandas as pd 

import os

from sklearn.model_selection import train_test_split

def read_data():
    df = pd.read_csv("./train.csv")
    print ("Shape of base training File = ", df.shape)
    # Remove missing values and duplicates from training data
    df.drop_duplicates(inplace=True)
    df.dropna(inplace=True)
    print("Shape of base training data after cleaning = ", df.shape)
    return df

df = read_data()
df_train, df_test = train_test_split(df, test_size = 0.02)
print ("\n\n", df_train.head(10))
print ("\nTrain Shape : ", df_train.shape)
print ("Test Shape : ", df_test.shape)
# Shape of base training File =  (404290, 6)
#Shape of base training data after cleaning =  (404287, 6)
# Train Shape :  (396201, 6)
# Test Shape :  (8086, 6)

词袋模型

import re
import gensim
from gensim import corpora
from nltk.corpus import stopwords
from nltk.stem.porter import *

words = re.compile(r"\w+",re.I)
stopword = stopwords.words('english')
stemmer = PorterStemmer()

# Cleaning and tokenizing the queries.
def tokenize_questions(df):
    question_1_tokenized = []
    question_2_tokenized = []

    for q in df.question1.tolist():
        question_1_tokenized.append([stemmer.stem(i.lower()) for i in words.findall(q) 
                                     if i not in stopword])

    for q in df.question2.tolist():
        question_2_tokenized.append([stemmer.stem(i.lower()) for i in words.findall(q) 
                                     if i not in stopword])

    df["Question_1_tok"] = question_1_tokenized
    df["Question_2_tok"] = question_2_tokenized
    
    return df

df_train = tokenize_questions(df_train)
df_test = tokenize_questions(df_test)

词典

def train_dictionary(df):
    
    questions_tokenized = df.Question_1_tok.tolist() + df.Question_2_tok.tolist()
    
    dictionary = corpora.Dictionary(questions_tokenized)
    dictionary.filter_extremes(no_below=5)
    dictionary.compactify()
    
    return dictionary
    
dictionary = train_dictionary(df_train)
print ("No of words in the dictionary = %s" %len(dictionary.token2id))

def get_vectors(df, dictionary):
    
    question1_vec = [dictionary.doc2bow(text) for text in df.Question_1_tok.tolist()]
    question2_vec = [dictionary.doc2bow(text) for text in df.Question_2_tok.tolist()]
    
    question1_csc = gensim.matutils.corpus2csc(question1_vec, num_terms=len(dictionary.token2id))
    question2_csc = gensim.matutils.corpus2csc(question2_vec, num_terms=len(dictionary.token2id))
    
    return question1_csc.transpose(),question2_csc.transpose()


q1_csc, q2_csc = get_vectors(df_train, dictionary)

print (q1_csc.shape)
print (q2_csc.shape)
# (396201, 21254)
# (396201, 21254)

q1_csc_test, q2_csc_test = get_vectors(df_test, dictionary)

相似度计算

'''
Similarity Measures:
    Cosine Similarity
    Manhattan Distance
    Euclidean Distance
'''

from sklearn.metrics.pairwise import cosine_similarity as cs
from sklearn.metrics.pairwise import manhattan_distances as md
from sklearn.metrics.pairwise import euclidean_distances as ed


def get_similarity_values(q1_csc, q2_csc):
    cosine_sim = []
    manhattan_dis = []
    eucledian_dis = []
        
    for i,j in zip(q1_csc, q2_csc):
        sim = cs(i,j)
        cosine_sim.append(sim[0][0])
        sim = md(i,j)
        manhattan_dis.append(sim[0][0])
        sim = ed(i,j)
        eucledian_dis.append(sim[0][0])
        
    return cosine_sim, manhattan_dis, eucledian_dis
cosine_sim, manhattan_dis, eucledian_dis = get_similarity_values(q1_csc, q2_csc)
y_pred_cos, y_pred_man, y_pred_euc = get_similarity_values(q1_csc_test, q2_csc_test)

print ("cosine_sim sample= \n", cosine_sim[0:5])
print ("\nmanhattan_dis sample = \n", manhattan_dis[0:5])
print ("\neucledian_dis sample = \n", eucledian_dis[0:5])

# cosine_sim sample=  [0.5773502691896258, 0.3086066999241839, 0.3086066999241838, 0.40089186286863654, 0.2886751345948129]
# manhattan_dis sample =  [6.0, 11.0, 18.0, 9.0, 10.0]
# eucledian_dis sample =  [2.449489742783178, 3.605551275463989, 5.0990195135927845, 3.0, 3.1622776601683795]

分类模型

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

xtrain = pd.DataFrame({"cosine" : cosine_sim, "manhattan" : manhattan_dis,
                        "eucledian" : eucledian_dis})
ytrain = df_train.is_duplicate

xtest = pd.DataFrame({"cosine" : y_pred_cos, "manhattan" : y_pred_man,
                       "eucledian" : y_pred_euc})
ytest = df_test.is_duplicate

rf = RandomForestClassifier()
rf.fit(xtrain, ytrain)
rf_predicted = rf.predict(xtest)

logist = LogisticRegression(random_state=0)
logist.fit(xtrain, ytrain)
logist_predicted = logist.predict(xtest)

from sklearn.metrics import log_loss

def calculate_logloss(y_true, y_pred):
    loss_cal = log_loss(y_true, y_pred)
    return loss_cal

logloss_rf = calculate_logloss(ytest, rf_predicted)
log_loss_logist = calculate_logloss(ytest, logist_predicted)
print ("Log loss value using Random Forest is = %f" %logloss_rf)
print ("Log loss value using Logistic Regression is = %f" %log_loss_logist)

from sklearn.metrics import accuracy_score
test_acc_rf = accuracy_score(ytest, rf_predicted) * 100
test_acc_logist = accuracy_score(ytest, logist_predicted) * 100
print ("Accuracy of Random Forest Model : ", test_acc_rf)
print ("Accuracy of Logistic Regression Model : ", test_acc_logist)

Log loss value using Random Forest is = 11.067395
Log loss value using Logistic Regression is = 12.071184
Accuracy of Random Forest Model : 67.95696265149641
Accuracy of Logistic Regression Model : 65.05070492208756

LSTM

'''
Single model may achieve LB scores at around 0.29+ ~ 0.30+
Average ensembles can easily get 0.28+ or less
Don't need to be an expert of feature engineering
All you need is a GPU!!!!!!!

The code is tested on Keras 2.0.0 using Tensorflow backend, and Python 2.7

According to experiments by kagglers, Theano backend with GPU may give bad LB scores while
        the val_loss seems to be fine, so try Tensorflow backend first please
'''

########################################
## import packages
########################################
import os
import re
import csv
import codecs
import numpy as np
import pandas as pd

from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from string import punctuation

from gensim.models import KeyedVectors
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers.merge import concatenate
from keras.models import Model
from keras.layers.normalization import BatchNormalization
from keras.callbacks import EarlyStopping, ModelCheckpoint

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

########################################
## set directories and parameters
########################################
BASE_DIR = '../input/'
EMBEDDING_FILE = BASE_DIR + 'GoogleNews-vectors-negative300.bin'
TRAIN_DATA_FILE = BASE_DIR + 'train.csv'
TEST_DATA_FILE = BASE_DIR + 'test.csv'
MAX_SEQUENCE_LENGTH = 30
MAX_NB_WORDS = 200000
EMBEDDING_DIM = 300
VALIDATION_SPLIT = 0.1

num_lstm = np.random.randint(175, 275)
num_dense = np.random.randint(100, 150)
rate_drop_lstm = 0.15 + np.random.rand() * 0.25
rate_drop_dense = 0.15 + np.random.rand() * 0.25

act = 'relu'
re_weight = True # whether to re-weight classes to fit the 17.5% share in test set

STAMP = 'lstm_%d_%d_%.2f_%.2f'%(num_lstm, num_dense, rate_drop_lstm, \
        rate_drop_dense)

########################################
## index word vectors
########################################
print('Indexing word vectors')

word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, \
        binary=True)
print('Found %s word vectors of word2vec' % len(word2vec.vocab))

########################################
## process texts in datasets
########################################
print('Processing text dataset')

# The function "text_to_wordlist" is from
# https://www.kaggle.com/currie32/quora-question-pairs/the-importance-of-cleaning-text
def text_to_wordlist(text, remove_stopwords=False, stem_words=False):
    # Clean the text, with the option to remove stopwords and to stem words.
    
    # Convert words to lower case and split them
    text = text.lower().split()

    # Optionally, remove stop words
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
    
    text = " ".join(text)

    # Clean the text
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r" u s ", " american ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e - mail", "email", text)
    text = re.sub(r"j k", "jk", text)
    text = re.sub(r"\s{2,}", " ", text)
    
    # Optionally, shorten words to their stems
    if stem_words:
        text = text.split()
        stemmer = SnowballStemmer('english')
        stemmed_words = [stemmer.stem(word) for word in text]
        text = " ".join(stemmed_words)
    
    # Return a list of words
    return(text)

texts_1 = [] 
texts_2 = []
labels = []
with codecs.open(TRAIN_DATA_FILE, encoding='utf-8') as f:
    reader = csv.reader(f, delimiter=',')
    header = next(reader)
    for values in reader:
        texts_1.append(text_to_wordlist(values[3]))
        texts_2.append(text_to_wordlist(values[4]))
        labels.append(int(values[5]))
print('Found %s texts in train.csv' % len(texts_1))

test_texts_1 = []
test_texts_2 = []
test_ids = []
with codecs.open(TEST_DATA_FILE, encoding='utf-8') as f:
    reader = csv.reader(f, delimiter=',')
    header = next(reader)
    for values in reader:
        test_texts_1.append(text_to_wordlist(values[1]))
        test_texts_2.append(text_to_wordlist(values[2]))
        test_ids.append(values[0])
print('Found %s texts in test.csv' % len(test_texts_1))

tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts_1 + texts_2 + test_texts_1 + test_texts_2)

sequences_1 = tokenizer.texts_to_sequences(texts_1)
sequences_2 = tokenizer.texts_to_sequences(texts_2)
test_sequences_1 = tokenizer.texts_to_sequences(test_texts_1)
test_sequences_2 = tokenizer.texts_to_sequences(test_texts_2)

word_index = tokenizer.word_index
print('Found %s unique tokens' % len(word_index))

data_1 = pad_sequences(sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
data_2 = pad_sequences(sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
labels = np.array(labels)
print('Shape of data tensor:', data_1.shape)
print('Shape of label tensor:', labels.shape)

test_data_1 = pad_sequences(test_sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
test_data_2 = pad_sequences(test_sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
test_ids = np.array(test_ids)

########################################
## prepare embeddings
########################################
print('Preparing embedding matrix')

nb_words = min(MAX_NB_WORDS, len(word_index))+1

embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if word in word2vec.vocab:
        embedding_matrix[i] = word2vec.word_vec(word)
print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

########################################
## sample train/validation data
########################################
#np.random.seed(1234)
perm = np.random.permutation(len(data_1))
idx_train = perm[:int(len(data_1)*(1-VALIDATION_SPLIT))]
idx_val = perm[int(len(data_1)*(1-VALIDATION_SPLIT)):]

data_1_train = np.vstack((data_1[idx_train], data_2[idx_train]))
data_2_train = np.vstack((data_2[idx_train], data_1[idx_train]))
labels_train = np.concatenate((labels[idx_train], labels[idx_train]))

data_1_val = np.vstack((data_1[idx_val], data_2[idx_val]))
data_2_val = np.vstack((data_2[idx_val], data_1[idx_val]))
labels_val = np.concatenate((labels[idx_val], labels[idx_val]))

weight_val = np.ones(len(labels_val))
if re_weight:
    weight_val *= 0.472001959
    weight_val[labels_val==0] = 1.309028344

########################################
## define the model structure
########################################
embedding_layer = Embedding(nb_words,
        EMBEDDING_DIM,
        weights=[embedding_matrix],
        input_length=MAX_SEQUENCE_LENGTH,
        trainable=False)
lstm_layer = LSTM(num_lstm, dropout=rate_drop_lstm, recurrent_dropout=rate_drop_lstm)

sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences_1 = embedding_layer(sequence_1_input)
x1 = lstm_layer(embedded_sequences_1)

sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences_2 = embedding_layer(sequence_2_input)
y1 = lstm_layer(embedded_sequences_2)

merged = concatenate([x1, y1])
merged = Dropout(rate_drop_dense)(merged)
merged = BatchNormalization()(merged)

merged = Dense(num_dense, activation=act)(merged)
merged = Dropout(rate_drop_dense)(merged)
merged = BatchNormalization()(merged)

preds = Dense(1, activation='sigmoid')(merged)

########################################
## add class weight
########################################
if re_weight:
    class_weight = {0: 1.309028344, 1: 0.472001959}
else:
    class_weight = None

########################################
## train the model
########################################
model = Model(inputs=[sequence_1_input, sequence_2_input], \
        outputs=preds)
model.compile(loss='binary_crossentropy',
        optimizer='nadam',
        metrics=['acc'])
#model.summary()
print(STAMP)

early_stopping =EarlyStopping(monitor='val_loss', patience=3)
bst_model_path = STAMP + '.h5'
model_checkpoint = ModelCheckpoint(bst_model_path, save_best_only=True, save_weights_only=True)

hist = model.fit([data_1_train, data_2_train], labels_train, \
        validation_data=([data_1_val, data_2_val], labels_val, weight_val), \
        epochs=200, batch_size=2048, shuffle=True, \
        class_weight=class_weight, callbacks=[early_stopping, model_checkpoint])

model.load_weights(bst_model_path)
bst_val_score = min(hist.history['val_loss'])

########################################
## make the submission
########################################
print('Start making the submission before fine-tuning')

preds = model.predict([test_data_1, test_data_2], batch_size=8192, verbose=1)
preds += model.predict([test_data_2, test_data_1], batch_size=8192, verbose=1)
preds /= 2

submission = pd.DataFrame({'test_id':test_ids, 'is_duplicate':preds.ravel()})
submission.to_csv('%.4f_'%(bst_val_score)+STAMP+'.csv', index=False)

decompesition attention

import numpy as np
import pandas as pd
from keras.layers import *
from keras.activations import softmax
from keras.models import Model
from keras.optimizers import Nadam, Adam
from keras.regularizers import l2
import keras.backend as K


MAX_LEN = 30


def create_pretrained_embedding(pretrained_weights_path, trainable=False, **kwargs):
    "Create embedding layer from a pretrained weights array"
    pretrained_weights = np.load(pretrained_weights_path)
    in_dim, out_dim = pretrained_weights.shape
    embedding = Embedding(in_dim, out_dim, weights=[pretrained_weights], trainable=False, **kwargs)
    return embedding


def unchanged_shape(input_shape):
    "Function for Lambda layer"
    return input_shape


def substract(input_1, input_2):
    "Substract element-wise"
    neg_input_2 = Lambda(lambda x: -x, output_shape=unchanged_shape)(input_2)
    out_ = Add()([input_1, neg_input_2])
    return out_


def submult(input_1, input_2):
    "Get multiplication and subtraction then concatenate results"
    mult = Multiply()([input_1, input_2])
    sub = substract(input_1, input_2)
    out_= Concatenate()([sub, mult])
    return out_


def apply_multiple(input_, layers):
    "Apply layers to input then concatenate result"
    if not len(layers) > 1:
        raise ValueError('Layers list should contain more than 1 layer')
    else:
        agg_ = []
        for layer in layers:
            agg_.append(layer(input_))
        out_ = Concatenate()(agg_)
    return out_


def time_distributed(input_, layers):
    "Apply a list of layers in TimeDistributed mode"
    out_ = []
    node_ = input_
    for layer_ in layers:
        node_ = TimeDistributed(layer_)(node_)
    out_ = node_
    return out_


def soft_attention_alignment(input_1, input_2):
    "Align text representation with neural soft attention"
    attention = Dot(axes=-1)([input_1, input_2])
    w_att_1 = Lambda(lambda x: softmax(x, axis=1),
                     output_shape=unchanged_shape)(attention)
    w_att_2 = Permute((2,1))(Lambda(lambda x: softmax(x, axis=2),
                             output_shape=unchanged_shape)(attention))
    in1_aligned = Dot(axes=1)([w_att_1, input_1])
    in2_aligned = Dot(axes=1)([w_att_2, input_2])
    return in1_aligned, in2_aligned


def decomposable_attention(pretrained_embedding='../data/fasttext_matrix.npy', 
                           projection_dim=300, projection_hidden=0, projection_dropout=0.2,
                           compare_dim=500, compare_dropout=0.2,
                           dense_dim=300, dense_dropout=0.2,
                           lr=1e-3, activation='elu', maxlen=MAX_LEN):
    # Based on: https://arxiv.org/abs/1606.01933
    
    q1 = Input(name='q1',shape=(maxlen,))
    q2 = Input(name='q2',shape=(maxlen,))
    
    # Embedding
    embedding = create_pretrained_embedding(pretrained_embedding, 
                                            mask_zero=False)
    q1_embed = embedding(q1)
    q2_embed = embedding(q2)
    
    # Projection
    projection_layers = []
    if projection_hidden > 0:
        projection_layers.extend([
                Dense(projection_hidden, activation=activation),
                Dropout(rate=projection_dropout),
            ])
    projection_layers.extend([
            Dense(projection_dim, activation=None),
            Dropout(rate=projection_dropout),
        ])
    q1_encoded = time_distributed(q1_embed, projection_layers)
    q2_encoded = time_distributed(q2_embed, projection_layers)
    
    # Attention
    q1_aligned, q2_aligned = soft_attention_alignment(q1_encoded, q2_encoded)    
    
    # Compare
    q1_combined = Concatenate()([q1_encoded, q2_aligned, submult(q1_encoded, q2_aligned)])
    q2_combined = Concatenate()([q2_encoded, q1_aligned, submult(q2_encoded, q1_aligned)]) 
    compare_layers = [
        Dense(compare_dim, activation=activation),
        Dropout(compare_dropout),
        Dense(compare_dim, activation=activation),
        Dropout(compare_dropout),
    ]
    q1_compare = time_distributed(q1_combined, compare_layers)
    q2_compare = time_distributed(q2_combined, compare_layers)
    
    # Aggregate
    q1_rep = apply_multiple(q1_compare, [GlobalAvgPool1D(), GlobalMaxPool1D()])
    q2_rep = apply_multiple(q2_compare, [GlobalAvgPool1D(), GlobalMaxPool1D()])
    
    # Classifier
    merged = Concatenate()([q1_rep, q2_rep])
    dense = BatchNormalization()(merged)
    dense = Dense(dense_dim, activation=activation)(dense)
    dense = Dropout(dense_dropout)(dense)
    dense = BatchNormalization()(dense)
    dense = Dense(dense_dim, activation=activation)(dense)
    dense = Dropout(dense_dropout)(dense)
    out_ = Dense(1, activation='sigmoid')(dense)
    
    model = Model(inputs=[q1, q2], outputs=out_)
    model.compile(optimizer=Adam(lr=lr), loss='binary_crossentropy', 
                  metrics=['binary_crossentropy','accuracy'])
    return model


def esim(pretrained_embedding='../data/fasttext_matrix.npy', 
         maxlen=MAX_LEN, 
         lstm_dim=300, 
         dense_dim=300, 
         dense_dropout=0.5):
             
    # Based on arXiv:1609.06038
    q1 = Input(name='q1',shape=(maxlen,))
    q2 = Input(name='q2',shape=(maxlen,))
    
    # Embedding
    embedding = create_pretrained_embedding(pretrained_embedding, mask_zero=False)
    bn = BatchNormalization(axis=2)
    q1_embed = bn(embedding(q1))
    q2_embed = bn(embedding(q2))

    # Encode
    encode = Bidirectional(LSTM(lstm_dim, return_sequences=True))
    q1_encoded = encode(q1_embed)
    q2_encoded = encode(q2_embed)
    
    # Attention
    q1_aligned, q2_aligned = soft_attention_alignment(q1_encoded, q2_encoded)
    
    # Compose
    q1_combined = Concatenate()([q1_encoded, q2_aligned, submult(q1_encoded, q2_aligned)])
    q2_combined = Concatenate()([q2_encoded, q1_aligned, submult(q2_encoded, q1_aligned)]) 
       
    compose = Bidirectional(LSTM(lstm_dim, return_sequences=True))
    q1_compare = compose(q1_combined)
    q2_compare = compose(q2_combined)
    
    # Aggregate
    q1_rep = apply_multiple(q1_compare, [GlobalAvgPool1D(), GlobalMaxPool1D()])
    q2_rep = apply_multiple(q2_compare, [GlobalAvgPool1D(), GlobalMaxPool1D()])
    
    # Classifier
    merged = Concatenate()([q1_rep, q2_rep])
    
    dense = BatchNormalization()(merged)
    dense = Dense(dense_dim, activation='elu')(dense)
    dense = BatchNormalization()(dense)
    dense = Dropout(dense_dropout)(dense)
    dense = Dense(dense_dim, activation='elu')(dense)
    dense = BatchNormalization()(dense)
    dense = Dropout(dense_dropout)(dense)
    out_ = Dense(1, activation='sigmoid')(dense)
    
    model = Model(inputs=[q1, q2], outputs=out_)
    model.compile(optimizer=Adam(lr=1e-3), loss='binary_crossentropy', metrics=['binary_crossentropy','accuracy'])
    return model

time distributed cnn

# coding: utf-8

# # Predicting Duplicate Questions

# In[5]:

import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import re
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import datetime, time, json
from string import punctuation

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, Dense, Dropout, Reshape, Merge, BatchNormalization, TimeDistributed, Lambda, Activation, LSTM, Flatten, Convolution1D, GRU, MaxPooling1D
from keras.regularizers import l2
from keras.callbacks import Callback, ModelCheckpoint, EarlyStopping
from keras import initializers
from keras import backend as K
from keras.optimizers import SGD
from collections import defaultdict


# In[6]:

train = pd.read_csv("../input/train.csv")[:100]
test = pd.read_csv("../input/test.csv")[:100]


# In[7]:

train.head(6)


# In[8]:

test.head()


# In[9]:

print(train.shape)
print(test.shape)


# In[10]:

# Check for any null values
print(train.isnull().sum())
print(test.isnull().sum())


# In[11]:

# Add the string 'empty' to empty strings
train = train.fillna('empty')
test = test.fillna('empty')


# In[12]:

print(train.isnull().sum())
print(test.isnull().sum())


# In[13]:

# Preview some of the pairs of questions
for i in range(6):
    print(train.question1[i])
    print(train.question2[i])
    print()

# In[14]:

stop_words = ['the','a','an','and','but','if','or','because','as','what','which','this','that','these','those','then',
              'just','so','than','such','both','through','about','for','is','of','while','during','to','What','Which',
              'Is','If','While','This']

# In[191]:

def text_to_wordlist(text, remove_stop_words=True, stem_words=False):
    # Clean the text, with the option to remove stop_words and to stem words.
    
    # Convert words to lower case and split them
    #text = text.lower()

    # Clean the text
    text = re.sub(r"[^A-Za-z0-9]", " ", text)
    text = re.sub(r"what's", "", text)
    text = re.sub(r"What's", "", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"I'm", "I am", text)
    text = re.sub(r" m ", " am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\0k ", "0000 ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e-mail", "email", text)
    text = re.sub(r"\s{2,}", " ", text)
    text = re.sub(r"quikly", "quickly", text)
    text = re.sub(r" usa ", " America ", text)
    text = re.sub(r" USA ", " America ", text)
    text = re.sub(r" u s ", " America ", text)
    text = re.sub(r" uk ", " England ", text)
    text = re.sub(r" UK ", " England ", text)
    text = re.sub(r"india", "India", text)
    text = re.sub(r"china", "China", text)
    text = re.sub(r"chinese", "Chinese", text) 
    text = re.sub(r"imrovement", "improvement", text)
    text = re.sub(r"intially", "initially", text)
    text = re.sub(r"quora", "Quora", text)
    text = re.sub(r" dms ", "direct messages ", text)  
    text = re.sub(r"demonitization", "demonetization", text) 
    text = re.sub(r"actived", "active", text)
    text = re.sub(r"kms", " kilometers ", text)
    text = re.sub(r"KMs", " kilometers ", text)
    text = re.sub(r" cs ", " computer science ", text) 
    text = re.sub(r" upvotes ", " up votes ", text)
    text = re.sub(r" iPhone ", " phone ", text)
    text = re.sub(r"\0rs ", " rs ", text) 
    text = re.sub(r"calender", "calendar", text)
    text = re.sub(r"ios", "operating system", text)
    text = re.sub(r"gps", "GPS", text)
    text = re.sub(r"gst", "GST", text)
    text = re.sub(r"programing", "programming", text)
    text = re.sub(r"bestfriend", "best friend", text)
    text = re.sub(r"dna", "DNA", text)
    text = re.sub(r"III", "3", text) 
    text = re.sub(r"the US", "America", text)
    text = re.sub(r"Astrology", "astrology", text)
    text = re.sub(r"Method", "method", text)
    text = re.sub(r"Find", "find", text) 
    text = re.sub(r"banglore", "Banglore", text)
    text = re.sub(r" J K ", " JK ", text)
    
    # Remove punctuation from text
    text = ''.join([c for c in text if c not in punctuation])
    
    # Optionally, remove stop words
    if remove_stop_words:
        text = text.split()
        text = [w for w in text if not w in stop_words]
        text = " ".join(text)
    
    # Optionally, shorten words to their stems
    if stem_words:
        text = text.split()
        stemmer = SnowballStemmer('english')
        stemmed_words = [stemmer.stem(word) for word in text]
        text = " ".join(stemmed_words)
    
    # Return a list of words
    return(text)


# In[192]:

def process_questions(question_list, questions, question_list_name, dataframe):
    '''transform questions and display progress'''
    for question in questions:
        question_list.append(text_to_wordlist(question))
        if len(question_list) % 100000 == 0:
            progress = len(question_list)/len(dataframe) * 100
            print("{} is {}% complete.".format(question_list_name, round(progress, 1)))


# In[193]:

train_question1 = []
process_questions(train_question1, train.question1, 'train_question1', train)


# In[194]:

train_question2 = []
process_questions(train_question2, train.question2, 'train_question2', train)


# In[165]:

test_question1 = []
process_questions(test_question1, test.question1, 'test_question1', test)


# In[166]:

test_question2 = []
process_questions(test_question2, test.question2, 'test_question2', test)


# In[195]:

# Preview some transformed pairs of questions
i = 0
for i in range(i,i+10):
    print(train_question1[i])
    print(train_question2[i])
    print()

# In[168]:

# Find the length of questions
lengths = []
for question in train_question1:
    lengths.append(len(question.split()))

for question in train_question2:
    lengths.append(len(question.split()))

# Create a dataframe so that the values can be inspected
lengths = pd.DataFrame(lengths, columns=['counts'])


# In[169]:

lengths.counts.describe()


# In[170]:

print(np.percentile(lengths.counts, 99.0))
print(np.percentile(lengths.counts, 99.4))
print(np.percentile(lengths.counts, 99.5))
print(np.percentile(lengths.counts, 99.9))


# In[171]:

# tokenize the words for all of the questions
all_questions = train_question1 + train_question2 + test_question1 + test_question2
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_questions)
print("Fitting is complete.")
train_question1_word_sequences = tokenizer.texts_to_sequences(train_question1)
print("train_question1 is complete.")
train_question2_word_sequences = tokenizer.texts_to_sequences(train_question2)
print("train_question2 is complete")


# In[172]:

test_question1_word_sequences = tokenizer.texts_to_sequences(test_question1)
print("test_question1 is complete.")
test_question2_word_sequences = tokenizer.texts_to_sequences(test_question2)
print("test_question2 is complete.")


# In[173]:

word_index = tokenizer.word_index
print("Words in index: %d" % len(word_index))


# In[174]:

# Pad the questions so that they all have the same length.

max_question_len = 36

train_q1 = pad_sequences(train_question1_word_sequences, 
                              maxlen = max_question_len)
print("train_q1 is complete.")

train_q2 = pad_sequences(train_question2_word_sequences, 
                              maxlen = max_question_len)
print("train_q2 is complete.")


# In[175]:

test_q1 = pad_sequences(test_question1_word_sequences, 
                             maxlen = max_question_len,
                             padding = 'post',
                             truncating = 'post')
print("test_q1 is complete.")

test_q2 = pad_sequences(test_question2_word_sequences, 
                             maxlen = max_question_len,
                             padding = 'post',
                             truncating = 'post')
print("test_q2 is complete.")


# In[30]:

y_train = train.is_duplicate


# In[31]:

# Load GloVe to use pretrained vectors

# Note for Kaggle users: Uncomment this - it couldn't be used on Kaggle

# From this link: https://nlp.stanford.edu/projects/glove/
#embeddings_index = {}
#with open('glove.840B.300d.txt', encoding='utf-8') as f:
#    for line in f:
#        values = line.split(' ')
#        word = values[0]
#        embedding = np.asarray(values[1:], dtype='float32')
#        embeddings_index[word] = embedding
#
#print('Word embeddings:', len(embeddings_index)) #151,250


# In[176]:

# Need to use 300 for embedding dimensions to match GloVe's vectors.
embedding_dim = 300

# Note for Kaggle users: Uncomment this too, because it relate to the code for GloVe.

nb_words = len(word_index)
#word_embedding_matrix = np.zeros((nb_words + 1, embedding_dim))
#for word, i in word_index.items():
#    embedding_vector = embeddings_index.get(word)
#    if embedding_vector is not None:
#        # words not found in embedding index will be all-zeros.
#        word_embedding_matrix[i] = embedding_vector
#
#print('Null word embeddings: %d' % np.sum(np.sum(word_embedding_matrix, axis=1) == 0)) #75,334


# In[177]:

units = 128 # Number of nodes in the Dense layers
dropout = 0.25 # Percentage of nodes to drop
nb_filter = 32 # Number of filters to use in Convolution1D
filter_length = 3 # Length of filter for Convolution1D
# Initialize weights and biases for the Dense layers
weights = initializers.TruncatedNormal(mean=0.0, stddev=0.05, seed=2)
bias = bias_initializer='zeros'

model1 = Sequential()
model1.add(Embedding(nb_words + 1,
                     embedding_dim,
                     #weights = [word_embedding_matrix], Commented out for Kaggle
                     input_length = max_question_len,
                     trainable = False))

model1.add(Convolution1D(filters = nb_filter, 
                         kernel_size = filter_length, 
                         padding = 'same'))
model1.add(BatchNormalization())
model1.add(Activation('relu'))
model1.add(Dropout(dropout))

model1.add(Convolution1D(filters = nb_filter, 
                         kernel_size = filter_length, 
                         padding = 'same'))
model1.add(BatchNormalization())
model1.add(Activation('relu'))
model1.add(Dropout(dropout))

model1.add(Flatten())



model2 = Sequential()
model2.add(Embedding(nb_words + 1,
                     embedding_dim,
                     #weights = [word_embedding_matrix],
                     input_length = max_question_len,
                     trainable = False))

model2.add(Convolution1D(filters = nb_filter, 
                         kernel_size = filter_length, 
                         padding = 'same'))
model2.add(BatchNormalization())
model2.add(Activation('relu'))
model2.add(Dropout(dropout))

model2.add(Convolution1D(filters = nb_filter, 
                         kernel_size = filter_length, 
                         padding = 'same'))
model2.add(BatchNormalization())
model2.add(Activation('relu'))
model2.add(Dropout(dropout))

model2.add(Flatten())



model3 = Sequential()
model3.add(Embedding(nb_words + 1,
                     embedding_dim,
                     #weights = [word_embedding_matrix],
                     input_length = max_question_len,
                     trainable = False))
model3.add(TimeDistributed(Dense(embedding_dim)))
model3.add(BatchNormalization())
model3.add(Activation('relu'))
model3.add(Dropout(dropout))
model3.add(Lambda(lambda x: K.max(x, axis=1), output_shape=(embedding_dim, )))



model4 = Sequential()
model4.add(Embedding(nb_words + 1,
                     embedding_dim,
                     #weights = [word_embedding_matrix],
                     input_length = max_question_len,
                     trainable = False))

model4.add(TimeDistributed(Dense(embedding_dim)))
model4.add(BatchNormalization())
model4.add(Activation('relu'))
model4.add(Dropout(dropout))
model4.add(Lambda(lambda x: K.max(x, axis=1), output_shape=(embedding_dim, )))




modela = Sequential()
modela.add(Merge([model1, model2], mode='concat'))
modela.add(Dense(units*2, kernel_initializer=weights, bias_initializer=bias))
modela.add(BatchNormalization())
modela.add(Activation('relu'))
modela.add(Dropout(dropout))

modela.add(Dense(units, kernel_initializer=weights, bias_initializer=bias))
modela.add(BatchNormalization())
modela.add(Activation('relu'))
modela.add(Dropout(dropout))

modelb = Sequential()
modelb.add(Merge([model3, model4], mode='concat'))
modelb.add(Dense(units*2, kernel_initializer=weights, bias_initializer=bias))
modelb.add(BatchNormalization())
modelb.add(Activation('relu'))
modelb.add(Dropout(dropout))

modelb.add(Dense(units, kernel_initializer=weights, bias_initializer=bias))
modelb.add(BatchNormalization())
modelb.add(Activation('relu'))
modelb.add(Dropout(dropout))



model = Sequential()
model.add(Merge([modela, modelb], mode='concat'))
model.add(Dense(units*2, kernel_initializer=weights, bias_initializer=bias))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))

model.add(Dense(units, kernel_initializer=weights, bias_initializer=bias))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))

model.add(Dense(units, kernel_initializer=weights, bias_initializer=bias))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))

model.add(Dense(1, kernel_initializer=weights, bias_initializer=bias))
model.add(BatchNormalization())
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


# In[178]:

# save the best weights for predicting the test question pairs
save_best_weights = 'question_pairs_weights.h5'

t0 = time.time()
callbacks = [ModelCheckpoint(save_best_weights, monitor='val_loss', save_best_only=True),
             EarlyStopping(monitor='val_loss', patience=5, verbose=1, mode='auto')]
history = model.fit([train_q1, train_q2, train_q1, train_q2],
                    y_train,
                    batch_size=256,
                    epochs=2, #Use 100, I reduce it for Kaggle,
                    validation_split=0.15,
                    verbose=True,
                    shuffle=True,
                    callbacks=callbacks)
t1 = time.time()
print("Minutes elapsed: %f" % ((t1 - t0) / 60.))


# In[179]:

# Aggregate the summary statistics
summary_stats = pd.DataFrame({'epoch': [ i + 1 for i in history.epoch ],
                              'train_acc': history.history['acc'],
                              'valid_acc': history.history['val_acc'],
                              'train_loss': history.history['loss'],
                              'valid_loss': history.history['val_loss']})


# In[180]:

summary_stats


# In[181]:

plt.plot(summary_stats.train_loss) # blue
plt.plot(summary_stats.valid_loss) # green
plt.show()


# In[182]:

# Find the minimum validation loss during the training
min_loss, idx = min((loss, idx) for (idx, loss) in enumerate(history.history['val_loss']))
print('Minimum loss at epoch', '{:d}'.format(idx+1), '=', '{:.4f}'.format(min_loss))
min_loss = round(min_loss, 4)


# In[183]:

# Make predictions with the best weights
model.load_weights(save_best_weights)
predictions = model.predict([test_q1, test_q2, test_q1, test_q2], verbose = True)


# In[184]:

#Create submission
submission = pd.DataFrame(predictions, columns=['is_duplicate'])
submission.insert(0, 'test_id', test.test_id)
file_name = 'submission_{}.csv'.format(min_loss)
submission.to_csv(file_name, index=False)


# In[185]:

submission.head(10)

reference

1 https://www.kaggle.com/anokas/data-analysis-xgboost-starter-0-35460-lb

你可能感兴趣的:(比赛,Quora,Question,Pairs)

ImportError: /nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4 爱编程的喵喵 Python基础课程 python ImportError torch nvJitLink 解决方案
大家好，我是爱编程的喵喵。双985硕士毕业，现担任全栈工程师一职，热衷于将数据思维应用到工作与生活中。从事机器学习以及相关的前后端开发工作。曾在阿里云、科大讯飞、CCF等比赛获得多次Top名次。现为CSDN博客专家、人工智能领域优质创作者。喜欢通过博客创作的方式对所学的知识进行总结与归纳，不仅形成深入且独到的理解，而且能够帮助新手快速入门。本文主要介绍了ImportError:/home/
巧用云平台API实现开源模型免费调用的实战教程 herosunly AIGC 人工智能大模型 API 实战教程
大家好，我是herosunly。985院校硕士毕业，现担任算法工程师一职，获得CSDN博客之星第一名，热衷于大模型算法的研究与应用。曾担任百度千帆大模型比赛、BPAA算法大赛评委，编写微软OpenAI考试认证指导手册。曾获得多项AI顶级比赛的Top名次，其中包括阿里云、科大讯飞比赛第一名，CCF、开放原子比赛二等奖。在技术创新领域拥有多项授权发明。曾辅导多位非科班出身的同学成功进入算法行业就业
2021.10.4 比赛题整理伍叁壹_ 比赛整理题解 c++
2021.10.42021CSPJ初二初一冲刺七链接集合总结炸了炸了。。T3半天做了个寂寞。对算法不熟悉。T1：简单思维题；T2：KMPnxt数组的运用；T3：二分+图，代码实现可用并查集；T4：四维树形dp。T1题意设a0←1a_0\gets1a0←1，an←ai+aja_n\getsa_i+a_jan←ai+aj（i，j在[0,n−1)[0,n-1)[0,n−1)范围内随机）。求对于给定的nn
巅峰对决，超三十万奖金等你挑战！第十届信也科技杯全球AI算法大赛火热开赛！中杯可乐多加冰前沿资讯分享科技人工智能算法计算机视觉机器学习深度学习
信也科技今年跟IJCAI和CIKM这两大全球顶级AI会议合作，这场比赛被全球人工智能顶会CIKM收录为官方赛事单元，获奖选手有机会全球人工智能顶会创造更大的影响力。一、赛事概况随着深度伪造技术的高度发展，人工智能产业走深向实，生成合成技术开始呈现工具化和普及化趋势。在生成合成内容质量显著提升的当下，基于换脸攻击的身份冒用和欺诈事件在全球范围内激增，严重威胁个人隐私和公共数据安全。第十届信也科技杯全
Codeforces Round 1034 (Div. 3)
比赛链接如下：https://codeforces.com/contest/2123A.BlackboardGameInitially,theintegersfrom00ton−1arewrittenonablackboard.Inoneround,Alicechoosesanintegeraontheblackboardanderasesit;thenBobchoosesanintegerbon
【技术派专享】并行智算云：RTX 5090 免费算力深度评测 + 实战指南▎ 为什么开发者需要关注云端算力？山顶望月川人工智能云计算
在微调Llama3、训练扩散模型或跑Kaggle比赛时，本地显卡（比如RTX3090/4090）常面临显存不足、训练慢、散热差等问题。而购买多卡服务器成本极高（一台8×A100机器年成本超20万），对个人和小团队极不友好。并行智算云近期推出的“开发者扶持计划”，提供RTX5090免费算力（显存32GB，FP32算力60TFLOPS），实测比4090训练速度快1.8倍，且支持多卡并行。下面从技术优势
怀化学院2024年ACM基地第二轮招新机试题解啊这.- 算法
比赛地址：https://www.nowcoder.com/acm/contest/96304。【邀请码：acm20241115】A宋学长买书#include#defineintlonglongconstintN=1e6+10;inta[N];intmin(inta,intb){if(a=m)pl=min(pl,x);elseans+=x,cnt++;}ans+=pl-1;if(cnt>m)pri
AtCoder Beginner Contest 412(ABCDE)
前言回来喽！！前一阵子期末周快复习疯了，接下来还想准备数学建模，感觉高中都没这么忙过T^T。中间参加了一场百度之星的比赛，只AC了两题，感觉好难啊还是太菜了，希望能混个牌呜呜呜。图论和数论题好难，还得多练啊……一、A-TaskFailedSuccessfully#includeusingnamespacestd;typedeflonglongll;typedefpairpii;voidsolve(
flask部署机器学习_如何开发端到端机器学习项目并使用Flask将其部署到Heroku cumichun6193 大数据 python 机器学习人工智能深度学习
flask部署机器学习There'sonequestionIalwaysgetaskedregardingDataScience:关于数据科学，我经常被问到一个问题：WhatisthebestwaytomasterDataScience?Whatwillgetmehired?掌握数据科学的最佳方法是什么？什么会雇用我？Myanswerremainsconstant:Thereisnoalterna
Python 爬虫实战：DOTA2 比赛数据全量采集（含赛事战报解析与数据库存储西攻城狮北 python 爬虫数据库
一、引言DOTA2作为一款全球知名的多人在线战术竞技游戏，拥有庞大的玩家群体和丰富的比赛数据。这些数据对于电竞分析师、数据研究员、游戏玩家等具有极高的价值。通过爬取DOTA2比赛数据，可以深入了解比赛详情、战队表现、选手数据等信息，为电竞行业提供数据支持。二、开发环境搭建（一）编程语言与工具选择选择Python语言，利用其丰富的库和简洁语法，高效完成爬虫开发任务。搭配PyCharm集成开发环境，享
树莓派4B控制步进电机（电机28BYJ4+驱动板ULN2003）
由于比赛要用到就捣鼓了一下，花了一些时间走了一些弯路，记录一下。使用材料：树莓派4B、电机28BYJ4、驱动板ULN2003、杜邦线若干电机和驱动板的淘宝链接：https://m.tb.cn/h.fmtAmLW?tk=nZVJ2VDNXfWhttps://m.tb.cn/h.fmtAmLW?tk=nZVJ2VDNXfW我使用的是树莓派口是GPIO.21-24，对应的BCM编码是5，6，13，19引
力扣 hot100 Day33 qq_51397044 Hot100 leetcode 算法
24.两两交换链表中的节点给你一个链表，两两交换其中相邻的节点，并返回交换后链表的头节点。你必须在不修改节点内部的值的情况下完成本题（即，只能进行节点交换）。//抄的classSolution{public:ListNode*swapPairs(ListNode*head){ListNode*dummy=newListNode(0);dummy->next=head;ListNode*prev=d
vue3 el-table回显选中的数据凉xiao薄 vue.js elementui javascript
html部分：{{scope.$index+1}}添加删除js部分：letmatchLists=ref([{source_id:1,title:'比赛1'},{source_id:2,title:'比赛2'},{source_id:3,title:'比赛3'},{source_id:4,title:'比赛4'},{source_id:5,title:'比赛5'},])letsource_ids=r
华为OD 机试 2025 B卷 - 投篮大赛 (C++ & Python & JAVA & JS & GO) 无限码力华为OD机试真题刷题笔记华为od 华为OD机试华为OD机试 2025B卷华为OD2025B卷华为OD机考2025B卷
投篮大赛华为OD机试真题目录点击查看:华为OD机试2025B卷真题题库目录｜机考题库+算法考点详解华为OD机试2025B卷100分题型题目描述你现在是一场采用特殊赛制投篮大赛的记录员。这场比赛由若干回合组成，过去几回合的得分可能会影响以后几回合的得分。比赛开始时，记录是空白的。你会得到一个记录操作的字符串列表ops，其中ops[i]是你需要记录的第i项操作，ops遵循下述规则：整数x-表示本回合新
“进球了吗？”“App说进了！”——足球数据是怎么第一时间知道的？东奔西走的小喇叭智能电视人工智能
“直播画面还在倒角，App却告诉我已经进球了？”你是不是也有过类似的疑问？其实你看到的比分弹窗，比电视直播还“超前”，可不是因为App会算命，而是现代体育数据采集技术早已高度工业化+智能化。今天我们就用一篇文章，带你从技术角度拆解——一粒进球，如何在1秒内被全世界“知晓”？全流程速览：数据是怎么传出来的？复制编辑比赛现场→采集系统记录→数据中心处理→实时推送服务→终端App刷新我们下面就分模块讲清
AttributeError: module ‘openai‘ has no attribute ‘ChatCompletion‘解决方案爱编程的喵喵 Python基础课程 python AttributeError openai ChatCompletion 解决方案
大家好，我是爱编程的喵喵。双985硕士毕业，现担任全栈工程师一职，热衷于将数据思维应用到工作与生活中。从事机器学习以及相关的前后端开发工作。曾在阿里云、科大讯飞、CCF等比赛获得多次Top名次。现为CSDN博客专家、人工智能领域优质创作者。喜欢通过博客创作的方式对所学的知识进行总结与归纳，不仅形成深入且独到的理解，而且能够帮助新手快速入门。本文主要介绍了AttributeError:mod
23国赛信息安全管理与评估理论题 KD杜小帅网络安全
理论技能与职业素养（100分）2023年全国职业院校技能大赛（高等职业教育组）“信息安全管理与评估”理论技能【注意事项】1.理论测试前请仔细阅读测试系统使用说明文档，按提供的账号和密码登录测试系统进行测试，账号只限1人登录。2.该部分答题时长包含在第三阶段比赛时长内，请在临近竞赛结束前提交。3.参赛团队可根据自身情况，可选择1-3名参赛选手进行作答，团队内部可以交流，但不得影响其他参赛队。一、单选
pwn手记录题3
前言：研究生的生活很充实（忙，大概三年我再也没有时间去认真做题了以下是最近比赛的遇到的题目。仅记录脚本，细节不会再核实了（没时间(;´༎ຶД༎ຶ`)）可能以后不太会触碰CTF比赛了强网杯-baby_heapfrompwnimport*context(os='linux',arch='amd64',log_level='debug')binary='./pwn'r=process(binary)el
missing classes detected while running r8 null_null999 android studio
https://stackoverflow.com/questions/70037537/proguard-missing-classes-detected-while-running-r8-after-adding-package-names-in
github简单入门梦花火开发环境 github ssh
github使用小小总结在这里并不讲述任何关于如何配置github的问题，这些东西，稍微百度一下就有，不具备写的价值。-question1:如何创建一个新项目？1.确认你本地的github已经配置完全（特别是密钥）ps:可以用[email protected]来验证密钥是否配置好。2.在github官网上开启一个新的空白的项目库（可以选择配置readme）3.如果是从头创建版本库，可以采用先克
华为OD机试2025B卷 - 比赛 (C++ & Python & JAVA & JS & GO) 无限码力华为OD机试真题刷题笔记华为od 华为OD2025B卷华为OD机考2025B卷华为OD机试华为OD机试2025B卷
比赛2025B卷目录点击查看：华为OD机试2025B卷真题题库目录｜机考题库+算法考点详解华为OD2025B卷100分题型题目描述一个有N个选手参加比赛，选手编号为1~N（3<=N<=100），有M（3<=M<=10）个评委对选手进行打分。打分规则为每个评委对选手打分，最高分10分，最低分1分。请计算得分最多的3位选手的编号。如果得分相同，则得分高分值最多的选手排名靠前(10分数量相同，则比较9分
FP16、BF16、INT8、INT4精度模型加载所需显存以及硬件适配的分析 herosunly 大模型精度 BF16 硬件适配
大家好，我是herosunly。985院校硕士毕业，现担任算法研究员一职，热衷于机器学习算法研究与应用。曾获得阿里云天池比赛第一名，CCF比赛第二名，科大讯飞比赛第三名。拥有多项发明专利。对机器学习和深度学习拥有自己独到的见解。曾经辅导过若干个非计算机专业的学生进入到算法行业就业。希望和大家一起成长进步。本文主要介绍了FP16、INT8、INT4精度模型加载占用显存大小的分析，希望对学习大
Svelte学习笔记六：谈谈双向绑定的使用月半叫做胖 Svelte 前端学习 svelte 前端框架
表单元素的双向绑定1.input受控绑定使用bind关键字进行绑定，svelte通过bind关键字来完成类似v-model的双向绑定textcheckboxnumberrangeselectletquestions=[{id:1,text:'question1'},{id:2,text:'question2'},{id:3,text:'question3'}];letselected=1;{#ea
【PaddleOCR】快速集成 PP-OCRv5 的 Python 实战秘籍--- PaddleOCR实例化 OCR 对象的参数介绍云天徽上 PaddleOCR python ocr 开发语言人工智能文字识别
博主简介：曾任某智慧城市类企业算法总监，目前在美国市场的物流公司从事高级算法工程师一职，深耕人工智能领域，精通python数据挖掘、可视化、机器学习等，发表过AI相关的专利并多次在AI类比赛中获奖。CSDN人工智能领域的优质创作者，提供AI相关的技术咨询、项目开发和个性化解决方案等服务，如有需要请站内私信或者联系任意文章底部的的VX名片（ID：xf982831907）博主粉丝群介绍：①群内初中生、
CTF-NetA：CTF网络流量分析的得力助手叶彩曼Darcy
CTF-NetA：CTF网络流量分析的得力助手CTF-NetA项目地址:https://gitcode.com/gh_mirrors/ct/CTF-NetA在网络安全领域，CTF（CaptureTheFlag）比赛是检验和提升技术能力的重要途径。然而，面对复杂的网络流量分析任务，许多参赛者往往感到力不从心。今天，我要向大家推荐一款专为CTF比赛设计的网络流量分析工具——CTF-NetA，它将助你一
史上最全的CTF保姆教程从入门到入狱【带工具】最爱吃南瓜网络安全 web安全 windows
下面分享的资源包含国内9套教程教程1-CTF从入门到提升四周学习视频教程教程2-信息安全CTF比赛培训教程教程3-CTF从入门到提升教程4-CTF培训web网络安全基础入门渗透测试教程教程5-CTF入门课程教程6-CTF夺旗全套视频教程教程7-网络安全课程新手入门必看教程8-bugku_CTF_Web视频教程网络安全预科班课程CTF入门国外3套教程国外教程1套-GoogleCTF2018Begin
【洛谷题解】P1001 【入门1】顺序结构 A+B Problem 少儿编程小杨老师洛谷算法数据结构 c++python
题目描述输入两个整数,a,b，输出它们的和（∣∣,∣∣≤109∣a∣,∣b∣≤109）。注意Pascal使用integer会爆掉哦！有负数哦！C/C++的main函数必须是int类型，而且C最后要return0。这不仅对洛谷其他题目有效，而且也是NOIP/CSP/NOI比赛的要求！好吧，同志们，我们就从这一题开始，向着大牛的路进发。任何一个伟大的思想，都有一个微不足道的开始。输入格式两个以空格分开
day04 链表part02
24.两两交换链表中的节点想不明白的时候，画图会很直观。写好操作的伪代码，按照伪代码写。classSolution{publicListNodeswapPairs(ListNodehead){if(head==null||head.next==null){//0个或者1个，直接返回returnhead;}ListNodedummy=newListNode(0);dummy.next=head;Li
Python实例之十大歌手评分 *濒危物种* 算法前端 python
实例背景：十大歌手，为丰富校园文化生活，学校拟组织一场歌手大赛，从参赛选手中选拔出十名相对突出的学生，授予“校园十大歌手”称号。比赛之中设置有评委组，每名选手演唱完毕之后会由评委组的十名评委打分。为保证比赛公平公正、防止作弊和恶意打分，计算得分(即平均分)时会先去掉最高分和最低分要求实现：根据每位评委的输入分数，实现计算每位选手得分的功能。【重要步骤提示】定义列表放评委给分找出列表的最高分和最低分
python实战项目79：采集知乎话题下的所有回答 wp_tao Python副业接单实战项目 python 开发语言
python实战项目79：采集知乎话题下的所有回答一、项目介绍二、代码使用方法三、drissionpage的优缺点四、完整代码五、注意事项一、项目介绍需求是采集知乎某话题下的所有回答，这里以话题“大学宿舍相处之间遇到莫名其妙的冷落怎么办呢？”为例，网页链接为https://www.zhihu.com/question/1898156781215146265，其中189815678121514626
Algorithm 香水浓 java Algorithm
冒泡排序 public static void sort(Integer[] param) { for (int i = param.length - 1; i > 0; i--) { for (int j = 0; j < i; j++) { int current = param[j]; int next = param[j + 1];
mongoDB 复杂查询表达式开窍的石头 mongodb
1:count Pg: db.user.find().count(); 统计多少条数据 2:不等于$ne Pg: db.user.find({_id:{$ne:3}},{name:1,sex:1,_id:0}); 查询id不等于3的数据。 3：大于$gt $gte(大于等于) &n
Jboss Java heap space异常解决方法, jboss OutOfMemoryError : PermGen space 0624chenhong jvm jboss
转自 http://blog.csdn.net/zou274/article/details/5552630 解决办法： window->preferences->java->installed jres->edit jre 把default vm arguments 的参数设为-Xms64m -Xmx512m ----------------
文件上传下载解析相对路径不懂事的小屁孩文件上传
有点坑吧，弄这么一个简单的东西弄了一天多，身边还有大神指导着，网上各种百度着。下面总结一下遇到的问题：文件上传，在页面上传的时候，不要想着去操作绝对路径，浏览器会对客户端的信息进行保护，避免用户信息收到攻击。在上传图片，或者文件时，使用form表单来操作。前台通过form表单传输一个流到后台，而不是ajax传递参数到后台，代码如下: <form action=&
怎么实现qq空间批量点赞换个号韩国红果果 qq
纯粹为了好玩！！逻辑很简单 1 打开浏览器console；输入以下代码。先上添加赞的代码 var tools={}; //添加所有赞 function init(){ document.body.scrollTop=10000; setTimeout(function(){document.body.scrollTop=0;},2000);//加
判断是否为中文灵静志远中文
方法一： public class Zhidao { public static void main(String args[]) { String s = "sdf灭礌 kjl d{';\fdsjlk是"; int n=0; for(int i=0; i<s.length(); i++) { n = (int)s.charAt(i); if((
一个电话面试后总结 a-john 面试
今天，接了一个电话面试，对于还是初学者的我来说，紧张了半天。面试的问题分了层次，对于一类问题，由简到难。自己觉得回答不好的地方作了一下总结：在谈到集合类的时候，举几个常用的集合类，想都没想，直接说了list,map。然后对list和map分别举几个类型： list方面：ArrayList,LinkedList。在谈到他们的区别时，愣住了
MSSQL中Escape转义的使用 aijuans MSSQL
IF OBJECT_ID('tempdb..#ABC') is not null drop table tempdb..#ABC create table #ABC ( PATHNAME NVARCHAR(50) ) insert into #ABC SELECT N'/ABCDEFGHI' UNION ALL SELECT N'/ABCDGAFGASASSDFA' UNION ALL
一个简单的存储过程 asialee mysql 存储过程构造数据批量插入
今天要批量的生成一批测试数据，其中中间有部分数据是变化的，本来想写个程序来生成的，后来想到存储过程就可以搞定，所以随手写了一个，记录在此： DELIMITER $$ DROP PROCEDURE IF EXISTS inse
annot convert from HomeFragment_1 to Fragment 百合不是茶 android 导包错误
创建了几个类继承Fragment, 需要将创建的类存储在ArrayList<Fragment>中; 出现不能将new 出来的对象放到队列中,原因很简单; 创建类时引入包是:import android.app.Fragment; 创建队列和对象时使用的包是:import android.support.v4.ap
Weblogic10两种修改端口的方法 bijian1013 weblogic 端口号配置管理 config.xml
一.进入控制台进行修改 1.进入控制台: http://127.0.0.1:7001/console 2.展开左边树菜单域结构->环境->服务器-->点击AdminServer(管理) &
mysql 操作指令征客丶 mysql
一、连接mysql 进入 mysql 的安装目录； $ bin/mysql -p [host IP 如果是登录本地的mysql 可以不写 -p 直接 -u] -u [userName] -p 输入密码，回车，接连；二、权限操作［如果你很了解mysql数据库后，你可以直接去修改系统表，然后用 mysql> flush privileges; 指令让权限生效］ 1、赋权 mys
【Hive一】Hive入门 bit1129 hive
Hive安装与配置 Hive的运行需要依赖于Hadoop，因此需要首先安装Hadoop2.5.2，并且Hive的启动前需要首先启动Hadoop。 Hive安装和配置的步骤 1. 从如下地址下载Hive0.14.0 http://mirror.bit.edu.cn/apache/hive/ 2.解压hive，在系统变
ajax 三种提交请求的方法 BlueSkator Ajax jqery
1、ajax 提交请求 $.ajax({ type:"post", url : "${ctx}/front/Hotel/getAllHotelByAjax.do", dataType : "json", success : function(result) { try { for(v
mongodb开发环境下的搭建入门 braveCS 运维
linux下安装mongodb 1）官网下载mongodb-linux-x86_64-rhel62-3.0.4.gz 2）linux 解压 gzip -d mongodb-linux-x86_64-rhel62-3.0.4.gz; mv mongodb-linux-x86_64-rhel62-3.0.4 mongodb-linux-x86_64-rhel62-
编程之美-最短摘要的生成 bylijinnan java 数据结构算法编程之美
import java.util.HashMap; import java.util.Map; import java.util.Map.Entry; public class ShortestAbstract { /** * 编程之美最短摘要的生成 * 扫描过程始终保持一个[pBegin,pEnd]的range,初始化确保[pBegin,pEnd]的ran
json数据解析及typeof chengxuyuancsdn js typeof json解析
// json格式 var people='{"authors": [{"firstName": "AAA","lastName": "BBB"},' +' {"firstName": "CCC&
流程系统设计的层次和目标 comsci 设计模式数据结构 sql 框架脚本
流程系统设计的层次和目标
RMAN List和report 命令 daizj oracle list report rman
LIST 命令使用RMAN LIST 命令显示有关资料档案库中记录的备份集、代理副本和映像副本的信息。使用此命令可列出： • RMAN 资料档案库中状态不是AVAILABLE 的备份和副本 • 可用的且可以用于还原操作的数据文件备份和副本 • 备份集和副本，其中包含指定数据文件列表或指定表空间的备份 • 包含指定名称或范围的所有归档日志备份的备份集和副本 • 由标记、完成时间、可
二叉树:红黑树 dieslrae 二叉树
红黑树是一种自平衡的二叉树,它的查找,插入,删除操作时间复杂度皆为O(logN),不会出现普通二叉搜索树在最差情况时时间复杂度会变为O(N)的问题. 红黑树必须遵循红黑规则,规则如下 1、每个节点不是红就是黑。 2、根总是黑的 &
C语言homework3，7个小题目的代码 dcj3sjt126com c
1、打印100以内的所有奇数。 # include <stdio.h> int main(void) { int i; for (i=1; i<=100; i++) { if (i%2 != 0) printf("%d ", i); } return 0; } 2、从键盘上输入10个整数，
自定义按钮, 图片在上, 文字在下, 居中显示 dcj3sjt126com 自定义
#import <UIKit/UIKit.h> @interface MyButton : UIButton -(void)setFrame:(CGRect)frame ImageName:(NSString*)imageName Target:(id)target Action:(SEL)action Title:(NSString*)title Font:(CGFloa
MySQL查询语句练习题，测试足够用了 flyvszhb sql mysql
http://blog.sina.com.cn/s/blog_767d65530101861c.html 1.创建student和score表 CREATE TABLE student ( id INT(10) NOT NULL UNIQUE PRIMARY KEY , name VARCHAR
转：MyBatis Generator 详解 happyqing mybatis
MyBatis Generator 详解 http://blog.csdn.net/isea533/article/details/42102297 MyBatis Generator详解 http://git.oschina.net/free/Mybatis_Utils/blob/master/MybatisGeneator/MybatisGeneator.
让程序员少走弯路的14个忠告 jingjing0907 工作计划学习
无论是谁，在刚进入某个领域之时，有再大的雄心壮志也敌不过眼前的迷茫：不知道应该怎么做，不知道应该做什么。下面是一名软件开发人员所学到的经验，希望能对大家有所帮助 1.不要害怕在工作中学习。只要有电脑，就可以通过电子阅读器阅读报纸和大多数书籍。如果你只是做好自己的本职工作以及分配的任务，那是学不到很多东西的。如果你盲目地要求更多的工作，也是不可能提升自己的。放
nginx和NetScaler区别流浪鱼 nginx
NetScaler是一个完整的包含操作系统和应用交付功能的产品，Nginx并不包含操作系统，在处理连接方面，需要依赖于操作系统，所以在并发连接数方面和防DoS攻击方面，Nginx不具备优势。 2.易用性方面差别也比较大。Nginx对管理员的水平要求比较高，参数比较多，不确定性给运营带来隐患。在NetScaler常见的配置如健康检查，HA等，在Nginx上的配置的实现相对复杂。 3.策略灵活度方
第11章动画效果（下） onestopweb 动画
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
FAQ - SAP BW BO roadmap blueoxygen BO BW
http://www.sdn.sap.com/irj/boc/business-objects-for-sap-faq Besides, I care that how to integrate tightly. By the way, for BW consultants, please just focus on Query Designer which i
关于java堆内存溢出的几种情况 tomcat_oracle java jvm jdk thread
【情况一】：　　 java.lang.OutOfMemoryError: Java heap space：这种是java堆内存不够，一个原因是真不够，另一个原因是程序中有死循环；　　如果是java堆内存不够的话，可以通过调整JVM下面的配置来解决：　　<jvm-arg>-Xms3062m</jvm-arg> 　　<jvm-arg>-Xmx
Manifest.permission_group权限组阿尔萨斯 Permission
结构继承关系 public static final class Manifest.permission_group extends Object java.lang.Object android. Manifest.permission_group 常量 ACCOUNTS 直接通过统计管理器访问管理的统计 COST_MONEY可以用来让用户花钱但不需要通过与他们直接牵涉的权限 D