python pandas 增加一列_在Python Pandas Dataframe中动态添加列的数据处理

我有以下问题.

可以说这是我的CSV

id f1 f2 f3

1 4 5 5

1 3 1 0

1 7 4 4

1 4 3 1

1 1 4 6

2 2 6 0

..........

因此,我有可以按ID分组的行.

我想创建如下的csv作为输出.

f1 f2 f3 f1_n f2_n f3_n f1_n_n f2_n_n f3_n_n f1_t f2_t f3_t

4 5 5 3 1 0 7 4 4 1 4 6

因此,我希望能够选择要转换为列的行数(始终从id的第一行开始).在这种情况下,我抓了3行.

然后,我还将跳过一个或多个行(在这种情况下,仅跳过一个),以从同一id组的最后一行获取最后一列.由于某些原因,我想使用一个数据框.

经过3-4个小时的奋斗.我找到了下面给出的解决方案.

但是我的解决方案很慢.我大约有700,000行,可能有大约70,000组ID.在我的4GB 4核心Lenovo上,model = 3上的上述代码将花费近一个小时.我需要进入模型=可能是10或15.我仍然是Python的新手,并且我相信可以进行一些更改来加快速度.有人可以深入解释我如何改进代码.

万分感谢.

型号:要抓取的行数

# train data frame from reading the csv

train = pd.read_csv(filename)

# Get groups of rows with same id

csv_by_id = train.groupby('id')

modelTarget = { 'f1_t','f2_t','f3_t'}

# modelFeatures is a list of features I am interested in the csv.

# The csv actually has hundreds

modelFeatures = { 'f1, 'f2' , 'f3' }

coreFeatures = list(modelFeatures) # cloning

selectedFeatures = list(modelFeatures) # cloning

newFeatures = list(selectedFeatures) # cloning

finalFeatures = list(selectedFeatures) # cloning

# Now create the column list depending on the number of rows I will grab from

for x in range(2,model+1):

newFeatures = [s + '_n' for s in newFeatures]

finalFeatures = finalFeatures + newFeatures

# This is the final column list for my one row in the final data frame

selectedFeatures = finalFeatures + list(modelTarget)

# Empty dataframe which I want to populate

model_data = pd.DataFrame(columns=selectedFeatures)

for id_group in csv_by_id:

#id_group is a tuple with first element as the id itself and second one a dataframe with the rows of a group

group_data = id_group[1]

#hmm - can this be better? I am picking up the rows which I need from first row on wards

df = group_data[coreFeatures][0:model]

# initialize a list

tmp = []

# now keep adding the column values into the list

for index, row in df.iterrows():

tmp = tmp + list(row)

# Wow, this one below surely should have something better.

# So i am picking up the feature column values from the last row of the group of rows for a particular id

targetValues = group_data[list({'f1','f2','f3'})][len(group_data.index)-1:len(group_data.index)].values

# Think this can be done easier too ? . Basically adding the values to the tmp list again

tmp = tmp + list(targetValues.flatten())

# coverting the list to a dict.

tmpDict = dict(zip(selectedFeatures,tmp))

# then the dict to a dataframe.

tmpDf = pd.DataFrame(tmpDict,index={1})

# I just could not find a better way of adding a dict or list directly into a dataframe.

# And I went through lots and lots of blogs on this topic, including some in StackOverflow.

# finally I add the frame to my main frame

model_data = model_data.append(tmpDf)

# and write it

model_data.to_csv(wd+'model_data' + str(model) + '.csv',index=False)

解决方法:

这将很好地扩展;特征数量中只有很小的常数.大约为O(组数)

In [28]: features = ['f1','f2','f3']

创建一些测试数据,组大小为7-12,每组70k

In [29]: def create_df(i):

....: l = np.random.randint(7,12)

....: df = DataFrame(dict([ (f,np.arange(l)) for f in features ]))

....: df['A'] = i

....: return df

....:

In [30]: df = concat([ create_df(i) for i in xrange(70000) ])

In [39]: df.info()

Int64Index: 629885 entries, 0 to 9

Data columns (total 4 columns):

f1 629885 non-null int64

f2 629885 non-null int64

f3 629885 non-null int64

A 629885 non-null int64

dtypes: int64(4)

创建一个框架,在其中选择每个组的前3行和最后一行(请注意,这将处理大小小于4的组,但是您的最后一行可能与另一行重叠,您可能希望使用groupby.filter来解决此问题)

In [31]: groups = concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()

# This step is necesary in pandas < master/0.14 as the returned fields

# will include the grouping field (the A), (is a bug/API issue)

In [33]: groups = groups[features]

In [34]: groups.head(20)

Out[34]:

f1 f2 f3

A

0 0 0 0 0

1 1 1 1

2 2 2 2

7 7 7 7

1 0 0 0 0

1 1 1 1

2 2 2 2

9 9 9 9

2 0 0 0 0

1 1 1 1

2 2 2 2

8 8 8 8

3 0 0 0 0

1 1 1 1

2 2 2 2

8 8 8 8

4 0 0 0 0

1 1 1 1

2 2 2 2

9 9 9 9

[20 rows x 3 columns]

In [38]: groups.info()

MultiIndex: 280000 entries, (0, 0) to (69999, 9)

Data columns (total 3 columns):

f1 280000 non-null int64

f2 280000 non-null int64

f3 280000 non-null int64

dtypes: int64(3)

而且相当快

In [32]: %timeit concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()

1 loops, best of 3: 1.16 s per loop

为了进行进一步的操作,通常应在此处停止并使用它(因为它以易于处理的很好的分组格式).

如果您想将其翻译成宽格式

In [35]: dfg = groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))

In [36]: %timeit groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))

dfg.head()

groups.info()

1 loops, best of 3: 14.5 s per loop

In [40]: dfg.columns = [ "{0}_{1}".format(f,i) for i in range(1,5) for f in features ]

In [41]: dfg.head()

Out[41]:

f1_1 f2_1 f3_1 f1_2 f2_2 f3_2 f1_3 f2_3 f3_3 f1_4 f2_4 f3_4

A

0 0 0 0 1 1 1 2 2 2 7 7 7

1 0 0 0 1 1 1 2 2 2 9 9 9

2 0 0 0 1 1 1 2 2 2 8 8 8

3 0 0 0 1 1 1 2 2 2 8 8 8

4 0 0 0 1 1 1 2 2 2 9 9 9

[5 rows x 12 columns]

In [42]: dfg.info()

Int64Index: 70000 entries, 0 to 69999

Data columns (total 12 columns):

f1_1 70000 non-null int64

f2_1 70000 non-null int64

f3_1 70000 non-null int64

f1_2 70000 non-null int64

f2_2 70000 non-null int64

f3_2 70000 non-null int64

f1_3 70000 non-null int64

f2_3 70000 non-null int64

f3_3 70000 non-null int64

f1_4 70000 non-null int64

f2_4 70000 non-null int64

f3_4 70000 non-null int64

dtypes: int64(12)

标签:python,pandas,dataframe,data-processing

来源: https://codeday.me/bug/20191013/1906664.html

你可能感兴趣的:(python,pandas,增加一列)