本文介绍了Python Pandas:如何将成对映射列表转换为行向量格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个2列的DataFrame,第1列对应于客户,第2列对应于该客户访问过的城市. DataFrame如下所示:

I have a 2-column DataFrame, column-1 corresponds to customer, column-2 corresponds to the city this customer has visited. The DataFrame looks like the following:

print(df)

    customer    visited_city
0   John        London
1   Mary        Melbourne
2   Steve       Paris
3   John        New_York
4   Peter       New_York
5   Mary        London
6   John        Melbourne
7   John        New_York

想将上述DataFrame转换为行向量格式,以便每行代表一个唯一的用户,行向量表示访问的城市.

I would like to convert the above DataFrame into a row-vector format, such that each row represents a unique user with the row vector indicating the cities visited.

print(wide_format_df)

          London  Melbourne  New_York  Paris
John      1.0        1.0       1.0      0.0
Mary      1.0        1.0       0.0      0.0
Steve     0.0        0.0       0.0      1.0
Peter     0.0        0.0       1.0      0.0

下面是我用来生成宽格式的代码.逐一遍历每个用户.我想知道有没有更有效的方法?

Below is the code I used to generate the wide format. It iterates through each user one by one. I was wondering is there any more efficient way to do so?

import pandas as pd
import numpy as np

UNIQUE_CITIESS = np.sort(df['visited_city'].unique())
p = len(UNIQUE_CITIESS)
unique_customers = df['customer'].unique().tolist()

X = []
for customer in unique_customers:
    x = np.zeros(p)    
    city_visited = np.sort(df[df['customer'] == customer]['visited_city'].unique())
    visited_idx = np.searchsorted(UNIQUE_CITIESS, city_visited)
    x[visited_idx] = 1    
    X.append(x)
wide_format_df = pd.DataFrame(np.array(X), columns=UNIQUE_CITIESS, index=unique_customers)
wide_format_df

推荐答案

请注意,您的问题已经过编辑,因此提供的答案不再回答您的问题.他们必须进行调整,以使New York中的John仅返回1,尽管他已经来过两次.

Please note that your question has been edited such that the answers provided no longer answer your question. They must adjust to only return 1 for John in New York despite the fact he's been there twice.

选项1 pir1
我喜欢这个答案,因为我认为它很优雅.

Option 1 pir1
I like this answer because I think it's elegant.

pd.get_dummies(df.customer).T.dot(pd.get_dummies(df.visited_city)).clip(0, 1)

       London  Melbourne  New_York  Paris
John        1          1         1      0
Mary        1          1         0      0
Peter       0          0         1      0
Steve       0          0         0      1


选项2 pir2
这个答案应该很快.


Option 2 pir2
This answer should be fast.

i, r = pd.factorize(df.customer.values)
j, c = pd.factorize(df.visited_city.values)
n, m = r.size, c.size
b = np.zeros((n, m), dtype=int)
b[i, j] = 1

pd.DataFrame(b, r, c).sort_index().sort_index(1)

       London  Melbourne  New_York  Paris
John        1          1         1      0
Mary        1          1         0      0
Peter       0          0         1      0
Steve       0          0         0      1


选项3 pir3
实用且快捷


Option 3 pir3
Practical and pretty quick

df.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)

visited_city  London  Melbourne  New_York  Paris
customer                                        
John               1          1         1      0
Mary               1          1         0      0
Peter              0          0         1      0
Steve              0          0         0      1


计时
下面的代码


Timing
Code Below

# Multiples of Minimum time
#
           pir1  pir2      pir3       wen       vai
10     1.392237   1.0  1.521555  4.337469  5.569029
30     1.445762   1.0  1.821047  5.977978  7.204843
100    1.679956   1.0  1.901502  6.685429  7.296454
300    1.568407   1.0  1.825047  5.556880  7.210672
1000   1.622137   1.0  1.613983  5.815970  5.396008
3000   1.808637   1.0  1.852953  4.159305  4.224724
10000  1.654354   1.0  1.502092  3.145032  2.950560
30000  1.555574   1.0  1.413612  2.404061  2.299856
wen = lambda d: d.pivot_table(index='customer', columns='visited_city',aggfunc=len, fill_value=0)
vai = lambda d: pd.crosstab(d.customer, d.visited_city)
pir1 = lambda d: pd.get_dummies(d.customer).T.dot(pd.get_dummies(d.visited_city)).clip(0, 1)
pir3 = lambda d: d.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)

def pir2(d):
    i, r = pd.factorize(d.customer.values)
    j, c = pd.factorize(d.visited_city.values)
    n, m = r.size, c.size
    b = np.zeros((n, m), dtype=int)
    b[i, j] = 1

    return pd.DataFrame(b, r, c).sort_index().sort_index(1)

results = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    columns='pir1 pir2 pir3 wen vai'.split(),
    dtype=float
)

for i in results.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in results.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        results.at[i, j] = timeit(stmt, setp, number=10)

print((lambda r: r.div(r.min(1), 0))(results))

results.plot(loglog=True)

这篇关于Python Pandas:如何将成对映射列表转换为行向量格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-02 19:55