使用python从文件中删除string

我有csv文件

 ID,"address","used_at","active_seconds","pageviews" 0a1d796327284ebb443f71d85cb37db9,"vk.com",2016-01-29 22:10:52,3804,115 0a1d796327284ebb443f71d85cb37db9,"2gis.ru",2016-01-29 22:48:52,214,24 0a1d796327284ebb443f71d85cb37db9,"yandex.ru",2016-01-29 22:14:30,4,2 0a1d796327284ebb443f71d85cb37db9,"worldoftanks.ru",2016-01-29 22:10:30,41,2 

我需要删除包含一些单词的string。 有117个字。

我试试

 for line in df: if 'yandex.ru' in line: df = df.replace(line, '') 

但是对于117个单词来说,它的工作过于缓慢,而且之后我创build了pivot_table和单词,我尝试删除它,包含在列中。

  aaa 10ruslake.ru youtube.ru 1tv.ru 24open.ru 0 0025977ab2998580d4559af34cc66a4e 0 0 34 43 1 00c651e018cbcc8fe7aa57492445c7a2 230 0 0 23 2 0120bc30e78ba5582617a9f3d6dfd8ca 12 0 0 0 3 01249e90ed8160ddae82d2190449b773 25 0 13 25 

该列只包含0

我怎样才能更快地做到这一点,并删除行,所以这些词不是在列?

IIUC你可以使用isin和boolean indexing

 print df ID address used_at \ 0 0a1d796327284ebb443f71d85cb37db9 vk.com 2016-01-29 22:10:52 1 0a1d796327284ebb443f71d85cb37db9 vk.com 2016-01-29 22:10:52 2 0a1d796327284ebb443f71d85cb37db9 2gis.ru 2016-01-29 22:48:52 3 0a1d796327284ebb443f71d85cb37db9 yandex.ru 2016-01-29 22:14:30 4 0a1d796327284ebb443f71d85cb37db9 worldoftanks.ru 2016-01-29 22:10:30 active_seconds pageviews 0 3804 115 1 3804 115 2 214 24 3 4 2 4 41 2 words = ['vk.com','yandex.ru'] print ~df.address.isin(words) 0 False 1 False 2 True 3 False 4 True Name: address, dtype: bool print df[~df.address.isin(words)] ID address used_at \ 2 0a1d796327284ebb443f71d85cb37db9 2gis.ru 2016-01-29 22:48:52 4 0a1d796327284ebb443f71d85cb37db9 worldoftanks.ru 2016-01-29 22:10:30 active_seconds pageviews 2 214 24 4 41 2 

然后使用pivot

 print df[~df.address.isin(words)].pivot(index='ID', columns='address', values='pageviews') address 2gis.ru worldoftanks.ru ID 0a1d796327284ebb443f71d85cb37db9 24 2 

另一个解决scheme是删除行,当一些列是0 (如pageviews ):

 print df ID address used_at \ 0 0a1d796327284ebb443f71d85cb37db9 youtube.ru 2016-01-29 22:10:52 1 0a1d796327284ebfsffsdf youtube.ru 2016-01-29 22:10:52 2 0a1d796327284ebb443f71d85cb37db9 vk.com 2016-01-29 22:10:52 3 0a1d796327284ebb443f71d85cb37db9 2gis.ru 2016-01-29 22:48:52 4 0a1d796327284ebb443f71d85cb37db9 yandex.ru 2016-01-29 22:14:30 5 0a1d796327284ebb443f71d85cb37db9 worldoftanks.ru 2016-01-29 22:10:30 active_seconds pageviews 0 3804 0 1 3804 0 2 3804 115 3 214 24 4 4 2 5 41 2 
 print df.pageviews != 0 0 False 1 False 2 True 3 True 4 True 5 True Name: pageviews, dtype: bool print df[(df.pageviews != 0)] ID address used_at \ 2 0a1d796327284ebb443f71d85cb37db9 vk.com 2016-01-29 22:10:52 3 0a1d796327284ebb443f71d85cb37db9 2gis.ru 2016-01-29 22:48:52 4 0a1d796327284ebb443f71d85cb37db9 yandex.ru 2016-01-29 22:14:30 5 0a1d796327284ebb443f71d85cb37db9 worldoftanks.ru 2016-01-29 22:10:30 active_seconds pageviews 2 3804 115 3 214 24 4 4 2 5 41 2 print df[(df.pageviews != 0)].pivot_table(index='ID', columns='address', values='pageviews') address 2gis.ru vk.com worldoftanks.ru yandex.ru ID 0a1d796327284ebb443f71d85cb37db9 24 115 2 2 

我知道处理csv文件的最快方法是使用包Pandas从中创build一个dataframe。

 import pandas as pd df = pd.read_csv(the_path_of_your_file,header = 0) df.ix[df.ix[:,'address'] == 'yandex.ru','address'] = '' 

这将用一个空stringreplace包含“yandex.ru”的单元格。 然后你可以把它写回csv:

 df.to_csv(the_path_of_your_file) 

如果你想要做的是删除那个URL的行,使用:

 df = df.drop(df[df.address == 'yandex.ru'].index)