使用excel文件时,pandas花费太多时间和内存太多

我正在尝试使用less于50k行的Excel表格。 我想要做的是 – 使用特定的列,我想获得所有的唯一值,然后通过使用唯一的值,我想获得所有包含该值的行,并把它们放在这种格式

[{ "unique_field_value": [Array containing row data that match the unique value as dictionaries] },] 

事情是当我testing像1000行一样less行时一切顺利。 随着数量的增长,内存使用量也会增加,直到不能再增长,我的电脑就会冻结。 那么,有没有什么东西与pandas做得不对? 这里是我的平台的细节:

 DISTRIB_ID=Ubuntu DISTRIB_RELEASE=16.04 DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS" NAME="Ubuntu" VERSION="16.04.3 LTS (Xenial Xerus)" ID_LIKE=debian VERSION_ID="16.04" 

这是我在Jupyter Notebook上运行的代码

 import pandas as pd import simplejson import datetime def datetime_handler(x): if isinstance(x, datetime.datetime): return x.isoformat() raise TypeError("Type not Known") path = "/home/misachi/Downloads/new members/my_file.xls" df = pd.read_excel(path, index_col=None, skiprows=[0]) df = df.dropna(thresh=5) df2 = df.drop_duplicates(subset=['corporate']) schemes = df2['corporate'].values result_list = [] result_dict = {} for count, name in enumerate(schemes): inner_dict = {} col_val = schemes[count] foo = df['corporate'] == col_val data = df[foo].to_json(orient='records', date_format='iso') result_dict[name] = simplejson.loads(data) result_list.append(result_dict) # print(result_list) # if count == 3: # break dumped = simplejson.dumps(result_list, ignore_nan=True, default=datetime_handler) with open('/home/misachi/Downloads/new members/members/folder/insurance.json', 'w') as json_f: json_f.write(dumped) 

编辑

这是预期的输出示例

 [{ "TABBY MEMORIAL CATHEDRAL": [{ "corp_id": 8494, "smart": null, "copay": null, "corporate": "TABBY MEMORIAL CATHEDRAL", "category": "CAT A", "member_names": "Brian Maombi", "member_no": "84984", "start_date": "2017-03-01T00:00:00.000Z", "end_date": "2018-02-28T00:00:00.000Z", "outpatient": "OUTPATIENT" }, { "corp_id": 8494, "smart": null, "copay": null, "corporate": "TABBY MEMORIAL CATHEDRAL", "category": "CAT A", "member_names": "Omula Peter", "member_no": "4784984", "start_date": "2017-03-01T00:00:00.000Z", "end_date": "2018-02-28T00:00:00.000Z", "outpatient": "OUTPATIENT" }], "CHECKIFY KENYA LTD": [{ "corp_id": 7489, "smart": "SMART", "copay": null, "corporate": "CHECKIFY KENYA LTD", "category": "CAT A", "member_names": "BENARD KONYI", "member_no": "ABB/8439", "start_date": "2017-08-01T00:00:00.000Z", "end_date": "2018-07-31T00:00:00.000Z", "outpatient": "OUTPATIENT" }, { "corp_id": 7489, "smart": "SMART", "copay": null, "corporate": "CHECKIFY KENYA LTD", "category": "CAT A", "member_names": "KEVIN WACHAI", "member_no": "ABB/67484", "start_date": "2017-08-01T00:00:00.000Z", "end_date": "2018-07-31T00:00:00.000Z", "outpatient": "OUTPATIENT" }] }] 

完整而清晰的代码是:

 import os import pandas as pd import simplejson import datetime def datetime_handler(x): if isinstance(x, datetime.datetime): return x.isoformat() raise TypeError("Unknown type") def work_on_data(filename): if not os.path.isfile(filename): raise IOError df = pd.read_excel(filename, index_col=None, skiprows=[0]) df = df.dropna(thresh=5) result_list = [{n: g.to_dict('records')} for n, g in df.groupby('corporate')] dumped = simplejson.dumps(result_list, ignore_nan=True, default=datetime_handler) return dumped dumped = work_on_data('/home/misachi/Downloads/new members/my_file.xls') with open('/home/misachi/Downloads/new members/members/folder/insurance.json', 'w') as json_f: json_f.write(dumped) 

获取字典

 result_dict = [{n: g.to_dict('records') for n, g in df.groupby('corporate')}] 

使用read_excel()指定chunksize=10000参数,并遍历文件,直到达到数据的末尾。 这将帮助您在处理大文件时pipe理内存。 如果你有多张纸来pipe理,请按照这个例子

  for chunk in pd.read_excel(path, index_col=None, skiprows=[0] chunksize=10000): df = chunk.dropna(thresh=5) df2 = df.drop_duplicates(subset=['corporate']) # rest of your code