使用excel文件时，pandas花费太多时间和内存太多

我正在尝试使用less于50k行的Excel表格。我想要做的是 – 使用特定的列，我想获得所有的唯一值，然后通过使用唯一的值，我想获得所有包含该值的行，并把它们放在这种格式

[{ "unique_field_value": [Array containing row data that match the unique value as dictionaries] },]

事情是当我testing像1000行一样less行时一切顺利。随着数量的增长，内存使用量也会增加，直到不能再增长，我的电脑就会冻结。那么，有没有什么东西与pandas做得不对？这里是我的平台的细节：

 DISTRIB_ID=Ubuntu DISTRIB_RELEASE=16.04 DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS" NAME="Ubuntu" VERSION="16.04.3 LTS (Xenial Xerus)" ID_LIKE=debian VERSION_ID="16.04"

这是我在Jupyter Notebook上运行的代码

 import pandas as pd import simplejson import datetime def datetime_handler(x): if isinstance(x, datetime.datetime): return x.isoformat() raise TypeError("Type not Known") path = "/home/misachi/Downloads/new members/my_file.xls" df = pd.read_excel(path, index_col=None, skiprows=[0]) df = df.dropna(thresh=5) df2 = df.drop_duplicates(subset=['corporate']) schemes = df2['corporate'].values result_list = [] result_dict = {} for count, name in enumerate(schemes): inner_dict = {} col_val = schemes[count] foo = df['corporate'] == col_val data = df[foo].to_json(orient='records', date_format='iso') result_dict[name] = simplejson.loads(data) result_list.append(result_dict) # print(result_list) # if count == 3: # break dumped = simplejson.dumps(result_list, ignore_nan=True, default=datetime_handler) with open('/home/misachi/Downloads/new members/members/folder/insurance.json', 'w') as json_f: json_f.write(dumped)

编辑

这是预期的输出示例

 [{ "TABBY MEMORIAL CATHEDRAL": [{ "corp_id": 8494, "smart": null, "copay": null, "corporate": "TABBY MEMORIAL CATHEDRAL", "category": "CAT A", "member_names": "Brian Maombi", "member_no": "84984", "start_date": "2017-03-01T00:00:00.000Z", "end_date": "2018-02-28T00:00:00.000Z", "outpatient": "OUTPATIENT" }, { "corp_id": 8494, "smart": null, "copay": null, "corporate": "TABBY MEMORIAL CATHEDRAL", "category": "CAT A", "member_names": "Omula Peter", "member_no": "4784984", "start_date": "2017-03-01T00:00:00.000Z", "end_date": "2018-02-28T00:00:00.000Z", "outpatient": "OUTPATIENT" }], "CHECKIFY KENYA LTD": [{ "corp_id": 7489, "smart": "SMART", "copay": null, "corporate": "CHECKIFY KENYA LTD", "category": "CAT A", "member_names": "BENARD KONYI", "member_no": "ABB/8439", "start_date": "2017-08-01T00:00:00.000Z", "end_date": "2018-07-31T00:00:00.000Z", "outpatient": "OUTPATIENT" }, { "corp_id": 7489, "smart": "SMART", "copay": null, "corporate": "CHECKIFY KENYA LTD", "category": "CAT A", "member_names": "KEVIN WACHAI", "member_no": "ABB/67484", "start_date": "2017-08-01T00:00:00.000Z", "end_date": "2018-07-31T00:00:00.000Z", "outpatient": "OUTPATIENT" }] }]

完整而清晰的代码是：

 import os import pandas as pd import simplejson import datetime def datetime_handler(x): if isinstance(x, datetime.datetime): return x.isoformat() raise TypeError("Unknown type") def work_on_data(filename): if not os.path.isfile(filename): raise IOError df = pd.read_excel(filename, index_col=None, skiprows=[0]) df = df.dropna(thresh=5) result_list = [{n: g.to_dict('records')} for n, g in df.groupby('corporate')] dumped = simplejson.dumps(result_list, ignore_nan=True, default=datetime_handler) return dumped dumped = work_on_data('/home/misachi/Downloads/new members/my_file.xls') with open('/home/misachi/Downloads/new members/members/folder/insurance.json', 'w') as json_f: json_f.write(dumped)

获取字典

 result_dict = [{n: g.to_dict('records') for n, g in df.groupby('corporate')}]

使用read_excel（）指定chunksize=10000参数，并遍历文件，直到达到数据的末尾。这将帮助您在处理大文件时pipe理内存。如果你有多张纸来pipe理，请按照这个例子

  for chunk in pd.read_excel(path, index_col=None, skiprows=[0] chunksize=10000): df = chunk.dropna(thresh=5) df2 = df.drop_duplicates(subset=['corporate']) # rest of your code

使用excel文件时，pandas花费太多时间和内存太多

写长整数到csv

评估另一个单元格中定义的公式

如何把一个button或文本框的Excel工作表？

在ListObjectresize后closures冻结

在Excel VBA中保存单个工作表

Excel VBA无法打开工作簿

使用xlsxwriter创build一个excel文件并将excel文件保存为PDF

ADOfunction..复制，过滤，从封闭的工作簿粘贴到活动wrokbook

如何停止在水晶报表导出的Excel文件中水平传播的细节部分背景

Excel函数，search最大通配符（*）