RAM crashed for XML to DataFrame conversion function

0

I have created the following function which converts an XML File to a DataFrame. This function works good for files smaller than 1 GB, for anything greater than that the RAM(13GB Google Colab RAM) crashes. Same happens if I try it locally on Jupyter Notebook (4GB Laptop RAM). Is there a way to optimize the code?

Code

#Libraries
import pandas as pd
import xml.etree.cElementTree as ET

#Function to convert XML file to Pandas Dataframe    
def xml2df(file_path):

  #Parsing XML File and obtaining root
  tree = ET.parse(file_path)
  root = tree.getroot()

  dict_list = []

  for _, elem in ET.iterparse(file_path, events=("end",)):
      if elem.tag == "row":
        dict_list.append(elem.attrib)      # PARSE ALL ATTRIBUTES
        elem.clear()

  df = pd.DataFrame(dict_list)
  return df

Part of an XML File ('Badges.xml')

<badges>
  <row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" Class="3" TagBased="False" />
  <row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82949" UserId="3893" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82950" UserId="4591" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82951" UserId="5196" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82952" UserId="2635" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82953" UserId="1113" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />

This conversion in needed so that I can perform furthur Data Analysis.

I have asked this on StackOverflow (Link) but the answers did not solve my query. I hope to find a solution here.

Ishan Dutta

Posted 2020-08-06T12:01:51.057

Reputation: 193

Answers

0

import dask
import dask.bag as db
import dask.dataframe as dd
from dask.dot import dot_graph
from dask.diagnostics import ProgressBar

dask.set_options(get=dask.multiprocessing.get)
tags_xml = db.read_text('data/Tags.xml', encoding='utf-8')
tags_xml.take(10)

Refer this link for complete tutorial Dask with XML

Syenix

Posted 2020-08-06T12:01:51.057

Reputation: 339

BTW 4GB is not sufficient enough for DS. And if its provided by your organization, please do let them know to give a good computation power if they really want you to work on Data Analysis/Data Science. Most Organization just go stingy when it comes to shelling out $$ for good computing rack or cloud services and force a DS/DA guy to work on 4gb-6GB Laptops. if they cant, look for better organization or request them to shut their AI shops. – Syenix – 2020-08-06T21:23:11.673

Is it possible to create a pandas dataframe from xml considering the given file has more than 500k rows? – Ishan Dutta – 2020-08-07T04:08:44.840

Did you even try the link which I posted? ‍♂️ – Syenix – 2020-08-07T05:35:57.480