4

I'm a newbie to Python and Pandas, I've spent a lot of time searching but haven't been able to find an answer to my particular problem.

I have a dataframe where the first few lines are just comments starting with '#', followed by the usual dataframe containing rows and columns. I have hundreds of such text files that I need to read in and manipulate. For eg.:

'#' blah1

'#' blah2

'#' blah3

Column1 Column2 Column3

a1 b1 c1

a2 b2 c2

etc.

I want to delete all the rows starting with '#'. Can somebody tell me how to do this in Pandas, preferably?

Alternatively, I tried to use the following code to read in the text file:

my_input=pd.read_table(filename, comment='#', header=80) 

But the problem was that the header row differs for each text file. Is there a way to generalize and tell Python that my header lies below that last line that starts with a '#'?

7
  • 1
    I think this may be a bug, I tried to use comment="'" (as your lines start with that?)... read_csv docs for comment seem pretty clear this should work. Commented Sep 5, 2014 at 1:58
  • 1
    not merged yet: github.com/pydata/pandas/pull/7470 (can the comment at beginning of the line which I think is fixed in master) Commented Sep 5, 2014 at 2:03
  • 1
    What version of pandas are you using? Normally this should work in 0.14.1 (Jeff, we split that PR, the comment part is already in 0.14.1). And following the docstring, the header kwarg should ignore fully commented lines. Commented Sep 5, 2014 at 6:37
  • @joris the above raises in 0.14.1, docs say: "If found at the beginning of a line, the line will be ignored altogether." and "Also, fully commented lines are ignored by the parameter header". Commented Sep 5, 2014 at 7:22
  • So following the docs, the above should be possible, no? What does raise? With 0.14.1 this works for me: df = pd.read_csv(StringIO(s), sep=' ', comment="'") Commented Sep 5, 2014 at 7:50

1 Answer 1

3

Updating to pandas 0.14.1 or higher allows you to correctly skip commented lines.

Older versions would leave the lines in as NaN which could be dropped with .dropna(), but would leave a broken header.

For older versions of pandas you could use 'skiprows' assuming you know how many lines are commented.

In[3]:

s = "# blah1\n# blah2\n# blah3\nCol1 Col2 Col3\na1 b1 c1\na2 b2 c2\n" pd.read_table(StringIO(s), skiprows=3, sep=' ') 

Out[3]:

Col1 Col2 Col3 0 a1 b1 c1 1 a2 b2 c2 
Sign up to request clarification or add additional context in comments.

1 Comment

I could use "skiprows" if I had one or two files, but the problem was that I had 300 files that I needed to extract data from, and for each of them I had to skip a different number of rows. But anyway, like you said correctly, the problem was with the version of Pandas that came installed with Anaconda. In the newer version, the 'comment' argument takes care of it.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.