Automated way to clean lots of .txt files?


I have hundreds of .txt files. I need to get into each one of them and remove certain paragraphs that start with specific words but as a whole, are not exactly the same every time. Is there an automatic way that can help me clean these parts out? If yes, what is it? If not, is it easy/quick to create my own AI tool for this job? Assuming that I need to get this done very soon, does it take a lot of time to learn how to create an AI tool to get the job done for me?

Thanks in Advance!


Posted 2018-03-10T11:26:54.217

Reputation: 19

1What it means "remove certain paragraphs that have specific start words but they are not exactly the same each time" ? – pasaba por aqui – 2018-03-10T14:48:22.280

I think some regex logic and use of sed command can help you out. – Ugnes – 2018-03-11T03:18:22.530

@pasabaporaqui I modified that means paragraphs starting with same sentences but changing off some where in the middle. – DuttaA – 2018-03-12T12:17:25.777

Yes, but never mind. thanks ! – user105139 – 2018-03-12T18:45:53.697



I think theres definitely a way to do what you want, but I'm not sure AI will work magic for you.

As far as I can tell there's a few things you need.

1: A program that finds strings and removes paragraphs with the offending strings indside them.
For this, some python code:

path = '/some/path/to/file'
for filename in os.listdir(path):
    with open(filename) as f:
        paragraphs = f.readlines()
        for para in paragraphs:
            for word in wordlist:
                if strip(para).startswith(word):
                    print para
                    #create and write the modified file etc

This is how you can find the offending paragraphs in a file.

2: You need that wordlist mentioned in the above program. If there are only a few words just define wordlist = {"words", "I", "don't", "like","to","start","paragraphs"}. If it's more vauge than that, I'd go to a thesaurus. If generating your wordlist is harder then we might be talking about machine learning.

If you absolutely need it right, and you're "only" doing 100s, it'll take a while to program an ai, then to verify it's doing things correctly, but I may very well have misunderstood your problem. If I did please let me know.


Posted 2018-03-10T11:26:54.217

Reputation: 51