Automated way to clean lots of .txt files?

0

I have hundreds of .txt files. I need to get into each one of them and remove certain paragraphs that start with specific words but as a whole, are not exactly the same every time. Is there an automatic way that can help me clean these parts out? If yes, what is it? If not, is it easy/quick to create my own AI tool for this job? Assuming that I need to get this done very soon, does it take a lot of time to learn how to create an AI tool to get the job done for me?

1What it means "remove certain paragraphs that have specific start words but they are not exactly the same each time" ? – pasaba por aqui – 2018-03-10T14:48:22.280

I think some regex logic and use of sed command can help you out. – Ugnes – 2018-03-11T03:18:22.530

@pasabaporaqui I modified that part..it means paragraphs starting with same sentences but changing off some where in the middle. – DuttaA – 2018-03-12T12:17:25.777

Yes, but never mind. thanks ! – user105139 – 2018-03-12T18:45:53.697

1

I think theres definitely a way to do what you want, but I'm not sure AI will work magic for you.

As far as I can tell there's a few things you need.

1: A program that finds strings and removes paragraphs with the offending strings indside them.
For this, some python code:

path = '/some/path/to/file'
for filename in os.listdir(path):
with open(filename) as f:
for para in paragraphs:
for word in wordlist:
if strip(para).startswith(word):
print para
#create and write the modified file etc


This is how you can find the offending paragraphs in a file.

2: You need that wordlist mentioned in the above program. If there are only a few words just define wordlist = {"words", "I", "don't", "like","to","start","paragraphs"}. If it's more vauge than that, I'd go to a thesaurus. If generating your wordlist is harder then we might be talking about machine learning.

If you absolutely need it right, and you're "only" doing 100s, it'll take a while to program an ai, then to verify it's doing things correctly, but I may very well have misunderstood your problem. If I did please let me know.