Organizing Results in Python

Question

Alright, so basically I have a Google script that searches for a keyword. The results look like:

 http://www.example.com/user/1234 http://www.youtube.com/user/125 http://www.forum.com/user/12

What could I do to organize these results like this?:

 Forums: http://www.forum.com/user/12 YouTubes: http://www.youtube.com/user/125 Unidentified: http://www.example.com/user/1234

By the way I'm organizing them with keywords. If the url has "forum" in it then it goes to the forum list, if it has YouTube it goes to the YouTube list, but if no keywords match up then it goes to unidentified.

I don't understand the question. Are both input and output strings? What are your rules of organizing things? By domain? Why is example.com unidentified? And finally: what have you tried? — freakish
– freakish, Commented Feb 10, 2014 at 14:23
I'm organizing them with keywords. If the url has "forum" in it then it goes to the forum list, if it has youtube it goes to the youtube list, but if no keywords match up then it goes to unidentified. — RydallCooper
– RydallCooper, Commented Feb 10, 2014 at 14:26
Yes, but I was using bash the run the Python script, then trying to organize the results with grep, sed, etc. All tries have failed, lol. I have no idea how I would solely go about doing this in Python. — RydallCooper
– RydallCooper, Commented Feb 10, 2014 at 14:28
What happens when a URL contains both "forum" and "youtube"? — Kevin
– Kevin, Commented Feb 10, 2014 at 14:31

DhruvPathak · Accepted Answer · 2014-02-10 14:37:20Z

1/. Create a dict, and assign an empty list to each keyword you have. eg my_dict = {'forums':[],'youtube':[],'unidentified':[]}

2/.Iterate over your urls.

3/. Generate a key for your url,domain name in your case, you can extract the key using re regex module.

4/ Check the dictionary ( of step#1) for this key, if it does not exist, assign it to 'unidentified key, if it exists, append this url to the list in the dictionary with that key.

I don't think he always wants the domain name to be the key. For instance, the key of example.com is "Unidentified".

Colin Bernet · Accepted Answer · 2014-02-10 14:38:52Z

Something like this? I guess you will be able to adapt this example to your needs

import pprint import re urls = ['http://www.example.com/user/1234', 'http://www.youtube.com/user/126', 'http://www.youtube.com/user/125', 'http://www.forum.com/useryoutube/12'] pattern = re.compile('//www\.(\w+)\.') keys = ['forum', 'youtube'] results = dict() for u in urls: ms = pattern.search(u) key = ms.group(1) if key in keys: results.setdefault(key, []).append(u) pprint.pprint(results)

it will better not to hardcoded the key, it should generate dynamically and create the key, as he doesn't know what are all the domain name or key in advanced.
Ah, I see, thanks. I edited my post. I gave the OP the possibility to select the keys he's interested in.
and now with a more solid pattern matching so that the last url is classified as forum

Jakob Bowyer · Accepted Answer · 2014-02-10 14:34:38Z

import urlparse urls = """ http://www.example.com/user/1234 http://www.youtube.com/user/125 http://www.forum.com/user/12 """.split() categories = { "youtube.com": [], "forum.com": [], "unknown": [], } for url in urls: netloc = urlparse.urlparse(url).netloc if netloc.count(".") == 2: # chop sub-domain netloc = netloc.split(".", 1)[1] if netloc in categories: categories[netloc].append(url) else: categories["unknown"].append(url) print categories

Parse the urls. Find the category. Append the full url

cjfaure · Accepted Answer · 2014-02-10 14:44:47Z

You should probably keep your sorted results in a dictionary and the unsorted ones in a list. You could then sort it like so:

categorized_results = {"forum": [], "youtube": []} uncategorized_results = [] for i in results: i = i.split(".") for k in categorized_results: j = True if k in i: categorized_results[k].append(i) j = False if j: uncategorized_results.append(i)

If you'd like to output it neatly:

category_aliases: {"forum": "Forums:", "youtube": "Youtubes:"} for i in categorized_results: print(category_aliases[i]) for j in categorized_results[i]: print(j) print("\n") print("Unidentified:") print("\n".join(uncategorized_results)) # Let's not put in another for loop.

user2814648 · Accepted Answer · 2014-02-10 15:38:06Z

How about this:

from urlparse import urlparse class Organizing_Results(object): CATEGORY = {'example': [], 'youtube': [], 'forum': []} def __init__(self): self.url_list = [] def add_single_url(self, url): self.url_list.append(urlparse(url)) def _reduce_result_list(self, acc, element): for c in self.CATEGORY: if c in element[1]: return self.CATEGORY[c].append(element) return self.CATEGORY['example'].append(element) def get_result(self): reduce(lambda x, y: c._reduce_result_list(x, y), c.url_list, []) return self.CATEGORY c = Organizing_Results() c.add_single_url('http://www.example.com/user/1234') c.add_single_url('http://www.youtube.com/user/1234') c.add_single_url('http://www.unidentified.com/user/1234') c.get_result()

You can easy broaden the class with more functions as you need.

Collectives™ on Stack Overflow

Organizing Results in Python

5 Answers 5

1 Comment

3 Comments

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

3 Comments

Comments

Comments

Comments

Related