Forum Classify Dataset¶

The dataset consists of 100 discussion threads crawled from Ubuntu Forums discussions. Each message in each individual thread is assigned a dialog label out of following eight classes: question, repeat question, clarification, further details, solution, positive feedback, negative feedback, junk.

The dataset was open sourced by IBM Research India and is available to download freely on the IBM Developer Data Asset Exchange.

# importing prerequisites
!pip install xmltodict
import os
import requests
import tarfile
import xmltodict
import collections

Requirement already satisfied: xmltodict in /opt/conda/envs/Python36/lib/python3.6/site-packages (0.12.0)

Download and Extract the Dataset¶

we will download the data and extract it.

fname = 'forum_datasets.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-forum/1.0.7/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)

165345

# Extracting the dataset
tar = tarfile.open(fname)
tar.extractall()
tar.close()

# Verifying the file was extracted properly
data_path = "forum_datasets/"
os.path.exists(data_path)

True

Explore the dataset:¶

The data will be in the folder of forum_datasets/Ubuntu, including 100 xml files. Each file may have original text description and its corresponding response (but not all of them).

path = 'forum_datasets/Ubuntu/'
label = []
content_txt = []
for r, d, f in os.walk(path):
    for file in f:
        if '.xml' in file:
            with open(os.path.join(r, file), encoding="utf-8", errors='ignore') as fl:
                xml_content = fl.read()  
                try:
                    rlt = xmltodict.parse(xml_content)
                    label.append(rlt['Thread']['InitPost']['Class'])
                    content_txt.append(rlt['Thread']['InitPost']['icontent'])
                except :
                    # if there is an error due to format issue, then pass
                    pass

rlt = xmltodict.parse(xml_content)
# Example tokenlist sentence Ordered Dictionary fields
print("\n" + str(rlt['Thread']['InitPost']))

OrderedDict([('UserID', 'biky2004indra'), ('Date', 'December 23rd, 2008'), ('Class', '1'), ('icontent', 'Hi, I am in a big problem. Please help me. my problem is that when i am trying to compile any c/c++ program using the command g++ -lm sample.c -o sample, where sample.c is my c program name, the the following error message is giving : /usr/bin/ld: 1: Syntax error: newline unexpected collect2: ld returned 2 exit status I am using UBUNTU 8.10 and it was updated regularly. I have tried in various way to debug, but no one is working. How can i solve it. someone suggested to uninstall everything and reinstall again. i have already uninstalled g++ and gcc & then reinstalled again. but no positive result. can anyone tell me how can i solve it without formatting the ubuntu. Thanks and waiting.')])

# Example tokenlist sentence Ordered Dictionary fields
od = collections.OrderedDict(sorted(rlt.items(), key=lambda x:x[0]))
od

OrderedDict([('Thread',
              OrderedDict([('ThreadID', '1020249'),
                           ('Title',
                            '/usr/bin/ld: 1: Syntax error: newline unexpected. Help immediately'),
                           ('InitPost',
                            OrderedDict([('UserID', 'biky2004indra'),
                                         ('Date', 'December 23rd, 2008'),
                                         ('Class', '1'),
                                         ('icontent',
                                          'Hi, I am in a big problem. Please help me. my problem is that when i am trying to compile any c/c++ program using the command g++ -lm sample.c -o sample, where sample.c is my c program name, the the following error message is giving : /usr/bin/ld: 1: Syntax error: newline unexpected collect2: ld returned 2 exit status I am using UBUNTU 8.10 and it was updated regularly. I have tried in various way to debug, but no one is working. How can i solve it. someone suggested to uninstall everything and reinstall again. i have already uninstalled g++ and gcc & then reinstalled again. but no positive result. can anyone tell me how can i solve it without formatting the ubuntu. Thanks and waiting.')])),
                           ('Post',
                            [OrderedDict([('UserID', 'dcstar'),
                                          ('Date', 'December 24th, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           'Quote: There is a Development and Programming forum, it would be better to ask programming questions like this over there.')]),
                             OrderedDict([('UserID', 'Xiong Chiamiov'),
                                          ('Date', 'December 24th, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           "I seem to remember seeing this cross-posted, but I'll forgive you. Since we know that it's not a problem with gcc (trust me, it's not), it's your code that has an issue, in which case the Programming forum is the best place to go, as dcstar suggested.")])])]))])

# All posts of the distribution among all categories 
path = 'forum_datasets/Ubuntu/'
label = []
content_txt = []
categories = []
for r, d, f in os.walk(path):
    for file in f:
        if '.xml' in file:
            with open(os.path.join(r, file), encoding="utf-8", errors='ignore') as fl:
                xml_content = fl.read()  
                try:
                    rlt = xmltodict.parse(xml_content)
                    od = collections.OrderedDict(sorted(rlt.items(), key=lambda x:x[0]))
                    
                    num_length=len(od['Thread']['Post'])
                    category = 0 
                    for i in range(num_length):
                        category = int(od['Thread']['Post'][i]['Class'])+1
                        categories.append(category)
                except :
                    pass

Visualizing the data¶

# Plot the histogram of all categories
import matplotlib.pyplot as plt
plt.hist(categories, density=True, bins=8)
plt.ylabel('percentage of all posts %');
plt.xlabel('Categories');
plt.show()