Forum Summarization Dataset

The dataset consists of 100 discussion threads from Ubuntu Forums discussions. The crawled web-pages (dataset) was preprocessed and all the pages belonging to the same thread were identified and processed to identify different structural units and their associated metadata (title, posts, user IDs etc.). Each message in each individual thread is assigned a dialog label out of following eight classes: question, repeat question, clarification, further details, solution, positive feedback, negative feedback, junk, and the gold standard were generated by two professional people with computer science background.

The dataset is open sourced by IBM Research India and is available to download freely on the IBM Developer Data Asset Exchange.

In [1]:
# importing prerequisites
!pip install xmltodict
import os
import sys
import requests
import tarfile
import xmltodict
import collections
Collecting xmltodict
  Downloading https://files.pythonhosted.org/packages/28/fd/30d5c1d3ac29ce229f6bdc40bbc20b28f716e8b363140c26eff19122d8a5/xmltodict-0.12.0-py2.py3-none-any.whl
Installing collected packages: xmltodict
Successfully installed xmltodict-0.12.0

Download and Extract the Dataset

We will download the data and extract it.

In [2]:
fname = 'forum_datasets.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-forum/1.0.7/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)
Out[2]:
165345
In [3]:
# Extracting the dataset
tar = tarfile.open(fname)
tar.extractall()
tar.close()
In [4]:
# Verifying the file was extracted properly
data_path = "forum_datasets/"
os.path.exists(data_path)
Out[4]:
True

Explore the Dataset:

The data will be in the folder of forum_datasets/Ubuntu, including 100 xml files. Each file may have original text description and its corresponding response (but not all of them).

In [5]:
path = 'forum_datasets/Ubuntu/'
label = []
content_txt = []
file_list= []
for r, d, f in os.walk(path):
    for file in f:
        if '.xml' in file:
            with open(os.path.join(r, file), encoding="utf-8", errors='ignore') as fl:
                file_list.append(file)
                xml_content = fl.read()  
                try:
                    rlt = xmltodict.parse(xml_content)
                    label.append(rlt['Thread']['InitPost']['Class'])
                    content_txt.append(rlt['Thread']['InitPost']['icontent'].split("\n"))
                except :
                    # if there is an error due to format issue, then pass
                    pass
In [6]:
rlt = xmltodict.parse(xml_content)
# Example tokenlist sentence Ordered Dictionary fields
print("\n" + str(rlt['Thread']['InitPost']))
OrderedDict([('UserID', 'cmay'), ('Date', 'April 14th, 2008'), ('Class', '1'), ('icontent', 'hi . i just wanna ask if there is any programs in ubuntu that helps peopel that can barely see. i have already the settings in window all to 23 in font size. however all seemed to be ok yeasterday but then i installed supertux becouse i was kinda bored around the evening and i wantede to play a bit supertux. ( yes i know i am way old to play supertux)  but than i had to install the restrited drivers and aafter restating the computer the fonts where incredebely small.  why they suddenly are i dont have an idea. i have such problems reading the fonts in firefox and i have as example tried to get an account whit the laroza learning to program forum . i can hardly read distorted code you have to fill in first. then after 3 days i can see that the forum has resgisterd me. i have no mail notification and i have not been able to get in whit the password. so i have probely given both bad email adress and cant login becouse i have enterd the wrong password. stuff like this happens all the time for me now. i am getting really sick and tired of this. is there some programs suitable for peopel whit really bad eyesight. this is really something that is getting a bit too frustrating. in windows xp there was some fetures that i did not have to use back then . but i suspect that there is something a bit more thourgtfull designet programs for this in the ubunut packages repos. any one have a suggestion. i would relly be gratefull if i could find something. i use my comupter all the time. thanks .')])
In [7]:
# Example tokenlist sentence Ordered Dictionary fields
od = collections.OrderedDict(sorted(rlt.items(), key=lambda x:x[0]))
od
Out[7]:
OrderedDict([('Thread',
              OrderedDict([('ThreadID', '755240'),
                           ('Title', 'any program for bad eyesight'),
                           ('InitPost',
                            OrderedDict([('UserID', 'cmay'),
                                         ('Date', 'April 14th, 2008'),
                                         ('Class', '1'),
                                         ('icontent',
                                          'hi . i just wanna ask if there is any programs in ubuntu that helps peopel that can barely see. i have already the settings in window all to 23 in font size. however all seemed to be ok yeasterday but then i installed supertux becouse i was kinda bored around the evening and i wantede to play a bit supertux. ( yes i know i am way old to play supertux)  but than i had to install the restrited drivers and aafter restating the computer the fonts where incredebely small.  why they suddenly are i dont have an idea. i have such problems reading the fonts in firefox and i have as example tried to get an account whit the laroza learning to program forum . i can hardly read distorted code you have to fill in first. then after 3 days i can see that the forum has resgisterd me. i have no mail notification and i have not been able to get in whit the password. so i have probely given both bad email adress and cant login becouse i have enterd the wrong password. stuff like this happens all the time for me now. i am getting really sick and tired of this. is there some programs suitable for peopel whit really bad eyesight. this is really something that is getting a bit too frustrating. in windows xp there was some fetures that i did not have to use back then . but i suspect that there is something a bit more thourgtfull designet programs for this in the ubunut packages repos. any one have a suggestion. i would relly be gratefull if i could find something. i use my comupter all the time. thanks .')])),
                           ('Post',
                            [OrderedDict([('UserID', 'friartuck'),
                                          ('Date', 'April 14th, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           'did you try to use orca just under system-----preferences-----universal acces-----assistive technology peferences then choose prefered applications choose the tab accessibility and here you see the option visual  click down here and you see orca with magnifier')]),
                             OrderedDict([('UserID', 'Sephoroth'),
                                          ('Date', 'April 14th, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           'By chance you keep Compiz Fusion enabled you can use the zoom or magnifier plugins. @friartuck: There is no "Universal Access" tab in System -> Preferences (though one can enable it in their Applications menu) so the path would be System -> Preferences -> Assistive technologies.')]),
                             OrderedDict([('UserID', 'bapoumba'),
                                          ('Date', 'April 14th, 2008'),
                                          ('Class', '8'),
                                          ('rcontent',
                                           'I\'ve moved the thread to " Assistive Technology and Accessibility Discussions", where you might get better help.')]),
                             OrderedDict([('UserID', 'friartuck'),
                                          ('Date', 'April 14th, 2008'),
                                          ('Class', '8'),
                                          ('rcontent',
                                           'you are right my friend must be the monday night alcohol session')]),
                             OrderedDict([('UserID', 'fooman'),
                                          ('Date', 'April 14th, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           'Quote: there is an extension for firefox called "no squint" that may help you:  https://addons.mozilla.org/en-US/firefox/addon/2592')]),
                             OrderedDict([('UserID', 'fooman'),
                                          ('Date', 'April 14th, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           'forgot to mention...it will even remember settings for individual sites...useful if you find some sites whose fonts are smaller than usual.')]),
                             OrderedDict([('UserID', 'edm1'),
                                          ('Date', 'April 14th, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           'My brother finds the zoom plugin for Compiz Fusion helpful.')]),
                             OrderedDict([('UserID', 'cmay'),
                                          ('Date', 'April 14th, 2008'),
                                          ('Class', '-1'),
                                          ('rcontent',
                                           'hi. thanks for the reply i dread the the day i have to get a laptop to work whit. i have a screen the size of a small tv set. and all the settings put to as large as it is not too big too see . anyway i will try to see if this compiz / orca is of any help .  i use my computer alot and i use it for both writing and makin music and almost my whole social life from time to time. i would not be able to live whitout it in the moment. so i really wanna get this problem fixed.  anyway i think i will reply back somehow if i find any solution that is right on the spot whit this.  i read in the ubuntu examples that ubunut filosofi is that everyone no matter language or handicap should be able to use it for free.  so i think it could be nice to see as many peopel reporting back of this being true . course i will read this treadh first . i did not know there was a such treadh.  thanks a lot for the help . i mean it really thanks a lot .  ps monday alcohol sessions ? try out tuxtyping for a real good laugh. this poor penguin looks so utterly amusing that you dont need any alcohol to get a coupel of hours of ha ha')]),
                             OrderedDict([('UserID', 'RKCole'),
                                          ('Date', 'April 23rd, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           "Hello, cmay.  Which version of Ubuntu are you using?  I also have very low vision, and I use the eZoom plugin which is in Compiz Fusion, and I turn on a feature for mouse-pointer scaling which makes the mouse pointer increase in size if you zoom the desktop. It si extremely helpful.  if you have Ubuntu 7.10 installed on your system, you can install CompizConfig Settings Manager (CCSM) to make setup a bit easier for you.  Please let me know what version of Ubuntu you are using, and maybe I can help you to get all set up.  If you have ever used programs such as ZoomText or Magic on Windows, (in my opinion) Compiz Fusion's eZoom plugin blows those programs way out of the water. It makes my life a WHOLE LOT esier. This was the thing I was waiting for to completely switch to Ubuntu, and when it arrived in v7.10, I jumped off of the Windows canoe and finally got myself a yacht.   Anyhow, take care, and please write back and et me know if I can be fo any help to you.")]),
                             OrderedDict([('UserID', 'stalkingwolf'),
                                          ('Date', 'June 26th, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           'In firefox you can increase the font size by going to View > Text Size > then click Increase.')]),
                             OrderedDict([('UserID', 's3a'),
                                          ('Date', 'July 14th, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           'A quicker way to do that would be to press ctrl and move the scroll up or down to zoom in or out.')]),
                             OrderedDict([('UserID', 's3a'),
                                          ('Date', 'July 14th, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           "Not that you'd do it but a nice ~40 inch HDTV with a resolution of 1920x1080 would do very well since it has a low resolution given the large screen size, it would have huge letters, images, etc while fitting a lot of things in it!")]),
                             OrderedDict([('UserID', 'AndreiP1'),
                                          ('Date', 'July 31st, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           "Hi, Cmay, I not got an answer to you qsn regards a program for ppl with bad vision - in that regards, I personally use a mac, and it's easy to enlarge anything there, but I can, however, give you and any1 else reading this a few tips on how to make sure our eyesight doesnt worsen. You can also restore your eyesight. When you work with a pc, read a book, or when you are doing anything else that requires you to stare at it for a long time, especially close up, you need look around the room, look into the distance now and again. If you do that, your eyesight will NOT get worse. Otherwise, it WILL - it's only normal, only common sense. If you do start noticing your eyesight worsening, then you can start doing more exercises - move your eyes up, down, left, right, look onto your nose etc - do this for a couple of mins once every hour or so, then close your eyes and relax. If you put your hands over your face, so that your eyes are right in the middle of your palms, with no light getting through, and stay like this for a minimum of 3 mins, then your eyes will relax and you'll see a huge improvement. It'll help a lot. If you do this, the n you'll never need to were glasses, no matter what age you are, or what work you do. Hope some1 will read this & it helps - I know, I've been there.  Andrei Petunin")]),
                             OrderedDict([('UserID', 'leona'),
                                          ('Date', 'July 31st, 2008'),
                                          ('Class', '2'),
                                          ('rcontent',
                                           "Quote: Hello there, I would very much like to know more about this plugin. you say its better then zoomtext? I use ZoomText at work, but can't use it at home (doesn't work under linux  ), so I would like to know more about this eZoom thingy and how you enable it. I'm using Ubuntu 8.04 LTS I have to say though, I've found Compiz, less then stable, in previous releases, it still messes up / corrupts the screen too often. Thank you")]),
                             OrderedDict([('UserID', 'Sef'),
                                          ('Date', 'July 31st, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           'Have you tried resetting the default font sizes?   System > Preferences > Appearance > Font')]),
                             OrderedDict([('UserID', 'cmay'),
                                          ('Date', 'August 12th, 2008'),
                                          ('Class', '4'),
                                          ('rcontent',
                                           'hi. i lost this tread . thanks for all your help. i have migraines sometimes and its what is giving me these problems. i am looking a lot for somethings that can help not just for me . i know peopel that have trouble reading /learning to to read and i once found a distro called blinux but i cant find it anymore google searcing and i think its maybe inactive. i know i have a few linux converts if i can find a lot of helping tools for visual impaired or non reading peopel. thats why i post and also i get myselve these splitting headaches where i can not focuse on my eyes however for everyday use i can live very well whit fontsettings 16. i found out in the meantime.  thanks it means a lot to me that someone has anwered my post.')]),
                             OrderedDict([('UserID', 'iaskedalice09'),
                                          ('Date', 'August 12th, 2008'),
                                          ('Class', '1'),
                                          ('rcontent',
                                           "Quote: Hi,  I just moved to Ubuntu and don't rely on zoom things as much but there was a time when I could not live without ZoomText being on. So for future ref, how do you install compiz fusion and eZoom? I've heard so much about the former but cannot find a d/l link and if you say ezoom is better than ZT I am anxious to see its potential. don't think I'll use it since linux finally lets me tweak the computer and things so i can see it but for reference y'know.")]),
                             OrderedDict([('UserID', 'meganox'),
                                          ('Date', 'September 1st, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           'If you set the font size in Firefox, make sure to use the Advanced options and set the minimum font size, and disable the option to let pages choose their own fonts. I have my default font size in firefox at 16pt and the minimum at 13. Many pages will look crappy because they aren\'t written to expand to acommodate user-chosen font sizes. If you use a site like that a lot you should contact them and let them know they are ignoring important accessibility guidelines that may even be legal obligations in their country, as well as alienating possible customers.  Don\'t forget you can zoom text in Firefox using the Ctrl-+ and Ctrl-- hotkeys or Ctrl-mousewheel, Ctrl-0 to reset. Firefox 3 remembers the zoom level on a per-site basis so you don\'t need the nosquint addon anymore. To stop images zooming with the text you need to enter about:config in the address bar, type "browser-zoom-full" in the filter field and double-click it set it to false. You can also change the "zoom.maxpercent" and "zoom.minpercent" settings here.  For magnification I use Compiz with the Enhanced Zoom Desktop and Place Windows plugins. I don\'t use the Magnifier plug in. To get a window double-size and filling the screen I use this combo, although I may have changed the key bindings from the defaults (Super means the "Windows" key"):  Super-Keypad-5 - centres the current window (Place Windows plugin) Super-2 - doubles the screen magnification (Zoom plugin) Super-v - resizes the window to the current screen size, i.e. x2 (Zoom plugin). It won\'t be centred any more though, so... Super-Keypad-5 to centre it again or...  Super-Keypad-N - move the window to N position on the screen, e.g. 7 is top-left, 3 is bottom-right and... Super-r - move and rezoom to fit the current window.  In general, Super-v resizes the window to the zoom level, and Super-r will move and resize the zoomed desktop to the current window. Alt-tabbing between windows will also move the zoomed desktop between them but not resize to fit them. Don\'t forget Super with the mouse wheel will zoom the desktop to arbitrary values.  You can also constrain the mouse to the zoomed area and scale the mouse with the window, so It\'s like every application has its own double size desktop. Personally I hate it when the screen moves around as I move the mouse, although you can leave the mouse unconstrained and use the mouse panning option so the screen only moves when you hit the edge, lessening the feeling of "motion sickness"  This approach saves you messing around with font sizes in lots of different places - I would say it\'s impossible to get all applications set up the way you want them, too many ignore the system font settings.  Using this method I can have all windows at double size and switch between them using alt-tab or any of the application switching plugins for Compiz. To reset to 100% zoom simply hit Super-1. You can in fact customise the hotkey zoom levels too, so I have them at 100%, 75%, and 50%. 75% is good for firefox otherwise the toolbars take up too much room and many pages just won\'t fit in the space. The downside is I can only see the panels when I zoom out to 100%, so I miss flashing icons in the notification area.  So you don\'t have to use the above combo on every window you open, the Window Rules plugin lets you set the size and position of any windows you open, so you could set it up so all windows are automatically 1/2 the width and height of your display, and position them in the centre, although you\'d have to do some math since it takes the top left corner as the position. This has much the same effect as setting your screen resolution to half, except your panels won\'t be visible so you\'d have a little more screen space. There is a problem in that you can\'t stop the Place Windows plugin from doing smart placement (where it places new windows as far as possible from existing ones - I shall report this as a bug), and this may mess with setting up fixed window placements in the Window Rules plugin.  On the important subject of eyestrain and other computer related disorders, there is a program in the Ubuntu repos called Workrave, I highly recommend it. It is a timer that nags you to take microbreaks, rest breaks, and limit your daily computer usage. It is much smarter than other similar programs that will nag you even if you have just returned from a break. If you are bad for skipping breaks you can even set it to lock the screen, although it will let you skip or postpone a break if you absolutely must.  I take a 20 second microbreak every 5 minutes where I stand up and look out the window at the farthest thing I can see. Every 45 minutes I take a 5 minute break where I walk around and relax my upper body, or do some household chores (I work from home). My eyes no longer water, my eyelids no longer twitch, my wrist and lower back pain are gone, and I seem to get my work done faster as a result. I know computers can enhance the lives of people who are differently abled in various ways, but you must take steps to ensure they don\'t negatively affect your health in the long term.  Hope this helps you some, I shall watch this thread so if you need any further clarification feel free to ask, it is a lot to take in all at once. Compiz has many accessibility features but that can make it difficult to find the ones you need and configure them however suits you best.   Code:  Quote:')]),
                             OrderedDict([('UserID', 'meganox'),
                                          ('Date', 'September 1st, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           'Quote: Compiz is already installed, if you go to System -> Preferences -> Appearance -> Visual Effects and select "Extra" you will get more effects. You need the compiz control panel:   Code:  will install it. You can run it with "ccsm" or it should install an icon called "Advanced Desktop Effects Settings" in your preferences menu.')]),
                             OrderedDict([('UserID', 'notlistening'),
                                          ('Date', 'October 1st, 2008'),
                                          ('Class', '5'),
                                          ('rcontent',
                                           'I am using compiz with the ezoom plugin in 8.04 it comes pre installed not necessarily pre-configured   What i find useful:  Super/Windows Key + Scroll up on mouse :   This does not limit you to 3 zoon levels and allows you to modify the zoom level at will.  If you have a five button mouse you can set the Super / Windows key + back button to reset view.  Super/Windows Key + m: Inverts everything bar the desktop  Super/Windows Key + n: Inverts the current window useful for firefox when you have a page with black background.  Just as a side note getting this all to work is a little easier said than done. I have an ATI HD card connected through HDMI to a 32 inch TV screen. This was not straight forward.   Some hardware can be very troublesome and others not compiz sometime just works out of the box straight after install the best way to check is Super/Windows Key + m if everything goes inverted your in business. Otherwise you may need to start by installing proprietary driver support and or the GLX package for your graphics card.  ** GLX WARNING : This may break your graphics output on reeboot. Be sure you are aware and can see enough to do a xorg reset in recovery mode.  If the GLX goes in okay if you may need to enable it by going System -> Preferences -> Appearance. Then navigate to the last tab Visual Effects and change the button to normal. If it works then it will ask you to keep the settings otherwise it will complain that it can not start. Back to the drawing board at that point.   It can be worth checking the forums for your graphics card and if it supports compiz and how hard/easy it is to setup.  This is not a full gude but give you ideas of where you need to start looking. In my experience compiz will not work on anything less than 1Ghz machine and thats a push. Not many of those around, just in my room.   Tom')])])]))])
In [8]:
# All posts of the distribution among all categories 
path = 'forum_datasets/Ubuntu/'
label = []
categories = []
for r, d, f in os.walk(path):
    for file in f:
        if '.xml' in file:
            with open(os.path.join(r, file), encoding="utf-8", errors='ignore') as fl:
                xml_content = fl.read()  
                try:
                    rlt = xmltodict.parse(xml_content)
                    od = collections.OrderedDict(sorted(rlt.items(), key=lambda x:x[0]))
                    
                    num_length=len(od['Thread']['Post'])
                    category = 0 
                    for i in range(num_length):
                        category = int(od['Thread']['Post'][i]['Class'])+1
                        categories.append(category)
                except :
                    pass

Visualizing the dataset

In [9]:
# Plot the histogram of all categories
import matplotlib.pyplot as plt
plt.hist(categories, density=True, bins=8)
plt.ylabel('percentage of all posts %');
plt.xlabel('Categories');
plt.show()
<Figure size 640x480 with 1 Axes>
In [10]:
txtfiles = []
for flst in file_list:
    txtfiles.append(flst.strip('.xml')+ '.txt')

Process the Gold Standard Dataset

In [11]:
path = 'forum_datasets/Gold_Summaries/Annotator_1/Ubuntu/'
label = []
gs_txt = []
index = 0
for r, d, f in os.walk(path):
    for file in f:
        if '.txt' in file and file in txtfiles:
            lineList = list()
            fileName = os.path.join(r, txtfiles[index])
            index +=1
            with open(fileName,'r',encoding='utf8',errors="ignore") as f:
                for line in f:
                    lineList.append(line.rstrip('\n'))
        gs_txt.append(lineList)    

The preview of a gold standard summarized sample

In [12]:
gs_txt[0]
Out[12]:
['The user wanted to install compiz theme and had downloaded it from the webpage',
 '(http://www.gnome-look.org/content/sh...ntent=91681). The user had installed compiz-fusion icon and emerald theme manager.',
 '',
 'The user had a wrong link which was for only cube caps.',
 '',
 'The information for the installation was available in the following webpage.',
 'http://www.gnome-look.org/content/sh...?content=65646 ',
 '',
 'The user can also try epidermis, details can be found in this thread about page 15/16 Below were the useful links ',
 'http://ubuntuforums.org/showthread.php?t=916410',
 'OR',
 'http://epidermis.tuxfamily.org']
In [13]:
content_txt[-1]
Out[13]:
['hi . i just wanna ask if there is any programs in ubuntu that helps peopel that can barely see. i have already the settings in window all to 23 in font size. however all seemed to be ok yeasterday but then i installed supertux becouse i was kinda bored around the evening and i wantede to play a bit supertux. ( yes i know i am way old to play supertux)  but than i had to install the restrited drivers and aafter restating the computer the fonts where incredebely small.  why they suddenly are i dont have an idea. i have such problems reading the fonts in firefox and i have as example tried to get an account whit the laroza learning to program forum . i can hardly read distorted code you have to fill in first. then after 3 days i can see that the forum has resgisterd me. i have no mail notification and i have not been able to get in whit the password. so i have probely given both bad email adress and cant login becouse i have enterd the wrong password. stuff like this happens all the time for me now. i am getting really sick and tired of this. is there some programs suitable for peopel whit really bad eyesight. this is really something that is getting a bit too frustrating. in windows xp there was some fetures that i did not have to use back then . but i suspect that there is something a bit more thourgtfull designet programs for this in the ubunut packages repos. any one have a suggestion. i would relly be gratefull if i could find something. i use my comupter all the time. thanks .']
In [14]:
len(content_txt)
Out[14]:
100

Calculate string similarity using cosine similarity

We simply use cosine similarity as a measure of similarity between two word vectors, which measures the cosine of the angle between them in an inner product space.

In [15]:
import string
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import nltk
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/dsxuser/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
In [16]:
def cleanup(txt):
    txt = ''.join([w for w in txt if w not in string.punctuation])
    txt = txt.lower()
    txt = ' '.join([w for w in txt.split() if w not in stopwords])
    return txt
In [17]:
gs_content = gs_txt[-1][0].split()
gs_exp_tmp = ""
for cnt in gs_content:
    gs_exp_tmp += cnt + " "
gs_exp = [gs_exp_tmp]
In [18]:
cleansent_gs = list(map(cleanup, gs_exp))
In [19]:
cleansent_gs
Out[19]:
['user wanted know programs help people could barely see user installed supertux restarted computer user observed change font sizes font size previously 23 restart observed incredibly small user registered himselfherself forum get help wrong password user unable login forum website']
In [20]:
cleansent_cnt = list(map(cleanup, content_txt[-1]))
In [21]:
cleansent_cnt
Out[21]:
['hi wanna ask programs ubuntu helps peopel barely see already settings window 23 font size however seemed ok yeasterday installed supertux becouse kinda bored around evening wantede play bit supertux yes know way old play supertux install restrited drivers aafter restating computer fonts incredebely small suddenly dont idea problems reading fonts firefox example tried get account whit laroza learning program forum hardly read distorted code fill first 3 days see forum resgisterd mail notification able get whit password probely given bad email adress cant login becouse enterd wrong password stuff like happens time getting really sick tired programs suitable peopel whit really bad eyesight really something getting bit frustrating windows xp fetures use back suspect something bit thourgtfull designet programs ubunut packages repos one suggestion would relly gratefull could find something use comupter time thanks']
In [22]:
! pip install sister
import sister
embedder = sister.MeanEmbedding(lang="en")
vec_cnt = embedder(cleansent_cnt[0])  # default is a 300-dim vector
Collecting sister
  Downloading https://files.pythonhosted.org/packages/6d/31/6c7c19362c36b20f4875bb0343ce6e8eefbfb7cd65b4f54a5b3e3cb07eb4/sister-0.1.5.tar.gz
Collecting fasttext (from sister)
  Downloading https://files.pythonhosted.org/packages/10/61/2e01f1397ec533756c1d893c22d9d5ed3fce3a6e4af1976e0d86bb13ea97/fasttext-0.9.1.tar.gz (57kB)
     |████████████████████████████████| 61kB 9.3MB/s eta 0:00:011
Requirement already satisfied: numpy in /opt/conda/envs/Python36/lib/python3.6/site-packages (from sister) (1.15.4)
Collecting Janome (from sister)
  Downloading https://files.pythonhosted.org/packages/79/f0/bd7f90806132d7d9d642d418bdc3e870cfdff5947254ea3cab27480983a7/Janome-0.3.10-py2.py3-none-any.whl (21.5MB)
     |████████████████████████████████| 21.5MB 9.6MB/s eta 0:00:01
Requirement already satisfied: pybind11>=2.2 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from fasttext->sister) (2.4.3)
Requirement already satisfied: setuptools>=0.7.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from fasttext->sister) (40.8.0)
Building wheels for collected packages: sister, fasttext
  Building wheel for sister (setup.py) ... done
  Stored in directory: /home/dsxuser/.cache/pip/wheels/7d/ed/48/76cdf61511496a1cd2a2c56dc9b4e9667401dfb6084f153cc3
  Building wheel for fasttext (setup.py) ... done
  Stored in directory: /home/dsxuser/.cache/pip/wheels/9f/f0/04/caa82c912aee89ce76358ff954f3f0729b7577c8ff23a292e3
Successfully built sister fasttext
Installing collected packages: fasttext, Janome, sister
Successfully installed Janome-0.3.10 fasttext-0.9.1 sister-0.1.5
Downloading from https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.simple.zip...
Loading model...

In [23]:
vec_gs = embedder(cleansent_gs[0])  # default is a 300-dim vector 
In [24]:
def cos_sim(vect1, vect2):
    vecta = vect1.reshape(1,-1)
    vectb = vect2.reshape(1,-1)    
    return cosine_similarity(vecta, vectb)[0][0]
In [25]:
similarity = cos_sim(vec_cnt, vec_gs)
In [26]:
similarity
Out[26]:
0.9228872