Bloatectomy: a method for the identification and removal of duplicate text in the bloated notes of electronic health records and other documents. Takes in a list of notes or a single file (.docx, .txt, .rtf, etc) or single string to be marked for duplicates which can then be highlighted, bolded, or removed. Marked output and tokens are output.
Read MoreWhere to find conda (anaconda) packages and libraries
where to find anaconda packages on osX
Read MoreTED Talk Recommender (Part3): flask app
Topic modeling of TED talks using Latent Dirichlet Allocation and visualization with tSNE.
Read MoreDCFem Tech Awards 2019 and Written Feedback
Moral of the story is this: give good and ‘bad’ feedback to your mentees often and in writing. It will make it easier for you and them to do things like nominate themselves or write a recommendation. Also, when you write for a women, read it and ask yourself if you would say those same things about a man. It will help you adjust any biased language that got in there.
Read MoreYMCA VS. Obesity Part 3: Linear Regression Results
Linear regression of multiple county-level statistics on the obesity rate in the United States.
Read MoreTED Talk Recommender (Part2): Topic Modeling and tSNE
Topic modeling of TED talks using Latent Dirichlet Allocation and visualization with tSNE.
Read MoreTed Talk Recommender (Part1): Cleaning text with NLTK
Above is a sample of the transcript from on of the most popular (highest views) Ted Talks. 'Do Schools Kill Creativity? by Sir Ken Robinson. I made a Ted Talk recommender using Natural Language Processing (NLP), topic modeling. The recommender lets you enter key words from the title of a talk, then it finds the talk and returns the urls to 5 Ted talks that are similar to yours. This post will cover cleaning, Part2 covers topic modeling and Part 3 is the recommender and app. The code is in my github repository.
The data for this project consisted of transcripts from Ted and TedX talks. Thanks to Rounak Banik and his web scraping I was able to obtain 2467 transcripts from 355 different Ted and TedX events from 2001-2017. I downloaded this corpus from Kaggle, along with metadata about every talk.
The first thing I saw when looking at these transcripts was that there were a lot of parentheticals for various non-speech sounds. For example, (Laughter) or (applause) or (Music). There were even some cute little notes when the lyrics of a performance were transcribed
someone like him ♫♫ He was tall and strong
I decided that I wanted to look at only the words that the speaker said, and remove these words in parentheses. Although, it would be interesting to collect these non-speech events and keep a count in the main matrix, especially for things like 'laughter' or applause or multimedia (present/not present) in making recommendations or calculating the popularity of a talk.
Lucky for me, all of the parentheses contained these non-speech sounds and any of the speaker's words that required parenthesis were in brackets, so I just removed them with a simple regular expression. Thank you, Ted transcribers, for making my life a little easier!!!
clean_parens = re.sub(r'\([^)]*\)', ' ', text)
Cleaning Text with NLTK
Four important steps for cleaning the text and getting it into a format that we can analyze:
1.tokenize
2.remove stop words/punctuation
3.lemmatize
4.vectorize
NLTK (Natural Language ToolKit) is a python library for NLP. I found it very easy to use and highly effective.
Tokenization
This is the process of splitting up the document (talk) into words. There are a few tokenizers in NLTK, and one called wordpunct was my favorite because it separated the punctuation as well.
2. Stopwords
The music notes were easy to remove by adding them to my stopwords. Stopwords are the words that don't give us much information, (i.e., the, and, it, she, as) along with the punctuation. We want to remove these from our text, too.
We can do this by importing NLTKs list of stopwords and then adding to it. I went through many iterations of cleaning in order to figure out which words to add to my stopwords. I added a lot of words and little things that weren't getting picked up, but this is a sample of my list.
3. Lemmatization
In this step, we get each word down to its root form. I chose the lemmatizer over the stemmer because it was more conservative and was able to change the ending to the appropriate one (i.e. children-->child, capacities-->capacity). This was at the expense of missing a few obvious ones (starting, unpredictability).
Now we have squeaky clean text! Here's the same excerpt that I showed you at the top of the post.
good morning great blown away whole thing fact leaving three
theme running conference relevant want talk one extraordinary
evidence human creativity
As you can see it no longer makes a ton of sense, but it will still be very informative once we process these words over the whole corpus of talks.
N-grams
Let's look at some of the n-grams. This is just pairs of words (or triplets) that show up together. It will tell us something about our corpus, but also guide us in our next step of vectorization. Here's what we get for the top 30 most fequent bi-grmas.
Tri-grams
The tri-grams were not very informative or useful aside from "new york city" and "0000 year ago" which get picked up in the bi-grams.
4. Vectorization
Vectorization is the important step of turning our words into numbers. The method that gave me the best results was count vectorizer. This function takes each word in each document and counts the number of times the word appears. You end up with each word (and n-gram) as your columns and each row is a document (talk), so the data is the frequency of each word in each document. As you can imagine, there will be a large number of zeros in this matrix; we call this a sparse matrix.
Now we are ready for topic modeling! In Part 2 we do topic modeling with Latent Dirichlet Allocation and visualization with tSNE!
Classification of Meows and Woofs: Part 2
Dog and Cat sounds. I reduced the dimensionality of the 1D FFTs of the sounds and then used several models to see which one is able to classify them. Gaussian Naive Bayes won.
Read MoreClassification of Meows and Woofs: Part 1 Spectrograms!
Dog and Cat sounds. This post is mostly spectrograms.
Read More