Let's look at the shape of our dataset:
hotel_reviews.shape
(35912, 19)
This is showing us that we are working with 35,912 rows and 19 columns. Eventually, we will be concerned only with the column that contains the text data, but for now, let's see what the first few rows look like to get a better sense of what is included in our data:
hotel_reviews.head()
This gives us the following table:
address |
categories |
city |
country |
latitude |
longitude |
name |
postalCode |
province |
reviews. |
reviews. |
reviews. |
reviews. |
reviews. |
reviews. |
reviews. title |
reviews. userCity |
reviews. username |
reviews. userProvince |
|
0 |
Riviera San Nicol 11/a |
Hotels |
Mableton |
US |
45.421611 |
12.376187 |
Hotel Russo Palace |
30126 |
GA |
2013-09-22T00:00:00Z |
2016-10-24T00:00:25Z |
NaN |
NaN |
4.0 |
Pleasant 10 min walk along the sea front to th... |
Good location away from the crouds |
NaN |
Russ (kent) |
NaN |
1 |
Riviera San Nicol 11/a |
Hotels |
Mableton |
US |
45.421611 |
12.376187 |
Hotel Russo Palace |
30126 |
GA |
2015-04-03T00:00:00Z |
2016-10-24T00:00:25Z |
NaN |
NaN |
5.0 |
Really lovely hotel. Stayed on the very top fl... |
Great hotel with Jacuzzi bath! |
NaN |
A Traveler |
NaN |
2 |
Riviera San Nicol 11/a |
Hotels |
Mableton |
US |
45.421611 |
12.376187 |
Hotel Russo Palace |
30126 |
GA |
2014-05-13T00:00:00Z |
2016-10-24T00:00:25Z |
NaN |
NaN |
5.0 |
Ett mycket bra hotell. Det som drog ner betyge... |
Lugnt l��ge |
NaN |
Maud |
NaN |
3 |
Riviera San Nicol 11/a |
Hotels |
Mableton |
US |
45.421611 |
12.376187 |
Hotel Russo Palace |
30126 |
GA |
2013-10-27T00:00:00Z |
2016-10-24T00:00:25Z |
NaN |
NaN |
5.0 |
We stayed here for four nights in October. The... |
Good location on the Lido. |
NaN |
Julie |
NaN |
4 |
Riviera San Nicol 11/a |
Hotels |
Mableton |
US |
45.421611 |
12.376187 |
Hotel Russo Palace |
30126 |
GA |
2015-03-05T00:00:00Z |
2016-10-24T00:00:25Z |
NaN |
NaN |
5.0 |
We stayed here for four nights in October. The... |
������ ��������������� |
NaN |
sungchul |
NaN |
Let's only include reviews from the United States in order to try and include only English reviews. First, let's plot our data, like so:
# plot the lats and longs of reviews hotel_reviews.plot.scatter(x='longitude', y='latitude')
The output looks something like this:
For the purpose of making our dataset a bit easier to work with, let's use pandas to subset the reviews and only include those that came from the United States:
# Filter to only include reviews within the US hotel_reviews = hotel_reviews[((hotel_reviews['latitude']<=50.0) & (hotel_reviews['latitude']>=24.0)) & ((hotel_reviews['longitude']<=-65.0) & (hotel_reviews['longitude']>=-122.0))] # Plot the lats and longs again hotel_reviews.plot.scatter(x='longitude', y='latitude') # Only looking at reviews that are coming from the US
The output is as follows:
It looks like a map of the U.S.! Let's shape our filtered dataset now:
hotel_reviews.shape
We have 30,692 rows and 19 columns. When we write reviews for hotels, we usually write about different things in the same review. For this reason, we will attempt to assign topics to single sentences rather than to the entire review.
To do so, let's grab the text column from our data, like so:
texts = hotel_reviews['reviews.text']