What is a stop-word list and what advantage does it have to remove them?
Stop words are extremely common words
A Stopword is a word without essential information content, such as “and”, “the”, or “www”, etc. In English, the terms “stopword” or “stopwords” are used for this purpose. They are used very often, but do not really provide any additional information during a search.
Advantages of removing Stop Word
- First the content is reduced to the essential.
- Secondly that saves disk space in your data-storage.
- Thirdly it makes it easier to evaluate the content according to relevance.
Search without Stop-Words
A full text search theoretically indexes all words. But the Stop Words are an exception. The index should not be unnecessarily enlarged by entries with unimportant words. This means that search engines simply ignore these words and punctuation marks other than Boolean operators. The fact that these are also safely ignored is due to the so-called stop word lists, which can be expanded again and again. Even words that are used very often, such as adjectives, verbs or pronouns, are considered stop words and are integrated into the lists. This also includes abbreviations such as www, http or com, which are also regarded as stop words by most search engines. In addition, Stop Words are not included in the previous indexing of the text. So search engines skip them over in order to save space in their databases, and speed up the search query.
Search with Stop-Words
If you want to start a further search, you can do this, for example, with a so-called phrase search. For example, if you want to include stop words in the search, you can put the corresponding search phrase in quotation marks. Alternatively, the search terms can also be linked together using the plus sign.
Special Cases
It becomes difficult with proper names that contain stop words. For example, the English article “the” is a Stop Word, which is also used when searching for “The Who”. This can lead to problems.
To prevent this problem, Google uses an advanced stopword detection that works as follows:
- Stopwords are determined from lists and removed as usual.
- Two search queries are generated, both with and without the detected stopwords.
- Search results are retrieved for these search queries.
- These search results are compared.
- If the search results are the same or similar, the removed terms are insignificant stopwords.
If the documents are different, the stopwords play a content role.
In this way, Google can avoid removing search terms that play an important role in the evaluation of search queries.
https://patents.google.com/patent/US7945579
Jeff Atwood, Co-Founder of Stack Overflow makes the following stopword experiment in 2004.
Over a period of a week, he searched for an entire dictionary of ~110k individual English words and recorded how many hits Google returned for each. Yes, this is probably a massive violation of the Google terms of service, but he tried to keep it polite and low impact — he used Gzip compressed HTTP requests, specified only 10 search results should be returned per query (as all he needed was the count of hits), and he added a healthy delay between queries so he wasn’t querying too rapidly. He is not sure this kind of experiment would fly against today’s Google, but it worked in 2004. At any rate, he ended up with a MySQL database of 110,000 English words and their frequency in Google as of late summer 2004.
Most used words in Google (52)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | the 522,000,000 of 515,000,000 and 508,000,000 to 507,000,000 in 479,000,000 for 468,000,000 internet 429,000,000 on 401,000,000 home 370,000,000 is 368,000,000 by 366,000,000 all 352,000,000 this 341,000,000 with 338,000,000 services 329,000,000 about 319,000,000 or 317,000,000 at 316,000,000 email 311,000,000 from 308,000,000 are 306,000,000 website 302,000,000 us 301,000,000 site 283,000,000 sites 279,000,000 you 276,000,000 information 276,000,000 contact 274,000,000 more 271,000,000 an 271,000,000 search 269,000,000 new 269,000,000 that 267,000,000 your 262,000,000 it 261,000,000 be 258,000,000 prices 258,000,000 as 255,000,000 page 246,000,000 hotels 240,000,000 products 234,000,000 other 222,000,000 have 219,000,000 web 219,000,000 copyright 218,000,000 download 218,000,000 not 214,000,000 can 209,000,000 reviews 209,000,000 our 206,000,000 use 205,000,000 women 200,000,000 |
Example of English Stop-Word List (153)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | a about above after again against all am an and any are as at be because been before being below between both but by could did do does doing down during each few for from further had has have having he he’d he’ll he’s her here here’s hers herself him himself his how how’s I I’d I’ll I’m I’ve if in into is it it’s its itself let’s me more most my myself nor of on once only or other ought our ours ourselves out over own same she she’d she’ll she’s should so some such than that that’s the their theirs them themselves then there there’s these they they’d they’ll they’re they’ve this those through to too under until up very was we we’d we’ll we’re we’ve were what what’s when when’s where where’s which while who who’s whom why why’s with would you you’d you’ll you’re you’ve your yours yourself yourselves |
Download Stopwordlists in several languages
https://github.com/gaffling/stopwords/
Table of Contents