Case Sensitivity and Duplicate URLs Getting Crawled
I've seen several scenarios where a single document gets crawled twice and leads to duplicate results for this particular item – two entries in the Crawl Log with the same display URL, but with different Doc IDs. This isn't the typical scenario where multiple very similar documents get calculated as "Near Duplicate" items, but rather a single document being crawled as two unique items.
For the spoiler, there is likely either a case sensitive crawl rule present or the Search Service Application has the "CaseSensitiveCrawling" property enabled. For more details around these, see Crawling case sensitive repositories using SharePoint Server 2010.
Should this occur, compare the duplicates URLs in the crawl log. Is the capitalization of the duplicate URLs consistent? In other words, does one item have capitalization such as https://foo/lookAtMe.aspx whereas the other has something like https://foo/LOOKatME.aspx?
Also, find the duplicate items in the MSSCrawlUrl table in the Crawl Store DB. For example, use the "Item ID" displayed in the Crawl Log UI (assume Item IDs 12 and 22 represent duplicates) to run a SQL Query such as:
SELECT * FROM [SSA_CrawlStoreDB].[dbo].[MSSCrawlURL] WHERE DocId in (12, 22)
From this output, compare the capitalization of the AccessUrl for each, such as the following duplicates:
- sts4://foo/siteurl=sites/news/siteid={guid}/weburl=lookAtMe/webid={guid}/listid={guid}/folderurl=/itemid=123
- sts4://foo/siteurl=sites/News/siteid={guid}/weburl=LOOKatME/webid={guid}/listid={guid}/folderurl=/itemid=123
The results of that SQL query can provide some further insight regarding where an item got emitted by looking to the ParentDocId and EnumerationDepth columns.
Finally, you can also find this in ULS when the item gets crawled, which emits an event such as the following:
mssdmn.exe SharePoint Server Search Connnectors:SharePoint dv7f VerboseEx
Emit link sts4://foo/siteurl=sites/news/siteid={guid}/weburl=lookAtMe/webid={guid}/listid={guid}/folderurl=/itemid=123,
DisplayURL = https://foo/sites/News/lookAtMe/fake.aspx
Note: In SP2013, the event is dv7f as above, but in SP2010, this same message is logged under category PHSts as event dv8v.
In summary, this is a short post and admittedly piggybacks on the work of other folks. However, I've seen this scenario pop up on multiple occasions, so wanted to reiterate this here.
In all the scenarios that I've seen, this has always come down to inconsistent capitalization. However, I suspect that it is also feasible (but not likely) for the AccessUrl to be different and cause a similar outcome. In either case, you still have the tools above to isolate the items and find a difference between these URLs to troubleshoot why the duplicates occur.
For capitalization related causes:
Does this URL match a Crawl Rule and if so, does it specify "Match Case"?
Does the SSA have this globally enabled? To check, run the following:
$SSA = Get-SPEnterpriseSearchServiceApplication "<the Name of Your SSA>"
#Note: If this GetProperty method below fails, the value is not set for the SSA meaning the default is in effect – which is case insensitive crawling
$SSA.GetProperty("CaseSensitiveCrawling")
And, to change the case sensitivity…
$SSA = Get-SPEnterpriseSearchServiceApplication "<the Name of Your SSA>"
$SSA.SetProperty("CaseSensitiveCrawling", 0)
$SSA.Update()
I hope this helps…
And for full disclosure and sanity check…
The first time I encountered this issue, my gut [incorrectly] said, "URLs are always case insensitive, right?" This then digressed into, "Well, maybe Unix/Linux based or Apache web servers could feasibly be case sensitive, but surely not IIS, right?"
As far as case sensitivity goes, here are some additional references:
https://www.w3.org/TR/WD-html40-970708/htmlweb.html
"URLs in general are case-sensitive (with the exception of machine names). There may be URLs, or parts of URLs, where case doesn't matter, but identifying these may not be easy. Users should always consider that URLs are case-sensitive"
https://httpd.apache.org/docs/2.2/urlmapping.html
"An especially useful feature of mod_speling, is that it will compare filenames without respect to case. This can help systems where users are unaware of the case-sensitive nature of URLs and the unix filesystem. But using mod_speling for anything more than the occasional URL correction can place additional load on the server, since each "incorrect" request is followed by a URL redirection and a new request from the client."
And the *best I could find regarding IIS:
https://stackoverflow.com/questions/5811021/how-to-enable-case-sensitivity-under-iis-express
"It is a misnomer that IIS is case-insensitive, it is the Windows file system that is case-insensitive, not IIS.... But for other than real file paths, IIS is 100% case-sensitive. The case of URL characters is passed to the IIS pipeline intact. It is up to the web application whether or not case-sensitivity exists. But good practice says you don't want /page1 to be different than /PAGE1."