How to crawl same url using Scrapy?

Let’s learn how to crawl same url using Scrapy. The most accurate or helpful solution is served by stackoverflow.com.

There are ten answers to this question.

Best solution

python - How to use scrapy to crawl same url by post ...

I want to crawl a website by post different page numbers,but I only get the data of the first page then the spider finished,I think maybe crawl the same url, it ...

stackoverflow.com

Other solutions

Google Webmaster Tools: Why Duplicate Meta descriptions are increasing even after using URL parameters?

Our site is showing Increase in Duplicate Meta tag and descriptions even after we have used Paginates_ NO URL to crawl from URL parameter section? What else we should do so that Googlebot stop crawling and increasing list of duplicate tags?

Answer:

This is even canonical problem of website, Define preferred URL And use canonical tag in each page....

Read more

Chhote Lal Lodh at Quora Mark as irrelevant Undo

When implementing a site-wide URL update, should on-site links be updated to the new format immediately, or should you wait a couple weeks for Google to associate the new URLs via the Canonical link element?

When you have to update/change just about every URL on your site, and you're using 301 redirects to send traffic going to the old style URLs to the new version of that URL, I'm wondering if it's best to update on-site links (like those in a top nav that...

Answer:

I don't see any benefit in leaving the internal links pointing to the old URLs. By switching over to...

Read more

Dan Cristo at Quora Mark as irrelevant Undo

How would you disallow a dynamic URL parameter in robots.txt?

Lots of documentation online about how to block /?q= but nothing conclusive. My worry about using /*?q is other parameters that begin with the letter q that we don't want in the no-crawl list. Example URL (how would you disallow it): example dot com...

Answer:

/*?q=* would be my first choice. I'm 99% sure the trailing * is me showing my age, and is totally redundant...

Read more

Ian Lurie at Quora Mark as irrelevant Undo

Why can't Amazon prices be scraped?

Im scraping the prices,author name and title of the book from the url :http://www.amazon.com/Alpha-Jasi... ... And this is my scraping code using scrapy : [Python] from scrapy.contrib.spiders import CrawlSpider, Rule  from scrapy.contrib.linkex - Pastebin...

Answer:

Bec the prices are populated via JavaScript. Scrapy doesn't supports javascript rendering. You have...

Read more

Shobhit Jain at Quora Mark as irrelevant Undo

How to scrape the Google cache

Had a plan to scrape a website, but now it's down indefinitely. Google has the site cached, but this makes things kind of complicated. Newbie questions about scraping websites and using the Google cache inside. I've read this question - should I be trying...

Answer:

You may want to check on archive.org's WayBackMachine as well which could help cover question 2) above...

Read more

hot soup at Ask.Metafilter.Com Mark as irrelevant Undo

I'm building a website that contains sorting when navigating from page to page. Should I add the sorting parameter to the URL or keep it in a cookie for SEO so that the same page URL is linked to everytime?

The advantage of using a sorting parameter is that I can cache the entire page. Another option is to keep the page static by adding ajax calls for links that use the sorting algorithm, however this affects the ability to crawl the site for links I believe...

Answer:

Another option, besides those you mention, is to use rel canonical so that link juice flows to the default...

Read more

Julias Shaw at Quora Mark as irrelevant Undo

Does Google crawl http://goo.gl shortened URLs?

We are using Google's own url shortening service. All the target URLs which are shortened are publicly available but I don't want Google to index them. I understand that robots.txt is going to prevent indexing of such documents if I choose to. I was...

Answer:

No, shortened URL's will not appear in the search results. The content will. Google won't necessarily...

Read more

Jesse Leimgruber at Quora Mark as irrelevant Undo

virtual site not indexed by google

Yet another "Why don't I appear on Google?" We host a virtual site for a client on our web server machine. Our site (http://www.techsmiths.com/ is indexed and appears reasonably placed in Google. Our client's (http://www.phoenixoptions.com...

Answer:

Hi Techsmith ~ You're right, we do get a lot of questions about sites which don't show up in Google...

Read more

techsmith-ga at Google Answers Mark as irrelevant Undo

How do I order the nodes of a social network to get the best locality when running map-reduce graph algorithms?

The Schimmy approach to optimising the performance of graph algorithms uses a partitioning strategy that groups nodes together using data derived from some attributes of the nodes -- see http://www.cloudera.com/blog/201... For example, a web-crawl graph...

Answer:

it's been a while (meaning the graph was pretty small then and so was the hadoop cluster and the average...

Read more

Joydeep Sen Sarma at Quora Mark as irrelevant Undo

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.