Paper Citation Crawler

GitHub ACM Citations

The project I worked on is a web crawler designed to gather paper meta information in BibTeX format from ACM conference pages. This information includes the title, author, abstract, publish conference, and year of the papers.

To achieve this, the crawler utilizes the Python-Requests library to fetch web pages, allowing it to retrieve the necessary data. The use of PhantomJS enables the execution of JavaScripts on the web pages, ensuring that all relevant content is accessible for parsing.

For parsing the meta information of the papers, the crawler utilizes a combination of regular expressions and BeautifulSoup, a Python library for web scraping. This combination facilitates the extraction of the desired BibTeX format data from the web pages.

To enhance efficiency and speed up the crawling process, the crawler implements a map-reduce multi-threading mechanism in Python. This optimization enables concurrent processing of multiple web pages, significantly reducing the overall execution time.

To mitigate the risk of being recognized as a Denial-of-Service (DoS) attack, the crawler can be configured with a set of proxies. These proxies help distribute the web page fetching requests, ensuring a more balanced and distributed workload.

The collected BibTeX files, containing the gathered paper meta information, can be found in my public GitHub repository. This allows for easy access and sharing of the collected data with others.

Techniques: HTTP, PhantomJS, HTML parsing, regex, map-reduce