Scraping your Google Scholar data

rstats
academia
dataviz
ggplot2
Author

Peder Braadland

Published

April 29, 2024

Just to be clear: Citations, H-index and other similar metrics do not indicate whether one’s research is good or important. The same goes for journal rankings - publishing in high level journals does not make you a good researcher, although it (unfortunately) still improves your chances of getting grants and tenure.

I do find it interesting, however, whether people are citing my work. In many ways I feel that my more impactful work often is the least cited. A better predictor of citations is whether the content is flashy and novel or not – regardless of whether the research is reproducible and generalizable.

In this post I will show you how you can scrape Google scholar profiles using the scholar package in R (available on CRAN). While Google Scholar’s dashboard does give you a brief overview of this, I will show you how you can collect more detailed data, including citation data over time for individual articles, and how you can create your own bibliography which can be handy when creating a CV you can easily update.

Creating a bibliography

We start off by finding our Google Scholard author ID and use this ID to retrieve all publications associated with that ID. We get eight pieces of information per profile, including title, author, journal, data on the issue, pages etc, cites, year of publication, some citation ID and a publication ID.

Code for scraping your Google Scholar publication data
# You'll need to install the packages below prior to loading.
# install.packages("scholar") 
# You can also use the following code (not used here since it doesn't cooperate well with knitr)
# pkg <- c("tidyverse", "ggpubr", "scholar", "formattable", "kableExtra")
# lapply(pkg, FUN = function(X){do.call("require", list(X))})

library(kableExtra)
library(formattable)
library(scholar)
library(tidyverse)
library(ggpubr)


'%ni%' <- Negate('%in%')

author_id <- scholar::get_scholar_id(last_name = "Braadland", first_name = "Peder")
pubs <- scholar::get_publications(author_id) 
glimpse(pubs)
Rows: 35
Columns: 8
$ title   <chr> "β-adrenergic receptor signaling in prostate cancer", "Ex vivo…
$ author  <chr> "PR Braadland, H Ramberg, HH Grytli, KA Taskén", "PR Braadland…
$ journal <chr> "Frontiers in oncology", "British journal of cancer", "Molecul…
$ number  <chr> "4, 375", "117 (11), 1656-1664", "17 (11), 2154-2168", "26 (4)…
$ cites   <dbl> 80, 40, 29, 25, 20, 16, 13, 11, 5, 3, 3, 2, 1, 0, 0, 0, 0, 0, …
$ year    <dbl> 2015, 2017, 2019, 2019, 2023, 2020, 2021, 2016, 2020, 2022, 20…
$ cid     <chr> "16069725449159400664", "190629465548971635", "164841235280137…
$ pubid   <chr> "UeHWp8X0CEIC", "u5HHmVD_uO8C", "YsMSGLbcyi4C", "W7OEmFMy1HYC"…

Note that Google scholar does not distinguish between original research, review articles, conference abstracts, preprints, replies etc., and you will therefore need to do some data wrangling to collect what you want to display. In my case, I want to display only original research and review articles. Since I have a rather short publication record, this was not a lot of work.

Code for cleaning your publication data
# Filter only original articles and aggregate duplicates

not_include <- c(
    "CIGARETTE SMOKING ASSOCIATES WITH SUPERIOR SURVIVAL AND INCREASED ILEAL MICROBIAL DIVERSITY IN PRIMARY SCLEROSING CHOLANGITIS",
    "Chromatin reprogramming as an adaptation mechanism in prostate cancer",
    "Mitochondrial dysfunction and lipid alterations in primary sclerosing cholangitis",
    "Multimarker analysis combining markers of fibrosis and inflammation in primary sclerosing cholangitis",
    "Potential systemic effects of beta-blocker use in prostate cancer patients",
    "Prostate Cancer Metabolism: Effects of <U+03B2>2-Adrenergic Receptor Knockdown in LNCaP Cells",
    "Scavenging for lethal prostate cancer biomarkers in FFPE tissue",
    "Spatial transcriptomics reveals shared gene and cellular composition in recurrent and primary sclerosing cholangitis",
    "Strong stroma and epithelial expression of Leucine-rich α-2-glycoprotein 1 predicts contradictory outcomes in patients progressing to castration-resistant prostate cancer",
    "Prostate Cancer Metabolism: Effects of <U+03B2>2-Adrenergic Receptor Knockdown in LNCaP Cells",
    "Pre-surgery blood levels of leucine-rich alpha-2-glycoprotein 1 identify patients with a high risk of progressing to castration-resistant prostate cancer",
    "Targeting therapy resistance in advanced prostate cancer"
)

pubs_clean <- 
pubs %>%
    # One article had been duplicated albeit with different titles
    mutate(title = ifelse(title == "Ex vivo metabolic fingerprinting identifies biomarkers predictive of prostate cancer recurrence",
                          "Ex vivo metabolic fingerprinting identifies biomarkers predictive of prostate cancer recurrence following radical prostatectomy", title)) %>%
    arrange(title) %>%
    # For duplicated articles, I summed the citations and filtered only one occurrence to avoid duplicates
    group_by(title) %>%
    mutate(cites = sum(cites)) %>%
    distinct(title, .keep_all = TRUE) %>%
    ungroup() %>%
    # Remove entries containing strings such as 'abstract', 'reply' etc. 
    filter(!str_detect(title, "Abstract")) %>%
    filter(!str_detect(title, "Back Cover")) %>%
    filter(!str_detect(title, "Reply")) %>%
    filter(!str_detect(title, "Prostate Cancer Metabolism:")) %>%
    filter(pubid != "mB3voiENLucC") %>%
    # Entries without a journal are likely not original research or review articles, but could be e.g. theses
    filter(!is.na(journal)) %>%
    # Remove entries manually
    filter(title %ni% not_include) %>%
    # Remove preprints
    filter(journal != "medRxiv") %>%
    # One article is assigned 2015 but was in fact published in 204
    mutate(year = ifelse(title == "\u03b2-adrenergic receptor signaling in prostate cancer", 2014, year)) %>%
    arrange(year) 

In the next step I will create a visually appealing table of my articles, and include a bar showing the number of citations per article. I use the kableExtra and formattable packages to achieve this.

Code for producing the table
# Prepare a data frame for the table
pubs_df <-
  data.frame(
    # For each article, create a character string including author, title, year, journal and page number, issue etc
    Information =
      pubs_clean %>%
        mutate(full_citation = paste0(
          author,
          " (", year, "). ",
          title, ". ",
          journal, ". ",
          number, ". "
        )) %>%
        arrange(year, title) %>%
        pull(full_citation),
    # Pull the number of citations
    citations =
      pubs_clean %>% arrange(year, title) %>% pull(cites),
    # Pull the publication ID (may be handy for later joining with other metrics)
    pubid =
      pubs_clean %>% arrange(year, title) %>% pull(pubid)
  )

# We use formattable::color_bar to assign a color to the bars reflecting the relative number of citations
pubs_df$citations <- formattable::color_bar("#ffd38b")(pubs_df$citations)

# Create the table
pubs_df %>%
  dplyr::select(-pubid) %>%
  kable("html", escape = F, html_font = "Roboto") %>%
  kable_paper() %>%
  # Specify width of the citations bar cells
  column_spec(2, width = "4cm") %>%
  kable_styling(bootstrap_options = "striped", font_size = 13) %>%
  footnote("Shows only peer-reviewed articles and pre-prints") 
Information citations
PR Braadland, H Ramberg, HH Grytli, KA Taskén (2014). β-adrenergic receptor signaling in prostate cancer. Frontiers in oncology. 4, 375. 80
PR Braadland, HH Grytli, H Ramberg, B Katz, R Kellman, ... (2016). Low β2-adrenergic receptor level may promote development of castration resistant prostate cancer and altered steroid metabolism. Oncotarget. 7 (2), 1878. 11
PR Braadland, G Giskeødegård, E Sandsmark, H Bertilsson, LR Euceda, ... (2017). Ex vivo metabolic fingerprinting identifies biomarkers predictive of prostate cancer recurrence following radical prostatectomy. British journal of cancer. 117 (11), 1656-1664. 43
PR Braadland, A Urbanucci (2019). Chromatin reprogramming as an adaptation mechanism in advanced prostate cancer. Endocrine-related cancer. 26 (4), R211-R235. 25
PR Braadland, H Ramberg, HH Grytli, A Urbanucci, HK Nielsen, ... (2019). The β2-Adrenergic Receptor Is a Molecular Switch for Neuroendocrine Transdifferentiation of Prostate Cancer Cells. Molecular Cancer Research. 17 (11), 2154-2168. 29
A Serguienko, P Braadland, LA Meza-Zepeda, B Bjerkehagen, ... (2020). Accurate 3-gene-signature for early diagnosis of liposarcoma progression. Clinical Sarcoma Research. 10, 1-11. 5
IJ Guldvik, V Zuber, PR Braadland, HH Grytli, H Ramberg, W Lilleby, ... (2020). Identification and validation of leucine-rich α-2-glycoprotein 1 as a noninvasive biomarker for improved precision in prostate cancer risk stratification. European urology open science. 21, 51-60. 16
H Ramberg, E Richardsen, GA de Souza, M Rakaee, ME Stensland, ... (2021). Proteomic analyses identify major vault protein as a prognostic biomarker for fatal prostate cancer. Carcinogenesis. 42 (5), 685-693. 13
PR Braadland, A BERGQUIST, C Rupp, R Voitl, AK Dhillon, T Folseraas, ... (2021). Vitamin B6 deficiency associates with liver transplantation-free survival in primary sclerosing cholangitis. JOURNAL OF HEPATOLOGY. 75, S421-S422. 0
IJ Guldvik, PR Braadland, S Sivanesan, H Ramberg, G Kristensen, ... (2022). Low blood levels of LRG1 before radical prostatectomy identify patients with high risk of progression to castration-resistant prostate cancer. European Urology Open Science. 45, 68-75. 2
PR Braadland, KM Schneider, A Bergquist, A Molinaro, ... (2022). Suppression of bile acid synthesis as a tipping point in the disease course of primary sclerosing cholangitis. JHEP Reports. 4 (11), 100561. 3
MJ Hole, KK Jørgensen, K Holm, PR Braadland, MH Meyer‐Myklestad, ... (2023). A shared mucosal gut microbiota signature in primary sclerosing cholangitis before and after liver transplantation. Hepatology. 77 (3), 715-728. 20
PR Braadland, A Bergquist, M Kummen, L Bossen, LK Engesæter, ... (2023). Clinical and biochemical impact of vitamin B6 deficiency in primary sclerosing cholangitis before and after liver transplantation. Journal of Hepatology. 79 (4), 955-966. 0
SC Raju, A Molinaro, A Awoyemi, SF Jørgensen, PR Braadland, A Nendl, ... (2024). Microbial-derived imidazole propionate links the heart failure-associated microbiome alterations to disease severity. Genome Medicine. 16 (1), 27. 1
W Lin, L Gerullat, PR Braadland, A Fournier, JR Hov, D Globisch (2024). Rapid and Bifunctional Chemoselective Metabolome Analysis of Liver Patient Samples Using the Reagent 4‐Nitrophenyl‐2H‐azirine. Angewandte Chemie International Edition, e. e202318579. 0
Note:
Shows only peer-reviewed articles and pre-prints

Visualizing cumulative publications

For fun I decided to look how my article numbers are accumulating. The model fit for a successful researcher should be an exponential one, right? Just kidding

Code for packages and plot theme
library(tidyverse)
library(lubridate)
library(ggpubr)
library(ggpubr)

# Define a theme
theme_simple <- function() {
  theme_minimal(base_family = "Montserrat") +
    theme(
      axis.title = element_blank(),
      panel.grid.minor = element_blank(),
      plot.title = element_text(face = "bold", hjust = 0),
      plot.subtitle = element_text(hjust = 0),
      strip.text = element_text(hjust = 0, size = 5, color = "#444444")
    )
}

I decided to add career breaks and other achievements below the plot.

Code for wrangling and producing the plot
# Since I'm going to create a vertically aligned plot composite, I want the two to have the same axis breaks and limits.

# Plot the step plot (cumulative citations)
# --------------------------------------->
step_plot_df <-
  pubs_clean %>%
  group_by(year) %>%
  summarise(sum_year = n()) %>%
  mutate(cum_freq = cumsum(sum_year)) %>%
  bind_rows(data.frame(year = 2013, sum_year = 0, cum_freq = 0))

step_plot <-
  step_plot_df %>%
  ggplot(aes(x = year, y = cum_freq)) +
  geom_step(color = "#222222", size = 0.75) +
  geom_smooth(color = "#999999", size = 0.4, se = FALSE, linetype = "dashed") +
  scale_x_continuous(breaks = c(2014:max(step_plot_df$year)), limits = c(2013, max(step_plot_df$year) + 1), expand = c(0.05, 0.05)) +
  scale_y_continuous(breaks = seq(0, max(step_plot_df$cum_freq), 2)) +
  labs(
    title = "Cumulative sum of published articles\n\n",
    y = "",
    x = "Year of publication"
  ) +
  theme_simple() +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_text(vjust = -0.2, margin = margin(r = -11), hjust = 0)
  )



# Annotate periods of interest (parental leave, academic achievements)
# --------------------------------------->

# Create a data frame with start and stop dates
events <- data.frame(
  ev = c(rep("Parental leave", 3), "MSc project", "Research assistant", "PhD period", "Postdoc 1", "Postdoc 2"),
  start = c("15.06.2017", "15.09.2019", "15.03.2024", "15.08.2013", "01.08.2014", "01.01.2016", "15.08.2020", "15.08.2023"),
  stop = c("15.09.2017", "15.08.2020", "15.06.2024", "30.06.2014", "31.12.2015", "27.02.2020", "14.08.2023", "31.12.2024")
) %>%
  mutate_at(vars(c("start", "stop")), dmy) %>%
  # To ease plot alignment we convert year-months into decimals
  mutate(
    start = as.numeric(format(start, "%Y")) + as.numeric(format(start, "%m")) / 12,
    stop = as.numeric(format(stop, "%Y")) + as.numeric(format(stop, "%m")) / 12
  )

# Create a dataframe for the legend
legend_data <- data.frame(
  ev = events$ev,
  start = NA,
  stop = NA
)

# Combine events and legend_data
legend_data <- rbind(events[, c("ev", "start", "stop")], legend_data)

# Create the plot
p_events <-
  ggplot() +
  geom_rect(data = events, aes(
    xmin = start, xmax = stop,
    ymin =
      ifelse(ev %in% c("Parental leave"), 1,
        ifelse(ev %in% c("MSc project", "PhD period", "Postdoc 1", "Postdoc 2", "Research assistant"), 2.1, 3)
      ),
    ymax = ifelse(ev %in% c("Parental leave"), 2,
      ifelse(ev %in% c("MSc project", "PhD period", "Postdoc 1", "Postdoc 2", "Research assistant"), 3, 4)
    ),
    fill = ev
  ), color = "white", size = 0.2) +
  scale_x_continuous(breaks = c(2014:max(step_plot_df$year)), limits = c(2013, max(step_plot_df$year) + 1), expand = c(0.05, 0.05)) +
  theme_simple() +
  theme(
    legend.position = "bottom",
    legend.title = element_blank(), ,
    legend.key.height = unit(3, "mm"),
    panel.grid.major.y = element_blank(),
    axis.text.y = element_blank(),
    axis.title.y = element_blank()
  ) +
  scale_fill_manual(
    values =
      c(
        "Parental leave" = "#ff85a5",
        "MSc project" = "#d4e080",
        "Research assistant" = "#9db98b",
        "PhD period" = "#ffd38b",
        "Postdoc 1" = "#66c4ec",
        "Postdoc 2" = "#7392ba"
      ),
    breaks = c("Parental leave", "MSc project", "Research assistant", "PhD period", "Postdoc 1", "Postdoc 2")
  ) +
  guides(fill = guide_legend(
    label.position = "bottom", nrow = 1
  ))

ggarrange(step_plot + 
              theme(plot.margin = unit(c(2, 2, 0, 2), "mm")), 
          p_events + 
              theme(plot.margin = unit(c(0, 2, 2, 2), "mm")), 
          nrow = 2, align = "v", heights = c(2, 1))

Visualizing citation data

We can look at how individual articles were cited over time as well as for all articles combined. To achieve this, you can loop over each article’s citation history and plot this as a faceted plot.

Code for wrangling and producing the plot
# Collect citation histories
citation_hist <- tibble(year = numeric(), cites = numeric(), pubid = character())
articles <- pubs_clean %>% pull(pubid)

for (i in 1:length(articles)) {
  article <- articles[i]
  article_info <- get_article_cite_history(id = author_id, article = article)
  citation_hist <- bind_rows(citation_hist, article_info)
}

# Append publication titles
citation_hist <-
  citation_hist %>%
  left_join(pubs_clean %>% dplyr::select(title, yr_pub = year, pubid))

# Produce the plot
citation_hist %>%
  ggplot(aes(x = year, y = cites)) +
  geom_col(width = 0.3, color = "#444444", position = position_nudge(x = 0.15)) +
  scale_x_continuous(breaks = seq(min(citation_hist$year), max(citation_hist$year), 1)) +
  facet_wrap(~ reorder(title, year), ncol = 2, labeller = label_wrap_gen(width = 120)) +
  # Indicate year of publication as dotted lines
  geom_segment(aes(x = yr_pub, xend = yr_pub, y = 0, yend = max(cites) - 0.5 * max(cites)), size = 0.2, linetype = "dotted", color = "#444444") +
  geom_point(aes(x = yr_pub, y = max(cites) - 0.5 * max(cites)), shape = 21, color = "#444444", stroke = 0.1, fill = "white", size = 1.5) +
  theme_simple()+
    theme(panel.grid.major.x = element_blank())+
  labs(
      title = "Citation history for publications with at least one citation",
      y = "Number of citations"
      )

Some interesting patterns arise. Firstly, most of my articles are rarely cited. Second, for all but one articles there is an expected lag period of a year or two before articles start getting cited - publishing takes time! The exception is the article to the lower right (“A shared mucosal …”). This article contains data on gut microbiota, a rather flashy subject which in my experience gets cited a lot (this work was led by Mikal J. Hole, a talented and hard working PhD student I co-supervise).

Finally we can look at overall citations. Here I’ve tried to reproduce Google Scholar’s own figures. I’ve also included a formula to calculate the H-index. The trend line ends at the last completed year to not give the illusion of a downward trend

Code for wrangling and H-index
# Calculate total number of citations
cits_total <- sum(citation_hist$cites)

# Calculate H-index
cits <- citation_hist %>%
  group_by(pubid) %>%
  summarise(cites = sum(cites)) %>%
  pull(cites) %>%
  sort()

# Calculate h-index
hindex <- function(x) {
  tx <- sort(x, decreasing = T)
  print(sum(tx >= seq_along(tx)))
}

hind <- hindex(cits)
Code for producing the plot
# Plot citations per year
cit_hist_df <- 
citation_hist %>%
    group_by(year) %>%
    summarise(cites = sum(cites))
cit_hist_df %>%
    ggplot(aes(x = year, y = cites))+
    geom_col(color = "#444444", width = 0.5)+
    theme_simple()+
    theme(panel.grid.major.x = element_blank(),
          panel.grid.minor.x = element_blank(),
          panel.grid.minor.y = element_blank())+
    scale_y_continuous(position = "right")+
    scale_x_continuous(breaks = seq(min(citation_hist$year), max(citation_hist$year), 1))+
    labs(title = paste("Citations (total): ", cits_total, " | ", "H-index: ", hind),
         y = "Number of citations\n")+
    geom_smooth(data = cit_hist_df %>% filter(year <= max(cit_hist_df$year) - 1), aes(x = year, y = cites), se = FALSE, color = "#ff85a5", size = 0.8)