Sample Analysis of NBA Play-by-Play Data

Overview

This is a sample analysis to test the viability of using play-by-play NBA data from BigDataBall.com for analysis. I’ve been looking for a source of play-by-play data that would allow me to calculate things like effective field goal percentage (eFG%) of teammates while a player is on vs. off the court. The play-by-play data that’s available at places like Basketball-Reference.com and NBA.com don’t have the necessary detail (e.g., they don’t describe who is on the court for a given play), but BigDataBall appears to. The data comes with a cost, though, so I wanted to run a sample analysis with their sample file to see if I liked the format and was willing to purchase full seasons of the data.

Given that broader objective, this analysis calculates the relative eFG% of teammates while each player is on vs. off the court for both teams in Game 4 of the 2016 NBA finals.

Note: Code can be forked from my Github repository here. Also, sorry for the poor code formatting. WordPress likes to massacre Magrittr pipe operators.

Setup

Load Packages

require(tidyverse)
require(magrittr)
require(rvest)
require(ggthemes)
require(pander)

Custom Functions

To help with the analysis, I’ll create a few functions.

The first function will filter the game’s data frame down to just the records for which a given player is on the court.

filter_on = function(df, nm = "name", h_o_a = "home") {
  if (h_o_a == "home") {
    df %>%
      filter(
        h1 == nm | h2 == nm | h3 == nm | h4 == nm | h5 == nm
      )
  } else if (h_o_a == "away") {
    df %>%
      filter(
        a1 == nm | a2 == nm | a3 == nm | a4 == nm | a5 == nm
      )
  } else {
    "Error! Select 'home' or 'away' for h_o_a."
  }
}

The next function will do the same for when a given player is off the court.

filter_off = function(df, nm = "name", h_o_a = "home") {
  if (h_o_a == "home") {
    df %>%
      filter(
        h1 != nm, h2 != nm, h3 != nm, h4 != nm, h5 != nm
      )
  } else if (h_o_a == "away") {
    df %>%
      filter(
        a1 != nm, a2 != nm, a3 != nm, a4 != nm, a5 != nm
      )
  } else {
    "Error! Select 'home' or 'away' for h_o_a."
  }
}

Since I’m interested in the shooting percentage of just the teammates while a player is on vs. off the court (in other words, I’m not interested in the player’s shooting percentage but rather just the rest of his teammates’ shooting percentages), I’ll need to filter out some records. This function will filter to just the shots attempted or made by the player’s team that were not attempted/made by the player himself.

# filters to just shots by the same team as the player by someone other than the player
filter_eFG = function(df, nm = "name", tm = "teamAbbrev")  {
  df %>%
    filter(
      team == tm,
      event_type %in% c("shot", "miss"), # want shots w/o free throws
      player != nm # want shots from teammates
    )
}

I’ll also create a function to calculate the eFG%: eFG% = (FGM + 0.5 * 3PM) / FGA.

eFG = function(df, made_col = "event_type", made_text = "shot", points_col = "points") {
  (sum(df[made_col] == made_text) + 0.5 * sum(df[points_col] == 3)) / nrow(df)
}

Finally, I’ll write a function to return the relative eFG% and a few more relevant pieces of information. I’m particularly interested in returning the number of shots attempted by teammates while a player is on/off the court. If there’s a large discrepancy, which indicates that a player is either on the court or off the court for the vast majority of the game, the relative eFG% may be unreliable due to random fluctuations. I’ll want to account for that in my analysis below.

eFGRatio = function(df, nm = "name", tm = "teamAbbrev", h_o_a = "home") {
  df_on = try(df %>% filter_on(nm, h_o_a) %>% filter_eFG(nm, tm))
  df_off = try(df %>% filter_off(nm, h_o_a) %>% filter_eFG(nm, tm))

  if (nrow(df_on) == 0) {
    # no plays that meet criteria -> return NA
    eFG_on = NA
    shots_on = NA
  } else {
    eFG_on = eFG(df_on)
    shots_on = nrow(df_on)
  }

  if(nrow(df_off) == 0) {
    # no plays that meet criteria -> return NA
    eFG_off = NA
    shots_off = NA
  } else {
    eFG_off = eFG(df_off)
    shots_off = nrow(df_off)
  }

  eFG_Ratio = eFG_on / eFG_off

  output = list(
    eFG_Ratio = eFG_Ratio, # ratio of eFG of teammates while on vs. while off court
    eFG_on = eFG_on, # eFG of teammates while player is on court
    eFG_off = eFG_off, # eFG of teammates while player is off court
    shots_on = shots_on, # number of shots by teammates while player is on court
    shots_off = shots_off # number of shots by teammates while player is off court
  )

  return(output)
}

Inputs

location_of_data = "/Users/kylewurtz/Dropbox/R/NBA Play-By-Play/Sample Analysis/Data/Sample BigDataBall" # Update!
data_file_name = "Sample_BigDataBall.csv"

Read in Sample Data Set

BigDataBall offers a free sample data set of its play-by-play data, and that’s what I’ll be working with in this file. For convenience, I’ve downloaded a copy of the data set and stored it in the repository.

df = read_csv(file.path(location_of_data, data_file_name))
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   period = col_integer(),
##   away_score = col_integer(),
##   home_score = col_integer(),
##   remaining_time = col_time(format = ""),
##   elapsed = col_time(format = ""),
##   play_id = col_integer(),
##   num = col_integer(),
##   outof = col_integer(),
##   points = col_integer(),
##   shot_distance = col_integer(),
##   original_x = col_integer(),
##   original_y = col_integer(),
##   converted_x = col_double(),
##   converted_y = col_double()
## )
## See spec(...) for full column specifications.

Work

Initial Investigation

Now that I have the data read in, I’ll take a quick look at the structure of the data set before moving on with the analysis.

glimpse(df)
## Observations: 467
## Variables: 44
## $ game_id        <chr> "0041500404", "0041500404", "0041500404", "0041...
## $ data_set       <chr> "2016 Playoff", "2016 Playoff", "2016 Playoff",...
## $ date           <chr> "6/10/16", "6/10/16", "6/10/16", "6/10/16", "6/...
## $ a1             <chr> "Harrison Barnes", "Harrison Barnes", "Harrison...
## $ a2             <chr> "Draymond Green", "Draymond Green", "Draymond G...
## $ a3             <chr> "Andrew Bogut", "Andrew Bogut", "Andrew Bogut",...
## $ a4             <chr> "Klay Thompson", "Klay Thompson", "Klay Thompso...
## $ a5             <chr> "Stephen Curry", "Stephen Curry", "Stephen Curr...
## $ h1             <chr> "Richard Jefferson", "Richard Jefferson", "Rich...
## $ h2             <chr> "LeBron James", "LeBron James", "LeBron James",...
## $ h3             <chr> "Tristan Thompson", "Tristan Thompson", "Trista...
## $ h4             <chr> "J.R. Smith", "J.R. Smith", "J.R. Smith", "J.R....
## $ h5             <chr> "Kyrie Irving", "Kyrie Irving", "Kyrie Irving",...
## $ period         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ away_score     <int> 0, 0, 0, 0, 3, 3, 3, 4, 5, 5, 7, 7, 7, 7, 10, 1...
## $ home_score     <int> 0, 0, 0, 0, 0, 3, 3, 3, 3, 6, 6, 6, 6, 8, 8, 10...
## $ remaining_time <time> 720 secs, 720 secs, 701 secs, 700 secs, 689 se...
## $ elapsed        <time> 0 secs, 0 secs, 19 secs, 20 secs, 31 secs, 46 ...
## $ play_length    <chr> "0:00:00", "0:00:00", "0:00:19", "0:00:01", "0:...
## $ play_id        <int> 0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 12, 13, 14, 15, ...
## $ team           <chr> NA, "GSW", "GSW", "GSW", "GSW", "CLE", "CLE", "...
## $ event_type     <chr> "start of period", "jump ball", "miss", "reboun...
## $ assist         <chr> NA, NA, NA, NA, "Stephen Curry", "LeBron James"...
## $ away           <chr> NA, "Andrew Bogut", NA, NA, NA, NA, NA, NA, NA,...
## $ home           <chr> NA, "Tristan Thompson", NA, NA, NA, NA, NA, NA,...
## $ block          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Dr...
## $ entered        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ left           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ num            <int> NA, NA, NA, NA, NA, NA, NA, 1, 2, NA, NA, NA, N...
## $ opponent       <chr> NA, NA, NA, NA, NA, NA, "Draymond Green", NA, N...
## $ outof          <int> NA, NA, NA, NA, NA, NA, NA, 2, 2, NA, NA, NA, N...
## $ player         <chr> NA, "Tristan Thompson", "Draymond Green", "Harr...
## $ points         <int> NA, NA, 0, NA, 3, 3, NA, 1, 1, 3, 2, 0, NA, 2, ...
## $ possession     <chr> NA, "Stephen Curry", NA, NA, NA, NA, NA, NA, NA...
## $ reason         <chr> NA, NA, NA, NA, NA, NA, "s.foul", NA, NA, NA, N...
## $ result         <chr> NA, NA, "missed", NA, "made", "made", NA, "made...
## $ steal          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ type           <chr> "start of period", "jump ball", "Jump Shot", "r...
## $ shot_distance  <int> NA, NA, 25, NA, 23, 24, NA, NA, NA, 23, 14, 14,...
## $ original_x     <int> NA, NA, -4, NA, 225, -240, NA, NA, NA, 232, -66...
## $ original_y     <int> NA, NA, 247, NA, -6, -19, NA, NA, NA, -11, 129,...
## $ converted_x    <dbl> NA, NA, 25.4, NA, 2.5, 1.0, NA, NA, NA, 48.2, 3...
## $ converted_y    <dbl> NA, NA, 29.7, NA, 4.4, 90.9, NA, NA, NA, 90.1, ...
## $ description    <chr> NA, "Jump Ball Thompson vs. Bogut: Tip to Curry...
summary(df)
##    game_id            data_set             date          
##  Length:467         Length:467         Length:467        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##       a1                 a2                 a3           
##  Length:467         Length:467         Length:467        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##       a4                 a5                 h1           
##  Length:467         Length:467         Length:467        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##       h2                 h3                 h4           
##  Length:467         Length:467         Length:467        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##       h5                period        away_score      home_score   
##  Length:467         Min.   :1.000   Min.   :  0.0   Min.   : 0.00  
##  Class :character   1st Qu.:2.000   1st Qu.: 32.0   1st Qu.:30.00  
##  Mode  :character   Median :3.000   Median : 50.0   Median :55.00  
##                     Mean   :2.548   Mean   : 54.7   Mean   :53.05  
##                     3rd Qu.:4.000   3rd Qu.: 81.0   3rd Qu.:81.00  
##                     Max.   :4.000   Max.   :108.0   Max.   :97.00  
##                                                                    
##  remaining_time      elapsed         play_length           play_id     
##  Length:467        Length:467        Length:467         Min.   :  0.0  
##  Class1:hms        Class1:hms        Class :character   1st Qu.:140.5  
##  Class2:difftime   Class2:difftime   Mode  :character   Median :280.0  
##  Mode  :numeric    Mode  :numeric                       Mean   :282.0  
##                                                         3rd Qu.:426.5  
##                                                         Max.   :583.0  
##                                                                        
##      team            event_type           assist         
##  Length:467         Length:467         Length:467        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      away               home              block          
##  Length:467         Length:467         Length:467        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    entered              left                num          opponent        
##  Length:467         Length:467         Min.   :1.000   Length:467        
##  Class :character   Class :character   1st Qu.:1.000   Class :character  
##  Mode  :character   Mode  :character   Median :1.000   Mode  :character  
##                                        Mean   :1.456                     
##                                        3rd Qu.:2.000                     
##                                        Max.   :2.000                     
##                                        NA's   :410                       
##      outof          player              points        possession       
##  Min.   :1.000   Length:467         Min.   :0.0000   Length:467        
##  1st Qu.:2.000   Class :character   1st Qu.:0.0000   Class :character  
##  Median :2.000   Mode  :character   Median :1.0000   Mode  :character  
##  Mean   :1.912                      Mean   :0.9361                     
##  3rd Qu.:2.000                      3rd Qu.:2.0000                     
##  Max.   :2.000                      Max.   :3.0000                     
##  NA's   :410                        NA's   :248                        
##     reason             result             steal          
##  Length:467         Length:467         Length:467        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      type           shot_distance     original_x        original_y    
##  Length:467         Min.   : 0.00   Min.   :-243.00   Min.   :-19.00  
##  Class :character   1st Qu.: 2.00   1st Qu.: -84.00   1st Qu.:  1.25  
##  Mode  :character   Median :14.00   Median :  -5.00   Median : 32.00  
##                     Mean   :13.44   Mean   : -16.80   Mean   : 81.24  
##                     3rd Qu.:24.00   3rd Qu.:  26.25   3rd Qu.:154.25  
##                     Max.   :31.00   Max.   : 238.00   Max.   :257.00  
##                     NA's   :305     NA's   :305       NA's   :305     
##   converted_x     converted_y    description       
##  Min.   : 0.70   Min.   : 3.40   Length:467        
##  1st Qu.:20.50   1st Qu.:12.88   Class :character  
##  Median :25.00   Median :47.25   Mode  :character  
##  Mean   :25.04   Mean   :48.77                     
##  3rd Qu.:32.08   3rd Qu.:87.40                     
##  Max.   :49.30   Max.   :90.90                     
##  NA's   :305     NA's   :305

The data set is rich with detail and has all the information needed to calculate effective field goal percentages of teammates while a player is on vs. off the court. I could spend hours playing with all the data in this file (x and y coordinates!), but for the purposes of this sample analysis I’ll limit the scope to eFG%.

Effective Field Goal Percentage When CLE Players are On Vs. Off the Court

First, I’ll take a look at the ratio of teammates’ eFG% while each Cleveland player is on vs. off the court. The goal here is to get a rough understanding of whether a player’s presence on the court makes his teammates more effective shooters. Certain players (e.g., LeBron) tend to draw a lot of attention from the other team’s defenders, and that attention may free up quality scoring opportunities for their teammates. This sort of analysis can supplement more traditional analyses that focus on an individual player’s efficiency.

I’ll start by creating an empty tibble that will house the results for Cleveland players.

cle_players = df %>%
  select(h1:h5) %>%
  gather(pos, player) %>%
  select(player) %>%
  unique() %>%
  arrange(player) %>%
  .[[1]]

cle_eFGs = tibble(
  player = cle_players,
  eFG_Ratio = NA, eFG_on = NA,  eFG_off = NA,
  shots_on = NA, shots_off = NA
)

Next, I’ll loop through each of the records in the cle_players tibble (each record contains a player) and calculate the relevant metrics using the functions I created earlier in the file. I’ll also plot the ratio of the eFG% of teammates while a player is on vs. off the court for each of the players.

for (ply in cle_eFGs$player) {
  eFGRatio_output = eFGRatio(df, ply, "CLE", "home")
  cle_eFGs$eFG_Ratio[cle_eFGs$player == ply] = eFGRatio_output$eFG_Ratio
  cle_eFGs$eFG_on[cle_eFGs$player == ply] = eFGRatio_output$eFG_on
  cle_eFGs$eFG_off[cle_eFGs$player == ply] = eFGRatio_output$eFG_off
  cle_eFGs$shots_on[cle_eFGs$player == ply] = eFGRatio_output$shots_on
  cle_eFGs$shots_off[cle_eFGs$player == ply] = eFGRatio_output$shots_off
}
cle_eFGs %>%
  arrange(desc(eFG_Ratio)) %>%
  mutate(player = factor(player, levels = .[["player"]]),
         ratio_cred = sqrt(pmin(shots_on, shots_off) / pmax(shots_on, shots_off))) %>%
  ggplot(., aes(x = player, y = eFG_Ratio, fill = ratio_cred)) +
  geom_bar(stat = "identity") +
  theme_fivethirtyeight() +
  scale_fill_continuous("Credibility", limits = c(0, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("CLE: Ratio of eFG% of Teammates While On Court vs. Off Court")
## Warning: Removed 1 rows containing missing values (position_stack).

cle

For the most part, the plot is fairly straightforward. The players are sorted from the best ratios to the worst, and the ratios are on the y-axis. Unsurprisingly, LeBron’s ratio is pretty solid at just over 2.0. This means that teammates’ eFG%s are roughly twice as good when he’s on the court than when he’s not. Interestingly, J.R. Smith’s and Kyrie Irving’s ratios are spectacular. These players are generally considered somewhat ineffective, but perhaps their recklessness draws attention and opens up some opportunities for their teammates. Of course, we’re just looking at one game so this could easily be an anomaly. Furthermore, this ratio metric shouldn’t be believed equally for each of the players. The fewer plays that are included in either the “on court” or the “off court” values, the more likely the ratio is to be thrown off due to chance. For example, LeBron played pretty much the entire game. It may simply be chance that CLE players didn’t shoot well during the couple plays that he was off the court. Conversly, Kevin Love spent about half the time on the court and half the time off the court. As such, there’s a higher likelihood that Love’s ratio is reflective of a real phenomenon than LeBron’s ratio. To try to put a numeric value to the believability of each of the ratios, I took a page from the actuarial world and used the simple square root rule credibility measure. The standard for full credibility (100% believable) is when the player spent an equal amount of time on the court as off the court. The more disproportionate the on-the-court plays are vs. the off-the-court plays, the lower the credibility. This credibility amount is mapped onto the color scale, with lighter blue colors being more credible.

Effective Field Goal Percentage When GSW Players are On Vs. Off the Court

I’ll also perform the same exercise for the Golden State players.

gsw_players = df %>%
  select(a1:a5) %>%
  gather(pos, player) %>%
  select(player) %>%
  unique() %>%
  arrange(player) %>%
  .[[1]]

gsw_eFGs = tibble(
  player = gsw_players,
  eFG_Ratio = NA, eFG_on = NA,  eFG_off = NA,
  shots_on = NA, shots_off = NA
)
for (ply in gsw_eFGs$player) {
  eFGRatio_output = eFGRatio(df, ply, "GSW", "away")
  gsw_eFGs$eFG_Ratio[gsw_eFGs$player == ply] = eFGRatio_output$eFG_Ratio
  gsw_eFGs$eFG_on[gsw_eFGs$player == ply] = eFGRatio_output$eFG_on
  gsw_eFGs$eFG_off[gsw_eFGs$player == ply] = eFGRatio_output$eFG_off
  gsw_eFGs$shots_on[gsw_eFGs$player == ply] = eFGRatio_output$shots_on
  gsw_eFGs$shots_off[gsw_eFGs$player == ply] = eFGRatio_output$shots_off
}
gsw_eFGs %>%
  arrange(desc(eFG_Ratio)) %>%
  mutate(player = factor(player, levels = .[["player"]]),
         ratio_cred = sqrt(pmin(shots_on, shots_off) / pmax(shots_on, shots_off))) %>%
  ggplot(., aes(x = player, y = eFG_Ratio, fill = ratio_cred)) +
  geom_bar(stat = "identity") +
  theme_fivethirtyeight() +
  scale_fill_continuous("Credibility", limits = c(0, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("GSW: Ratio of eFG% of Teammates While On Court vs. Off Court")

gsw

Interestingly, the results don’t look too great for some of the stars on the Warriors. The best values come from either players who didn’t get much playing time (McAdoo and Bogut) and Harrison B arnes, who lives in the shadows of Curry, Thompson, and Green. The results for McAdoo and Bogut could simply be chance error due to the small sample size (credibility isn’t very good for any of the GSW players due to their skewed rotation) or the fact that they were probably playing during times when Cleveland also had their reserves in the game. The results for Barnes are more interesting, though. As a starter, he probably played the majority of his time with other starters. Yet his ratio is much better than the rest of the starters. That could be worth investigating.

For now, though, I’ll wrap up this little analysis. I think there’s a sufficient level of granularity in this data to allow me to waste hours and hours nerding out, so I’ll be purchasing the BigDataBall subscription for the upcoming season and the data for historical seasons. I’ll be blogging more about random play-by-play analyses over the course of the season, so stay tuned!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s