The Question

When you work on data science problems, you always start with a question you’re trying to answer. This question will affect the data you pick, your exploration process, and how you interpret the results.

The question for this article is: How much (in percentage) should you tip your taxi driver?

To answer the question, we’ll use a portion of the NYC Taxi dataset. The data file we’ll be using is taxi-01-2020-sample.csv.bz2

Note: CSV is a horrible, horrible format. There’s no standard, no schema and everything is unmarshalled to text (unlike JSON where you have types). If you can, pick another format. My preferred data storage format is SQLite.

Exploration Code

We’re searching for the answer to our question, we’ll be focused on productivity. If at some point this code will get to production, go ahead and refactor it.

To simplify working with the input, we’ll pass the input file in the standard input. We’ll have several stages of exploring the data and each will have a corresponding command line switch. In the main function we have the following line:

r := bzip2.NewReader(os.Stdin)

and then we’ll call each exploration step with r.

First Contact

Before you start working with data, it’s important to have a quick look at it to see that it matches your expectations. You should also check that the data fits in memory.

Listing 1: First Look

19 func firstLook(r io.Reader) error {
20     var numLines, numBytes int
21     s := bufio.NewScanner(r)
22     for s.Scan() {
23         if numLines < 5 {
24             fmt.Println(s.Text())
25         }
26         numBytes += len(s.Text())
27         numLines++
28     }
29 
30     if err := s.Err(); err != nil {
31         return err
32     }
33 
34     fmt.Printf("size: %.2fMB\n", float64(numBytes)/1_000_000)
35     fmt.Printf("lines: %d\n", numLines)
36     return nil
37 }

Listing 1 shows the first look on the data. On line 21, we create a bufio.Scanner to scan line by line. On lines 23-25, we print the first 5 lines from the file. On line 34, we print the file size and on line 35, we print the number of lines.

Listing 2: Running First Look

$ go run taxi.go -first_look < taxi-01-2020-sample.csv.bz2
VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
2,2003-01-01 00:07:17,2003-01-01 14:16:59,1.0,0.0,1.0,N,193,193,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2008-12-31 23:35:00,2008-12-31 23:36:53,1.0,0.42,1.0,N,263,263,2.0,3.5,0.5,0.5,0.0,0.0,0.3,7.3,2.5
2,2009-01-01 00:06:19,2009-01-01 00:10:22,1.0,0.85,1.0,N,107,137,2.0,5.0,0.0,0.5,0.0,0.0,0.3,8.3,2.5
2,2009-01-01 00:48:28,2009-01-01 00:57:48,1.0,0.93,1.0,N,100,186,2.0,7.5,0.0,0.5,0.0,0.0,0.3,10.8,2.5
size: 101.68MB
lines: 1000001

Listing 2 shows how to run the first step. We run the program with go run. In the output we see the first 5 lines and the uncompressed file size and the number of lines.

The file is a CSV file and small enough to fit in memory. To calculate the tip percentage we need only two columns: tip_amount and total_amount. If you’re curious about the data schema, see here.

Loading the Data

Once the initial look aligns with your assumptions, you can load the data. We’re going to use github.com/jszwec/csvutil to parse the CSV and gonum to calculate some statistics.

Listing 3: Imports

14     "github.com/jszwec/csvutil"
15     "gonum.org/v1/gonum/floats"
16     "gonum.org/v1/gonum/stat"

Listing 3 shows the external imports in our code. On line 14, we import csvutil and on lines 15-16, we import floats and stat from gonum.

Listing 4: Load Data

62 type Row struct {
63     Tip   float64 `csv:"tip_amount"`
64     Total float64 `csv:"total_amount"`
65 }
66 
67 func loadData(r io.Reader) ([]float64, []float64, error) {
68     var tip, total []float64
69     dec, err := csvutil.NewDecoder(csv.NewReader(r))
70     if err != nil {
71         return nil, nil, err
72     }
73 
74     for {
75         var row Row
76         err := dec.Decode(&row)
77 
78         if err == io.EOF {
79             break
80         }
81 
82         if err != nil {
83             return nil, nil, err
84         }
85 
86         tip = append(tip, row.Tip)
87         total = append(total, row.Total)
88     }
89 
90     return tip, total, nil
91 }

Listing 4 shows how we load the data. On lines 62-65, we define a Row struct with the fields that interest us. On line 68, we define the tip and amount slices to hold the values from the tip_amount and total_amount fields in the CSV. On lines 74-88, we run a for loop to upload the data and finally on line 90, we return the data.

Listing 5: Statistics

39 func statistics(r io.Reader) error {
40     tip, total, err := loadData(r)
41     if err != nil {
42         return err
43     }
44 
45     fmt.Printf(
46         "tip: min=%.2f, mean=%.2f, max=%.2f\n",
47         floats.Min(tip),
48         stat.Mean(tip, nil),
49         floats.Max(tip),
50     )
51 
52     fmt.Printf(
53         "total: min=%.2f, mean=%.2f, max=%.2f\n",
54         floats.Min(total),
55         stat.Mean(total, nil),
56         floats.Max(total),
57     )
58 
59     return nil
60 }

Listing 5 shows the statistics step of our data exploration. On line 40, we load the date. On lines 45-50, we print the minimal, mean (average) and maximal values for the tip. On lines 52-57, we do the same for the total.

Listing 6: Running Statistics

$ go run taxi.go -stats < taxi-01-2020-sample.csv.bz2 
tip: min=-11.80, mean=2.21, max=333.50
total: min=-333.30, mean=18.47, max=4268.30

Listing 6 shows how to run the statistics step. We can see there are some bad values. Both minimum values are negative and the maximal value for the total amount is more than $4,000.

In any real life dataset you will have bad values, and you need to decide what to do with them. We’re going to take the easy approach and ignore them. We will filter out negative values. Also, since we’re not planning on taxi rides that cost more than $100, we’ll filter out rows with total_amount bigger than 100.

Tip

Listing 7: Load Filtered Data

114 func loadDataFiltered(r io.Reader) ([]float64, []float64, error) {
115     var tip, total []float64
116     dec, err := csvutil.NewDecoder(csv.NewReader(r))
117     if err != nil {
118         return nil, nil, err
119     }
120 
121     for {
122         var row Row
123 
124         err := dec.Decode(&row)
125         if err == io.EOF {
126             break
127         }
128 
129         if err != nil {
130             return nil, nil, err
131         }
132 
133         if row.Total <= 0 || row.Tip <= 0 || row.Total > 100 {
134             continue
135         }
136 
137         tip = append(tip, row.Tip)
138         total = append(total, row.Total)
139     }
140 
141     return tip, total, nil
142 }

Listing 7 shows loading of filtered data. The only difference from loadData are lines 133-135 where we filter out rows.

Now we can calculate the tip we want to pay. We’d like to be on the generous side so we’ll use the 75% quantile value. The 75% quantile (or percentile) is a number that 75% of the values are below it.

Listing 8: Desired Tip

93  func desiredTip(r io.Reader) error {
94      tip, total, err := loadDataFiltered(r)
95      if err != nil {
96          return err
97      }
98  
99      fmt.Printf("%d filtered values\n", len(tip))
100 
101     pct := make([]float64, len(tip))
102     for i, t := range tip {
103         pct[i] = t / (total[i] - t)
104     }
105 
106     // stat.Quantile required sorted values
107     sort.Float64s(pct)
108     q := 0.75
109     val := stat.Quantile(q, stat.Empirical, pct, nil)
110     fmt.Printf("%.2f quantile tip: %.2f\n", q, val)
111     return nil
112 }

Listing 8 shows the desiredTip function. On line 94, we load the filtered data. On line 99, we print how many filtered values we have to check so we don’t filter out too many rows. On lines 101-104, we create a slice of percentages. Finally on lines 107-110, we calculate the 75% percentile and on line 107, we print it.

Listing 9: Running Desired Tip

$ go run taxi.go -tip < taxi-01-2020-sample.csv.bz2 
716422 filtered values
0.75 quantile tip: 0.20

Listing 9 shows the tip step output. We see we filtered out about 30% of the rows. And finally, we see that the 75% quantile is 20%.

But wait! Maybe we should tip more on the weekend? Let’s have a look:

Listing 10: Loading Data With Time

145 func unmarshalTime(data []byte, t *time.Time) error {
146     var err error
147     *t, err = time.Parse("2006-01-02 15:04:05", string(data))
148     return err
149 }
150 
151 type TimeRow struct {
152     Tip   float64   `csv:"tip_amount"`
153     Total float64   `csv:"total_amount"`
154     Time  time.Time `csv:"tpep_pickup_datetime"`
155 }
156 
157 func loadDataWithTime(r io.Reader) ([]time.Time, []float64, []float64, error) {
158     var tip, total []float64
159     var times []time.Time
160     dec, err := csvutil.NewDecoder(csv.NewReader(r))
161     dec.Register(unmarshalTime)
162     if err != nil {
163         return nil, nil, nil, err
164     }
165 
166     for {
167         var row TimeRow
168 
169         err := dec.Decode(&row)
170         if err == io.EOF {
171             break
172         }
173 
174         if err != nil {
175             return nil, nil, nil, err
176         }
177 
178         if row.Total <= 0 || row.Tip <= 0 || row.Total > 100 {
179             continue
180         }
181 
182         tip = append(tip, row.Tip)
183         total = append(total, row.Total)
184         times = append(times, row.Time)
185     }
186 
187     return times, tip, total, nil
188 }

Listing 10 shows how to load the data with time. On lines 145-149, we write an unmarshalTime function to parse time from a []byte. On lines 151-155, we define TimeRow which is a row that contains a Time field. On line 159, we define the times slice and on line 160, we register unmarshalTime to handle time.Time fields. Finally on line 187, we return the times, tip and total.

Listing 11: Tip by Weekday

190 func weekdayTip(r io.Reader) error {
191     times, tip, total, err := loadDataWithTime(r)
192     if err != nil {
193         return err
194     }
195 
196     pct := make(map[time.Weekday][]float64)
197     for i, t := range tip {
198         wday := times[i].Weekday()
199         p := t / (total[i] - t)
200         pct[wday] = append(pct[wday], p)
201     }
202 
203     for wday := time.Sunday; wday < time.Saturday; wday += 1 {
204         // stat.Quantile required sorted values
205         p := pct[wday]
206         sort.Float64s(p)
207         q := 0.75
208         val := stat.Quantile(q, stat.Empirical, p, nil)
209         fmt.Printf("%-10s: %.2f quantile tip: %.2f (%6d samples)\n", wday, q, val, len(p))
210     }
211 
212     return nil
213 }

Listing 11 shows the “tip by weekday” calculation. On line 196, we use a map to hold percentages per weekday. On lines 197 to 201, we populate the percentages for each week day, this is the equivalent of a “GROUP BY” operation in a database. On lines 203-209, we go over each weekday, calculate the .75 quantile and print it out.

On line 209, we use the -10s% verb to have all weekdays take at least 10 characters and aligned to the line. This might seem like a trivial detail, but aligned output is much easier to compare for us humans - as you can see in the output below. We also align the number of samples for the same reason.

Listing 12: Running Weekday Tip

$ go run taxi.go -daily < taxi-01-2020-sample.csv.bz2
Sunday    : 0.75 quantile tip: 0.20 ( 77942 samples)
Monday    : 0.75 quantile tip: 0.20 ( 82561 samples)
Tuesday   : 0.75 quantile tip: 0.20 ( 97634 samples)
Wednesday : 0.75 quantile tip: 0.20 (118497 samples)
Thursday  : 0.75 quantile tip: 0.20 (125692 samples)
Friday    : 0.75 quantile tip: 0.20 (125743 samples)

Listing 12 shows how to run the daily step. We can see that there’s no difference in the tip percentage over the weekend.

Conclusion

A little curiosity and a little Go will get you pretty far in your data science journey. You don’t have to use deep learning, decision trees, support vector machines and other algorithms to get useful answers.

How are you using Go for data science? I’d love to hear, ping me at miki@ardanlabs.com.

Trusted by top technology companies

We've built our reputation as educators and bring that mentality to every project. When you partner with us, your team will learn best practices and grow along the way.

30,000+

Engineers Trained

1,000+

Companies Worldwide

12+

Years in Business