Condividi tramite


Using PowerShell to generate a large test CSV file with random data

I recently posted a new blog that shows how to load a very large CSV file into Excel, breaking the limit of 1 million rows in a single Excel sheet. If you haven't seen it, you should check it out: Loading CSV/text files with more than a million rows into Excel.

One of the challenges for me while writing that blog post was finding sample data to demonstrate that Excel feature. I have a few files that are that big from work, but obviously I could not share those publicly. I also tried to find some sample data on public sites like https://data.gov but I could not find anything with more than 1 million rows.

The only option I had left was creating a sample file myself with some random data. I created a simple PowerShell script to create a file with a few columns filled with random data. You would think that it would be straightforward to do that, but my first version, which wrote one line to the file at a time, would take hours to generate a file with 2 million rows. Not good enough.

I optimized it by creating a thousand rows in memory and writing that batch to the file in one operation. Repeating that 2,000 times gave me a file with two million rows in under 20 minutes. It included 4 columns and about 36 characters per row. I ended up with a file size of 68.8 MB (72,144,471 bytes).

Note that I had to specify the encoding, since the default encoding in a regular PowerShell output would create double-byte characters and would make the resulting file twice as large.

Below is the PowerShell script, both as text and a picture with the color coding from the PowerShell ISE.

 $Start = Get-Date
$BaseDate = (Get-Date "2016/12/31 12:59:59")
$FullYear = 366*24*60*60

$File = ”c:\users\jose\file.csv” 
"Date,Customer,Type,Value" | Out-File -FilePath $File -Encoding utf8

1..2000 | % {

  Write-Progress -Activity “Generating File” -PercentComplete ($_/20) 
  $Lines = ""
  1..1000 | % {
    $Dt = $BaseDate.AddSeconds(-(Get-Random $FullYear))
    $Ct = (Get-Random 100)
    if ((Get-Random 5) -lt 4) {$Ty="Sale"} else { $Ty="Return"}
    $Vl = (Get-Random 100000) / 100
    $Lines += [string]$Dt + "," + [string]$Ct + "," + $Ty + "," + [string]$Vl + [char]10 + [char]13
  }

  $Lines | Out-File -FilePath $File -Encoding utf8 -Append
}

$End = Get-Date
"Started at $Start and ended at $End"
$Diff = ($End-$Start).TotalSeconds
"Processing took $Diff seconds"

large-file-powershell