Vivado Non-Project Mode

I. The only way to go for serious FPGA designers

https://hwjedi.wordpress.com/2017/01/04/vivado-non-project-mode-the-only-way-to-go-for-serious-fpga-designers/

Vivado has been a huge improvement over ISE. Good riddance. However, as with any powerful tool, there are many knobs that you can turn which can be daunting to the average user. Vivado’s GUI is okay and I use it a lot for analyzing my compiled designs at different stages. More on that later. However for compiling projects I prefer to use a TCL based build script.

There are two modes to run the tools – Project Mode and Non-Project Mode. The GUI design flow uses Project Mode and you can also script in Project Mode using TCL commands. Project Mode is a more automated and guided flow and is great starting point for most designs. But if you want more customization and achieve maximum timing and area performance you need to use Non-Project Mode flow to unlock all the capabilities Vivado offers. In Non-Project mode, an in-memory project is created to let the Vivado tools manage various properties of a design, but the project file is not written to disk, and the project status is not preserved.

The most simple Non-Project Mode TCL build script looks something like this:

# filename: build.tcl

# Assign part to in-memory project (will also create the in-memory project)
# Used when generating ip and executing synth, impl.
set_part "xcku060-ffva1517-2-i"

# read all design files
read_verilog -sv ../rtl/lms.sv
read_verilog -sv ../rtl/filt.sv
read_verilog -sv ../rtl/top.sv
read_ip ../rtl/x_ip/A19xB17pP37iq/A19xB17pP37iq.xci
# read constraints
read_xdc ../rtl/clocks.xdc
read_xdc ../rtl/pblocks.xdc
read_xdc ../rtl/pins.xdc
# generate ip
generate_target all [get_ips]
# Synthesize Design
synth_design -top top -part xcku060-ffva1517-2-i

# Opt Design 
opt_design

# Place Design
place_design 

# Route Design
route_design

# Write out bitfile
write_debug_probes -force my_proj/my_proj.ltx
write_bitstream -force my_proj/my_proj.bit

To run this build script run in a vivado tcl shell like this:

$ vivado -mode tcl

****** Vivado v2016.4 (64-bit)
  **** SW Build 1733598 on Wed Dec 14 22:35:42 MST 2016
  **** IP Build 1731160 on Wed Dec 14 23:47:21 MST 2016
    ** Copyright 1986-2016 Xilinx, Inc. All Rights Reserved.

Vivado% source build.tcl
... Build messages ...
Vivado% exit

This should be pretty self explanatory even to users who have only built using the GUI. You may want to go over UG835 TCL Command Reference Guide and UG904 Vivado Implementation Guide to help you understand the commands and their various options.

For synth_design there are a bunch of options like the ones you see in the GUI plus more. I have found that setting -shreg_min_size to 5 instead of the default value of 3 helps my design.

For the implementation related commands Vivado provides different directives that attack the problem differently. Look at the TCL guide for details. So depending on your design different directives and combinations of directives will yield better results. Note that using the GUI or Project Mode TCL script you can essentially only run the following commands in this order –

opt_design -directive
place_design -directive
phys_opt -directive
route_design -directive
phys_opt -directive

However it turns out there are steps/directives that you can execute only in Non-Project Mode. Also, and most importantly, you can run steps again iteratively to improve timing. For example

opt_design
place_design -directive Explore
phys_opt_design -directive AggressiveExplore
phys_opt_design -directive AggressiveFanoutOpt
phys_opt_design -directive AlternateReplication
route_design -directive Explore
phys_opt_design -directive AggressiveExplore
route_design -tns_cleanup

To someone who has only been using the GUI up until now they are probably saying WTF!! Believe me, running the same commands again with different directives or after phys_opts can lead to significant improvements in timing. I frankly would not be able to fit my design in the FPGA I’m targeting without these capabilities. There are lot of ways to skin the cat and it takes a while to get a feel for what works and what doesn’t for your design. But if you are having trouble fitting your design its completely worth the effort.

Finally, another benefit of Non-Project Mode is the ability to write out design checkpoints (dcp) at various steps. This is helpful (and necessary in non-project mode since its in-memory and not stored to disk) because you can reopen your design at an intermediate stage and try to improve timing from that point instead of starting from scratch. With the many hours it can take to fit todays large FPGAs, this is an invaluable asset.

In my next post I’ll go over tips to help you create a build script for your needs that takes advantage of the stuff mentioned here.

II. Building off a solid foundation

https://hwjedi.wordpress.com/2017/01/29/vivado-non-project-mode-part-ii-building-off-a-solid-foundation/

In my last post we talked about Vivado’s Non-Project mode to build FPGA designs. Now lets figure out how to come up with the right strategy to achieve your timing goals.

Before we dive into this its important to recognize a general fact about building FPGA designs – improvements made earlier in the flow have a greater impact on improving your chances of closing timing and finishing faster. Its crucial to first try to resolve timing issues by fixing RTL code and ensuring that your design constraints are proper – over constraining your design can lead unnecessary increase in build time and frustration. Also make sure critical warnings are dealt with. The less stress we put on the placer the more the placer can do. With better placer QOR, the router does a better job. You get the point.

Take a look at the build script below. Its complete in most ways and will work fine as-is. Its also a great place to build off of. There is a “User Settings” section which is where a user would typically make any modifications depending on their project. In the “Build Design” section you’ll see the familiar commands that we discussed in the last post. After each build step any reports I want are generated and a design checkpoint (DCP) is created that I can go back to analyze in the GUI later on if needed.

# filename: build.tcl

set BUILD_DATE [ clock format [ clock seconds ] -format %m%d%Y ]
set BUILD_TIME [ clock format [ clock seconds ] -format %H%M%S ]

#######################################################################################
# User Settings 
#######################################################################################

# global settings
set PROJ_NM "best_proj_everrr"
set PROJ_DIR "./$PROJ_NM"
set PART_NM "xcku060-ffva1517-2-i"

# synthesis related settings
set SYNTH_ARGS ""
append SYNTH_ARGS " " -flatten_hierarchy " " rebuilt " "
append SYNTH_ARGS " " -gated_clock_conversion " " off " "
append SYNTH_ARGS " " -bufg " {" 12 "} "
append SYNTH_ARGS " " -fanout_limit " {" 10000 "} "
append SYNTH_ARGS " " -directive " " Default " "
append SYNTH_ARGS " " -fsm_extraction " " auto " "
#append SYNTH_ARGS " " -keep_equivalent_registers " "
append SYNTH_ARGS " " -resource_sharing " " auto " "
append SYNTH_ARGS " " -control_set_opt_threshold " " auto " "
#append SYNTH_ARGS " " -no_lc " "
#append SYNTH_ARGS " " -shreg_min_size " {" 3 "} "
append SYNTH_ARGS " " -shreg_min_size " {" 5 "} "
append SYNTH_ARGS " " -max_bram " {" -1 "} "
append SYNTH_ARGS " " -max_dsp " {" -1 "} "
append SYNTH_ARGS " " -cascade_dsp " " auto " "
append SYNTH_ARGS " " -verbose

set DEFINES ""
append DEFINES -verilog_define " " USE_DEBUG " "

set TOP_MODULE "pin_top"

#######################################################################################
# Build Design
#######################################################################################

# Assign part to in-memory project (will also create the in-memory project)
# Used when generating ip and executing synth, impl.
set_part $PART_NM

# read all design files and constraints
source sources.tcl
source constraints.tcl

# Synthesize Design
eval "synth_design $DEFINES $SYNTH_ARGS -top $TOP_MODULE -part $PART_NM"
report_timing_summary -file $PROJ_DIR/${PROJ_NM}_post_synth_tim.rpt
report_utilization -file $PROJ_DIR/${PROJ_NM}_post_synth_util.rpt
write_checkpoint -force $PROJ_DIR/${PROJ_NM}_post_synth.dcp

# Opt Design 
opt_design -directive Explore
report_timing_summary -file $PROJ_DIR/${PROJ_NM}_post_opt_tim.rpt
report_utilization -file $PROJ_DIR/${PROJ_NM}_post_opt_util.rpt
write_checkpoint -force $PROJ_DIR/${PROJ_NM}_post_opt.dcp
# Upgrade DSP connection warnings (like "Invalid PCIN Connection for OPMODE value") to
# an error because this is an error post route
set_property SEVERITY {ERROR} [get_drc_checks DSPS-*]
# Run DRC on opt design to catch early issues like comb loops
report_drc -file $PROJ_DIR/${PROJ_NM}_post_opt_drc.rpt

# Place Design
place_design -directive Explore 
report_timing_summary -file $PROJ_DIR/${PROJ_NM}_post_place_tim.rpt
report_utilization -file $PROJ_DIR/${PROJ_NM}_post_place_util.rpt
write_checkpoint -force $PROJ_DIR/${PROJ_NM}_post_place.dcp

# Post Place Phys Opt
phys_opt_design -directive AggressiveExplore
report_timing_summary -file $PROJ_DIR/${PROJ_NM}_post_place_physopt_tim.rpt
report_utilization -file $PROJ_DIR/${PROJ_NM}_post_place_physopt_util.rpt
write_checkpoint -force $PROJ_DIR/${PROJ_NM}_post_place_physopt.dcp

# Route Design
route_design -directive Explore
report_timing_summary -file $PROJ_DIR/${PROJ_NM}_post_route_tim.rpt
report_utilization -hierarchical -file $PROJ_DIR/${PROJ_NM}_post_route_util.rpt
report_route_status -file $PROJ_DIR/${PROJ_NM}_post_route_status.rpt
report_io -file $PROJ_DIR/${PROJ_NM}_post_route_io.rpt
report_power -file $PROJ_DIR/${PROJ_NM}_post_route_power.rpt
report_design_analysis -logic_level_distribution \
                       -of_timing_paths [get_timing_paths -max_paths 10000 \
                       -slack_lesser_than 0] \
                       -file $PROJ_DIR/${PROJ_NM}_post_route_vios.rpt
write_checkpoint -force $PROJ_DIR/${PROJ_NM}_post_route.dcp

set WNS [get_property SLACK [get_timing_paths -max_paths 1 -nworst 1 -setup]]
puts "Post Route WNS = $WNS"

# Write out bitfile
write_debug_probes -force $PROJ_DIR/${PROJ_NM}_${BUILD_DATE}_${BUILD_TIME}_${WNS}ns.ltx
write_bitstream -force $PROJ_DIR/${PROJ_NM}_${BUILD_DATE}_${BUILD_TIME}_${WNS}ns.bit \
 -bin_file

So lets start with the place_design command. Falling back on what I mentioned earlier, before you try different strategies after place, you must figure out the best placer directive to get the most impact later on. As of version 2016.4 there are 18 different directives that you can run. How do you figure out which one to use?! One way is brute force – just run builds with all of these directives and see what works the best. But we can be smart about this and use the tools provided to our advantage. If you aren’t using an FPGA with SLRs then all the directives prefixed with SSI_ should be avoided. Assuming you are having some trouble meeting timing avoid default, RuntimeOptimized, and Quick. That leaves us with 9 directives. Now you can use the following script on an existing post opt_design DCP to try each directive out and figure out which gives you the best post place timing. You save a lot of time here by using the post opt DCP since you don’t have to run synthesis and opt_design every time.

# filename: place_directive_explore.tcl

set PROJ_NM "best_proj_everrr"
set PROJ_DIR "./$PROJ_NM"

# list of place_design directives we want to try out
set directives "Explore \
                WLDrivenBlockPlacement \
                ExtraNetDelay_high \
                ExtraNetDelay_low \
                AltSpreadLogic_high \
                AltSpreadLogic_medium \
                AltSpreadLogic_low \
                ExtraPostPlacementOpt \
                ExtraTimingOpt"

# empty list for results
set wns_results ""
# empty list for time elapsed messages
set time_msg ""

foreach j $directives {
    # open post opt design checkpoint
    open_checkpoint $PROJ_DIR/${PROJ_NM}_post_opt.dcp
    # run place design with a different directive
    place_design -directive $j
    # append time elapsed message to time_msg list
    lappend time_msg [exec grep "place_design: Time (s):" vivado.log | tail -1]
    # append wns result to our results list
    set WNS [ get_property SLACK [get_timing_paths -max_paths 1 -nworst 1 -setup] ]
    append wns_results $WNS " "
}

# print out results at end
set i 0
foreach j $directives {
    puts "Post Place WNS with directive $j = [lindex $wns_results $i] "
    puts [lindex $time_msg [expr $i*2]]
    puts " "
    incr i
}

This is a great example of the power of using DCPs and Non-Project Mode commands. (Probably not the best example of TCL scripting but it gets the job done.)

For a project I am working on I got the following results:

Post Place WNS with directive Explore = -0.023
place_design: Time (s): cpu = 01:06:52 ; elapsed = 00:33:19 . Memory (MB): peak = 9731.234 ; gain = 3642.824 ; free physical = 130182 ; free virtual = 177071

Post Place WNS with directive WLDrivenBlockPlacement = -0.432
place_design: Time (s): cpu = 01:17:27 ; elapsed = 00:43:37 . Memory (MB): peak = 14935.910 ; gain = 2993.324 ; free physical = 119939 ; free virtual = 166867

Post Place WNS with directive ExtraNetDelay_high = -0.134
place_design: Time (s): cpu = 02:23:59 ; elapsed = 01:50:46 . Memory (MB): peak = 20106.434 ; gain = 2993.379 ; free physical = 114133 ; free virtual = 161160

Post Place WNS with directive ExtraNetDelay_low = -0.470
place_design: Time (s): cpu = 01:18:29 ; elapsed = 00:44:22 . Memory (MB): peak = 25258.746 ; gain = 2978.348 ; free physical = 108836 ; free virtual = 155867

Post Place WNS with directive AltSpreadLogic_high = -0.192
place_design: Time (s): cpu = 01:10:46 ; elapsed = 00:36:28 . Memory (MB): peak = 30464.715 ; gain = 3030.332 ; free physical = 103222 ; free virtual = 150286

Post Place WNS with directive AltSpreadLogic_medium = -0.135
place_design: Time (s): cpu = 01:12:09 ; elapsed = 00:39:40 . Memory (MB): peak = 35560.871 ; gain = 2972.355 ; free physical = 98236 ; free virtual = 145303

Post Place WNS with directive AltSpreadLogic_low = -0.179
place_design: Time (s): cpu = 01:20:01 ; elapsed = 00:45:25 . Memory (MB): peak = 40713.355 ; gain = 2984.340 ; free physical = 93140 ; free virtual = 140211

Post Place WNS with directive ExtraPostPlacementOpt = -0.023
place_design: Time (s): cpu = 01:20:41 ; elapsed = 00:40:32 . Memory (MB): peak = 45909.020 ; gain = 2982.355 ; free physical = 88140 ; free virtual = 135126

Post Place WNS with directive ExtraTimingOpt = -0.160
place_design: Time (s): cpu = 01:11:59 ; elapsed = 00:37:57 . Memory (MB): peak = 51082.590 ; gain = 2974.355 ; free physical = 83043 ; free virtual = 130032

That’s quite a spread there. Seems like Explore is the way to go for me. It has the best WNS and shortest run time. Though I may want to try out some of the other decent candidates like “ExtraPostPlacementOpt” because it may have done placement in such a way that helps with routing. In my experience the post place WNS numbers are approximate and are more of an indication of what to expect after route. But you can definitely recognize directives to avoid using this methodology.

I think we have chewed on enough for now. In my next post I’ll go over what I think is the most advantageous feature of Non-Project Mode – PhysOpt Looping.

III. Phys Opt Looping

https://hwjedi.wordpress.com/2017/02/09/vivado-non-project-mode-part-iii-phys-opt-looping/

Physical Optimization (PhysOpt) looping is one of the most powerful capabilities provided by Vivado to help with timing closure but surprisingly difficult to get much public information on. Hopefully this post changes that :).

So what is Physical Optimization? PhysOpt algos specifically target failing timing paths with different optimizations to aid in meeting timing. After place these optimizations are thin
gs like replicating regs to fix fan out issues, moving regs out
of SRLs, BRAMs and DSPs, retiming, and reducing logic levels on critical paths. After route the optimizations are fewer but include things like routing and clock fixes on failing paths. Checkout UG904 for details if you are interested.

Because PhysOpt requires timing data that is only available after placement, it cannot be run prior to placement (though there is a way to do this using iphysopt – more on that later). Running phys_opt_design after place_design has a significant impact. After route_design I’ve had mixed results and it does take a long time.

Like the other implementation commands, there are different options and directives for phys_opt_design. With project mode you can only run phys_opt_design once after place and once after route. There are no such restrictions here. I typically run 3 different directives one after another after place – something like this:

...
place_design -directive Explore
phys_opt_design -directive AggressiveExplore
phys_opt_design -directive AggressiveFanoutOpt
phys_opt_design -directive AlternateReplication
route_design -directive Explore
...

Each directive emphasizes different optimizations and running them back to back like this has proven to be very powerful. But you can go even further by looping on this group of commands to get even more oomph – this is called PhyOpt Looping! The idea is to run post place PhysOpt as many times as you can until timing is met or until you can’t improve WNS and then TNS any further.

The basic post place PhysOpt Loop looks like this:

...
place_design -directive Explore
set WNS [ get_property SLACK [get_timing_paths -max_paths 1 -nworst 1 -setup] ]
...
# Post Place PhysOpt Looping
set NLOOPS 5 
if {$WNS < 0.000} {
    for {set i 0} {$i < $NLOOPS} {incr i} {
        phys_opt_design -directive AggressiveExplore 
        phys_opt_design -directive AggressiveFanoutOpt
        phys_opt_design -directive AlternateReplication
    }
    report_timing_summary -file $PROJ_DIR/${PROJ_NM}_post_place_physopt_tim.rpt
    report_design_analysis -logic_level_distribution \
        -of_timing_paths [get_timing_paths -max_paths 10000 \
        -slack_lesser_than 0] \ 
        -file $PROJ_DIR/post_place_physopt_vios.rpt
    write_checkpoint -force $PROJ_DIR/${PROJ_NM}_post_place_physopt.dcp
}
route_design -directive Explore
...

What we are doing here is running 5 loops of 3 different PhysOpts if the estimated WNS was negative after place. The behavior you’ll notice, is that the WNS will keep going down until any number of PhysOpts don’t help. However TNS can still go down because PhysOpts might still work on other failing paths. This works fine but there are 2 improvements that can be had:

  1. If at any point in the loop you meet timing or are no longer improving WNS and TNS, you are going to just be running the commands unnecessarily and wasting time, lots and lots of time.
  2. The WNS number reported after place_design is an estimate since the design hasn’t been routed. So to ensure that the pre route netlist has better QOR, its recommended to over-constrain the design prior to running the PhysOpt looping.

To address these concerns I’ve modified the script to:

...
# Post Place PhysOpt Looping
set NLOOPS 5 
set TNS_PREV 0
set WNS_SRCH_STR "WNS="
set TNS_SRCH_STR "TNS="

if {$WNS < 0.000} {
    # add over constraining 
    set_clock_uncertainty 0.200 [get_clocks clk_out1_mmcm]
    set_clock_uncertainty 0.100 [get_clocks clk_out2_mmcm]

    for {set i 0} {$i < $NLOOPS} {incr i} {
        phys_opt_design -directive AggressiveExplore
        # get WNS / TNS by getting lines with the search string in it (grep),
        # get the last line only (tail -1),
        # extracting everything after the search string (sed), and
        # cutting just the first value out (cut). whew!
        set WNS [ exec grep $WNS_SRCH_STR vivado.log | tail -1 | sed -n -e "s/^.*$WNS_SRCH_STR//p" | cut -d\  -f 1]                                    
        set TNS [ exec grep $TNS_SRCH_STR vivado.log | tail -1 | sed -n -e "s/^.*$TNS_SRCH_STR//p" | cut -d\  -f 1]
        if {($TNS == $TNS_PREV && $i > 0) || $WNS >= 0.000} {
            break
        }
        set TNS_PREV $TNS

        phys_opt_design -directive AggressiveFanoutOpt 
        set WNS [ exec grep $WNS_SRCH_STR vivado.log | tail -1 | sed -n -e "s/^.*$WNS_SRCH_STR//p" | cut -d\  -f 1]
        set TNS [ exec grep $TNS_SRCH_STR vivado.log | tail -1 | sed -n -e "s/^.*$TNS_SRCH_STR//p" | cut -d\  -f 1]
        if {($TNS == $TNS_PREV && $i > 0) || $WNS >= 0.000} {
            break
        }
        set TNS_PREV $TNS

        phys_opt_design -directive AlternateReplication
        set WNS [ exec grep $WNS_SRCH_STR vivado.log | tail -1 | sed -n -e "s/^.*$WNS_SRCH_STR//p" | cut -d\  -f 1]
        set TNS [ exec grep $TNS_SRCH_STR vivado.log | tail -1 | sed -n -e "s/^.*$TNS_SRCH_STR//p" | cut -d\  -f 1]
        if {($TNS == $TNS_PREV) || $WNS >= 0.000} {
            break
        }
        set TNS_PREV $TNS
    }

    # remove over constraining 
    set_clock_uncertainty 0 [get_clocks clk_out1_mmcm]
    set_clock_uncertainty 0 [get_clocks clk_out2_mmcm]

    report_timing_summary -file $PROJ_DIR/${PROJ_NM}_post_place_physopt}_tim.rpt
    report_design_analysis -logic_level_distribution \
                           -of_timing_paths [get_timing_paths -max_paths 10000 \
                                                              -slack_lesser_than 0] \ 
                           -file $PROJ_DIR/post_place_physopt_vios.rpt
    write_checkpoint -force $PROJ_DIR/${PROJ_NM}_post_place_physopt.dcp
}
...

The script is pretty self explanatory but I’ll go over a couple of things. Some of the clocks has been over constrained using the set_clock_uncertainty command before running the PhysOpts and then over constraining is removed after exiting the loop. In order to get the WNS and TNS I’m searching for these values from the vivado.log file. I couldn’t find a better way to get TNS. Its actually pretty fast and much quicker than running get_timing_paths. After running each phys_opt_design command, I check if the WNS>=0.000 or if the TNS hasn’t improved from the previous phys_opt_design. The only time the TNS check isn’t done is when you go through the loop the first time because you want to make sure you run each of the directives at least once. This TNS check has saved me a lot of time because in situations where the WNS never gets to 0.000 and the TNS stops improving (pretty common), you would keep doing unnecessary PhysOpts and wasting loads of time.

Its worth mentioning that if your post place WNS is very high, say <-0.700ns, you probably aren’t going to succeed. You need to revisit your design or try different placer directives.

If you don’t close on timing after route, its recommend to run PhysOpt only once. So no looping business when at this point.

One more thing…

You can run report_phys_opt command after the last PhysOpt command to see each optimization made in detail. It may give you an idea of what’s wrong in your design. If you can fix some of the issues in RTL or with constraints you’ll be able get better QOR and shorten your build time because you may not have to run all those PhysOpts.

Another feature that was introduced recently in 2015.3 is Interactive PhysOpt. With this flow you can save off the optimizations done in one run, and apply them prior to running place! This significantly improves the place QOR and run time. I have yet to actually try this out myself but its seems really promising and will fill you in on my observations in the future.

First use your last run by opening the post_place_physopt.dcp. Then write all the PhysOpt optimizations out to a tcl file.

set PROJ_NM "best_proj_everrr"
set PROJ_DIR "./$PROJ_NM"
open_checkpoint $PROJ_DIR/${PROJ_NM}_post_place_physopt.dcp
write_iphys_opt_tcl -place $PROJ_DIR/${PROJ_NM}_post_place_physopt.tcl

You’ll see that this tcl file has a bunch of iphys_opt_design commands that execute the optimizations. Use this tcl file prior to running place_design on your next run or try it out on a post_opt.dcp. You can choose to apply all the optimizations in the tcl file or a subset such as just fanout or BRAM optimizations. At this point I don’t have a good feeling for whether you should apply all or a subset and which subset of optimizations. Will have to play with this.

set PROJ_DIR "./$PROJ_NM"
open_checkpoint $PROJ_DIR/${PROJ_NM}_post_opt.dcp
read_iphys_opt_tcl -fanout_opt -place $PROJ_DIR/${PROJ_NM}_post_place_physopt.tcl
place_design -directive Explore
...

你可能感兴趣的:(Vivado Non-Project Mode)